Comparative analysis of the genomes from two Acanthamoeba castellanii strains

public public 1yr ago 0 bookmarks

Description

This repository contains scripts and documentation related to the analysis and comparison of the Acanthamoeba castellanii genome from strains C3 and Neff. Analyses include: Annotation statistics, busco, quast, orthologous gene comparison with related species, circos plot, sequence divergence between strains and Hi-C contact profiles at the rDNA sequences.

A frozen copy of this repository as well as output data are available for download in the associated Zenodo record .

Installation

The pipeline is written using snakemake and manages dependencies using conda. Most of the pipeline steps are run inside self-contained conda environments, which are automatically built upon execution. There are two dependencies (MCScanX and dnaglider) which are not available through conda and need to be installed separately.

Dependencies:

  • python3.7+

    • snakemake

    • pandas

    • numpy

  • conda

  • dnaglider

  • MCScanX

The input data (genomes, annotations, ...) are downloaded automatically from Zenodo when executing the pipeline.

Usage

The analyses are separated into distinct workflows in the rules directory. The whole analysis pipeline can be run using snakemake as follows:

snakemake --use-conda -j4

Structure

The master script Snakefile will call each workflow one after the other. Each workflow contains rules with input and output files, which execute code or external scripts. Each rule is executed in its own conda environment and will download its dependencies on the first execution. The overall workflow can be represented as a graph:

pipeline graph

The envs directory contains conda environment build specifications for the different rules.

General parameters for the pipeline are stored in the config.yaml file and can be modified. The strains to analyze as well as the path to their sequence files are defined in samples.tsv . All external scripts executed by rules are stored in the scripts folder. Custom python utility libraries imported in the pipeline are stored in src .

The doc directory contains jupyter notebook with general analyses of the pipeline results.

Code Snippets

13
script: '../scripts/00_fetch_proteomes.py'
22
shell: 'cd-hit -i {input} -o {output} -c {params.sim} -M 0'
41
42
43
44
45
46
shell:
  """
  zenodo_get -d https://doi.org/10.5281/zenodo.5507417 -w {output.url_tbl}
  wget $(grep "shared_assets" {output.url_tbl}) -O - \
   | tar xzvf - --directory={params.in_dir} >/dev/null
  """
14
15
16
17
18
19
20
21
22
23
24
shell:
    """
    st={wildcards.strain}
    sed 's/^>\(.*\)/>'"$st"'_\\1/' {input.genome} > {output.genome}
    sed 's/^>\(.*\)/>'"$st"'_\\1/' {input.cds} > {output.cds}
    sed 's/^>\(.*\)/>'"$st"'_\\1/' {input.proteome} > {output.proteome}
    sed 's/^\([^#]\)/'"$st"'_\\1/' {input.annotations} |
        sed 's/ID=/ID='"$st"'_/' | 
        sed 's/Parent=/Parent='"$st"'_/' \
        > {output.annotations}
    """
34
35
36
37
38
shell:
    """
    cp {input.genome} {output.genome}
    cp {input.annot} {output.annot}
    """
51
shell: "Rscript scripts/01_annot_stats.R {input.v1} {input.neff} {input.c3} {output.tbl} {output.plt}"
63
64
65
66
67
shell:
    """
    assembly-stats -t {input.neff_v2} {input.neff_v1} > {params.assembly_tbl}
    Rscript scripts/01_assembly_stats.R {params.assembly_tbl} {output}
    """
78
shell: 'quast -t {threads} -e -o {output} {input.neff_v2} {input.ref_fa}'
85
86
87
88
89
shell:
    """
    busco -f --long --mode genome -c {threads} --lineage eukaryota -i {input} -o $(basename {output})
    mv "$(basename {output})" "$(dirname {output})/"
    """
 95
 96
 97
 98
 99
100
101
102
103
shell:
    """
    sed 's/^[\t ]*//' {input}/run_eukaryota_odb10/short_summary.txt \
      | grep -v '^[#\*]' \
      | sed '/^$/d' \
      | sed 's/[\t ]*$//' \
      | grep -v '\%' \
      > {output}
    """
113
script: '../scripts/01_plot_busco.py'
121
122
123
124
125
126
127
shell:
    """
    awk -vOFS='\t' '
        /^>/ {{if (seqlen){{print seqlen}}; printf "%s\t",substr($0, 2) ;seqlen=0;next; }}
        {{ seqlen += length($0)}}END{{print seqlen}}
        ' {params.genome} > {output}
    """
136
137
138
139
140
141
shell:
    """
    grep -v "^#" {params.rdna} | awk -vOFS='\t' '{{print $1,$4,$5,$9}}' > {output}
    awk -vOFS='\t' '$5 == 1 {{print $1,$2,$3,"HGT"}}' {input.hgt} >> {output}
    awk -vOFS='\t' '{{print $1,$2,$3,"virus"}}' {input.vir} >> {output}
    """
153
script: "../scripts/01_plot_karyo.py"
9
shell: "seqkit grep -r -f {params.hgt_ids} {input} | seqkit translate > {output}"
17
script: '../scripts/02_get_v1_hgt_gff.py'
28
29
30
31
32
33
34
35
36
37
shell:
    """
    dnaglider -threads {threads} \
              -fields "GC,GCSKEW,ATSKEW,ENTRO,KMER" \
              -kmers 2,3,4 \
              -window {params.win} \
              -stride {params.step} \
              -fasta {input} \
              -out {output}
    """
50
51
52
53
54
55
56
57
58
shell:
    """
    makeblastdb -dbtype prot -in {params.v2_prots} -title neff -out {params.v2_db}
    blastp -evalue 1e-30 \
           -outfmt "6 qseqid sseqid pident evalue" \
           -db {params.v2_db} \
           -num_threads {threads} \
           -query {input.v1_hgt} > {output.blast}
    """
66
67
68
69
70
71
72
73
74
75
76
77
run:
    df = pd.read_csv(
        input[0],
        sep='\t',
        header=None,
        names=['query', 'target', 'pident', 'evalue']
    )
    keep_idx = df.groupby('query')['pident'].transform(max) == df['pident']
    # For each query HGT, retain the match with highest sequence identity
    df = df[keep_idx]
    # Filter to keep only matches with > 95% identity
    df.loc[df.pident > params['pident']].to_csv(output[0], sep='\t', index=False)
85
86
87
88
89
90
91
92
93
94
95
96
97
run:
    blast = pd.read_csv(input['blast'], sep='\t')
    annot = pd.read_csv( input['annot'], sep='\t')
    strain = wildcards['strain']
    annot = annot.loc[annot.ID.str.startswith(strain), :]
    annot.ID = annot.ID.str.replace(f"{strain}"+r"_([a-zA-Z0-9_]*).*", r"\1")
    annot.chrom= annot.chrom.str.replace(f"{strain}_", "")
    blast['hgt'] = 1
    blast.target = blast.target.str.replace(r'([a-zA-Z0-9_]*).*', r'\1')
    annot = blast.merge(annot, how='right', left_on='target', right_on='ID')
    annot.hgt = annot.hgt.fillna(0).astype(int)
    annot = annot.loc[:, ['chrom', 'start', 'end', 'ID', 'hgt', 'n_exon']]
    annot.to_csv(output[0], sep='\t', header=None, index=False)
108
109
110
111
112
113
114
115
shell:
    """
    echo -e "chrom\tstart\tend\tgeneID\tHGT\tNEXON\tGC\tGCSKEW\tATSKEW\tENTRO\t2MER\t3MER\t4MER" > {output}
    bedtools intersect -a {input.win} -b {input.hgt} -wb \
        | sort --parallel={threads} -k11,11 -k12,12n \
        | bedtools groupby -g 11,12,13,14,15,16 -c 4,5,6,7,8,9,10 -o mean \
        | sort --parallel={threads} -k1,1 -k2,2n >> {output}
    """
13
14
15
16
17
18
19
shell:
    """
    python scripts/03_rdna_hic.py \
        {input.cool}::/resolutions/{params.res} \
        {input.rdna} \
        {output}
    """
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
shell:
    """
    echo -n "" > {output.prot}
    for i in {input.prot}; do
        cat $i >> {output.prot}
    done

    echo -n "" > {output.genome}
    for i in {input.genome}; do
        cat $i >> {output.genome}
    done

    echo -n "" > {output.annot}
    for i in {input.annot}; do
        cat $i >> {output.annot}
    done
    """
40
41
42
43
44
45
46
47
shell:
    """
    bash scripts/04_gen_mcscanx_input.sh -g {input.annot} \
                                         -o {params.out_dir} \
                                         -f {input.prot} \
                                         -c {threads}
    MCScanX -s 3 {params.out_dir}/MCScanX_in
    """
62
63
64
65
66
shell:
    """
    cp {params.cfg} {output.cfg}
    bash scripts/04_gen_circos_files.sh {input.ref} {params.mcsx_prefix} {params.dir}
    """
74
75
76
77
78
79
80
81
run:
    karyo = pd.read_csv(join(params['circos_dir'], 'karyotype.txt'), sep=' ', header=None)
    chroms = list(karyo.iloc[:, 2])
    links = pd.read_csv(input[0], sep=' ', header=None,
        names=['c1', 's1', 'e1', 'c2', 's2', 'e2', 'col']
    )
    filtered = links.loc[(np.isin(links.c1,chroms)) & (np.isin(links.c2, chroms)), :]
    filtered.to_csv(output[0], sep=' ',header=None, index=False)
 96
 97
 98
 99
100
101
shell:
    """
    bundlelinks -strict -min_bundle_size 1000 -max_gap 50000 \
                -links {input.mcsx} | sed 's/lgrey=$//' > {params.circos_dir}/bundles.txt
    circos -conf {input.cfg} -outputfile {output}
    """
114
115
116
117
118
119
120
121
shell:
    """
    minimap2 -c {input.neff} {input.c3} -x map-ont \
    > {output.paf}
    python ./scripts/04_compute_seq_divergence.py \
        {output.paf} \
        {output.div}
    """
127
script: "../scripts/04_plot_seq_divergence.py"
15
shell: "genomepy install -g '{params.genomedir}/{wildcards.virus}' {params.assembly}"
25
shell: "minimap2 -xasm20 -t {threads} {input.amoeba} {input.virus}/*/*fa > {output}"
36
37
38
39
shell:
    """
    fd ".*paf" {params.pafdir} -x awk -vOFS='\t' -vvir={{/.}} '{{print $6,$8,$9,vir,$10/$11}}' {{}} > {output}
    """
44
45
46
47
48
49
50
51
52
53
54
55
run:
    import matplotlib.pyplot as plt
    import seaborn as sns
    import matplotlib as mpl
    mpl.use("Agg")
    vir = pd.read_csv(
        input[0], sep="\t", names=["chrom", "start", "end", "virus", "ident"]
        )
    vir["seqlen"] = vir.end - vir.start
    # Amount of each virus inserted
    sns.scatterplot(data=vir, x="seqlen", y="ident", hue="virus")
    plt.savefig(output[0])
64
65
66
67
68
69
shell:
    """
    sort -k1,1 -k2,2n {input} \
        | bedtools merge -o mean -c 5 -i - -d {params.neigh_dist} \
        > {output}
    """
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
shell:
    """
    # Discard coordinates with chromosomes absent from the cool file
    # and get the bed file into 2d coordinates
    cut -f1-3 {input} \
        | grep -f <(cooler dump -t chroms {params.cool} | awk '{{print $1"\t"}}') \
        | awk -vOFS='\t' '{{print $0,$1,$2+10000,$3+10000}}' \
        | coolpup.py {params.cool} \
               - \
               --nshifts 10 \
               --mindist 0 \
               --minshift 100000 \
               --outname {output.tbl} \
               --log WARNING
    plotpup.py {output.tbl} \
               --col_names {wildcards.strain} \
               --enrichment 0 \
               --vmin 1 \
               --output {output.fig}                  
    """
109
110
111
112
113
114
115
116
117
118
119
120
121
shell:
    """
    # Convert the bed file int 2d coordinates.
    cut -f1-3 {input} | awk -vOFS='\t' '{{print $0,$0}}' > {params.tpos}
    chromosight quantify --pattern=borders \
                         --threads {threads} \
                         --perc-zero=60 \
                         --perc-undetected=60 \
                         --win-size=31 \
                         {params.tpos} \
                         {params.cool} \
                         {params.base}  
    """
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
shell:
    """
    # Convert fasta to tabular, sort by seqname, keep longest seq
    # and convert back to fasta
    sed 's/>[^ ]* \(.*\)$/>\\1\t/' {input} \
        | tr -d '\\n' \
        | sed 's/>/\\n>/g' \
        | sed '/^$/d' \
        | awk -vFS='\t' -vOFS='\t' '{{print $2,length($2),$1}}' \
        | sort -k3,3 -k2,2n \
        | uniq -f2 \
        | awk '{{print $3,$1}}' \
        | sed 's/ /\\n/' \
        > {output}
    """
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
shell:
    """
    ulimit -n 10000
    # Move all organisms proteomes to orthofinder workdir
    mkdir -p {params.orthofinder_dir}
    rm -f {params.orthofinder_dir}/*
    cp {input.org_proteomes} {params.orthofinder_dir}
    # Move Ac proteomes as well, but trim proteome from filename
    for strain in {input.ac_proteomes}; do
        fname=$(basename $strain)
        new_fname="${{fname/_proteome_filtered/}}"
        cp $strain "{params.orthofinder_dir}/$new_fname"
    done
    #cp "{{input.hgt_neff_v1}}" "{params.orthofinder_dir}/"
    orthofinder -f {params.orthofinder_dir} \
                -o {output} \
                -S diamond \
                -t {threads} \
                -n "amoeba"
    """
70
71
72
73
74
75
76
shell:
    """
    cat {input}/Results_amoeba/Species_Tree/SpeciesTree_rooted.txt > {output.tree}
    python scripts/06_orthocount_to_fasta.py \
        {input}/Results_amoeba/Orthogroups/Orthogroups.GeneCount.tsv \
        {output.matrix}
    """
89
90
91
92
93
94
95
96
shell: 
    """
    echo "_seqFile {input.matrix}" > {params.conf_file}
    # Tree seems to have incompatible format with CLI GLOOME but works in web version...
    #echo "_treeFile {input.tree}" >> {params.conf_file}
    echo "_outDir {output}" >> {params.conf_file}
    ./bin/gainLoss.VR01.266 {params.conf_file}
    """
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
run:
    ortho_dir = join(input[0], "Results_amoeba", "Orthogroups")
    ortho = pd.read_csv(join(ortho_dir, 'Orthogroups.tsv'), sep='\t')
    unassigned = pd.read_csv(join(ortho_dir, 'Orthogroups_UnassignedGenes.tsv'), sep='\t')
    # Add unassigned genes as self-contained orthogroups
    ortho = pd.concat([ortho, unassigned], axis=0)
    # Get gene families absent in all of A. castellanii strains
    get_absent = lambda b: b.isnull().values
    st_abs = {strain: get_absent(ortho[strain]) for strain in params['focus_st']}
    ac_abs = np.bitwise_and.reduce([*st_abs.values()])

    # Convert count to presence/absence for all species
    pres_mat = ortho.drop(columns='Orthogroup').apply(lambda c: ~c.isnull(), axis=0)
    pres_mat.index = ortho.Orthogroup
    # Save presence matrix for further analysis (encore true/false as 0/1)
    pres_mat.astype(int).to_csv(output['pres_mat'], index=True, header=True, sep='\t')

    # Exclude strains of interest when counting
    amoeba_df = ortho.loc[:, list(params['amoeba'])]
    #bact_df = ortho.loc[:, list(params['bact'])]
    # Binarize variables: Is there any gene for species c in each orthogroup
    any_pres = lambda df: df.apply(lambda c: ~ c.isnull(), axis=0).apply(np.any, axis=1)
    #bact_pres = any_pres(bact_df)
    #bact_abs = ~bact_pres
    amoeba_pres = any_pres(amoeba_df)
    amoeba_abs = ~amoeba_pres
    # Record  genes found only in amoeba but not bacteria
    # Note: hgt events inferred in Clarke et al. also included for comparison)
    vdf = pd.DataFrame({
        'neff'  : ~st_abs['Neff'],
        #'NEFF_v1_hgt_cds'  : ~get_absent(ortho.NEFF_v1_hgt_cds),
        'c3'    : ~st_abs['C3'],
        'ac'    : ~ac_abs,
        #'bact'  : bact_pres,
        'amoeba': amoeba_pres,
    })
    vdf.index = ortho.Orthogroup

    vdf.astype(int).to_csv(output['pres_compact'], sep='\t', index=True, header=True)
154
script: '../scripts/06_plot_orthogroups_venn.py'
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
import time
import urllib
from Bio import Entrez


def safe_request(fun):
    """
    Wraps function requesting data to allow safe errors and retry.

    Parameters
    ----------
    fun : python function
        The python function that queries a server

    Returns
    -------
    wrapped_f : python function
        A wrapped version of the input function which will call itself recursively
        every 5 seconds if the server is overloaded and will return None if the
        query record does not exist
    """

    def wrapped_f(*args, **kwargs):

        try:
            a = fun(*args, **kwargs)
            return a
        except urllib.error.HTTPError as e:
            if e.code in (429, 502):
                time.sleep(5)
                print(
                    "Sending too many requests, sleeping 5sec and retrying..."
                )
                wrapped_f(*args, **kwargs)
            else:
                print("No sequence found for %s" % args[0])
                return None

    return wrapped_f


@safe_request
def name_to_proteins(
    name, db="protein", email="[email protected]", filters=""
):
    """
    Given an organism name, fetch all available protein sequences.

    Parameters
    ----------
    name : str
        Name of an organism or group of organism (e.g. Legionella).
    db : str
        Entrez db name to use for the search.
    email : str
        User email for identification.
    filters : str
        Additional filters to use for the query. E.g. " AND refseq[filter]" to
        filter results from refseq

    Returns
    -------
    seq_record : TextIOWrapper
        iterable object containing the query Fasta records.

    """
    Entrez.email = email
    query_string = name + "[Orgn]"
    time.sleep(0.1)
    # First search to see the number of hits (returns values for 10 by default)
    query = Entrez.read(
        Entrez.esearch(term=query_string, db=db, email=email, retmax=10 ** 9)
    )
    seqs = ""
    s = 0
    chunk_size = 1000
    # Fetch proteins by batch of 100 to avoid maximum number of queries
    for e in range(chunk_size, len(query["IdList"]), chunk_size):
        print(f"fetching entries {s} to {e}")
        fetched = False
        while not fetched:
            try:
                seqs += Entrez.efetch(
                    id=query["IdList"][s:e], rettype="fasta", db=db
                ).read()
                fetched = True
            except:
                print("Fetching failed, trying again.")
                time.sleep(5)
        s = e
        time.sleep(0.1)
    e = len(query["IdList"])
    print(f"fetching entries {s} to {e}")
    seqs += Entrez.efetch(
        id=query["IdList"][s:e], rettype="fasta", db=db
    ).read()
    return seqs


org_df = snakemake.params["org"]
organism = org_df.loc[
    org_df["clean_name"] == f"{snakemake.wildcards.organism}", "name"
]
time.sleep(0.1)  # Do not spam NCBI :)
proteome = name_to_proteins(
    organism, email="[email protected]", filters=" AND refseq[filter]"
)
try:
    print(f"Writing {proteome.count('>')} proteins for {organism}.")
    with open(snakemake.output[0], "w") as outf:
        outf.write(proteome)
except TypeError:
    print(f"No proteome found for {organism[0]}")
    pass
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
library(readr)
library(stringr)
library(ggplot2)
library(dplyr)
library(tidyr)
library(tibble)
library(gridExtra)
args <- commandArgs(trailingOnly=TRUE)

### GET ARGS ###
v1_gff <- args[1]
neff_gff <- args[2]
c3_gff <- args[3]
out_tbl <- args[4]
out_plot <- args[5]

### LOAD DATA ###
load_gff <- function(in_gff, name){
    gff <- read_tsv(in_gff, comment="#", col_names=F)
    colnames(gff) <- c("chrom", "source", "type", "start", "end", "score", "strand", "phase", "attributes")
    gff <- gff %>% filter(type %in% c("mRNA", "CDS", "gene", "exon"))

    ### Clean data
    gff <- gff %>% mutate(attributes=gsub("Parent=", "ID=", attributes)) %>%
        mutate(ID=str_extract(attributes, 'ID=[^;\\.]*')) %>%
        mutate(ID=gsub("=[a-zA-Z]{1,}:", "=", ID)) %>%
        mutate(ID=sapply(ID, function(x){str_split(x,regex('[=-]'))[[1]][2]})) %>%
        mutate(name=name)
    return(gff)
}
gff <- load_gff(v1_gff, "NEFF_v1") %>%
    bind_rows(load_gff(neff_gff, "Neff")) %>%
    bind_rows(load_gff(c3_gff, "C3"))

### COMPUTE STATS ###

gff_gene_len <- gff %>%
    filter(type == 'gene') %>%
    mutate(gene_len = end - start)

# N exon / gene (drop duplicated exons)
gff_exons <- gff %>% 
    filter(type != 'gene') %>%
    group_by(ID) %>%
    distinct(chrom, start, end, type, .keep_all=T) %>%
    mutate(n_exon=sum(type == 'exon')) %>%
    filter(type == 'mRNA') %>% 
    mutate(mrna_len = end - start)

# Combine informations and reorder columns
gff_stats <- gff_gene_len %>%
    select(ID, gene_len) %>%
    inner_join(gff_exons, by='ID') %>%
    select(ID, chrom, start, end, type, strand, n_exon, gene_len, mrna_len, attributes)


### VISUALISE ###

# Label generator to get number of samples
n_obs <- function(x){return(data.frame(y=0, label=paste0("n=", length(x))))}

feature_len <- ggplot(data=gff %>% filter(type %in% c("CDS", "gene")), 
                      aes(x=name, y=log10(end - start), fill=name)) +
    geom_violin() +
    stat_summary(fun.data = n_obs, geom = "text") +
    theme_minimal() +
    xlab("") + 
    ylab("log10 length") + 
    ggtitle("Gene length") +
    facet_wrap(.~type) +
    scale_y_continuous(limits = c(0, 6)) + 
    geom_boxplot(width=0.1)

exon_quantiles <- gff_exons %>% 
  group_by(name) %>% 
  summarise(x=list(enframe(quantile(n_exon, probs=c(0.25,0.5,0.75)), "quantiles", "exons"))) %>% 
  unnest(x) %>%
  spread(key=quantiles, value=exons)

# Assign a quantile label to each exon number based on its strain's exon number distribution
exons_tbl <- gff_exons %>%
    inner_join(exon_quantiles, by="name") %>%
    mutate(
        n_exon_quant = case_when(
            n_exon <= `25%` ~ "25% quantile",
            n_exon >= `75%` ~ "75% quantile",
            TRUE            ~ "Median"
        )
    ) %>%
    select(name, n_exon, n_exon_quant)

exons_tbl <- exons_tbl %>% 
    mutate(
        n_exon_quant = factor(
            n_exon_quant,
            ordered=T, 
            levels=c("25% quantile", "Median", "75% quantile")
        )
    )

exon_per_gene <- ggplot(data=exons_tbl, aes(x=n_exon, fill=n_exon_quant)) +
    scale_fill_manual(values=c("#6666ee", "#999999", "#ee6666")) +
    geom_histogram(binwidth=1) +
    theme_minimal() +
    stat_summary(aes(x = 0.1, y = n_exon, xintercept = stat(y), group = name), 
               fun.y = median, geom = "vline") +
    ylab("Genes") +
    xlab("Exons per genes") +
    facet_grid(name~.) +
    scale_x_continuous(limits = c(-0, 100))

### SAVE OUTPUT ###
svg(out_plot, height=10, width=12)
grid.arrange(feature_len, exon_per_gene)
dev.off()
write_tsv(gff_stats, out_tbl)
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
library(ggplot2)
library(patchwork)
library(readr)
library(tidyr)
library(dplyr)
library(tools)
library(stringr)


args <- commandArgs(trailingOnly=T)

# No need to scale, now
#library(scales)
# minscale <- function(x){s = scale(x); return(s + abs(min(s)))}
#  mutate_at(vars(-assembly), minscale ) %>%

asb <- read_tsv(args[1]) %>%
  rename(n_contigs=number, assembly=filename) %>%
  mutate(assembly=file_path_sans_ext(basename(assembly))) %>%
  select(assembly, total_length, n_contigs, N50, N70, N90, N50n, N70n, N90n) %>%
  pivot_longer(-assembly, 'metric') %>%
  mutate(assembly = gsub("_genome", "", assembly))

svg(args[2])

len_asb <- ggplot(
             data=asb %>% filter(metric == 'total_length'),
             aes(x=assembly, y=value, fill=assembly)
           ) +
        geom_bar(stat='identity') +
        theme_minimal() +
        ylab("Length") +
        xlab("") +
        ggtitle("Assembly length") +
        theme(legend.position="none")

num_tigs <- ggplot(
             data=asb %>% filter(metric == 'n_contigs'),
             aes(x=assembly, y=value, fill=assembly)
            ) +
        geom_bar(stat='identity') +
        theme_minimal() +
        ylab("# contigs") +
        xlab("") +
        ggtitle("Total number of contigs") +
        theme(legend.position="none")

Nxx <- ggplot(
         data=asb %>% filter(str_detect(metric, '0$')),
         aes(x=assembly, y=value, fill=assembly)
       ) +
       geom_bar(stat='identity') +
       facet_wrap(~metric) +
       theme_minimal() +
       xlab("") +
       ylab("Length") +
       theme(legend.position="none")

Nxxn <- ggplot(
          data=asb %>% filter(str_detect(metric, 'n$')),
          aes(x=assembly, y=value, fill=assembly)
        ) +
        geom_bar(stat='identity') +
        xlab("") +
        ylab("# contigs") +
        theme_minimal() +
        facet_wrap(~metric)

(len_asb + num_tigs + plot_spacer()) / (Nxx) / (Nxxn)

dev.off()
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
from collections import OrderedDict
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt


def read_busco(tbl: str, name: str) -> pd.DataFrame:
    """
    Read a busco summary table into a
    dataframe and create relevant columns
    """
    df = pd.read_csv(tbl, names=["number", "type"], sep="\t", header=None)
    df["strain"] = name
    # Single letter busco class as a column
    df["busco"] = df["type"].str.extract(r".*\(([A-Z])\).*")
    df.loc[pd.isnull(df.busco), "busco"] = "T"
    # Transform number of buscos to percentages
    df["perc"] = 100 * np.round(
        df["number"] / df.number[df.busco == "T"].values[0], decimals=2
    )
    return df


# Concatenate busco tables from all strains and set relevant
# variables as indices
busco = (
    pd.concat([read_busco(tbl, name) for name, tbl in snakemake.input.items()])
    .reset_index(drop=True)
    .sort_values("strain", ascending=False)
    .set_index(["busco", "strain"])
)
mpl.use("Agg")
# Keep track of the order in which strains will be plotted
str_to_num = OrderedDict({"v1": 0, "neff": 1, "c3": 2})
# Map each BUSCO class to a color
col_dict = OrderedDict(
    [("M", "#F04441"), ("F", "#F0E441"), ("D", "#3592C6"), ("S", "#58B4E8"),]
)
# We plot one busco class per iteration
for i, t in enumerate(col_dict.keys()):
    # Subset busco table for current strain
    busco_type = busco.loc[t]
    # Retrieve x index from strains (deterministic order)
    r = [str_to_num[s] for s in busco_type.index.values]
    # Stacked barplot, uses cumulative height of previous bars
    if i:
        plt.barh(r, busco_type.perc.values, color=col_dict[t], left=cum_height)
        cum_height += busco_type.perc.values
    else:
        cum_height = busco_type.perc.values
        plt.barh(r, busco_type.perc.values, color=col_dict[t])


def busco_string(strain):
    """Use the BUSCO summary table to construct the standard busco string"""
    mis = busco.loc[("M", strain), "number"]
    fra = busco.loc[("F", strain), "number"]
    com = busco.loc[("C", strain), "number"]
    sin = busco.loc[("S", strain), "number"]
    dup = busco.loc[("D", strain), "number"]
    tot = busco.loc[("T", strain), "number"]
    return f"{strain} - C:{com}[S:{sin}, D:{dup}], F:{fra}, M:{mis}, n={tot}"


# Write the strain name and busco string besides each bar
plt.yticks(r, [busco_string(s) for s in str_to_num.keys()])
plt.xlabel("% BUSCOs")
plt.ylabel("Strain")
plt.savefig(str(snakemake.output))
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
import sys
import re
from collections import defaultdict
import csv
from Bio import SeqIO
from Bio.Graphics import BasicChromosome
from Bio.SeqFeature import SeqFeature, FeatureLocation
from reportlab.lib.units import cm
from matplotlib import cm as mcm
from matplotlib.colors import to_hex

logh = open(snakemake.log[0], "w")
# {chrom: length}
entries = {}
for rec in SeqIO.parse(snakemake.params["genome"], "fasta"):
    if len(rec.seq) > 100000:
        entries[rec.id] = len(rec.seq)
# all available colors
# cols = list(colors.getAllNamedColors().values())[8:]
cols = [to_hex(mcm.tab10(c)) for c in range(len(entries))]
# {chrom: {type: [features]}} where a feature is [start, end, type, color]
features = defaultdict(list)
type_id = 0
ftype_to_col = {}
with open(snakemake.input["features"]) as bedh:
    bed = csv.DictReader(
        bedh, delimiter="\t", fieldnames=["chrom", "start", "end", "type"]
    )
    for feat in bed:
        start, end = int(feat["start"]), int(feat["end"])
        if start > end:
            start, end = end, start
        try:
            if end > entries[feat["chrom"]]:
                continue
        except KeyError:
            continue
        if feat["type"] not in ftype_to_col.keys():
            ftype_to_col[feat["type"]] = cols[type_id]
            logh.write(f"{feat['type']}: {cols[type_id]}\n")
            type_id += 1
        features[feat["chrom"]].append(
            SeqFeature(
                location=FeatureLocation(int(feat["start"]), int(feat["end"])),
                type=feat["type"],
                qualifiers={"color": [ftype_to_col[feat["type"]]]},
            )
        )

max_len = max(entries.values())
telomere_length = 10000  # For illustration

chr_diagram = BasicChromosome.Organism(output_format="svg")
chr_diagram.page_size = (29.7 * cm, 21 * cm)  # A4 landscape

for name, length in entries.items():
    # features = [f for f in record.features if f.type == "tRNA"]
    chrom_feat = features[name]

    chrom_num = re.sub(r"^.*_([0-9]+)$", r"\1", name)
    cur_chromosome = BasicChromosome.Chromosome(chrom_num)
    # Set the scale to the MAXIMUM length plus the two telomeres in bp,
    # want the same scale used on all five chromosomes so they can be
    # compared to each other
    cur_chromosome.scale_num = max_len + 2 * telomere_length

    # Add an opening telomere
    start = BasicChromosome.TelomereSegment()
    start.scale = telomere_length
    cur_chromosome.add(start)

    # Add a body - again using bp as the scale length here.
    body = BasicChromosome.AnnotatedChromosomeSegment(length, chrom_feat)
    body.scale = length
    cur_chromosome.add(body)

    # Add a closing telomere
    end = BasicChromosome.TelomereSegment(inverted=True)
    end.scale = telomere_length
    cur_chromosome.add(end)

    # This chromosome is done
    chr_diagram.add(cur_chromosome)
chr_diagram.draw(snakemake.output[0], snakemake.params["title"])
logh.close()
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import re

hgt_ids = {i.strip() for i in open(snakemake.input[0])}
outf = open(snakemake.output[0], "w")
with open(snakemake.params["gff"], "r") as gff:
    hgt = False
    n_exons = 0
    for line in gff:
        fields = line.split("\t")
        if re.match("^#", line):
            continue
        elif fields[2] == "gene":
            # ID chrom start end type strand n_exon gene_len mrna_len attributes
            if hgt:
                outf.write(f"{curr_hgt.strip()}\t{n_exons}\n")
            gene_id = re.search("gene:([^;]+);", fields[8]).group(1)
            if gene_id in hgt_ids:
                hgt = True
                n_exons = 0
                curr_hgt = line
            else:
                hgt = False
        elif fields[2] == "exon":
            n_exons += 1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import patches as mpatches
import matplotlib as mpl
import matplotlib.gridspec as gridspec
import cooler

## LOAD CLI ARGS
cool_path = sys.argv[1]
rdna_path = sys.argv[2]
out_path = sys.argv[3]

# Discard contigs smaller than this size
# Note: assumes chromosomes are sorted by size in the cool file
MIN_SIZE = 100000

## LOAD INPUT FILES
c = cooler.Cooler(cool_path)
gff = pd.read_csv(rdna_path, sep="\t", comment="#", usecols=range(9))
gff.columns = [
    "seqname",
    "source",
    "feature",
    "start",
    "end",
    "score",
    "strand",
    "frame",
    "attribute",
]

## EXTRACT COOL DATA
chroms = c.chroms()[:]
# Discard all chromosomes after the first shorter than MIN_SIZE
chroms["include"] = chroms.length > MIN_SIZE
stop_bin = c.extent(chroms.loc[np.where(~chroms.include)[0][0], "name"])[1]
bins = c.bins()

# Remove chromosomes that are absent from cool or too small
gff = gff.loc[np.isin(gff.seqname, chroms.loc[chroms.include, "name"]), :]

## TRANSFORM

# Transform rDNA coords to bins
gff["minbin"] = gff.apply(
    lambda r: np.min(
        bins.fetch(f"{r.seqname}:{r.start}-{r.end}").index.values
    ),
    axis=1,
)

gff["maxbin"] = gff.apply(
    lambda r: np.max(
        bins.fetch(f"{r.seqname}:{r.start}-{r.end}").index.values
    ),
    axis=1,
)

gff = gff.drop_duplicates(subset=["attribute", "minbin", "maxbin"])

## VISUALIZE
mat = c.matrix(balance=False)[:stop_bin, :stop_bin]
# np.fill_diagonal(mat, 0)

# Define color saturation threshold
# sat_thr = np.percentile(mat[mat > 0], 85)

# Plot heatmap with rDNA overlay

plt.style.use("seaborn-white")
gs = gridspec.GridSpec(4, 1, height_ratios=[5, 1, 1, 1])
mpl.rcParams["figure.figsize"] = (5, 10)
fig = plt.figure()
ax1 = fig.add_subplot(gs[0])
ax2 = fig.add_subplot(gs[1])
ax3 = fig.add_subplot(gs[2])
ax4 = fig.add_subplot(gs[3])
fig.align_xlabels()
ax2.margins(x=0)
ax3.margins(x=0)
ax4.margins(x=0)

coldict = {
    "18s_rRNA": ("b", "-.", "18S", ax3),
    "28s_rRNA": ("g", ":", "28S", ax2),
    "8s_rRNA": ("r", "--", "5S", ax4),
}
legend_entries = [
    mpatches.Patch(color=v[0], label=v[2], ls=v[1]) for k, v in coldict.items()
]
plt.legend(handles=legend_entries, loc="upper right")
# mat[np.isnan(mat)] = 0
ax1.imshow(np.log(mat), cmap="Reds", interpolation="none", rasterized=True)

# Add vertical lines to line plots at rdna positions
for r in gff.iterrows():
    coldict[r[1]["attribute"]][3].axvline(
        x=r[1]["maxbin"],
        alpha=0.4,
        # linestyle=coldict[r[1]["attribute"]][1],
        lw=0.3,
        c=coldict[r[1]["attribute"]][0],
    )

# Add chromosome grid
for r in chroms.iterrows():
    if r[1].include:
        ax2.axvline(x=c.extent(r[1]["name"])[1], alpha=0.4, c="grey", lw=0.5)
        ax3.axvline(x=c.extent(r[1]["name"])[1], alpha=0.4, c="grey", lw=0.5)
        ax4.axvline(x=c.extent(r[1]["name"])[1], alpha=0.4, c="grey", lw=0.5)


# Subset bins containing rRNA
rdna_bins = {
    sub: np.unique(gff.maxbin[gff.attribute == sub])
    for sub in np.unique(gff.attribute)
}
# Compute contact sum per bin for each subunit
rdna_sig = {}
rdna_sig["28S"] = mat[rdna_bins["28s_rRNA"], :].mean(axis=0)
rdna_sig["18S"] = mat[rdna_bins["18s_rRNA"], :].mean(axis=0)
rdna_sig["5S"] = mat[rdna_bins["8s_rRNA"], :].mean(axis=0)

ax2.plot(
    range(mat.shape[0]), np.log10(rdna_sig["28S"]), label="28S", lw=0.5, c="g"
)
ax3.plot(
    range(mat.shape[0]), np.log10(rdna_sig["18S"]), label="18S", lw=0.5, c="b"
)
ax4.plot(
    range(mat.shape[0]), np.log10(rdna_sig["5S"]), label="5S", lw=0.5, c="r"
)

fig.savefig(out_path, dpi=2500)
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import sys
import re


def get_cigar_gaps(cigar):
    # Match all indels and capture indel size
    gap_regex = re.compile(r"([0-9]+)[ID]")
    total_gap_size = 0
    for gap_size in re.findall(gap_regex, cigar):
        total_gap_size += int(gap_size)
    return total_gap_size


with open(sys.argv[1]) as paf, open(sys.argv[2], "w") as outf:
    for line in paf:
        align = line.split("\t")
        if align[16] != "tp:A:P":  # Only consider primary alignments
            continue
        cigar = align[-1]
        n_gaps = get_cigar_gaps(cigar)
        n_ambiguous = int(align[15].lstrip("nn:i:"))
        n_bad = int(align[12].lstrip("NM:i:"))
        n_matches = int(align[9])
        n_mismatches = n_bad - (n_ambiguous + n_gaps)
        div = n_mismatches / (n_matches + n_mismatches)
        outf.write(f"{align[5]}\t{align[7]}\t{align[8]}\t{div}\n")
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
REF="$1"
# MCscanX output prefix
MCSX="$2"
CIR_DIR=$3
CAND="$4"

# Only include scaffolds larger than 100kb
len_cutoff=100000
# Karyotype
awk '/^>/ {if (seqlen){print seqlen}; printf "%s ",$0;seqlen=0;next;}
     {seqlen += length($0)}
     END{print seqlen}' $REF |
  awk -v minlen=$len_cutoff '$2 >= minlen {print "chr","-",$1,$1,"0",$2,"Neff"}' |
  tr -d '>' \
  >${CIR_DIR}/karyotype.txt

# Aesthetics, can comment out: Make Neff scaffolds reverse orderd. Looks better
# for 2-strain comparison
grep "Neff_scaffold" ${CIR_DIR}/karyotype.txt > ${CIR_DIR}/neff_karyo.txt
grep -v "Neff_scaffold" ${CIR_DIR}/karyotype.txt | sort -rV > ${CIR_DIR}/other_karyo.txt
cat ${CIR_DIR}/neff_karyo.txt ${CIR_DIR}/other_karyo.txt > ${CIR_DIR}/karyotype.txt
rm  ${CIR_DIR}/neff_karyo.txt ${CIR_DIR}/other_karyo.txt
# Candidate genes
#tail -n +2  $CAND |
#  awk '{print $1,$2,$3}' \
#  >${CIR_DIR}/hgt_candidates.txt

# Parsing MCScanX collinearity output into circos links
n_genes=$(grep -v "^#" "$MCSX.collinearity" | wc -l)
n_done=0
while read line
do
  # Make links involving HGT candidates red
  g1="${line%% *}"
  g2="${line##* }"
#  attr=$( \
#    awk -v g1=${g1%%-*} -v g2=${g2%%-*} \
#        'BEGIN{attr="color=lgrey_a4,z=0"} \
#        {if($0 ~ g1 || $0 ~ g2){attr="color=dred,z=30";exit}} \
#        END{print attr}' \
#        ${CAND} \
#  )
  attr='lgrey'
  # Get coordinate of gene ID from GFF file
  awk -v g1=$g1 -v g2=$g2 -v attr="$attr" \
    'BEGIN{N1=0;N2=0}
     $2 ~ g1 {
       N1+=1;c1=$1;s1=$3;e1=$4}
     $2 ~ g2 {
       N2+=1;c2=$1;s2=$3;e2=$4}
     END{
       if (N1 < 1 || N2 < 1) {exit 1}
       else {print c1,s1,e1,c2,s2,e2,attr}}' "$MCSX.gff"
  echo -ne "  $n_done / $n_genes         \r" 1>&2
  (( n_done++ ))
  # Convert gene IDs to coordinates
done < <(grep -v "^#" "$MCSX.collinearity" | tr -d ' ' | tr '\t' ' ' | cut -d$' ' -f2,3)  \
     > "${CIR_DIR}/mcsx.txt"
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
function usage () {
   cat <<EOF
Usage: `basename $0` -g genes -o output_folder -r ref [-h]
   -g   gtf file with genes coordinates
   -o   output folder for MCScanX files
   -f   reference genome
   -c   number of cpus to use for BLAST
   -h   displays this help
EOF
   exit 0
}


N_CPU=1
# Parsing CL arguments
while getopts ":g:o:f:c:h" opt; do
   case $opt in
   g )  GFF=${OPTARG} ;;
   o )  OUT_F=${OPTARG} ;;
   f )  FASTA=${OPTARG};;
   c )  N_CPU=${OPTARG} ;;
   h )  usage ;;
   \?)  usage ;;
   esac
done
# Testing if mandatory arguments have been provided
if [ "x" == "x$FASTA" ] || [ "x" == "x$GFF" ] || \
   [ "x" == "x$OUT_F" ];
then
  echo "Error: A reference genome, input GFF file and output folder \
  are required."
  usage
  exit 0
fi


#1: format GFF file for MCScanX, only keeping CDS
echo "Formatting GFF..."
OUT_GFF="$OUT_F/MCScanX_genes.gff"
awk 'BEGIN{OFS="\t"}
     $3 ~ "CDS" {id_match=match($9,/ID=[^;]*/)
      id=substr($9,RSTART+3,RLENGTH-3)
      print $1,id,$4,$5}' $GFF > $OUT_GFF
echo "GFF formatted !"

#2: build blast database from sequences and all vs all blast
BASE_OUT=$(basename $FASTA)
BASE_OUT=${BASE_OUT%%.*}
makeblastdb -in $FASTA -dbtype prot -out "${OUT_F}/${BASE_OUT}"

echo "Blasting transcriptome against itself."
blastp -query $FASTA \
       -db "${OUT_F}/${BASE_OUT}" \
       -outfmt 6 \
       -max_target_seqs 5 \
       -num_threads $N_CPU \
       -out "${OUT_F}/${BASE_OUT}.blast"

#3 Renaming chromosomes according to MCScanX conventions
MC_IN="$OUT_F/MCScanX_in"
# Use two first characters of fasta file as organism abbreviation
#SP_CODE=${BASE_OUT:0:2}
#sed "s/^scaffold_\([0-9]*\)/$SP_CODE\1/g" | 
sed 's/\.cds//g' "$OUT_GFF" > "$MC_IN.gff"
#sed "s/scaffold_\([0-9]*\)/$SP_CODE\1/g" | 
sed 's/::[^	]*//g' "${OUT_F}/${BASE_OUT}.blast" > "$MC_IN.blast"


echo "All input files are ready:"
echo "  - BLAST output: $MC_IN.blast"
echo "  - GFF file: $MC_IN.gff"
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 12))
div = pd.read_csv(
    snakemake.input[0], sep="\t", names=["chrom", "start", "end", "div"]
)
div = div.sort_values(["chrom", "start"])
div["align_len"] = abs(div.end - div.start)
div["log_len"] = np.log10(div.align_len)
jp = sns.jointplot(
    data=div,
    x="log_len",
    y="div",
    kind="kde",
    shade=True,
    cmap="plasma",
    levels=200,
)
jp.set_axis_labels(
    "Log10 length of aligned block [bp]",
    "Per base divergence in aligned blocks",
)
avg_div = np.average(div["div"], weights=div["align_len"])
avg_log_len = np.mean(div.log_len)
jp.ax_joint.axvline(
    x=avg_log_len, linestyle="--", label=f"",
)
jp.ax_joint.axhline(y=avg_div, linestyle="--")
jp.ax_marg_x.axvline(x=avg_log_len)
jp.ax_marg_y.axhline(y=avg_div)
jp.fig.suptitle(
    "Neff - C3 gap excluded sequence divergence\n"
    f"mean divergence length: {100*avg_div:.2f}%"
)
plt.savefig(snakemake.output[0])
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import sys
import pandas as pd

gene_count = sys.argv[1]
out_fasta = sys.argv[2]

gc = pd.read_csv(gene_count, sep="\t")


def get_presence(freqs):
    """
    Given an array of frequences, return gene presence code

    Parameters
    ----------
    freqs : numpy array of ints
        1D Array of gene frequences.

    Returns
    -------
    str :
        Gene presence code, in the form of a string of 1 (present) and 0 (absent)

    Examples
    --------
    >>> get_presence(np.array([32, 1, 0, 0, 2]))
    '11001'
    """
    freqs[freqs > 1] = 1
    freqs = freqs.astype(str)
    return "".join(freqs)


# Store species names and associated ortholog presence code in a dict
sp_presence = {
    gc.columns[i]: get_presence(gc.iloc[:, i])
    for i in range(1, gc.shape[1] - 1)
}

with open(out_fasta, "w") as fa:
    for sp, presence in sp_presence.items():
        fa.write(">" + str(sp) + "\n")
        fa.write(presence + "\n")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import pandas as pd
from matplotlib_venn import venn2, venn3
import matplotlib as mpl
import matplotlib.pyplot as plt

mpl.use("Agg")
vdf = pd.read_csv(snakemake.input["pres_compact"], sep="\t").drop(
    "Orthogroup", axis=1
)
vdf = vdf.astype(bool)
# vdf = vdf.rename(columns={'NEFF_v1_hgt_cds': 'v1'})
# Make Venn diagram. sets are A, B, AB, C, AC, BC, ABC
fig, ax = plt.subplots(1, 2, figsize=(15, 12))
venn3(
    subsets=(
        len(vdf.query("    c3 and not neff and not amoeba")),  # A
        len(vdf.query("not c3 and     neff and not amoeba")),  # B
        len(vdf.query("    c3 and     neff and not amoeba")),  # AB
        len(vdf.query("not c3 and not neff and     amoeba")),  # C
        len(vdf.query("    c3 and not neff and     amoeba")),  # AC
        len(vdf.query("not c3 and     neff and     amoeba")),  # BC
        len(vdf.query("    c3 and     neff and     amoeba")),  # ABC
    ),
    set_labels=("C3", "Neff", "Amoeba"),
    ax=ax[0],
)
venn2(
    subsets=(
        len(vdf.query("    c3 and not neff")),
        len(vdf.query("not c3 and     neff")),
        len(vdf.query("    c3 and     neff")),
    ),
    set_labels=("C3", "Neff"),
    ax=ax[1],
)
fig.savefig(snakemake.output["venn"])
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
library(tidyverse)
library(topGO)
library(viridis)

# Parse CL args
args <- commandArgs(trailingOnly=T)

# Extract gene IDs and GO terms.
annot_tbl <- read_tsv(args[1], col_types='cciicciiicciiiiiii') %>%
  dplyr::select(ID, attributes, hgt) %>%
  mutate(GO=str_replace(attributes, '.*Ontology_term=([^;]*);', "\\1"))
hgt_candidates <- annot_tbl %>%
  filter(hgt == 1) %>%
  pull(ID)

out_file <- args[2]
tmp_mapfile <- paste(dirname(out_file), 'id2go.tsv', sep='/')
out_fig <- paste0(tools::file_path_sans_ext(out_file), ".svg")

# Write geneID - GO terms mapping to file
annot_tbl %>%
  dplyr::select(ID, GO) %>%
  mutate(GO = str_replace_all(GO, ";", ", ")) %>%
  write_tsv(tmp_mapfile)


geneID2GO <- readMappings(tmp_mapfile)
geneNames <- names(geneID2GO)
geneList <- factor(as.integer(geneNames %in% hgt_candidates))
names(geneList) <- geneNames
GOdata <- new(
  "topGOdata",
  description = "HGT analyis in A. castellanii",
  ontology = "BP",
  allGenes = geneList, 
  nodeSize = 10,
  gene2GO = geneID2GO, 
  annot = annFUN.gene2GO
)


fisher.stat <- new(
  "classicCount",
  testStatistic = GOFisherTest, 
  name = "Fisher test"
)
resultFisher <- getSigGroups(GOdata, fisher.stat)
weight.stat <- new(
  "weightCount",
  testStatistic = GOFisherTest, 
  name = "Fisher test with weight algorithm",
  sigRatio = 'ratio'
)
resultWeight <- getSigGroups(GOdata, weight.stat)

#showSigOfNodes(GOdata, score(resultWeight), firstSigNodes = 9, useInfo = 'all')

# Get 'significant' terms
pvals <- score(resultWeight)
sig <- pvals[pvals<0.01]

termStat(GOdata, names(sig))

# Report top enriched GO terms
allRes <- GenTable(
  GOdata,
  classic = resultFisher,
  weight = resultWeight, 
  orderBy = "weight",
  ranksOf = "classic",
  topNodes = 200
)

write_tsv(allRes, out_file)
allRes <- allRes %>%
	mutate(GO.ID=paste(GO.ID, Term, sep=', '))

tidy_res = as.tibble(allRes) %>%
  mutate(
    weight = as.numeric(weight),
    classic = as.numeric(classic),
    GO.ID = factor(
      GO.ID,
      ordered=T,
      levels=unique(GO.ID[ordered(weight)])
    )
  )

# Visualise top enriched GO terms
svg(out_fig)
ggplot(
  data=tidy_res %>% top_n(30, -weight), 
  aes(x=GO.ID, y=-log10(weight))
) + 
  geom_segment(
    aes(xend=GO.ID, yend=min(-log10(weight))),
    size=1.1
  ) +
  geom_point(aes(size=Annotated, color=Significant / Expected)) + 
  geom_hline(aes(yintercept=2), lty=2, col='red') +
  theme_minimal() + 
  xlab("") +
  ylab("-log10 pvalue") +
  ggtitle("GO enrichment test in HGT candidates\n(Fisher exact test, weight algorithm)") +
  scale_color_viridis() +
  coord_flip()
dev.off()
90
shell: "Rscript scripts/go_enrich.R {input.annot} {input.candidates} {output}"
ShowHide 38 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/cmdoret/Acastellanii_genome_analysis
Name: acastellanii_genome_analysis
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...