Comparative analysis of the genomes from two Acanthamoeba castellanii strains

public 1yr ago 0 bookmarks

View Workflow

acastellanii_genome_analysis — View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Description

This repository contains scripts and documentation related to the analysis and comparison of the Acanthamoeba castellanii genome from strains C3 and Neff. Analyses include: Annotation statistics, busco, quast, orthologous gene comparison with related species, circos plot, sequence divergence between strains and Hi-C contact profiles at the rDNA sequences.

A frozen copy of this repository as well as output data are available for download in the associated Zenodo record .

Installation

The pipeline is written using snakemake and manages dependencies using conda. Most of the pipeline steps are run inside self-contained conda environments, which are automatically built upon execution. There are two dependencies (MCScanX and dnaglider) which are not available through conda and need to be installed separately.

Dependencies:

python3.7+
- snakemake
- pandas
- numpy
conda
dnaglider
MCScanX

The input data (genomes, annotations, ...) are downloaded automatically from Zenodo when executing the pipeline.

Usage

The analyses are separated into distinct workflows in the rules directory. The whole analysis pipeline can be run using snakemake as follows:

snakemake --use-conda -j4

Structure

The master script Snakefile will call each workflow one after the other. Each workflow contains rules with input and output files, which execute code or external scripts. Each rule is executed in its own conda environment and will download its dependencies on the first execution. The overall workflow can be represented as a graph:

pipeline graph

The envs directory contains conda environment build specifications for the different rules.

General parameters for the pipeline are stored in the config.yaml file and can be modified. The strains to analyze as well as the path to their sequence files are defined in samples.tsv . All external scripts executed by rules are stored in the scripts folder. Custom python utility libraries imported in the pipeline are stored in src .

The doc directory contains jupyter notebook with general analyses of the pipeline results.

Code Snippets

script: '../scripts/00_fetch_proteomes.py'

SnakeMake From line 13 of rules/00_downloaders.smk

shell: 'cd-hit -i {input} -o {output} -c {params.sim} -M 0'

SnakeMake cd-hit From line 22 of rules/00_downloaders.smk

shell:
  """
  zenodo_get -d https://doi.org/10.5281/zenodo.5507417 -w {output.url_tbl}
  wget $(grep "shared_assets" {output.url_tbl}) -O - \
   | tar xzvf - --directory={params.in_dir} >/dev/null
  """

SnakeMake From line 41 of rules/00_downloaders.smk

shell:
    """
    st={wildcards.strain}
    sed 's/^>\(.*\)/>'"$st"'_\\1/' {input.genome} > {output.genome}
    sed 's/^>\(.*\)/>'"$st"'_\\1/' {input.cds} > {output.cds}
    sed 's/^>\(.*\)/>'"$st"'_\\1/' {input.proteome} > {output.proteome}
    sed 's/^\([^#]\)/'"$st"'_\\1/' {input.annotations} |
        sed 's/ID=/ID='"$st"'_/' | 
        sed 's/Parent=/Parent='"$st"'_/' \
        > {output.annotations}
    """

SnakeMake From line 14 of rules/01_annot_stats.smk

shell:
    """
    cp {input.genome} {output.genome}
    cp {input.annot} {output.annot}
    """

SnakeMake From line 34 of rules/01_annot_stats.smk

shell: "Rscript scripts/01_annot_stats.R {input.v1} {input.neff} {input.c3} {output.tbl} {output.plt}"

SnakeMake From line 51 of rules/01_annot_stats.smk

shell:
    """
    assembly-stats -t {input.neff_v2} {input.neff_v1} > {params.assembly_tbl}
    Rscript scripts/01_assembly_stats.R {params.assembly_tbl} {output}
    """

SnakeMake assembly-stats From line 63 of rules/01_annot_stats.smk

shell: 'quast -t {threads} -e -o {output} {input.neff_v2} {input.ref_fa}'

SnakeMake QUAST From line 78 of rules/01_annot_stats.smk

shell:
    """
    busco -f --long --mode genome -c {threads} --lineage eukaryota -i {input} -o $(basename {output})
    mv "$(basename {output})" "$(dirname {output})/"
    """

SnakeMake From line 85 of rules/01_annot_stats.smk

shell:
    """
    sed 's/^[\t ]*//' {input}/run_eukaryota_odb10/short_summary.txt \
      | grep -v '^[#\*]' \
      | sed '/^$/d' \
      | sed 's/[\t ]*$//' \
      | grep -v '\%' \
      > {output}
    """

SnakeMake From line 95 of rules/01_annot_stats.smk

script: '../scripts/01_plot_busco.py'

SnakeMake From line 113 of rules/01_annot_stats.smk

shell:
    """
    awk -vOFS='\t' '
        /^>/ {{if (seqlen){{print seqlen}}; printf "%s\t",substr($0, 2) ;seqlen=0;next; }}
        {{ seqlen += length($0)}}END{{print seqlen}}
        ' {params.genome} > {output}
    """

SnakeMake From line 121 of rules/01_annot_stats.smk

shell:
    """
    grep -v "^#" {params.rdna} | awk -vOFS='\t' '{{print $1,$4,$5,$9}}' > {output}
    awk -vOFS='\t' '$5 == 1 {{print $1,$2,$3,"HGT"}}' {input.hgt} >> {output}
    awk -vOFS='\t' '{{print $1,$2,$3,"virus"}}' {input.vir} >> {output}
    """

SnakeMake From line 136 of rules/01_annot_stats.smk

script: "../scripts/01_plot_karyo.py"

SnakeMake From line 153 of rules/01_annot_stats.smk

shell: "seqkit grep -r -f {params.hgt_ids} {input} | seqkit translate > {output}"

SnakeMake seqkit From line 9 of rules/02_horizontal_gene_transfer.smk

script: '../scripts/02_get_v1_hgt_gff.py'

SnakeMake From line 17 of rules/02_horizontal_gene_transfer.smk

shell:
    """
    dnaglider -threads {threads} \
              -fields "GC,GCSKEW,ATSKEW,ENTRO,KMER" \
              -kmers 2,3,4 \
              -window {params.win} \
              -stride {params.step} \
              -fasta {input} \
              -out {output}
    """

SnakeMake From line 28 of rules/02_horizontal_gene_transfer.smk

shell:
    """
    makeblastdb -dbtype prot -in {params.v2_prots} -title neff -out {params.v2_db}
    blastp -evalue 1e-30 \
           -outfmt "6 qseqid sseqid pident evalue" \
           -db {params.v2_db} \
           -num_threads {threads} \
           -query {input.v1_hgt} > {output.blast}
    """

SnakeMake From line 50 of rules/02_horizontal_gene_transfer.smk

run:
    df = pd.read_csv(
        input[0],
        sep='\t',
        header=None,
        names=['query', 'target', 'pident', 'evalue']
    )
    keep_idx = df.groupby('query')['pident'].transform(max) == df['pident']
    # For each query HGT, retain the match with highest sequence identity
    df = df[keep_idx]
    # Filter to keep only matches with > 95% identity
    df.loc[df.pident > params['pident']].to_csv(output[0], sep='\t', index=False)

SnakeMake From line 66 of rules/02_horizontal_gene_transfer.smk

run:
    blast = pd.read_csv(input['blast'], sep='\t')
    annot = pd.read_csv( input['annot'], sep='\t')
    strain = wildcards['strain']
    annot = annot.loc[annot.ID.str.startswith(strain), :]
    annot.ID = annot.ID.str.replace(f"{strain}"+r"_([a-zA-Z0-9_]*).*", r"\1")
    annot.chrom= annot.chrom.str.replace(f"{strain}_", "")
    blast['hgt'] = 1
    blast.target = blast.target.str.replace(r'([a-zA-Z0-9_]*).*', r'\1')
    annot = blast.merge(annot, how='right', left_on='target', right_on='ID')
    annot.hgt = annot.hgt.fillna(0).astype(int)
    annot = annot.loc[:, ['chrom', 'start', 'end', 'ID', 'hgt', 'n_exon']]
    annot.to_csv(output[0], sep='\t', header=None, index=False)

SnakeMake BLAST From line 85 of rules/02_horizontal_gene_transfer.smk

shell:
    """
    echo -e "chrom\tstart\tend\tgeneID\tHGT\tNEXON\tGC\tGCSKEW\tATSKEW\tENTRO\t2MER\t3MER\t4MER" > {output}
    bedtools intersect -a {input.win} -b {input.hgt} -wb \
        | sort --parallel={threads} -k11,11 -k12,12n \
        | bedtools groupby -g 11,12,13,14,15,16 -c 4,5,6,7,8,9,10 -o mean \
        | sort --parallel={threads} -k1,1 -k2,2n >> {output}
    """

SnakeMake BEDTools From line 108 of rules/02_horizontal_gene_transfer.smk

shell:
    """
    python scripts/03_rdna_hic.py \
        {input.cool}::/resolutions/{params.res} \
        {input.rdna} \
        {output}
    """

SnakeMake From line 13 of rules/03_rdna.smk

shell:
    """
    echo -n "" > {output.prot}
    for i in {input.prot}; do
        cat $i >> {output.prot}
    done

    echo -n "" > {output.genome}
    for i in {input.genome}; do
        cat $i >> {output.genome}
    done

    echo -n "" > {output.annot}
    for i in {input.annot}; do
        cat $i >> {output.annot}
    done
    """

SnakeMake From line 12 of rules/04_synteny.smk

shell:
    """
    bash scripts/04_gen_mcscanx_input.sh -g {input.annot} \
                                         -o {params.out_dir} \
                                         -f {input.prot} \
                                         -c {threads}
    MCScanX -s 3 {params.out_dir}/MCScanX_in
    """

SnakeMake From line 40 of rules/04_synteny.smk

shell:
    """
    cp {params.cfg} {output.cfg}
    bash scripts/04_gen_circos_files.sh {input.ref} {params.mcsx_prefix} {params.dir}
    """

SnakeMake From line 62 of rules/04_synteny.smk

run:
    karyo = pd.read_csv(join(params['circos_dir'], 'karyotype.txt'), sep=' ', header=None)
    chroms = list(karyo.iloc[:, 2])
    links = pd.read_csv(input[0], sep=' ', header=None,
        names=['c1', 's1', 'e1', 'c2', 's2', 'e2', 'col']
    )
    filtered = links.loc[(np.isin(links.c1,chroms)) & (np.isin(links.c2, chroms)), :]
    filtered.to_csv(output[0], sep=' ',header=None, index=False)

SnakeMake From line 74 of rules/04_synteny.smk

shell:
    """
    bundlelinks -strict -min_bundle_size 1000 -max_gap 50000 \
                -links {input.mcsx} | sed 's/lgrey=$//' > {params.circos_dir}/bundles.txt
    circos -conf {input.cfg} -outputfile {output}
    """

SnakeMake NG-Circos From line 96 of rules/04_synteny.smk

shell:
    """
    minimap2 -c {input.neff} {input.c3} -x map-ont \
    > {output.paf}
    python ./scripts/04_compute_seq_divergence.py \
        {output.paf} \
        {output.div}
    """

SnakeMake Minimap2 From line 114 of rules/04_synteny.smk

script: "../scripts/04_plot_seq_divergence.py"

SnakeMake From line 127 of rules/04_synteny.smk

shell: "genomepy install -g '{params.genomedir}/{wildcards.virus}' {params.assembly}"

SnakeMake genomepy From line 15 of rules/05_viruses.smk

shell: "minimap2 -xasm20 -t {threads} {input.amoeba} {input.virus}/*/*fa > {output}"

SnakeMake Minimap2 From line 25 of rules/05_viruses.smk

shell:
    """
    fd ".*paf" {params.pafdir} -x awk -vOFS='\t' -vvir={{/.}} '{{print $6,$8,$9,vir,$10/$11}}' {{}} > {output}
    """

SnakeMake From line 36 of rules/05_viruses.smk

run:
    import matplotlib.pyplot as plt
    import seaborn as sns
    import matplotlib as mpl
    mpl.use("Agg")
    vir = pd.read_csv(
        input[0], sep="\t", names=["chrom", "start", "end", "virus", "ident"]
        )
    vir["seqlen"] = vir.end - vir.start
    # Amount of each virus inserted
    sns.scatterplot(data=vir, x="seqlen", y="ident", hue="virus")
    plt.savefig(output[0])

SnakeMake matplotlib seaborn From line 44 of rules/05_viruses.smk

shell:
    """
    sort -k1,1 -k2,2n {input} \
        | bedtools merge -o mean -c 5 -i - -d {params.neigh_dist} \
        > {output}
    """

SnakeMake BEDTools From line 64 of rules/05_viruses.smk

shell:
    """
    # Discard coordinates with chromosomes absent from the cool file
    # and get the bed file into 2d coordinates
    cut -f1-3 {input} \
        | grep -f <(cooler dump -t chroms {params.cool} | awk '{{print $1"\t"}}') \
        | awk -vOFS='\t' '{{print $0,$1,$2+10000,$3+10000}}' \
        | coolpup.py {params.cool} \
               - \
               --nshifts 10 \
               --mindist 0 \
               --minshift 100000 \
               --outname {output.tbl} \
               --log WARNING
    plotpup.py {output.tbl} \
               --col_names {wildcards.strain} \
               --enrichment 0 \
               --vmin 1 \
               --output {output.fig}                  
    """

SnakeMake From line 80 of rules/05_viruses.smk

shell:
    """
    # Convert the bed file int 2d coordinates.
    cut -f1-3 {input} | awk -vOFS='\t' '{{print $0,$0}}' > {params.tpos}
    chromosight quantify --pattern=borders \
                         --threads {threads} \
                         --perc-zero=60 \
                         --perc-undetected=60 \
                         --win-size=31 \
                         {params.tpos} \
                         {params.cool} \
                         {params.base}  
    """

SnakeMake From line 109 of rules/05_viruses.smk

shell:
    """
    # Convert fasta to tabular, sort by seqname, keep longest seq
    # and convert back to fasta
    sed 's/>[^ ]* \(.*\)$/>\\1\t/' {input} \
        | tr -d '\\n' \
        | sed 's/>/\\n>/g' \
        | sed '/^$/d' \
        | awk -vFS='\t' -vOFS='\t' '{{print $2,length($2),$1}}' \
        | sort -k3,3 -k2,2n \
        | uniq -f2 \
        | awk '{{print $3,$1}}' \
        | sed 's/ /\\n/' \
        > {output}
    """

SnakeMake From line 7 of rules/06_phylogeny.smk

shell:
    """
    ulimit -n 10000
    # Move all organisms proteomes to orthofinder workdir
    mkdir -p {params.orthofinder_dir}
    rm -f {params.orthofinder_dir}/*
    cp {input.org_proteomes} {params.orthofinder_dir}
    # Move Ac proteomes as well, but trim proteome from filename
    for strain in {input.ac_proteomes}; do
        fname=$(basename $strain)
        new_fname="${{fname/_proteome_filtered/}}"
        cp $strain "{params.orthofinder_dir}/$new_fname"
    done
    #cp "{{input.hgt_neff_v1}}" "{params.orthofinder_dir}/"
    orthofinder -f {params.orthofinder_dir} \
                -o {output} \
                -S diamond \
                -t {threads} \
                -n "amoeba"
    """

SnakeMake Diamond OrthoFinder From line 43 of rules/06_phylogeny.smk

shell:
    """
    cat {input}/Results_amoeba/Species_Tree/SpeciesTree_rooted.txt > {output.tree}
    python scripts/06_orthocount_to_fasta.py \
        {input}/Results_amoeba/Orthogroups/Orthogroups.GeneCount.tsv \
        {output.matrix}
    """

SnakeMake From line 70 of rules/06_phylogeny.smk

shell: 
    """
    echo "_seqFile {input.matrix}" > {params.conf_file}
    # Tree seems to have incompatible format with CLI GLOOME but works in web version...
    #echo "_treeFile {input.tree}" >> {params.conf_file}
    echo "_outDir {output}" >> {params.conf_file}
    ./bin/gainLoss.VR01.266 {params.conf_file}
    """

SnakeMake From line 89 of rules/06_phylogeny.smk

run:
    ortho_dir = join(input[0], "Results_amoeba", "Orthogroups")
    ortho = pd.read_csv(join(ortho_dir, 'Orthogroups.tsv'), sep='\t')
    unassigned = pd.read_csv(join(ortho_dir, 'Orthogroups_UnassignedGenes.tsv'), sep='\t')
    # Add unassigned genes as self-contained orthogroups
    ortho = pd.concat([ortho, unassigned], axis=0)
    # Get gene families absent in all of A. castellanii strains
    get_absent = lambda b: b.isnull().values
    st_abs = {strain: get_absent(ortho[strain]) for strain in params['focus_st']}
    ac_abs = np.bitwise_and.reduce([*st_abs.values()])

    # Convert count to presence/absence for all species
    pres_mat = ortho.drop(columns='Orthogroup').apply(lambda c: ~c.isnull(), axis=0)
    pres_mat.index = ortho.Orthogroup
    # Save presence matrix for further analysis (encore true/false as 0/1)
    pres_mat.astype(int).to_csv(output['pres_mat'], index=True, header=True, sep='\t')

    # Exclude strains of interest when counting
    amoeba_df = ortho.loc[:, list(params['amoeba'])]
    #bact_df = ortho.loc[:, list(params['bact'])]
    # Binarize variables: Is there any gene for species c in each orthogroup
    any_pres = lambda df: df.apply(lambda c: ~ c.isnull(), axis=0).apply(np.any, axis=1)
    #bact_pres = any_pres(bact_df)
    #bact_abs = ~bact_pres
    amoeba_pres = any_pres(amoeba_df)
    amoeba_abs = ~amoeba_pres
    # Record  genes found only in amoeba but not bacteria
    # Note: hgt events inferred in Clarke et al. also included for comparison)
    vdf = pd.DataFrame({
        'neff'  : ~st_abs['Neff'],
        #'NEFF_v1_hgt_cds'  : ~get_absent(ortho.NEFF_v1_hgt_cds),
        'c3'    : ~st_abs['C3'],
        'ac'    : ~ac_abs,
        #'bact'  : bact_pres,
        'amoeba': amoeba_pres,
    })
    vdf.index = ortho.Orthogroup

    vdf.astype(int).to_csv(output['pres_compact'], sep='\t', index=True, header=True)

SnakeMake From line 108 of rules/06_phylogeny.smk

script: '../scripts/06_plot_orthogroups_venn.py'

SnakeMake From line 154 of rules/06_phylogeny.smk

import time
import urllib
from Bio import Entrez


def safe_request(fun):
    """
    Wraps function requesting data to allow safe errors and retry.

    Parameters
    ----------
    fun : python function
        The python function that queries a server

    Returns
    -------
    wrapped_f : python function
        A wrapped version of the input function which will call itself recursively
        every 5 seconds if the server is overloaded and will return None if the
        query record does not exist
    """

    def wrapped_f(*args, **kwargs):

        try:
            a = fun(*args, **kwargs)
            return a
        except urllib.error.HTTPError as e:
            if e.code in (429, 502):
                time.sleep(5)
                print(
                    "Sending too many requests, sleeping 5sec and retrying..."
                )
                wrapped_f(*args, **kwargs)
            else:
                print("No sequence found for %s" % args[0])
                return None

    return wrapped_f


@safe_request
def name_to_proteins(
    name, db="protein", email="[email protected]", filters=""
):
    """
    Given an organism name, fetch all available protein sequences.

    Parameters
    ----------
    name : str
        Name of an organism or group of organism (e.g. Legionella).
    db : str
        Entrez db name to use for the search.
    email : str
        User email for identification.
    filters : str
        Additional filters to use for the query. E.g. " AND refseq[filter]" to
        filter results from refseq

    Returns
    -------
    seq_record : TextIOWrapper
        iterable object containing the query Fasta records.

    """
    Entrez.email = email
    query_string = name + "[Orgn]"
    time.sleep(0.1)
    # First search to see the number of hits (returns values for 10 by default)
    query = Entrez.read(
        Entrez.esearch(term=query_string, db=db, email=email, retmax=10 ** 9)
    )
    seqs = ""
    s = 0
    chunk_size = 1000
    # Fetch proteins by batch of 100 to avoid maximum number of queries
    for e in range(chunk_size, len(query["IdList"]), chunk_size):
        print(f"fetching entries {s} to {e}")
        fetched = False
        while not fetched:
            try:
                seqs += Entrez.efetch(
                    id=query["IdList"][s:e], rettype="fasta", db=db
                ).read()
                fetched = True
            except:
                print("Fetching failed, trying again.")
                time.sleep(5)
        s = e
        time.sleep(0.1)
    e = len(query["IdList"])
    print(f"fetching entries {s} to {e}")
    seqs += Entrez.efetch(
        id=query["IdList"][s:e], rettype="fasta", db=db
    ).read()
    return seqs


org_df = snakemake.params["org"]
organism = org_df.loc[
    org_df["clean_name"] == f"{snakemake.wildcards.organism}", "name"
]
time.sleep(0.1)  # Do not spam NCBI :)
proteome = name_to_proteins(
    organism, email="[email protected]", filters=" AND refseq[filter]"
)
try:
    print(f"Writing {proteome.count('>')} proteins for {organism}.")
    with open(snakemake.output[0], "w") as outf:
        outf.write(proteome)
except TypeError:
    print(f"No proteome found for {organism[0]}")
    pass

Python Biopython From line 2 of scripts/00_fetch_proteomes.py

library(readr)
library(stringr)
library(ggplot2)
library(dplyr)
library(tidyr)
library(tibble)
library(gridExtra)
args <- commandArgs(trailingOnly=TRUE)

### GET ARGS ###
v1_gff <- args[1]
neff_gff <- args[2]
c3_gff <- args[3]
out_tbl <- args[4]
out_plot <- args[5]

### LOAD DATA ###
load_gff <- function(in_gff, name){
    gff <- read_tsv(in_gff, comment="#", col_names=F)
    colnames(gff) <- c("chrom", "source", "type", "start", "end", "score", "strand", "phase", "attributes")
    gff <- gff %>% filter(type %in% c("mRNA", "CDS", "gene", "exon"))

    ### Clean data
    gff <- gff %>% mutate(attributes=gsub("Parent=", "ID=", attributes)) %>%
        mutate(ID=str_extract(attributes, 'ID=[^;\\.]*')) %>%
        mutate(ID=gsub("=[a-zA-Z]{1,}:", "=", ID)) %>%
        mutate(ID=sapply(ID, function(x){str_split(x,regex('[=-]'))[[1]][2]})) %>%
        mutate(name=name)
    return(gff)
}
gff <- load_gff(v1_gff, "NEFF_v1") %>%
    bind_rows(load_gff(neff_gff, "Neff")) %>%
    bind_rows(load_gff(c3_gff, "C3"))

### COMPUTE STATS ###

gff_gene_len <- gff %>%
    filter(type == 'gene') %>%
    mutate(gene_len = end - start)

# N exon / gene (drop duplicated exons)
gff_exons <- gff %>% 
    filter(type != 'gene') %>%
    group_by(ID) %>%
    distinct(chrom, start, end, type, .keep_all=T) %>%
    mutate(n_exon=sum(type == 'exon')) %>%
    filter(type == 'mRNA') %>% 
    mutate(mrna_len = end - start)

# Combine informations and reorder columns
gff_stats <- gff_gene_len %>%
    select(ID, gene_len) %>%
    inner_join(gff_exons, by='ID') %>%
    select(ID, chrom, start, end, type, strand, n_exon, gene_len, mrna_len, attributes)


### VISUALISE ###

# Label generator to get number of samples
n_obs <- function(x){return(data.frame(y=0, label=paste0("n=", length(x))))}

feature_len <- ggplot(data=gff %>% filter(type %in% c("CDS", "gene")), 
                      aes(x=name, y=log10(end - start), fill=name)) +
    geom_violin() +
    stat_summary(fun.data = n_obs, geom = "text") +
    theme_minimal() +
    xlab("") + 
    ylab("log10 length") + 
    ggtitle("Gene length") +
    facet_wrap(.~type) +
    scale_y_continuous(limits = c(0, 6)) + 
    geom_boxplot(width=0.1)

exon_quantiles <- gff_exons %>% 
  group_by(name) %>% 
  summarise(x=list(enframe(quantile(n_exon, probs=c(0.25,0.5,0.75)), "quantiles", "exons"))) %>% 
  unnest(x) %>%
  spread(key=quantiles, value=exons)

# Assign a quantile label to each exon number based on its strain's exon number distribution
exons_tbl <- gff_exons %>%
    inner_join(exon_quantiles, by="name") %>%
    mutate(
        n_exon_quant = case_when(
            n_exon <= `25%` ~ "25% quantile",
            n_exon >= `75%` ~ "75% quantile",
            TRUE            ~ "Median"
        )
    ) %>%
    select(name, n_exon, n_exon_quant)

exons_tbl <- exons_tbl %>% 
    mutate(
        n_exon_quant = factor(
            n_exon_quant,
            ordered=T, 
            levels=c("25% quantile", "Median", "75% quantile")
        )
    )

exon_per_gene <- ggplot(data=exons_tbl, aes(x=n_exon, fill=n_exon_quant)) +
    scale_fill_manual(values=c("#6666ee", "#999999", "#ee6666")) +
    geom_histogram(binwidth=1) +
    theme_minimal() +
    stat_summary(aes(x = 0.1, y = n_exon, xintercept = stat(y), group = name), 
               fun.y = median, geom = "vline") +
    ylab("Genes") +
    xlab("Exons per genes") +
    facet_grid(name~.) +
    scale_x_continuous(limits = c(-0, 100))

### SAVE OUTPUT ###
svg(out_plot, height=10, width=12)
grid.arrange(feature_len, exon_per_gene)
dev.off()
write_tsv(gff_stats, out_tbl)

R ggplot2 dplyr tidyr stringr readr tibble gridExtra From line 5 of scripts/01_annot_stats.R

library(ggplot2)
library(patchwork)
library(readr)
library(tidyr)
library(dplyr)
library(tools)
library(stringr)


args <- commandArgs(trailingOnly=T)

# No need to scale, now
#library(scales)
# minscale <- function(x){s = scale(x); return(s + abs(min(s)))}
#  mutate_at(vars(-assembly), minscale ) %>%

asb <- read_tsv(args[1]) %>%
  rename(n_contigs=number, assembly=filename) %>%
  mutate(assembly=file_path_sans_ext(basename(assembly))) %>%
  select(assembly, total_length, n_contigs, N50, N70, N90, N50n, N70n, N90n) %>%
  pivot_longer(-assembly, 'metric') %>%
  mutate(assembly = gsub("_genome", "", assembly))

svg(args[2])

len_asb <- ggplot(
             data=asb %>% filter(metric == 'total_length'),
             aes(x=assembly, y=value, fill=assembly)
           ) +
        geom_bar(stat='identity') +
        theme_minimal() +
        ylab("Length") +
        xlab("") +
        ggtitle("Assembly length") +
        theme(legend.position="none")

num_tigs <- ggplot(
             data=asb %>% filter(metric == 'n_contigs'),
             aes(x=assembly, y=value, fill=assembly)
            ) +
        geom_bar(stat='identity') +
        theme_minimal() +
        ylab("# contigs") +
        xlab("") +
        ggtitle("Total number of contigs") +
        theme(legend.position="none")

Nxx <- ggplot(
         data=asb %>% filter(str_detect(metric, '0$')),
         aes(x=assembly, y=value, fill=assembly)
       ) +
       geom_bar(stat='identity') +
       facet_wrap(~metric) +
       theme_minimal() +
       xlab("") +
       ylab("Length") +
       theme(legend.position="none")

Nxxn <- ggplot(
          data=asb %>% filter(str_detect(metric, 'n$')),
          aes(x=assembly, y=value, fill=assembly)
        ) +
        geom_bar(stat='identity') +
        xlab("") +
        ylab("# contigs") +
        theme_minimal() +
        facet_wrap(~metric)

(len_asb + num_tigs + plot_spacer()) / (Nxx) / (Nxxn)

dev.off()

R ggplot2 dplyr tidyr stringr readr patcHwork From line 5 of scripts/01_assembly_stats.R

from collections import OrderedDict
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt


def read_busco(tbl: str, name: str) -> pd.DataFrame:
    """
    Read a busco summary table into a
    dataframe and create relevant columns
    """
    df = pd.read_csv(tbl, names=["number", "type"], sep="\t", header=None)
    df["strain"] = name
    # Single letter busco class as a column
    df["busco"] = df["type"].str.extract(r".*\(([A-Z])\).*")
    df.loc[pd.isnull(df.busco), "busco"] = "T"
    # Transform number of buscos to percentages
    df["perc"] = 100 * np.round(
        df["number"] / df.number[df.busco == "T"].values[0], decimals=2
    )
    return df


# Concatenate busco tables from all strains and set relevant
# variables as indices
busco = (
    pd.concat([read_busco(tbl, name) for name, tbl in snakemake.input.items()])
    .reset_index(drop=True)
    .sort_values("strain", ascending=False)
    .set_index(["busco", "strain"])
)
mpl.use("Agg")
# Keep track of the order in which strains will be plotted
str_to_num = OrderedDict({"v1": 0, "neff": 1, "c3": 2})
# Map each BUSCO class to a color
col_dict = OrderedDict(
    [("M", "#F04441"), ("F", "#F0E441"), ("D", "#3592C6"), ("S", "#58B4E8"),]
)
# We plot one busco class per iteration
for i, t in enumerate(col_dict.keys()):
    # Subset busco table for current strain
    busco_type = busco.loc[t]
    # Retrieve x index from strains (deterministic order)
    r = [str_to_num[s] for s in busco_type.index.values]
    # Stacked barplot, uses cumulative height of previous bars
    if i:
        plt.barh(r, busco_type.perc.values, color=col_dict[t], left=cum_height)
        cum_height += busco_type.perc.values
    else:
        cum_height = busco_type.perc.values
        plt.barh(r, busco_type.perc.values, color=col_dict[t])


def busco_string(strain):
    """Use the BUSCO summary table to construct the standard busco string"""
    mis = busco.loc[("M", strain), "number"]
    fra = busco.loc[("F", strain), "number"]
    com = busco.loc[("C", strain), "number"]
    sin = busco.loc[("S", strain), "number"]
    dup = busco.loc[("D", strain), "number"]
    tot = busco.loc[("T", strain), "number"]
    return f"{strain} - C:{com}[S:{sin}, D:{dup}], F:{fra}, M:{mis}, n={tot}"


# Write the strain name and busco string besides each bar
plt.yticks(r, [busco_string(s) for s in str_to_num.keys()])
plt.xlabel("% BUSCOs")
plt.ylabel("Strain")
plt.savefig(str(snakemake.output))

Python Pandas numpy matplotlib BUSCO From line 2 of scripts/01_plot_busco.py

import sys
import re
from collections import defaultdict
import csv
from Bio import SeqIO
from Bio.Graphics import BasicChromosome
from Bio.SeqFeature import SeqFeature, FeatureLocation
from reportlab.lib.units import cm
from matplotlib import cm as mcm
from matplotlib.colors import to_hex

logh = open(snakemake.log[0], "w")
# {chrom: length}
entries = {}
for rec in SeqIO.parse(snakemake.params["genome"], "fasta"):
    if len(rec.seq) > 100000:
        entries[rec.id] = len(rec.seq)
# all available colors
# cols = list(colors.getAllNamedColors().values())[8:]
cols = [to_hex(mcm.tab10(c)) for c in range(len(entries))]
# {chrom: {type: [features]}} where a feature is [start, end, type, color]
features = defaultdict(list)
type_id = 0
ftype_to_col = {}
with open(snakemake.input["features"]) as bedh:
    bed = csv.DictReader(
        bedh, delimiter="\t", fieldnames=["chrom", "start", "end", "type"]
    )
    for feat in bed:
        start, end = int(feat["start"]), int(feat["end"])
        if start > end:
            start, end = end, start
        try:
            if end > entries[feat["chrom"]]:
                continue
        except KeyError:
            continue
        if feat["type"] not in ftype_to_col.keys():
            ftype_to_col[feat["type"]] = cols[type_id]
            logh.write(f"{feat['type']}: {cols[type_id]}\n")
            type_id += 1
        features[feat["chrom"]].append(
            SeqFeature(
                location=FeatureLocation(int(feat["start"]), int(feat["end"])),
                type=feat["type"],
                qualifiers={"color": [ftype_to_col[feat["type"]]]},
            )
        )

max_len = max(entries.values())
telomere_length = 10000  # For illustration

chr_diagram = BasicChromosome.Organism(output_format="svg")
chr_diagram.page_size = (29.7 * cm, 21 * cm)  # A4 landscape

for name, length in entries.items():
    # features = [f for f in record.features if f.type == "tRNA"]
    chrom_feat = features[name]

    chrom_num = re.sub(r"^.*_([0-9]+)$", r"\1", name)
    cur_chromosome = BasicChromosome.Chromosome(chrom_num)
    # Set the scale to the MAXIMUM length plus the two telomeres in bp,
    # want the same scale used on all five chromosomes so they can be
    # compared to each other
    cur_chromosome.scale_num = max_len + 2 * telomere_length

    # Add an opening telomere
    start = BasicChromosome.TelomereSegment()
    start.scale = telomere_length
    cur_chromosome.add(start)

    # Add a body - again using bp as the scale length here.
    body = BasicChromosome.AnnotatedChromosomeSegment(length, chrom_feat)
    body.scale = length
    cur_chromosome.add(body)

    # Add a closing telomere
    end = BasicChromosome.TelomereSegment(inverted=True)
    end.scale = telomere_length
    cur_chromosome.add(end)

    # This chromosome is done
    chr_diagram.add(cur_chromosome)
chr_diagram.draw(snakemake.output[0], snakemake.params["title"])
logh.close()

Python matplotlib Biopython reportlab From line 5 of scripts/01_plot_karyo.py

import re

hgt_ids = {i.strip() for i in open(snakemake.input[0])}
outf = open(snakemake.output[0], "w")
with open(snakemake.params["gff"], "r") as gff:
    hgt = False
    n_exons = 0
    for line in gff:
        fields = line.split("\t")
        if re.match("^#", line):
            continue
        elif fields[2] == "gene":
            # ID chrom start end type strand n_exon gene_len mrna_len attributes
            if hgt:
                outf.write(f"{curr_hgt.strip()}\t{n_exons}\n")
            gene_id = re.search("gene:([^;]+);", fields[8]).group(1)
            if gene_id in hgt_ids:
                hgt = True
                n_exons = 0
                curr_hgt = line
            else:
                hgt = False
        elif fields[2] == "exon":
            n_exons += 1

Python From line 2 of scripts/02_get_v1_hgt_gff.py

import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import patches as mpatches
import matplotlib as mpl
import matplotlib.gridspec as gridspec
import cooler

## LOAD CLI ARGS
cool_path = sys.argv[1]
rdna_path = sys.argv[2]
out_path = sys.argv[3]

# Discard contigs smaller than this size
# Note: assumes chromosomes are sorted by size in the cool file
MIN_SIZE = 100000

## LOAD INPUT FILES
c = cooler.Cooler(cool_path)
gff = pd.read_csv(rdna_path, sep="\t", comment="#", usecols=range(9))
gff.columns = [
    "seqname",
    "source",
    "feature",
    "start",
    "end",
    "score",
    "strand",
    "frame",
    "attribute",
]

## EXTRACT COOL DATA
chroms = c.chroms()[:]
# Discard all chromosomes after the first shorter than MIN_SIZE
chroms["include"] = chroms.length > MIN_SIZE
stop_bin = c.extent(chroms.loc[np.where(~chroms.include)[0][0], "name"])[1]
bins = c.bins()

# Remove chromosomes that are absent from cool or too small
gff = gff.loc[np.isin(gff.seqname, chroms.loc[chroms.include, "name"]), :]

## TRANSFORM

# Transform rDNA coords to bins
gff["minbin"] = gff.apply(
    lambda r: np.min(
        bins.fetch(f"{r.seqname}:{r.start}-{r.end}").index.values
    ),
    axis=1,
)

gff["maxbin"] = gff.apply(
    lambda r: np.max(
        bins.fetch(f"{r.seqname}:{r.start}-{r.end}").index.values
    ),
    axis=1,
)

gff = gff.drop_duplicates(subset=["attribute", "minbin", "maxbin"])

## VISUALIZE
mat = c.matrix(balance=False)[:stop_bin, :stop_bin]
# np.fill_diagonal(mat, 0)

# Define color saturation threshold
# sat_thr = np.percentile(mat[mat > 0], 85)

# Plot heatmap with rDNA overlay

plt.style.use("seaborn-white")
gs = gridspec.GridSpec(4, 1, height_ratios=[5, 1, 1, 1])
mpl.rcParams["figure.figsize"] = (5, 10)
fig = plt.figure()
ax1 = fig.add_subplot(gs[0])
ax2 = fig.add_subplot(gs[1])
ax3 = fig.add_subplot(gs[2])
ax4 = fig.add_subplot(gs[3])
fig.align_xlabels()
ax2.margins(x=0)
ax3.margins(x=0)
ax4.margins(x=0)

coldict = {
    "18s_rRNA": ("b", "-.", "18S", ax3),
    "28s_rRNA": ("g", ":", "28S", ax2),
    "8s_rRNA": ("r", "--", "5S", ax4),
}
legend_entries = [
    mpatches.Patch(color=v[0], label=v[2], ls=v[1]) for k, v in coldict.items()
]
plt.legend(handles=legend_entries, loc="upper right")
# mat[np.isnan(mat)] = 0
ax1.imshow(np.log(mat), cmap="Reds", interpolation="none", rasterized=True)

# Add vertical lines to line plots at rdna positions
for r in gff.iterrows():
    coldict[r[1]["attribute"]][3].axvline(
        x=r[1]["maxbin"],
        alpha=0.4,
        # linestyle=coldict[r[1]["attribute"]][1],
        lw=0.3,
        c=coldict[r[1]["attribute"]][0],
    )

# Add chromosome grid
for r in chroms.iterrows():
    if r[1].include:
        ax2.axvline(x=c.extent(r[1]["name"])[1], alpha=0.4, c="grey", lw=0.5)
        ax3.axvline(x=c.extent(r[1]["name"])[1], alpha=0.4, c="grey", lw=0.5)
        ax4.axvline(x=c.extent(r[1]["name"])[1], alpha=0.4, c="grey", lw=0.5)


# Subset bins containing rRNA
rdna_bins = {
    sub: np.unique(gff.maxbin[gff.attribute == sub])
    for sub in np.unique(gff.attribute)
}
# Compute contact sum per bin for each subunit
rdna_sig = {}
rdna_sig["28S"] = mat[rdna_bins["28s_rRNA"], :].mean(axis=0)
rdna_sig["18S"] = mat[rdna_bins["18s_rRNA"], :].mean(axis=0)
rdna_sig["5S"] = mat[rdna_bins["8s_rRNA"], :].mean(axis=0)

ax2.plot(
    range(mat.shape[0]), np.log10(rdna_sig["28S"]), label="28S", lw=0.5, c="g"
)
ax3.plot(
    range(mat.shape[0]), np.log10(rdna_sig["18S"]), label="18S", lw=0.5, c="b"
)
ax4.plot(
    range(mat.shape[0]), np.log10(rdna_sig["5S"]), label="5S", lw=0.5, c="r"
)

fig.savefig(out_path, dpi=2500)

Python Pandas numpy matplotlib cooler MaxBin cooler From line 2 of scripts/03_rdna_hic.py

import sys
import re


def get_cigar_gaps(cigar):
    # Match all indels and capture indel size
    gap_regex = re.compile(r"([0-9]+)[ID]")
    total_gap_size = 0
    for gap_size in re.findall(gap_regex, cigar):
        total_gap_size += int(gap_size)
    return total_gap_size


with open(sys.argv[1]) as paf, open(sys.argv[2], "w") as outf:
    for line in paf:
        align = line.split("\t")
        if align[16] != "tp:A:P":  # Only consider primary alignments
            continue
        cigar = align[-1]
        n_gaps = get_cigar_gaps(cigar)
        n_ambiguous = int(align[15].lstrip("nn:i:"))
        n_bad = int(align[12].lstrip("NM:i:"))
        n_matches = int(align[9])
        n_mismatches = n_bad - (n_ambiguous + n_gaps)
        div = n_mismatches / (n_matches + n_mismatches)
        outf.write(f"{align[5]}\t{align[7]}\t{align[8]}\t{div}\n")

Python From line 6 of scripts/04_compute_seq_divergence.py

REF="$1"
# MCscanX output prefix
MCSX="$2"
CIR_DIR=$3
CAND="$4"

# Only include scaffolds larger than 100kb
len_cutoff=100000
# Karyotype
awk '/^>/ {if (seqlen){print seqlen}; printf "%s ",$0;seqlen=0;next;}
     {seqlen += length($0)}
     END{print seqlen}' $REF |
  awk -v minlen=$len_cutoff '$2 >= minlen {print "chr","-",$1,$1,"0",$2,"Neff"}' |
  tr -d '>' \
  >${CIR_DIR}/karyotype.txt

# Aesthetics, can comment out: Make Neff scaffolds reverse orderd. Looks better
# for 2-strain comparison
grep "Neff_scaffold" ${CIR_DIR}/karyotype.txt > ${CIR_DIR}/neff_karyo.txt
grep -v "Neff_scaffold" ${CIR_DIR}/karyotype.txt | sort -rV > ${CIR_DIR}/other_karyo.txt
cat ${CIR_DIR}/neff_karyo.txt ${CIR_DIR}/other_karyo.txt > ${CIR_DIR}/karyotype.txt
rm  ${CIR_DIR}/neff_karyo.txt ${CIR_DIR}/other_karyo.txt
# Candidate genes
#tail -n +2  $CAND |
#  awk '{print $1,$2,$3}' \
#  >${CIR_DIR}/hgt_candidates.txt

# Parsing MCScanX collinearity output into circos links
n_genes=$(grep -v "^#" "$MCSX.collinearity" | wc -l)
n_done=0
while read line
do
  # Make links involving HGT candidates red
  g1="${line%% *}"
  g2="${line##* }"
#  attr=$( \
#    awk -v g1=${g1%%-*} -v g2=${g2%%-*} \
#        'BEGIN{attr="color=lgrey_a4,z=0"} \
#        {if($0 ~ g1 || $0 ~ g2){attr="color=dred,z=30";exit}} \
#        END{print attr}' \
#        ${CAND} \
#  )
  attr='lgrey'
  # Get coordinate of gene ID from GFF file
  awk -v g1=$g1 -v g2=$g2 -v attr="$attr" \
    'BEGIN{N1=0;N2=0}
     $2 ~ g1 {
       N1+=1;c1=$1;s1=$3;e1=$4}
     $2 ~ g2 {
       N2+=1;c2=$1;s2=$3;e2=$4}
     END{
       if (N1 < 1 || N2 < 1) {exit 1}
       else {print c1,s1,e1,c2,s2,e2,attr}}' "$MCSX.gff"
  echo -ne "  $n_done / $n_genes         \r" 1>&2
  (( n_done++ ))
  # Convert gene IDs to coordinates
done < <(grep -v "^#" "$MCSX.collinearity" | tr -d ' ' | tr '\t' ' ' | cut -d$' ' -f2,3)  \
     > "${CIR_DIR}/mcsx.txt"

Shell NG-Circos From line 7 of scripts/04_gen_circos_files.sh

function usage () {
   cat <<EOF
Usage: `basename $0` -g genes -o output_folder -r ref [-h]
   -g   gtf file with genes coordinates
   -o   output folder for MCScanX files
   -f   reference genome
   -c   number of cpus to use for BLAST
   -h   displays this help
EOF
   exit 0
}


N_CPU=1
# Parsing CL arguments
while getopts ":g:o:f:c:h" opt; do
   case $opt in
   g )  GFF=${OPTARG} ;;
   o )  OUT_F=${OPTARG} ;;
   f )  FASTA=${OPTARG};;
   c )  N_CPU=${OPTARG} ;;
   h )  usage ;;
   \?)  usage ;;
   esac
done
# Testing if mandatory arguments have been provided
if [ "x" == "x$FASTA" ] || [ "x" == "x$GFF" ] || \
   [ "x" == "x$OUT_F" ];
then
  echo "Error: A reference genome, input GFF file and output folder \
  are required."
  usage
  exit 0
fi


#1: format GFF file for MCScanX, only keeping CDS
echo "Formatting GFF..."
OUT_GFF="$OUT_F/MCScanX_genes.gff"
awk 'BEGIN{OFS="\t"}
     $3 ~ "CDS" {id_match=match($9,/ID=[^;]*/)
      id=substr($9,RSTART+3,RLENGTH-3)
      print $1,id,$4,$5}' $GFF > $OUT_GFF
echo "GFF formatted !"

#2: build blast database from sequences and all vs all blast
BASE_OUT=$(basename $FASTA)
BASE_OUT=${BASE_OUT%%.*}
makeblastdb -in $FASTA -dbtype prot -out "${OUT_F}/${BASE_OUT}"

echo "Blasting transcriptome against itself."
blastp -query $FASTA \
       -db "${OUT_F}/${BASE_OUT}" \
       -outfmt 6 \
       -max_target_seqs 5 \
       -num_threads $N_CPU \
       -out "${OUT_F}/${BASE_OUT}.blast"

#3 Renaming chromosomes according to MCScanX conventions
MC_IN="$OUT_F/MCScanX_in"
# Use two first characters of fasta file as organism abbreviation
#SP_CODE=${BASE_OUT:0:2}
#sed "s/^scaffold_\([0-9]*\)/$SP_CODE\1/g" | 
sed 's/\.cds//g' "$OUT_GFF" > "$MC_IN.gff"
#sed "s/scaffold_\([0-9]*\)/$SP_CODE\1/g" | 
sed 's/::[^	]*//g' "${OUT_F}/${BASE_OUT}.blast" > "$MC_IN.blast"


echo "All input files are ready:"
echo "  - BLAST output: $MC_IN.blast"
echo "  - GFF file: $MC_IN.gff"

Shell BLAST From line 10 of scripts/04_gen_mcscanx_input.sh

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 12))
div = pd.read_csv(
    snakemake.input[0], sep="\t", names=["chrom", "start", "end", "div"]
)
div = div.sort_values(["chrom", "start"])
div["align_len"] = abs(div.end - div.start)
div["log_len"] = np.log10(div.align_len)
jp = sns.jointplot(
    data=div,
    x="log_len",
    y="div",
    kind="kde",
    shade=True,
    cmap="plasma",
    levels=200,
)
jp.set_axis_labels(
    "Log10 length of aligned block [bp]",
    "Per base divergence in aligned blocks",
)
avg_div = np.average(div["div"], weights=div["align_len"])
avg_log_len = np.mean(div.log_len)
jp.ax_joint.axvline(
    x=avg_log_len, linestyle="--", label=f"",
)
jp.ax_joint.axhline(y=avg_div, linestyle="--")
jp.ax_marg_x.axvline(x=avg_log_len)
jp.ax_marg_y.axhline(y=avg_div)
jp.fig.suptitle(
    "Neff - C3 gap excluded sequence divergence\n"
    f"mean divergence length: {100*avg_div:.2f}%"
)
plt.savefig(snakemake.output[0])

Python Pandas numpy matplotlib seaborn From line 1 of scripts/04_plot_seq_divergence.py

import sys
import pandas as pd

gene_count = sys.argv[1]
out_fasta = sys.argv[2]

gc = pd.read_csv(gene_count, sep="\t")


def get_presence(freqs):
    """
    Given an array of frequences, return gene presence code

    Parameters
    ----------
    freqs : numpy array of ints
        1D Array of gene frequences.

    Returns
    -------
    str :
        Gene presence code, in the form of a string of 1 (present) and 0 (absent)

    Examples
    --------
    >>> get_presence(np.array([32, 1, 0, 0, 2]))
    '11001'
    """
    freqs[freqs > 1] = 1
    freqs = freqs.astype(str)
    return "".join(freqs)


# Store species names and associated ortholog presence code in a dict
sp_presence = {
    gc.columns[i]: get_presence(gc.iloc[:, i])
    for i in range(1, gc.shape[1] - 1)
}

with open(out_fasta, "w") as fa:
    for sp, presence in sp_presence.items():
        fa.write(">" + str(sp) + "\n")
        fa.write(presence + "\n")

Python Pandas From line 1 of scripts/06_orthocount_to_fasta.py

import pandas as pd
from matplotlib_venn import venn2, venn3
import matplotlib as mpl
import matplotlib.pyplot as plt

mpl.use("Agg")
vdf = pd.read_csv(snakemake.input["pres_compact"], sep="\t").drop(
    "Orthogroup", axis=1
)
vdf = vdf.astype(bool)
# vdf = vdf.rename(columns={'NEFF_v1_hgt_cds': 'v1'})
# Make Venn diagram. sets are A, B, AB, C, AC, BC, ABC
fig, ax = plt.subplots(1, 2, figsize=(15, 12))
venn3(
    subsets=(
        len(vdf.query("    c3 and not neff and not amoeba")),  # A
        len(vdf.query("not c3 and     neff and not amoeba")),  # B
        len(vdf.query("    c3 and     neff and not amoeba")),  # AB
        len(vdf.query("not c3 and not neff and     amoeba")),  # C
        len(vdf.query("    c3 and not neff and     amoeba")),  # AC
        len(vdf.query("not c3 and     neff and     amoeba")),  # BC
        len(vdf.query("    c3 and     neff and     amoeba")),  # ABC
    ),
    set_labels=("C3", "Neff", "Amoeba"),
    ax=ax[0],
)
venn2(
    subsets=(
        len(vdf.query("    c3 and not neff")),
        len(vdf.query("not c3 and     neff")),
        len(vdf.query("    c3 and     neff")),
    ),
    set_labels=("C3", "Neff"),
    ax=ax[1],
)
fig.savefig(snakemake.output["venn"])

Python Pandas matplotlib matplotlib-venn From line 1 of scripts/06_plot_orthogroups_venn.py

library(tidyverse)
library(topGO)
library(viridis)

# Parse CL args
args <- commandArgs(trailingOnly=T)

# Extract gene IDs and GO terms.
annot_tbl <- read_tsv(args[1], col_types='cciicciiicciiiiiii') %>%
  dplyr::select(ID, attributes, hgt) %>%
  mutate(GO=str_replace(attributes, '.*Ontology_term=([^;]*);', "\\1"))
hgt_candidates <- annot_tbl %>%
  filter(hgt == 1) %>%
  pull(ID)

out_file <- args[2]
tmp_mapfile <- paste(dirname(out_file), 'id2go.tsv', sep='/')
out_fig <- paste0(tools::file_path_sans_ext(out_file), ".svg")

# Write geneID - GO terms mapping to file
annot_tbl %>%
  dplyr::select(ID, GO) %>%
  mutate(GO = str_replace_all(GO, ";", ", ")) %>%
  write_tsv(tmp_mapfile)


geneID2GO <- readMappings(tmp_mapfile)
geneNames <- names(geneID2GO)
geneList <- factor(as.integer(geneNames %in% hgt_candidates))
names(geneList) <- geneNames
GOdata <- new(
  "topGOdata",
  description = "HGT analyis in A. castellanii",
  ontology = "BP",
  allGenes = geneList, 
  nodeSize = 10,
  gene2GO = geneID2GO, 
  annot = annFUN.gene2GO
)


fisher.stat <- new(
  "classicCount",
  testStatistic = GOFisherTest, 
  name = "Fisher test"
)
resultFisher <- getSigGroups(GOdata, fisher.stat)
weight.stat <- new(
  "weightCount",
  testStatistic = GOFisherTest, 
  name = "Fisher test with weight algorithm",
  sigRatio = 'ratio'
)
resultWeight <- getSigGroups(GOdata, weight.stat)

#showSigOfNodes(GOdata, score(resultWeight), firstSigNodes = 9, useInfo = 'all')

# Get 'significant' terms
pvals <- score(resultWeight)
sig <- pvals[pvals<0.01]

termStat(GOdata, names(sig))

# Report top enriched GO terms
allRes <- GenTable(
  GOdata,
  classic = resultFisher,
  weight = resultWeight, 
  orderBy = "weight",
  ranksOf = "classic",
  topNodes = 200
)

write_tsv(allRes, out_file)
allRes <- allRes %>%
	mutate(GO.ID=paste(GO.ID, Term, sep=', '))

tidy_res = as.tibble(allRes) %>%
  mutate(
    weight = as.numeric(weight),
    classic = as.numeric(classic),
    GO.ID = factor(
      GO.ID,
      ordered=T,
      levels=unique(GO.ID[ordered(weight)])
    )
  )

# Visualise top enriched GO terms
svg(out_fig)
ggplot(
  data=tidy_res %>% top_n(30, -weight), 
  aes(x=GO.ID, y=-log10(weight))
) + 
  geom_segment(
    aes(xend=GO.ID, yend=min(-log10(weight))),
    size=1.1
  ) +
  geom_point(aes(size=Annotated, color=Significant / Expected)) + 
  geom_hline(aes(yintercept=2), lty=2, col='red') +
  theme_minimal() + 
  xlab("") +
  ylab("-log10 pvalue") +
  ggtitle("GO enrichment test in HGT candidates\n(Fisher exact test, weight algorithm)") +
  scale_color_viridis() +
  coord_flip()
dev.off()

R ggplot2 tidyverse viridis topGO From line 4 of scripts/go_enrich.R

shell: "Rscript scripts/go_enrich.R {input.annot} {input.candidates} {output}"

SnakeMake From line 90 of master/Snakefile

ShowHide 38 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/cmdoret/Acastellanii_genome_analysis

Name: acastellanii_genome_analysis

Version: 1

Badge:

Insert copied code into your website to add a link to this workflow.

License: MIT License

Keywords:

assembly-stats cooler BEDTools Biopython BLAST BUSCO cd-hit Diamond genomepy MaxBin Minimap2 NG-Circos OrthoFinder Pandas patcHwork QUAST seqkit Snakemake topGO dplyr ggplot2 gridExtra readr stringr tibble tidyr tidyverse viridis cooler matplotlib matplotlib-venn numpy reportlab seaborn

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free