Snakemake workflow for Illumina RNA-sequencing experiments - extract population genomic signals from RNA-Seq data

public 1yr ago Version: v2.0.1 0 bookmarks

View Workflow

RNA-Seq-Pop

Documentation : https://sanjaynagi.github.io/rna-seq-pop/

Walkthrough : https://www.youtube.com/watch?v=5QQe7DLHO4M

RNA-Seq-Pop is a computational pipeline to analyse Illumina RNA-Sequencing data of any organism. As well as performing standard transcriptomic analyses, such as differential expression, RNA-Seq-Pop also calls and analyses genetic polymorphisms, extracting population genomic signals. The workflow can perform the following analyses of illumina paired-end RNA-Sequencing data:

Quality control of fastq reads with fastqc , QC metrics integrated into a final report with multiQC
Differential expression analysis with Kallisto at the gene level ( DESeq2 ) and transcript level ( Sleuth ), and gene set enrichment analyses.
Variant calling with freebayes
Fst and Population branch statistic (PBS) , both in windows across contigs and at the gene-level ( Scikit-allel ).
Allele frequencies at variants of interest (pre-specified loci of choice).
Various summary statistics (Wattersons Theta, Sequence Diversity, Dxy etc)
Analysis of Ancestry Informative Markers (AIMs) to determine relative ancestry of An.gambiae/coluzzii/arabiensis ( Anopheles gambiae s.l )
Determines Karyotype of chromosomal inversions using compkaryo ( Anopheles gambiae s.l )

The workflow is generalised, and will function with any Illumina single or paired-end RNA-sequencing. However, certain modules, such as the ancestry and karyotyping, are only appropriate for An. gambiae s.l . These can be activated in the configuration file (config.yaml). The workflow works with pooled samples, diploid, or haploid individuals.

If you have any feedback on how the workflow may be improved, please do get in touch, or feel free to fork the github repo and create a pull request for any additional features you would like to implement. If you are using the workflow and would like to give feedback or troubleshoot, consider joining the discord server here , otherwise email or log an issue on github.

Authors

Sanjay Curtis Nagi (@sanjaynagi) - [email protected]

Citation

Sanjay C Nagi, Ambrose Oruni, David Weetman, Martin J Donnelly (2022). RNA-Seq-Pop : Exploiting the sequence in RNA-Seq - a Snakemake workflow reveals patterns of insecticide resistance in the malaria vector Anopheles gambiae . Molecular Ecology Resources ; doi: https://onlinelibrary.wiley.com/doi/10.1111/1755-0998.13759

If you use this workflow in a paper, please give credits to the author by citing the manuscript and its DOI - https://onlinelibrary.wiley.com/doi/10.1111/1755-0998.13759

Usage

Please see the [documentation](https://sanjaynagi.github.io/rna-seq-pop/

for instructions on how to run and contribute to RNA-Seq-Pop.

Release notes

1.0.4 - config files have been slightly restructured to make them neater. FDR control in GSEA. exampleconfig now up to date.
1.0.3 - support for single-end reads, venn and custom heatmaps added, updated results folder structure. fastqs can be specified in sample metadata.
1.0.2 - Changed Pi, theta calculations to be in windows across genome, and removed plotting from SummaryStats.py. Changed use of 'chrom' to 'contig'
1.0.1 - New feature to plot a heatmap of various gene families

Code Snippets

wrapper:
    "v1.15.0/bio/kallisto/index"

SnakeMake From line 14 of rules/alignment.smk

wrapper:
    "v1.15.0/bio/kallisto/quant"

SnakeMake Quant From line 35 of rules/alignment.smk

shell:
    "gzip -d -c {input} > {output} 2> {log}"

SnakeMake From line 48 of rules/alignment.smk

wrapper:
    "v1.15.0/bio/samtools/faidx"

SnakeMake From line 61 of rules/alignment.smk

shell:
    "hisat2-build -p {threads} {input.fasta} {params.prefix}  2> {log}"

SnakeMake HISAT2 From line 81 of rules/alignment.smk

shell:
    """
    hisat2 {params.extra} --threads {threads} -x {params.idx} {params.readflags} 2> {log.align} | 
    samblaster {params.samblaster} 2> {log.sort} | samtools sort -@{threads} - -o {output} 2>> {log.sort}
    """

SnakeMake SAMtools HISAT2 SAMBLASTER From line 107 of rules/alignment.smk

wrapper:
    "v1.15.0/bio/samtools/index"

SnakeMake From line 124 of rules/alignment.smk

shell: 
    """
    python -m ipykernel install --user --name=pythonGenomics 2> {log}
    """

SnakeMake ipykernel From line 28 of rules/common.smk

shell:
    """
    papermill -k pythonGenomics {input.input_nb} {output.out_nb} -p wkdir {params.wkdir} 2> {log}
    """

SnakeMake papermill From line 55 of rules/common.smk

script:
    "../scripts/DeseqGeneDE.R"

SnakeMake From line 29 of rules/diffexp.smk

script:
    "../scripts/SleuthIsoformsDE.R"

SnakeMake From line 57 of rules/diffexp.smk

script:
    "../scripts/ProgressiveGenes.R"

SnakeMake From line 90 of rules/diffexp.smk

shell:
    """
    papermill {input.nb} {output.nb} -k pythonGenomics -p go_path {input.gaf} -p metadata_path {input.metadata} -p config_path {params.config} -p selection {params.selection} 2> {log}
    cp {output.nb} {output.docs_nb} 2>> {log}
    """

SnakeMake papermill From line 128 of rules/diffexp.smk

script:
    "../scripts/Ag1000gSweepsDE.py"

SnakeMake From line 154 of rules/diffexp.smk

shell:
    """
    papermill {input.nb} {output.nb} -k pythonGenomics -p metadata_path {input.metadata} -p config_path {params.config_path} -p go_path {input.eggnog} -p normcounts_path {input.normcounts} -p pfam_path {input.pfam} 2> {log}
    cp {output.nb} {output.docs_nb} 2>> {log}
    """

SnakeMake papermill From line 198 of rules/diffexp.smk

shell:
    """
    papermill {input.nb} {output.nb} -k pythonGenomics -p wkdir {params.wd} -p metadata_path {input.metadata} 2> {log}
    cp {output.nb} {output.docs_nb} 2>> {log}
    """

SnakeMake papermill From line 223 of rules/diffexp.smk

shell:
    """
    papermill {input.nb} {output.nb} -k pythonGenomics -p wkdir {params.wd} -p dataset {params.dataset} 2> {log}
    cp {output.nb} {output.docs_nb} 2>> {log}
    """

SnakeMake papermill From line 251 of rules/diffexp.smk

shell:
    "bcftools concat {input.calls} | vcfuniq > {output} 2> {log}"

SnakeMake BCFtools Vcflib From line 19 of rules/filterAnnotate.smk

wrapper:
    "v1.15.0/bio/bcftools/view"

SnakeMake From line 35 of rules/filterAnnotate.smk

shell:
    "snpEff download {params.ref} -dataDir {params.dir} 2> {log}"

SnakeMake snpEff From line 52 of rules/filterAnnotate.smk

shell:
    """
    snpEff eff {params.db} -dataDir {params.dir} -csvStats {output.csvStats} {input.calls} > {params.prefix} 2> {log}
    bgzip {params.prefix}
    """

SnakeMake snpEff From line 77 of rules/filterAnnotate.smk

shell:
    """
    jupyter-book build --all docs/rna-seq-pop-results --path-output results/rna-seq-pop-results 2> {log}
    ln -sf docs/rna-seq-pop-results/_build/html/index.html {params.dataset}-results.html 2>> {log}
    """

SnakeMake From line 23 of rules/jupyter-book.smk

script:
    "../scripts/VennDiagrams.py"

SnakeMake From line 23 of rules/misc.smk

script:
    "../scripts/checkInputs.py"

SnakeMake From line 22 of rules/qc.smk

wrapper:
    "v2.2.1/bio/fastp"

SnakeMake fastp From line 38 of rules/qc.smk

wrapper:
    "v1.15.0/bio/samtools/flagstat"

SnakeMake From line 54 of rules/qc.smk

shell:
    "mosdepth --threads {threads} --fast-mode {params.prefix} {input.bam}"

SnakeMake mosdepth From line 74 of rules/qc.smk

shell:
    """
    bcftools stats {input} > {output} 2> {log}
    """

SnakeMake BCFtools From line 93 of rules/qc.smk

wrapper:
    "v2.2.1/bio/multiqc"

SnakeMake MultiQC From line 120 of rules/qc.smk

shell:
    """
    papermill {input.nb} {output.nb} -k pythonGenomics -p wkdir {params.wd} 2> {log}
    cp {output.nb} {output.docs_nb} 2>> {log}
    """

SnakeMake papermill From line 138 of rules/qc.smk

script:
    "../scripts/SNPstatistics.py"

SnakeMake From line 35 of rules/variantAnalysis.smk

shell:
    """
    papermill {input.nb} {output.nb} -k pythonGenomics -p metadata_path {input.metadata} -p dataset {params.dataset} -p config_path {params.config_path} -p ploidy {params.ploidy} -p missingprop {params.missingprop} -p qualflt {params.qualflt} 2> {log}
    cp {output.nb} {output.docs_nb} 2>> {log}
    """

SnakeMake papermill From line 100 of rules/variantAnalysis.smk

shell:
    """
    papermill {input.nb} {output.nb} -k pythonGenomics -p metadata_path {input.metadata} -p dataset {params.dataset} -p config_path {params.config_path} -p ploidy {params.ploidy} -p missingprop {params.missingprop} -p qualflt {params.qualflt} 2> {log}
    cp {output.nb} {output.docs_nb} 2>> {log}
    """

SnakeMake papermill From line 137 of rules/variantAnalysis.smk

shell:
    """
    papermill {input.nb} {output.nb} -k pythonGenomics -p config_path {params.config} -p pbs {params.pbs} -p pbscomps {params.pbscomps} -p metadata_path {input.metadata} -p dataset {params.dataset} -p ploidy {params.ploidy} -p missingprop {params.missingprop} -p qualflt {params.qualflt} 2> {log}
    cp {output.nb} {output.docs_nb} 2>> {log}
    """

SnakeMake papermill From line 188 of rules/variantAnalysis.smk

script:
    "../scripts/PerGeneFstPBS.py"

SnakeMake From line 227 of rules/variantAnalysis.smk

script:
    "../scripts/AncestryInformativeMarkers.py"

SnakeMake From line 262 of rules/variantAnalysis.smk

shell:
    """
    paste <(bcftools query -l {input.vcf}) \
    <(python {params.basedir}/scripts/compkaryo/compkaryo/compkaryo.py {input.vcf} {wildcards.karyo} -p {params.ploidy}) | 
    column -s $'\\t' -t | sort -k 1 > {output}
    """

SnakeMake From line 285 of rules/variantAnalysis.smk

shell:
    """
    papermill {input.nb} {output.nb} -k pythonGenomics -p config_path {params.config_path} -p metadata_path {params.metadata} -p dataset {params.dataset} -p ploidy {params.ploidy} 2> {log}
    cp {output.nb} {output.docs_nb} 2>> {log}
    """

SnakeMake papermill From line 320 of rules/variantAnalysis.smk

script:
    "../scripts/GenerateFreebayesParams.R"

SnakeMake From line 24 of rules/variantCalling.smk

shell:
    "freebayes -f {input.ref} -t {input.regions} --ploidy {params.ploidy} --populations {input.pops} --pooled-discrete --use-best-n-alleles 5 -L {input.samples} > {output} 2> {log}"

SnakeMake FreeBayes From line 48 of rules/variantCalling.smk

shell:
    """
    samtools mpileup {input.bam} -r {params.region} -f {params.ref} 2> {log} | 
    python {params.basedir}/scripts/BaseParser.py > {output} 2>> {log}
    """

SnakeMake SAMtools From line 21 of rules/variantsOfInterest.smk

script:
    "../scripts/VariantsOfInterestAlleleBalance.R"

SnakeMake From line 52 of rules/variantsOfInterest.smk

shell:
    """
    papermill {input.nb} {output.nb} -k pythonGenomics -p voi_path {input.VariantsOfInterest} 2> {log}
    cp {output.nb} {output.docs_nb} 2>> {log}
    """

SnakeMake papermill From line 100 of rules/variantsOfInterest.smk

import sys
sys.stderr = open(snakemake.log[0], "w")

import pandas as pd
import numpy as np
from collections import defaultdict

# Read in parameters 
pval_threshold = snakemake.params['pval']
upper_fc = snakemake.params['fc']
lower_fc = 1/upper_fc # if someone wants a FC threshold of 2, need to have lower threshold of 0.5.
# Read in list of contrasts
comparisons = pd.DataFrame(snakemake.params['DEcontrasts'], columns=['contrast'])

# Read in .csv file containing selection signals and associated metadata
signals = pd.read_csv("resources/signals.csv")

for comp in comparisons['contrast']:

    # Load DE results
    DEgenes = pd.read_csv(f"results/genediff/{comp}.csv")
    # Filter to overexpressed genes
    sigup = DEgenes[np.logical_and(DEgenes['padj'] < pval_threshold, np.logical_or(DEgenes['FC'] > upper_fc, DEgenes['FC'] < lower_fc))]

    sweep = {}
    nswept = {}

    # Loop through each signal, recording if any DE genes are found in one
    for i, cols in signals.iterrows():

        if pd.isnull(cols['overlapping_genes']): # Skip signal if no overlapping genes
            continue

        sweptgenes = np.array(cols['overlapping_genes'].split(" "))

        # Get boolean array - if list of swept genes isin our DE genes
        overlap = np.isin(sweptgenes, sigup['GeneID'])

        sweep[cols['uid']] = sweptgenes[overlap]
        nswept[cols['uid']] = sweptgenes

    genes = np.concatenate(list(sweep.values()))
    swept = sigup[np.isin(sigup['GeneID'], genes)]

    for k,v in sweep.items():
        sweep[k] = ' '.join(v)

    # Build dataframe and add sweep metadata columns
    sweptDE = pd.DataFrame.from_dict(sweep, orient='index', columns=['overlapping_DE_genes'])
    sweptDE = sweptDE.reset_index().rename(columns={'index': 'Ag1000g_sweep'})
    sweptDE['overlapping_genes'] = signals['overlapping_genes'][~pd.isnull(signals['overlapping_genes'])].reset_index(drop=True)
    sweptDE['chromosome'] = signals['peak_end_seqid'][~pd.isnull(signals['overlapping_genes'])].reset_index(drop=True)
    sweptDE['epicenter'] = signals['epicenter_coord'][~pd.isnull(signals['overlapping_genes'])].reset_index(drop=True)
    sweptDE['known_loci'] = signals['known_loci'][~pd.isnull(signals['overlapping_genes'])].reset_index(drop=True)

    wheresweep = defaultdict(dict)
    whatsweep = defaultdict(list)

    # Now loop through each swept gene to find name of sweeps it lies under
    # And location of sweep
    for gene in genes:

        for i, cols in sweptDE.iterrows():

            sweptgenes = np.array(cols['overlapping_DE_genes'].split(" "))

            if np.isin(sweptgenes, gene).any():
                wheresweep[gene]['contig'] = cols['chromosome']
                wheresweep[gene]['epicenter'] = cols['epicenter']
                wheresweep[gene]['known_loci'] = cols['known_loci']

                whatsweep[gene].append(cols['Ag1000g_sweep'])

    # Join name of sweeps column into a single string, so it fits in one column of data frame
    for k,v in whatsweep.items():
        whatsweep[k] = ' '.join(v)

    dfwhere = pd.DataFrame.from_dict(wheresweep, orient='index')
    dfwhat = pd.DataFrame.from_dict(whatsweep, orient='index', columns=['Ag1000g_sweeps'])

    df = pd.concat([dfwhat, dfwhere], axis=1)
    df = df.reset_index().rename(columns={'index': 'GeneID'})

    swept = swept.merge(df)
    swept.to_csv(f"results/genediff/ag1000gSweeps/{comp}_swept.tsv", sep="\t")

Python Pandas numpy From line 7 of scripts/Ag1000gSweepsDE.py

import sys
sys.stderr = open(snakemake.log[0], "w")

import rnaseqpoptools as rnaseqpop
import pandas as pd
import numpy as np
import zarr
import allel
from collections import defaultdict

### AIMS ###
dataset = snakemake.params['dataset']
metadata = rnaseqpop.load_metadata(snakemake.input['metadata'])
metadata = metadata.sort_values(by='species').reset_index(drop=True)
contigs = snakemake.params['contigs']
ploidy = snakemake.params['ploidy']
numbers = rnaseqpop.get_numbers_dict(ploidy)
qualflt = snakemake.params['qualflt']
missingprop = snakemake.params['missingprop']

# read AIMs
aims = zarr.open(snakemake.input['aims_zarr_gambcolu'], mode='r')

## initialize dicts
ancestryPerAim = {}
aims_chrom_gamb = {}
aims_chrom_colu = {}
all_gamb = defaultdict(list)
all_colu = defaultdict(list)
n_aims_per_chrom = {}

for contig in contigs:

    # read in and filter data
    path = f"results/variantAnalysis/vcfs/{dataset}.{contig}.vcf.gz"
    vcf, geno, acsubpops, pos, depth, snpeff, subpops, pops =  rnaseqpop.readAndFilterVcf(path=path,
                                                               contig=contig,
                                                               samples=metadata,
                                                               numbers=numbers,
                                                               ploidy=ploidy,
                                                               qualflt=qualflt,
                                                               missingfltprop=missingprop)
    aimspos = aims[contig]['POS'][:]

    # get intersection of aims and our SNPs
    aims_pos_mask, aims_mask_2 = pos.locate_intersection(aimspos)
    our_aims = pos[aims_pos_mask]
    print(f"\n In the data, across all samples there are {our_aims.shape[0]} Ancestry Informative markers on Chromosome {contig}")

    # get gamb and colu alleles, and subset to aims that we have in the rna-seq data 
    aimscolu = aims[contig]['colu_allele'][:][aims_mask_2]
    aimsgamb = aims[contig]['gamb_allele'][:][aims_mask_2]

    # get mask that was used in readAndFilterVcf()
    mask = pos.locate_intersection(vcf['variants/POS'])[1]
    ref  = vcf['variants/REF'][mask][aims_pos_mask]
    alt = vcf['variants/ALT'][mask][aims_pos_mask]

    # filter geno array to set of aims
    geno_aims = geno.compress(aims_pos_mask, axis=0)

    totalgambscore = {}
    totalcoluscore = {}

    for aim in our_aims:

        gambscore = {}
        coluscore = {}

        # filter arrays 
        mask = our_aims == aim
        ref_ = ref[mask]
        alt_ = alt[mask]
        aimscolu_ = aimscolu[mask]
        aimsgamb_ = aimsgamb[mask]

        gn_aim = geno_aims.compress(mask, axis=0)

        # convert genotypes to nucleotides
        gn2nucleotide = {0:ref_[0],
                        1:alt_[0][0],
                         2:alt_[0][1],
                         3:alt_[0][2],
                        -1:float("nan")}
        gn = rnaseqpop.replace_with_dict2_generic(gn_aim, gn2nucleotide)

        # for each sample, get proportion of gambiae/coluzzii alleles
        # alleles that are different to both will be missed here
        for sample in metadata.treatment.unique():
            alleles = gn.take(subpops[sample], axis=1).flatten()

            # at each AIM, do we have gamb or colu alleles
            gamb = alleles[alleles != 'nan'] == aimsgamb_
            colu = alleles[alleles != 'nan'] == aimscolu_

            # get proportion of gamb v colu alleles at each locus
            gambscore[sample] = np.mean(gamb)
            coluscore[sample] = np.mean(colu)

        totalgambscore[aim] = dict(gambscore)
        totalcoluscore[aim] = dict(coluscore)

        gambscores = rnaseqpop.flip_dict(totalgambscore)
        coluscores = rnaseqpop.flip_dict(totalcoluscore)

        prop_gambiae = {}
        prop_colu = {}
        n_aims_per_sample = {}

        for sample in metadata.treatment.unique():

            prop_gambiae[sample] = np.nanmean(np.array(list(gambscores[sample].values())))
            all_gamb[sample].append(np.nanmean(np.array(list(gambscores[sample].values()))))
            prop_colu[sample] = np.nanmean(np.array(list(coluscores[sample].values())))
            all_colu[sample].append(np.nanmean(np.array(list(coluscores[sample].values()))))

            arr = np.array(list(gambscores[sample].values()))
            dim = arr.shape[0]
            n_aims_per_sample[sample] = dim-np.sum(np.isnan(arr))

    # store AIM fractions for each chromosome in outer dict 
    aims_chrom_gamb[contig] = dict(prop_gambiae)
    aims_chrom_colu[contig] = dict(prop_colu)
    n_aims_per_chrom[contig] = dict(n_aims_per_sample)

    # Store ancestry score per aim
    ancestryPerAim[contig] = pd.concat([pd.DataFrame(gambscores).add_suffix("_gamb"), pd.DataFrame(coluscores).add_suffix("_colu")], axis=1)
    ancestryPerAim[contig]['contig'] = contig

    # plot and store for each chromosome
    coludf = pd.DataFrame.from_dict(prop_colu, orient='index', columns=['AIM_fraction_coluzzii'])
    gambdf = pd.DataFrame.from_dict(prop_gambiae, orient='index', columns=['AIM_fraction_gambiae'])
    perchromdf = gambdf.merge(coludf, left_index=True, right_index=True)
    aimsperchromdf = pd.DataFrame.from_dict(n_aims_per_sample, orient='index', columns=['n_AIMs'])

    perchromdf.to_csv(f"results/variantAnalysis/ancestry/AIM_fraction_{contig}.tsv", sep="\t", index=True)
    rnaseqpop.plot_aims(perchromdf, aimsperchromdf, species1="coluzzii", species2="gambiae", figtitle=f"AIM_fraction_{contig}", total=False)


aims_chrom_gamb = rnaseqpop.flip_dict(aims_chrom_gamb)
aims_chrom_colu = rnaseqpop.flip_dict(aims_chrom_colu)
n_aims_per_chrom = rnaseqpop.flip_dict(n_aims_per_chrom)

# get ancestry per aim for later plotting on chromosome
ancestryPerAim = pd.concat(ancestryPerAim, axis=0)
ancestryPerAim.to_csv("results/variantAnalysis/ancestry/ancestryScorePerAim.tsv", sep="\t")

# get genome wide average AIM fractions
for k in all_gamb:
    all_gamb[k] = np.nanmean(all_gamb[k])
    all_colu[k] = np.nanmean(all_colu[k])

df1 = pd.DataFrame.from_dict(all_gamb, orient='index',columns=['AIM_fraction_gambiae'])
df2 = pd.DataFrame.from_dict(all_colu, orient='index', columns=['AIM_fraction_coluzzii'])
n_aimsdf = pd.DataFrame.from_dict(n_aims_per_chrom)
n_aimsdf.to_csv(f"results/variantAnalysis/ancestry/n_AIMS_per_chrom.tsv", sep="\t", index=True)

df = df1.merge(df2, left_index=True, right_index=True)
df.to_csv(f"results/variantAnalysis/ancestry/AIMs_summary.tsv", sep="\t", index=True)

rnaseqpop.plot_aims(df, n_aimsdf, species1="coluzzii", species2="gambiae", figtitle="AIM_fraction_whole_genome", total=True)




########### The same but for arabiensis v gambiae/coluzzii 
# script should be modularised but no time atm, isnt necessary

if metadata['species'].isin(['arabiensis']).any():

    aims = zarr.open(snakemake.input['aims_zarr_arab'], mode='r')

    aims_chrom_gamb = {}
    aims_chrom_arab = {}
    all_gamb = defaultdict(list)
    all_arab = defaultdict(list)
    n_aims_per_chrom = {}

    for contig in contigs:

        # read in and filter data
        path = f"results/variantAnalysis/vcfs/{dataset}.{contig}.vcf.gz"
        vcf, geno, acsubpops, pos, depth, snpeff, subpops, pops = rnaseqpop.readAndFilterVcf(path=path,
                                                                contig=contig,
                                                                samples=metadata,
                                                                numbers=numbers,
                                                                qualflt=qualflt,
                                                                missingfltprop=missingprop)
        aimspos = aims[contig]['POS'][:]

        # get intersection of aims and our SNPs
        aims_pos_mask, aims_mask_2 = pos.locate_intersection(aimspos)
        our_aims = pos[aims_pos_mask]
        print(f"\n In the data, across all samples there are {our_aims.shape[0]} gamb-arab Ancestry Informative markers on Chromosome {contig}")

        # get gamb and colu alleles, and subset to aims that we have in the rna-seq data 
        aimsgamb = aims[contig]['gambcolu_allele'][:][aims_mask_2]
        aimsarab = aims[contig]['arab_allele'][:][aims_mask_2]

        # get mask that was used in readAndFilterVcf()
        mask = pos.locate_intersection(vcf['variants/POS'])[1]
        ref  = vcf['variants/REF'][mask][aims_pos_mask]
        alt = vcf['variants/ALT'][mask][aims_pos_mask]

        # filter geno array to set of aims
        geno_aims = geno.compress(aims_pos_mask, axis=0)

        totalgambscore = {}
        totalarabscore = {}

        for aim in our_aims:

            gambscore = {}
            arabscore = {}

            # filter arrays 
            mask = our_aims == aim
            ref_ = ref[mask]
            alt_ = alt[mask]
            aimsarab_ = aimsarab[mask]
            aimsgamb_ = aimsgamb[mask]

            gn_aim = geno_aims.compress(mask, axis=0)

            #convert genotypes to nucleotides
            gn2nucleotide = {0:ref_[0],
                            1:alt_[0][0],
                            2:alt_[0][1],
                            3:alt_[0][2],
                            -1:float("nan")}
            gn = rnaseqpop.replace_with_dict2_generic(gn_aim, gn2nucleotide)

            # for each sample, get proportion of gambiae/arabiensis alleles
            # alleles that are different to both will be missed here
            for sample in metadata.treatment.unique():
                alleles = gn.take(subpops[sample], axis=1).flatten()

                # at each AIM, do we have gamb or arab alleles
                gamb = alleles[alleles != 'nan'] == aimsgamb_
                arab = alleles[alleles != 'nan'] == aimsarab_

                # get proportion of gamb v arab alleles at each locus
                gambscore[sample] = np.mean(gamb)
                arabscore[sample] = np.mean(arab)

            totalgambscore[aim] = dict(gambscore)
            totalarabscore[aim] = dict(arabscore)

            gambscores = rnaseqpop.flip_dict(totalgambscore)
            arabscores = rnaseqpop.flip_dict(totalarabscore)

            prop_gambiae = {}
            prop_arab = {}
            n_aims_per_sample = {}

            for sample in metadata.treatment.unique():

                prop_gambiae[sample] = np.nanmean(np.array(list(gambscores[sample].values())))
                all_gamb[sample].append(np.nanmean(np.array(list(gambscores[sample].values()))))
                prop_arab[sample] = np.nanmean(np.array(list(arabscores[sample].values())))
                all_arab[sample].append(np.nanmean(np.array(list(arabscores[sample].values()))))

                arr = np.array(list(gambscores[sample].values()))
                dim = arr.shape[0]
                n_aims_per_sample[sample] = dim-np.sum(np.isnan(arr))

        aims_chrom_gamb[contig] = dict(prop_gambiae)
        aims_chrom_arab[contig] = dict(prop_arab)
        n_aims_per_chrom[contig] = dict(n_aims_per_sample)

        # Store ancestry score per aim
        ancestryPerAim[contig] = pd.concat([pd.DataFrame(gambscores).add_suffix("_gamb"), pd.DataFrame(coluscores).add_suffix("_colu")], axis=1)
        ancestryPerAim[contig]['contig'] = contig

        # plot and store for each chromosome
        gambdf = pd.DataFrame.from_dict(prop_gambiae, orient='index', columns=['AIM_fraction_gambiae'])
        arabdf = pd.DataFrame.from_dict(prop_arab, orient='index', columns=['AIM_fraction_arabiensis'])
        perchromdf = gambdf.merge(arabdf, left_index=True, right_index=True)
        aimsperchromdf = pd.DataFrame.from_dict(n_aims_per_sample, orient='index', columns=['n_AIMs'])

        perchromdf.to_csv(f"results/variantAnalysis/ancestry/AIM_fraction_{contig}.arab.tsv", sep="\t", index=True)
        rnaseqpop.plot_aims(perchromdf, aimsperchromdf, species1="arabiensis", species2="gambiae", figtitle=f"AIM_fraction_arab_{contig}", total=False)

    aims_chrom_gamb = rnaseqpop.flip_dict(aims_chrom_gamb)
    aims_chrom_arab = rnaseqpop.flip_dict(aims_chrom_arab)
    n_aims_per_chrom = rnaseqpop.flip_dict(n_aims_per_chrom)

    # get ancestry per aim for later plotting on chromosome
    ancestryPerAim = pd.concat(ancestryPerAim, axis=0)
    ancestryPerAim.to_csv("results/variantAnalysis/ancestry/ancestryScorePerAim.tsv", sep="\t")

    # get genome wide average AIM fractions
    for k in all_gamb:
        all_gamb[k] = np.nanmean(all_gamb[k])
        all_arab[k] = np.nanmean(all_arab[k])

    df1 = pd.DataFrame.from_dict(all_gamb, orient='index', columns=['AIM_fraction_gambiae'])
    df2 = pd.DataFrame.from_dict(all_arab, orient='index', columns=['AIM_fraction_arabiensis'])
    n_aimsdf = pd.DataFrame.from_dict(n_aims_per_chrom)
    n_aimsdf.to_csv(f"results/variantAnalysis/ancestry/n_AIMS_per_chrom_arab.tsv", sep="\t", index=True)

    df = df1.merge(df2, left_index=True, right_index=True)
    df.to_csv(f"results/variantAnalysis/ancestry/AIMs_summary_arab.tsv", sep="\t", index=True)

    rnaseqpop.plot_aims(df, n_aimsdf, species1="arabiensis", species2="gambiae", figtitle="AIM_fraction_whole_genome", total=True)

Python Pandas numpy zarr pop-tools From line 6 of scripts/AncestryInformativeMarkers.py

import os
import sys

class parseString(object):

    def __init__(self, ref, string):
        self.ref = ref.upper()
        self.string = string.upper()
        self.types = {'A':0,'G':0,'C':0,'T':0,'-':[],'*':0,'+':[],'X':[]}
        self.process()

    def process(self):
        # remove end of read character
        self.string = self.string.replace('$','')
        while self.string != '':
            if self.string[0] == '^':
                # skip two characters when encountering '^' as it indicates
                # a read start mark and the read mapping quality
                self.string = self.string[2:]
            elif self.string[0] == '*':
                self.types['*'] += 1
                # skip to next character
                self.string = self.string[1:]

            elif self.string[0] in ['.',',']:
                if (len(self.string)== 1) or (self.string[1] not in ['+','-']):
                    # a reference base
                    self.types[self.ref] += 1
                    self.string = self.string[1:]
                elif self.string[1] == '+': 
                    insertionLength = int(self.string[2])
                    insertionSeq = self.string[3:3+ insertionLength]
                    self.types['+'].append(insertionSeq)
                    self.string = self.string[3+insertionLength:]
                elif self.string[1] == '-':
                    deletionLength = int(self.string[2])
                    deletionSeq = self.string[3:3+deletionLength]
                    self.types['-'].append(deletionSeq)
                    self.string = self.string[3+deletionLength:]

            elif self.string[0] in self.types and\
                 ((len(self.string)==1) or (self.string[1] not in ['-','+'])):
                # one of the four bases
                self.types[self.string[0]] += 1
                self.string = self.string[1:]
            else:
                # unrecognized character
                # or a read that reports a substitition followed by an insertion/deletion
                self.types['X'].append(self.string[0])
                self.string = self.string[1:]
        return
    def __repr__(self):
        types = self.types
        return '\t'.join(list(map(str,[types['A'], types['C'], types['G'],types['T'],\
                                  types['*']])) +\
                         list(map(','.join, [types['-'],types['+'],types['X']])))


def main():
    print("chrom\tpos\tref\tcov\tA\tC\tG\tT\t*\t-\t+\tX", file=sys.stdout)
    for line in sys.stdin:
        toks = line.strip('\n').split('\t')
        ref = toks[2].upper()
        cov = toks[3]
        print('\t'.join([toks[0], toks[1],ref, cov]) + '\t' + \
            parseString(ref, toks[4]).__repr__(), file=sys.stdout)

if __name__ == '__main__':
    main()

Python From line 8 of scripts/BaseParser.py

import sys
sys.stderr = open(snakemake.log[0], "w")
import os
import pandas as pd
import numpy as np
import rnaseqpoptools as rnaseqpop
import re
import allel

# Read in parameters 
metadata = rnaseqpop.load_metadata(snakemake.params['metadata'])
ref = snakemake.input['ref']
gffpath = snakemake.params['gffpath']
contigs = snakemake.params['contigs']
autofastq = snakemake.params['fastq']
sweeps = snakemake.params['sweeps']
signalpath = "resources/signals.csv"


## Read fasta file and grep chromosome names, check if they match config
file = open(ref, "r")
refchroms = []
for line in file:
    if re.search(">", line):
        refchroms.append(line.split()[0].replace(">", ""))
assert (np.isin(contigs, refchroms)).all(), f"Not all chromosome names specified in config.yaml are found in the reference fasta file ({ref})"


# Check that the contrasts specified in config.yaml are all in the treatment column
comparisons = pd.DataFrame(snakemake.params['contrasts'], columns=['contrast'])
comparisons = comparisons.contrast.str.split("_", expand=True) # split contrasts into case v control 
comparisons = comparisons.values.ravel() # make 1d array 
assert (np.isin(comparisons, metadata['treatment']).all()), f"treatments specified in config.yaml contrasts do not exist in samples.tsv. {comparisons}"


# Check that samples in fastq.tsv match those in metadata.tsv
if autofastq == True:
    # Check to see if fastq files match the metadata
    for sample in metadata['sampleID']:
        for n in [1,2]:
            fqpath = f"resources/reads/{sample}_{n}.fastq.gz"
            assert os.path.isfile(fqpath), f"all sample names in 'samples.tsv' do not match a .fastq.gz file in the {snakemake.workflow.basedir}/resources/reads/ directory, please rename or consider using the fastq table option"


# Check column names of gene_names.tsv
gene_names = pd.read_csv(snakemake.params['gene_names'], sep="\t")
colnames = ['GeneID', 'TranscriptID', 'GeneDescription', 'GeneName']
assert (gene_names.columns.isin(colnames)).all(), f"Column names of gene_names.tsv should be {colnames}"


# Check that the chromosomes are present in the gff file 
gff = allel.gff3_to_dataframe(gffpath)
assert np.isin(contigs, gff.seqid).all(), f"All provided chromosome names ({contigs}) are not present in the reference GFF file"

Python Pandas numpy pop-tools From line 13 of scripts/checkInputs.py

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

#' This script performs differential expression analysis at the 
#' gene level using DESeq2. It outputs heatmaps, PCAs, gene counts,
#' as well as DE results. Final DE results for all contrasts are 
#' saved in an .xslx report 

## Differential expression
library(DESeq2)
library(pheatmap)
library(data.table)
library(ggrepel)
library(openxlsx)
library(glue)
library(RColorBrewer)
library(tidyverse)
library(jsonlite)

load_metadata <- function(metadata_path) {
  # Check the file extension and load metadata accordingly
  if (tools::file_ext(metadata_path) == "xlsx") {
    metadata <- readxl::read_excel(metadata_path)
  } else if (tools::file_ext(metadata_path) == "tsv") {
    metadata <- data.table::fread(metadata_path, sep = "\t")
  } else if (tools::file_ext(metadata_path) == "csv") {
    metadata <- data.table::fread(metadata_path, sep = ",")
  } else {
    stop("Metadata file must be .xlsx, .tsv, or .csv")
  }
  return(metadata)
}

#read metadata and get contrasts
metadata = load_metadata(snakemake@input[['metadata']]) %>% as.data.frame()
gene_names = fread(snakemake@input[['genes2transcripts']], sep="\t") %>% distinct()

contrastsdf = data.frame("contrast" = snakemake@params[['DEcontrasts']])
contrasts = contrastsdf$contrast

##### define functions ######
round_df = function(df, digits) {

  #' This function rounds all the numeric columns of a data.frame
  nums = vapply(df, is.numeric, FUN.VALUE = logical(1))
  df[,nums] = round(df[,nums], digits = digits)
  (df)
}

vst_pca = function(counts, samples, colourvar, name="PCA", st="", comparison=""){

  #' This function takes counts and sample metadata as input and builds a DESeq2 dataset
  #' Returning normalised and variance-stabilised counts, and performs PCA on the data
  #' Plotting and saving to pdf

  # make DESeq dataset
  dds = DESeqDataSetFromMatrix(countData = counts, 
                               colData = samples, 
                               design = ~ treatment)
  ###### estimate paramters and normalise 
  dds = estimateSizeFactors(dds)
  dds = estimateDispersions(dds)
  vsd = varianceStabilizingTransformation(dds)
  normcounts = counts(dds, normalized=TRUE)
  vstcounts = assay(vsd)
  vstcounts = vstcounts[order(apply(vstcounts,1,sum),decreasing=TRUE),]

  #### write pca of samples to pdf
  pca2 = prcomp(t(vstcounts),center=TRUE)
  pc = data.frame(pca2$x) %>% rownames_to_column("sampleID")
  pc = left_join(pc, samples)

  pdf(glue("results/counts/{name}{st}{comparison}.pdf"))  
  plt = ggplot(data=pc,aes(x=PC1, y=PC2, colour=treatment)) + 
    geom_point(size=6, alpha=0.8) + 
    geom_text_repel(aes(label=sampleID), colour="black") + 
    theme_light() + 
    labs(title=glue("{name} {st} {comparison}"),
         x=glue("PC1 - Variance explained - {round(summary(pca2)$importance[2,][1], 3)}"),
         y=glue("PC2 - Variance explained - {round(summary(pca2)$importance[2,][2], 3)}"))  + 
    theme(legend.position = "bottom",
          legend.title = element_blank(),
          legend.background=element_blank(),
          panel.grid.minor = element_blank(),
          panel.grid.major = element_blank(),
          panel.background = element_blank(),
          plot.title = element_text(hjust = 0.5),
          axis.text.x = element_text(colour="black", size=18),
          axis.text.y = element_text(colour="black", size=18))
  print(plt)
  null = dev.off()
  ggsave(glue("results/counts/{name}{st}{comparison}.png"), plot=plt)
  return(list(vstcounts, dds, normcounts))
}

volcano = function(data, title){

  data = data %>% mutate(col = case_when(padj < 0.05 & abs(log2FoldChange) > 1 ~ "#ff6666",
                                         padj > 0.05 & abs(log2FoldChange) > 1 ~ "#39e600",
                                         padj < 0.05 & abs(log2FoldChange) < 1 ~ "#3385ff",
                                         padj > 0.05 & abs(log2FoldChange) < 1 ~ "#b3b3b3"))

  data$labels = data %>% dplyr::mutate("Gene_name" = case_when(GeneName == "" ~ GeneID,
                                                               is.na(GeneName) ~ GeneID,
                                                               TRUE ~ GeneName)) %>% select(Gene_name) %>% deframe()

  plot = ggplot(data=data, aes(x=log2FoldChange, y=-log10(padj), color=col, alpha=0.4), size=2) + 
    geom_point() + 
    scale_color_identity() +
    xlim(-10, 10) +
    ylab("-log10(P)") +
    xlab("log2 Fold Change") +
    ggtitle(title) + 
    theme_light() + 
    theme(legend.position = "none")  +
    geom_vline(xintercept = log2(2), linetype='dashed', colour='grey') + 
    geom_vline(xintercept = log2(0.5), linetype='dashed', colour='grey') + 
    geom_hline(yintercept = -log10(0.05), linetype='dashed', colour='grey')# + 
    #geom_text_repel(data = subset(data, data[['padj']] < 0.01 & abs(data[['log2FoldChange']]) > 2), aes(label = subset(data, data[['padj']] < 0.01 & abs(data[['log2FoldChange']]) > 2)[["labels"]], colour='black'))

  print(plot)
}

#### main ####
cat("\n", "------------- Kallisto - DESeq2 - RNASeq Differential expression ---------", "\n")

#### Read counts for each sample
df = list()
for (sample in metadata$sampleID){
  df[[sample]]= fread(glue("results/counts/{sample}/abundance.tsv"), sep = "\t")
}

counts = data.frame('TranscriptID' = df[[1]]$target_id)
# Get read counts for each gene and fill table
for (sample in metadata$sampleID){
  reads = df[[sample]]$est_counts
  counts = cbind(counts, reads)
}

# Rename columns
colnames(counts) = c("TranscriptID", metadata$sampleID)
## Aggregate to gene level
counts = left_join(counts, gene_names) %>% select(GeneID, metadata$sampleID)
counts = counts %>% group_by(GeneID) %>% summarise_all(sum)
gene_names = gene_names %>% select(-TranscriptID)

### Count total reads counted
counts = counts[!is.na(counts$GeneID),]
counts = counts %>% column_to_rownames('GeneID')
#### Get count statistics for each sample and plot ####
count_stats = apply(counts, 2, sum) %>% enframe(name="Sample", value="total_counts") # total counts
ngenes = nrow(counts)
count_stats$genes_zerocounts = apply(counts, 2, function(x){sum(x==0)}) # genes with zero counts 
count_stats$genes_lessthan10counts = apply(counts, 2, function(x){sum(x<10)}) # genes with less than 10 counts
count_stats = count_stats %>% dplyr::mutate("proportion_zero" = genes_zerocounts/ngenes,
                                     "proportion_low" = genes_lessthan10counts/ngenes)
count_stats %>% fwrite(., "results/counts/countStatistics.tsv",sep="\t")

print("Counting and plotting total reads per sample...")
pdf("results/counts/total_reads_counted.pdf")
ggplot(count_stats, aes(x=Sample, y=total_counts, fill=metadata$treatment)) + 
  geom_bar(stat='identity') + 
  theme_light() +
  ggtitle("Total reads counted to genes") +
  theme(axis.text.x = element_text(angle=45),
        plot.title = element_text(hjust = 0.5))
null = dev.off() 

# round numbers to be whole, they are not due averaging across transcripts
counts = counts %>% rownames_to_column('GeneID') %>% 
  dplyr::mutate_if(is.numeric, round) %>% column_to_rownames('GeneID')

############ Plots PCA with all data, and performs DESeq2 normalisation ########################
res = vst_pca(counts, metadata, colourvar = 'strain', name="PCA")
vstcounts = res[[1]]
dds = res[[2]]
normcounts = res[[3]]

### write out raw and normalised counts 
counts %>% 
  rownames_to_column("GeneID") %>% 
  fwrite(., "results/counts/rawcounts.tsv", sep="\t", row.names = FALSE)

normcounts %>% 
  as.data.frame() %>% 
  rownames_to_column("GeneID") %>% 
  round_df(., 1) %>% 
  fwrite(., "results/counts/normCounts.tsv", sep="\t", row.names = FALSE)

# calculate correlations between samples based on the count data, and plot heatmap
correlations = cor(vstcounts)
pdf("results/counts/heatmap_correlations.pdf")
pheatmap(correlations)
garbage = dev.off()

png("results/counts/heatmap_correlations.png")
pheatmap(correlations)
garbage = dev.off()

# add column if it doesnt exist
if(!("lab" %in% colnames(metadata)))
{
  metadata$lab = 'FALSE'
}

# for each strain in the dataset, do a PCA plot
# analysis of case v control with DESEq2 and make PCAs and volcano plots.
if ("strain" %in% colnames(metadata)){
  for (sp in unique(metadata$species)){
    df_samples = metadata %>% filter(species == sp)
    df_samples = df_samples %>% filter(lab == 'FALSE') #remove lab samples
    for (st in unique(df_samples$strain)){
      ai_list = list()
      ai_list[[st]] = df_samples[df_samples$strain == st,]
      # if the strain has both case and control (i.e. exposed v unexposed)
      if (length(unique(ai_list[[st]]$cohort)) > 1){
        cat(glue("\n Running PCA for all {st} samples \n"))
        # subset counts data to our strains of interest
        subcounts = counts[colnames(counts) %in% ai_list[[st]]$sampleID]
        # perform PCA on the data at strain level
        vstcounts = vst_pca(subcounts, ai_list[[st]], 'treatment', "PCA_", st=st)[[1]]
      }
    }
  }
}

results_list = list()
names_list = list()
ngenes_list = list()
######### subset data and run DESeq for each combo we need, store in xlsx ########
for (cont in contrasts){
  control = str_split(cont, "_")[[1]][1] # get first of string, which is control 
  case = str_split(cont, "_")[[1]][2] # get case 
  controls = which(metadata$treatment %in% control)
  cases = which(metadata$treatment %in% case)

  ## Perform PCA for each comparison
  subcounts = counts[,c(controls, cases)]
  subsamples = metadata[c(controls, cases),]
  # make treatment a factor with the 'susceptible' as reference
  subsamples$treatment = as.factor(as.character(subsamples$treatment))
  subsamples$treatment = relevel(subsamples$treatment, control)
  null = vst_pca(subcounts, subsamples, colourvar='treatment', "PCA_", comparison=cont)[[]]

  # perform DE with each comparison, using DESeq dataset (dds) from whole library
  cat("\n", glue("--- Running DESeq2 differential expression analysis on {cont} ---"), "\n")
  cds = nbinomWaldTest(dds)
  results = results(cds, contrast = c("treatment", case, control)) %>% as.data.frame() 
  results = results[order(results$padj),] #order by pvalue 
  results = results %>% rownames_to_column("GeneID") %>% dplyr::mutate("FC" = (2^log2FoldChange))

  ### absolute difference
  #### Get rowsums of counts, grouping by case/control. Then get difference of counts and join with DE results
  readdiff = data.frame(t(rowsum(t(subcounts), group = subsamples$treatment, na.rm = T))) #transpose and get rowsums for each group
  readdiff$absolute_diff = readdiff[,case] - readdiff[,control] #get difference
  readdiff = data.frame(readdiff) %>% rownames_to_column('GeneID')
  results = unique(left_join(results, readdiff[,c('GeneID','absolute_diff')]))

  # join DE results with normal gene names
  results = unique(left_join(results, gene_names))
  fwrite(results, glue("results/genediff/{cont}.csv")) #write to csv 
  # volcano plot for each comparison, using EnhancedVolcano. First make vector of labels which is AGAPs unless a gene name exists
  results$labels = results %>% dplyr::mutate("Gene_name" = case_when(GeneName == "" ~ GeneID,
                                     is.na(GeneName) ~ GeneID,
                                     TRUE ~ GeneName)) %>% select(Gene_name) %>% deframe()

  #get number of sig genes 
  res1 = results %>% filter(padj < 0.05) %>% 
    count("direction" = FC > 1) %>% 
    dplyr::mutate("direction" = case_when(direction == FALSE ~ "Downregulated, padj = 0.05",
                                   direction == TRUE ~ "Upregulated, padj = 0.05")) %>%
    dplyr::rename(!!glue("{cont}_ngenes") := "n")

  res2 = results %>% filter(padj < 0.001) %>% 
    count("direction" = FC > 1) %>% 
    dplyr::mutate("direction" = case_when(direction == FALSE ~ "Downregulated, padj = 0.001",
                                   direction == TRUE ~ "Upregulated, padj = 0.001")) %>%
    dplyr::rename(!!glue("{cont}_ngenes") := "n")

  ngenes_list[[cont]] = bind_rows(res1, res2)

  # store names of comparisons for xlsx report sheets
  results_list[[cont]] = results
  names_list[[cont]] = cont

  pdf(glue("results/genediff/Volcano_plot_{cont}.pdf"))
  volcano(data = results_list[[cont]], title = cont)
  null = dev.off()
  cat("\n", glue("{cont} complete!"), "\n")
}

# Join different comparisons together and write out number of sig genes 
purrr::reduce(ngenes_list, inner_join) %>% fwrite("results/genediff/nsig_genes.tsv", sep="\t", col.names = TRUE)

#### write to excel file on diff sheets #### 
sheets = unlist(names_list)
wb <- createWorkbook("Workbook")

for (i in 1:length(sheets)){
  addWorksheet(wb, glue("{sheets[[i]]}"))
  writeData(wb, sheets[i], results_list[[i]], rowNames = FALSE, colNames = TRUE)
}
#### save workbook to disk once all worksheets and data have been added ####
saveWorkbook(wb,file=snakemake@output[['xlsx']], overwrite = TRUE)


### Get kallisto statistics ###

totalReads = c()
totalAligned = c()

# open kallisto json info and store stats, save to file 
for (sample in metadata$sampleID){
  run = fromJSON(glue("results/counts/{sample}/run_info.json"), flatten=TRUE)

  totalReads = c(totalReads, run$n_processed)
  totalAligned = c(totalAligned, run$n_pseudoaligned)

}

df = data.frame("sample" = c(metadata$sampleID, "Total"),
                "totalReads" = c(totalReads, sum(totalReads)),
                "totalAligned" = c(totalAligned, sum(totalAligned))) %>% 
  mutate("percentage" = (totalAligned/totalReads)*100) %>% 
  fwrite("results/counts/KallistoQuantSummary.tsv", sep="\t", col.names = TRUE)


sessionInfo()

R ggplot2 tidyverse JSON data.table DESeq2 ggrepel kallisto RColorBrewer pheatmap GLUE jsonlite openxlsx From line 2 of scripts/DeseqGeneDE.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

## Generate freebayes params ##
# 1) make bed for parallelising freebayes
library(dplyr)
library(data.table)
library(glue)


load_metadata <- function(metadata_path) {
  # Check the file extension and load metadata accordingly
  if (tools::file_ext(metadata_path) == "xlsx") {
    metadata <- readxl::read_excel(metadata_path)
  } else if (tools::file_ext(metadata_path) == "tsv") {
    metadata <- data.table::fread(metadata_path, sep = "\t")
  } else if (tools::file_ext(metadata_path) == "csv") {
    metadata <- data.table::fread(metadata_path, sep = ",")
  } else {
    stop("Metadata file must be .xlsx, .tsv, or .csv")
  }
  return(metadata)
}

# read inputs
contigs = snakemake@params[['contigs']]
chunks = snakemake@params[['chunks']]
fai = fread(snakemake@input[['index']])

# select contigs we want, and start, end columns
fai = fai[fai$V1 %in% contigs, c(1,2)]

# for each chromsome
for (contig in contigs){
   #subset index to desired contig
   f = fai[fai$V1 == contig]
   #get sequence of n chunks from 0 to length of contig
   bedseq = round(seq(0, f$V2, length.out = chunks))

   #for each chunk
   for (i in 1:(chunks-1)){
      #write bed file, one for each chunk/interval, which will be passed as input to freebayes
      row = c(contig, bedseq[i], bedseq[i+1])
      data.frame(row) %>% t() %>% fwrite(., glue("resources/regions/genome.{contig}.region.{i}.bed"), sep="\t", col.names = FALSE)
   }
}



# 2) Make bamlist and populations.tsv file

metadata = load_metadata(snakemake@params[['metadata']])
metadata$bams = paste0("results/alignments/", metadata$sampleID,".bam")

metadata %>% 
   select(bams, strain) %>% 
   fwrite(., snakemake@output[['pops']], sep="\t", row.names = FALSE, col.names = FALSE)

metadata %>% 
   select(bams) %>% 
   fwrite(.,snakemake@output[['bamlist']], sep="\t", row.names = FALSE, col.names = FALSE)

sessionInfo()

R dplyr data.table FreeBayes GLUE From line 2 of scripts/GenerateFreebayesParams.R

import sys
sys.stderr = open(snakemake.log[0], "w")

import rnaseqpoptools as rnaseqpop
import pandas as pd
import numpy as np
import allel
from scipy import stats
from functools import partial, reduce
import warnings
warnings.filterwarnings('ignore') # suppress numpy runtime warnings, this is a bit dangerous, should be removed for release or resolve source of warnings

# snakemake inputs and params
dataset = snakemake.params['dataset']
metadata_path = snakemake.input['metadata']
metadata = rnaseqpop.load_metadata(metadata_path)
gffpath = snakemake.input['gff']
pbs = snakemake.params['pbs']
pbscomps = snakemake.params['pbscomps']
contigs = snakemake.params['contigs']
ploidy = snakemake.params['ploidy']
numbers = rnaseqpop.get_numbers_dict(ploidy)
missingprop = snakemake.params['missingprop']

# gff
features = allel.gff3_to_dataframe(gffpath,
                       attributes=["ID", "description"])
gff = features[features.type == 'gene']
# gene names file, rename
gene_names = pd.read_csv(snakemake.input['geneNames'], sep="\t").drop(columns=['TranscriptID']).drop_duplicates()

### main ####
# Read in list of contrasts
comparisons = pd.DataFrame(snakemake.params['DEcontrasts'], columns=['contrast'])
comparisons = comparisons.contrast.str.split("_", expand=True)
comparisons.columns = ['sus', 'res']
comparisons = [list(row) for i,row in comparisons.iterrows()]

print(f"The pairwise comparisons for Fst are {comparisons}")

fstbychrom={}
if pbs: pbsbychrom={}
tajdbychrom={}
gdivbychrom = {}
dxybychrom = {}

for contig in contigs:

    # path to vcf
    path = f"results/variantAnalysis/vcfs/{dataset}.{contig}.vcf.gz"
    # function to read in vcfs and associated SNP data
    vcf, geno, acsubpops, pos, depth, snpeff, subpops, pops = rnaseqpop.readAndFilterVcf(path=path,
                                                               contig=contig, 
                                                               samples=metadata,
                                                               numbers=numbers,
                                                               ploidy=ploidy,
                                                               qualflt=30,
                                                               missingfltprop=missingprop)
    # subset gff to appropriate contig
    genes = gff[gff.seqid == contig].sort_values('start').reset_index(drop=True)

    ### Average Fst, pbs, tajima d for each gene
    fst_per_comp = {}
    fst_per_gene = {}
    pbs_per_gene = {}
    pbs_per_comp = {}
    tajd_per_pop = {}
    tajd_per_gene = {}
    gdiv_per_pop = {}
    gdiv_per_gene = {}
    se_per_comp = {}
    se_per_gene = {}
    dxy_per_comp = {}
    dxy_per_gene = {}
    pos_dict = {}
    n_dict = {}

    # loop through each gene and calculate fst, pbs, tajimas d, or sequence diversity for each comparison
    for i, gene in genes.iterrows():
        ID = gene.ID
        # locate_ranges() to get a boolean, needed as locate_range() will throw errors if no snps found in gene
        gene_bool = pos.locate_ranges([gene['start']], [gene['end']], strict=False)
        nsnps = gene_bool.sum()

        # if there are less than 3 snps in this gene then skip to next in loop
        if nsnps < 2:
            continue

        # store number of snps per gene
        n_dict[ID] = nsnps
        # store midpoint positions of gene
        pos_dict[ID] = (gene['start'] + gene['end'])/2

        # fst and dxy per gene between each comparison
        for comp1,comp2 in comparisons:
            name = comp1 + "_" + comp2
            ac1 = acsubpops[comp1].compress(gene_bool, axis=0)
            ac2 = acsubpops[comp2].compress(gene_bool, axis=0)

            fst_per_comp[name], se_per_comp[name],_,_= allel.average_hudson_fst(ac1, ac2, blen=1)

            dxy_per_comp[name] = allel.sequence_divergence(pos[gene_bool], ac1, ac2)

        # tajimas d and sequence diversity per gene for each subpop(i.e treatment)
        for subpop in subpops:
            ac = acsubpops[subpop].compress(gene_bool)
            genepos = pos[gene_bool]
            tajd_per_pop[subpop] = allel.tajima_d(ac=ac, pos=genepos)
            gdiv_per_pop[subpop] = allel.sequence_diversity(ac=ac, pos=genepos)

        # pbs for each gene for each pbc comparison as defined in config.yaml
        if pbs is True:
            for pbscomp in pbscomps:
                pop1, pop2, outpop = pbscomp.split("_")
                pbs_per_comp[pbscomp],se,_,_ = rnaseqpop.meanPBS(acsubpops[pop1].compress(gene_bool, axis=0),
                                          acsubpops[pop2].compress(gene_bool, axis=0),
                                          acsubpops[outpop].compress(gene_bool, axis=0),
                                                     window_size=1,
                                                    normalise=True)
        # store inner dict in outer dicts
        fst_per_gene[ID] = dict(fst_per_comp)
        se_per_gene[ID] = dict(se_per_comp)
        if pbs is True : pbs_per_gene[ID] = dict(pbs_per_comp)
        tajd_per_gene[ID] = dict(tajd_per_pop)
        gdiv_per_gene[ID] = dict(gdiv_per_pop)
        dxy_per_gene[ID] = dict(dxy_per_comp)

    #reverse the dicts so the comparisons/subpops are on the outer dict
    fst_per_gene = rnaseqpop.flip_dict(fst_per_gene)
    se_per_gene = rnaseqpop.flip_dict(se_per_gene)
    if pbs is True : pbs_per_gene = rnaseqpop.flip_dict(pbs_per_gene)
    tajd_per_gene = rnaseqpop.flip_dict(tajd_per_gene)
    gdiv_per_gene = rnaseqpop.flip_dict(gdiv_per_gene)
    dxy_per_gene = rnaseqpop.flip_dict(dxy_per_gene)

    print(f"Chromosome {contig} complete...\n")
    for comp1,comp2 in comparisons:
        name = comp1 + "_" + comp2
        a = np.array(list(fst_per_gene[name].values()))
        print(f"Overall Fst (averaged across genes) for chromosome {contig} between {name} is {np.nanmean(a)}")

    # Make dataframe of number of snps per gene (that pass quality and missingness filters)
    ndf = pd.DataFrame.from_dict(n_dict, orient='index').reset_index(drop=False)
    ndf.columns = ['GeneID', 'nSNPs']

    # Make dataframe of midpoints of each gene
    posdf = pd.DataFrame.from_dict(pos_dict, orient='index').reset_index(drop=False)
    posdf.columns = ['GeneID', 'Gene_midpoint']

    # Make dataframe of fst for each comparison
    fst_dfs = {}
    se_dfs = {}
    dxy_dfs = {}
    for comp1,comp2 in comparisons:
        name = comp1 + "_" + comp2
        fst_df = pd.DataFrame.from_dict(fst_per_gene[name], orient='index').reset_index(drop=False)
        fst_df.columns = ['GeneID', (name + '_zFst')]
        fst_dfs[name] = fst_df

        se_df = pd.DataFrame.from_dict(se_per_gene[name], orient='index').reset_index(drop=False)
        se_df.columns = ['GeneID', (name + '_SE')]
        se_dfs[name] = se_df

        dxy_df = pd.DataFrame.from_dict(dxy_per_gene[name], orient='index').reset_index(drop=False)
        dxy_df.columns = ['GeneID', (name + '_dxy')]
        dxy_dfs[name] = dxy_df


    my_reduce = partial(pd.merge, on='GeneID', how='outer')
    fst_allcomparisons = reduce(my_reduce, fst_dfs.values())
    se_allcomparisons = reduce(my_reduce, se_dfs.values())
    fst_allcomparisons = reduce(my_reduce, [fst_allcomparisons, se_allcomparisons])
    fst_allcomparisons['contig'] = contig
    dxy_allcomparisons = reduce(my_reduce, dxy_dfs.values())
    dxy_allcomparisons['contig'] = contig

    tajd_dfs = {}
    gdiv_dfs = {}
    # Store sequence diversity and tajimas d for each gene and each subpop
    for subpop in subpops:
        tajd_df = pd.DataFrame.from_dict(tajd_per_gene[subpop], orient='index').reset_index(drop=False)
        tajd_df.columns = ['GeneID', (subpop+"_Tajima_d")]
        tajd_dfs[subpop] = tajd_df
        gdiv_df = pd.DataFrame.from_dict(gdiv_per_gene[subpop], orient='index').reset_index(drop=False)
        gdiv_df.columns = ['GeneID', (subpop+"_SeqDiv")]
        gdiv_dfs[subpop] = gdiv_df

    # Combine tajimas d and sequence diversity for each sample
    tajdall = reduce(my_reduce, tajd_dfs.values())
    gdivall = reduce(my_reduce, gdiv_dfs.values())
    tajdall['contig'] = contig
    gdivall['contig'] = contig

    if pbs is True:
        # PBS store as dataframes 
        pbs_dfs = {}
        for pbscomp in pbscomps:
            pbs_df = pd.DataFrame.from_dict(pbs_per_gene[pbscomp], orient='index').reset_index(drop=False)
            pbs_df.columns = ['GeneID', (pbscomp+"PBS")]
            pbs_dfs[pbscomp] = pbs_df

        pbs_allcomparisons = reduce(my_reduce, pbs_dfs.values())
        pbs_allcomparisons['contig'] = contig

    fstbychrom[contig] = reduce(lambda left, right: pd.merge(left,right, on=['GeneID'],
                                                how='inner'), [fst_allcomparisons, gene_names, ndf, posdf])
    tajdbychrom[contig] = reduce(lambda left, right: pd.merge(left,right, on=['GeneID'],
                                                how='inner'), [tajdall, gene_names, ndf,posdf])
    gdivbychrom[contig] = reduce(lambda left, right: pd.merge(left,right, on=['GeneID'],
                                                how='inner'), [gdivall, gene_names, ndf, posdf])
    dxybychrom[contig] = reduce(lambda left, right: pd.merge(left, right, on=['GeneID'],
                                                how='inner'), [dxy_allcomparisons, gene_names, ndf, posdf])
    if pbs is True:
        pbsbychrom[contig] = reduce(lambda left,right: pd.merge(left,right,on=['GeneID'],
                                                    how='inner'), [pbs_allcomparisons, gene_names, ndf, posdf])


## Concatenate chromosome dfs to one big data frame and remove duplicates
fstall = pd.concat(fstbychrom.values(), ignore_index=True).drop_duplicates()
tajdall = pd.concat(tajdbychrom.values(), ignore_index=True).drop_duplicates()
gdivall = pd.concat(gdivbychrom.values(), ignore_index=True).drop_duplicates()
dxyall = pd.concat(dxybychrom.values(), ignore_index=True).drop_duplicates()

# Write to csv
fstall.to_csv(f"results/variantAnalysis/selection/FstPerGene.tsv", index=False, sep="\t")
tajdall.to_csv(f"results/variantAnalysis/selection/TajimasDPerGene.tsv", index=False, sep="\t")
gdivall.to_csv(f"results/variantAnalysis/diversity/SequenceDivPerGene.tsv", index=False, sep="\t")
dxyall.to_csv(f"results/variantAnalysis/diversity/DxyPerGene.tsv", index=False, sep="\t")

if pbs is True:
    pbsall = pd.concat(pbsbychrom.values(), ignore_index=True).drop_duplicates()
    pbsall.to_csv(f"results/variantAnalysis/selection/PbsPerGene.tsv", index=False, sep="\t")

Python Snakemake Pandas numpy scipy pop-tools From line 6 of scripts/PerGeneFstPBS.py

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")


#' Finds genes that are differentially expressed in the same direction
#' Across two DE comparisons. For example, a gene may be significantly 
#' upregulated in the control v intermediate phenotype comparison. 
#' And also in the intermediate phenotype vs full phenotype comparison.

library(data.table)
library(tidyverse)
library(glue)

# read metadata and get contrasts
comps = snakemake@params['comps'][[1]]
padj_threshold = snakemake@params['pval']
isoform_level = snakemake@params['isoform_level']
gene_level = snakemake@params['gene_level']
upper_fc = snakemake@params['fc']
lower_fc = 1/as.numeric(upper_fc) # if someone wants a FC threshold of 2, need to have lower threshold of 0.5.

for (cont in comps){
  sus = str_split(cont, "_")[[1]][1]           # get control/susceptible name (such as Kisumu)
  intermediate = str_split(cont, "_")[[1]][2]     # get intermediate
  res = str_split(cont, "_")[[1]][3]              # get last of string, which is resistant/case 

  if (gene_level == TRUE){

    #### Gene diff ####
    one = fread(glue("results/genediff/{intermediate}_{res}.csv"))
    up1  = one %>% filter(FC > upper_fc, padj < padj_threshold)
    down1 = one %>% filter(FC < lower_fc, padj < padj_threshold)

    two = fread(glue("results/genediff/{sus}_{intermediate}.csv"))
    up2  = two %>% filter(FC > upper_fc, padj < padj_threshold)
    down2 = two %>% filter(FC < lower_fc, padj < padj_threshold)

    intersectdown = inner_join(down1, down2, by="GeneID", suffix=c("_field", "_lab"))
    intersectup = inner_join(up1, up2, by="GeneID", suffix=c("_field", "_lab"))

    fwrite(intersectup, glue("results/genediff/{sus}_{intermediate}_{res}.up.progressive.tsv"), sep="\t")
    fwrite(intersectdown, glue("results/genediff/{sus}_{intermediate}_{res}.down.progressive.tsv"), sep="\t")
  }

  if (isoform_level == TRUE){
    #### Isoforms #### 
    one = fread(glue("results/isoformdiff/{intermediate}_{res}.csv"))
    up1  = one %>% filter(FC > upper_fc, qval < padj_threshold)
    down1 = one %>% filter(FC < lower_fc, qval < padj_threshold)

    two = fread(glue("results/isoformdiff/{sus}_{intermediate}.csv"))
    up2  = two %>% filter(FC > upper_fc, qval < padj_threshold)
    down2 = two %>% filter(FC < lower_fc, qval < padj_threshold)

    intersectdown = inner_join(down1, down2, by="TranscriptID", suffix=c("_field", "_lab"))
    intersectup = inner_join(up1, up2, by="TranscriptID", suffix=c("_field", "_lab"))

    fwrite(intersectup, glue("results/isoformdiff/{sus}_{intermediate}_{res}.up.progressive.tsv"), sep="\t")
    fwrite(intersectdown, glue("results/isoformdiff/{sus}_{intermediate}_{res}.down.progressive.tsv"), sep="\t")
  }

}


sessionInfo()

R tidyverse data.table GLUE From line 2 of scripts/ProgressiveGenes.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

#' This script uses the sleuth R package to perform differential isoform analysis
#' This will be performed for each 
#' Sleuth uses the bootstraps of kallisto to estimate uncertainty 

## Differential expression
library(tidyverse)
library(sleuth)
library(data.table)
library(ggrepel)
library(openxlsx)
library(glue)
library(RColorBrewer)
library(EnhancedVolcano)

print("------------- Kallisto - Sleuth - RNASeq isoform Differential expression ---------")
#### read data ####

load_metadata <- function(metadata_path) {
  # Check the file extension and load metadata accordingly
  if (tools::file_ext(metadata_path) == "xlsx") {
    metadata <- readxl::read_excel(metadata_path)
  } else if (tools::file_ext(metadata_path) == "tsv") {
    metadata <- data.table::fread(metadata_path, sep = "\t")
  } else if (tools::file_ext(metadata_path) == "csv") {
    metadata <- data.table::fread(metadata_path, sep = ",")
  } else {
    stop("Metadata file must be .xlsx, .tsv, or .csv")
  }
  return(metadata)
}

metadata = load_metadata(snakemake@input[['metadata']]) %>% as.data.frame() %>% dplyr::rename('sample' = "sampleID")
#add path column for sleuth object
metadata$path = paste0("results/counts/", metadata$sample)

#read metadata and get contrasts
gene_names = fread(snakemake@input[['genes2transcripts']], sep="\t") %>% select(-GeneID)

#contrasts
contrastsdf = data.frame("contrast" = snakemake@params[['DEcontrasts']])
contrasts = contrastsdf$contrast

######### subset data and run DESeq for each combo we need, store in xlsx ########
results_list = list()
names_list = list()
#loop through each chosen contrast and perform DE 
for (cont in contrasts){
  control = str_split(cont, "_")[[1]][1] #get first of string, which is control (or susceptible)
  case = str_split(cont, "_")[[1]][2] #get second of string, which is case (or resistant)
  controls = which(metadata$treatment %in% control) #get indices of case/control
  cases = which(metadata$treatment %in% case)

  ##subset to subcounts of our contrast
  subsamples = metadata[c(controls, cases),]

  #make treatment a factor with the 'susceptible' as reference
  subsamples$treatment = as.factor(as.character(subsamples$treatment))
  subsamples$treatment = relevel(subsamples$treatment, control)

  print(glue("Running Sleuth differential isoform expression analysis on {cont}"))
  so = sleuth_prep(subsamples, extra_bootstrap_summary = TRUE, 
                    transformation_function = function(x) log2(x + 0.5))
  so = sleuth_fit(so, ~treatment, 'full')
  #fit reduced model necessary for isoform analysis
  so = sleuth_fit(so, ~1, 'reduced')
  #fit likelihood ratio test
  so = sleuth_wt(so, which_beta = paste0("treatment",case))

  results = sleuth_results(so, test =paste0("treatment",case), 'reduced:full', show_all = FALSE) %>% 
    dplyr::rename("TranscriptID" = "target_id") %>% 
    dplyr::mutate("FC" = (2^b))

  # Join DE results with normal gene names
  results = unique(left_join(results, gene_names))
  fwrite(results, glue("results/isoformdiff/{cont}.csv")) # write to csv 

  # Store names of contrasts and results tables for xlsx report sheets
  results_list[[cont]] = results
  names_list[[cont]] = cont

    # volcano plot for each comparison, using EnhancedVolcano. First make vector of labels which is AGAPs unless a gene name exists
  labels = results %>% dplyr::mutate("Gene_name" = case_when(GeneName == "" ~ TranscriptID,
                                     is.na(GeneName) ~ TranscriptID,
                                     TRUE ~ GeneName)) %>% select(Gene_name) %>% deframe()
  pdf(glue("results/isoformdiff/Volcano_plot_{cont}.pdf"))
  print(EnhancedVolcano(results_list[[cont]],
                  lab=labels,
                  x='b',
                  y='pval',
                  title = cont))
  garbage = dev.off()
  print(glue("{cont} complete!"))
}

#### write to excel file on diff sheets #### 
sheets = unlist(names_list)
wb <- createWorkbook("Workbook")

for (i in 1:length(sheets)){
  addWorksheet(wb, glue("{sheets[[i]]}"))
  writeData(wb, sheets[i], results_list[[i]], rowNames = FALSE, colNames = TRUE)
}
#### save workbook to disk once all worksheets and data have been added ####
saveWorkbook(wb,file=snakemake@output[['xlsx']], overwrite = TRUE)

sessionInfo()

R tidyverse data.table ggrepel kallisto RColorBrewer GLUE sleuth openxlsx EnhancedVolcano From line 2 of scripts/SleuthIsoformsDE.R

import sys
sys.stderr = open(snakemake.log[0], "w")

import pandas as pd
import numpy as np
import allel
import rnaseqpoptools as rnaseqpop

# Read in parameters from snakemake
dataset = snakemake.params['dataset']
metadata = rnaseqpop.load_metadata(snakemake.input['metadata'])
metadata = metadata.sort_values(by='species')
contigs = snakemake.params['contigs']
ploidy = snakemake.params['ploidy']
numbers = rnaseqpop.get_numbers_dict(ploidy)
qualflt = snakemake.params['qualflt']
missingprop = snakemake.params['missingprop']
gffpath = snakemake.input['gff']

# load gff
features = allel.gff3_to_dataframe(gffpath,
                        attributes=["ID", "Parent", "description"])

# define functions
def isnotmissing(gn):
    """
    Function to check if missing values are present at a SNP
    """
    return((gn != -1).all())

#initialise dicts
snpsPerGff = {}
total_snps_per_chrom = {}
snps_per_gene_allchroms = {}
snpeffdict = {}

seqdivdictchrom = {}
thetadictchrom = {}
coefdictchrom= {}

for i, contig in enumerate(contigs):

    # Read in and Filter VCF
    path = f"results/variantAnalysis/vcfs/{dataset}.{contig}.vcf.gz"
    vcf, geno, acsubpops, pos, depth, snpeff, subpops, populations = rnaseqpop.readAndFilterVcf(path=path,
                                                           contig=contig,
                                                           samples=metadata,
                                                           numbers=numbers,
                                                           ploidy=ploidy,
                                                           qualflt=qualflt,
                                                           missingfltprop=missingprop)
    # Store total SNPs per chromosome
    # And summaries from SnpEff
    total_snps_per_chrom[contig] = geno.shape[0]
    snpeffdict[contig] = snpeff[1].value_counts(normalize=True)

    # Plot SNP density
    rnaseqpop.plot_density(pos, 
                window_size=100000, 
                title=f"Variant Density | {dataset} | Chromosome {contig}", 
                path=f"results/variantAnalysis/diversity/{dataset}_SNPdensity_{contig}.svg")

    ######## SNP counts per gene ########
    # Subset GFF to appropriate chromosome
    gff = features.query("seqid == @contig").sort_values('start').reset_index(drop=True)
    genes = gff.query("type == 'gene'")
    exons = features.query("type == 'exon'").reset_index(drop=True)

    ## Proportion SNPs per GFF feature
    snpsPerGff[contig] = rnaseqpop.getSNPGffstats(gff,  pos)
    snpsPerGff[contig]['chromosome'] = contig

    ## Calculate missing SNPs per sample, SNPs per gene etc
    snpsnotmissing = {}
    missing_array = {}
    snps_per_gene = {}
    snps_per_sample = {}

    # For each sample in metadata, compress the genotype array and check is data is missing
    for sample in metadata['sampleID']:

        bool_ = sample == populations
        gn = geno.compress(bool_, axis=1)

        res = map(isnotmissing, gn[:,0])
        missing_array[sample] = np.fromiter(res, dtype=bool)
        snpsnotmissing[sample] = missing_array[sample].sum()

    # For each gene, find out how many SNPs per gene 
    for i, gene in genes.iterrows():
        # locate_ranges() to get a boolean, needed as locate_range() will throw errors if no snps found in gene
        gene_bool = pos.locate_ranges([gene['start']], [gene['end']], strict=False)

        for sample in metadata['sampleID']:
            presentSNPs = missing_array[sample].compress(gene_bool, axis=0) # subset to each gene
            snps_per_sample[sample] = presentSNPs.sum()

        snps_per_gene[gene['ID']] = dict(snps_per_sample)

    snps_per_gene_allchroms[contig] = pd.DataFrame.from_dict(rnaseqpop.flip_dict(snps_per_gene))

snpsPerGff = pd.concat(snpsPerGff).reset_index().rename(columns={'level_1':'feature'})
snpsPerGff = snpsPerGff.groupby('feature').agg({'ReferenceGenomeProportion': 'mean', 'called':'sum', 'proportion': 'mean', 'total':'sum'})
snpsPerGff.to_csv(snakemake.output['snpsPerGenomicFeature'], sep="\t", index=True)

# Total SNPs per contig, all samples
totalsnpsdf = pd.DataFrame.from_dict(total_snps_per_chrom, orient='index', columns=['Total_SNPs'])
totalsnpsdf.to_csv(f"results/variantAnalysis/SNPstats/totalSNPs.tsv", sep="\t", index=True)

# SNPcount per gene
snpcountsdf = pd.concat(snps_per_gene_allchroms)  
snpcountsdf.to_csv("results/variantAnalysis/SNPstats/nSNPsPerGene.tsv", sep="\t", index=True)
genesNoSNPs = pd.DataFrame((snpcountsdf == 0).sum(axis=0), columns=['Genes with zero SNPs'])
genesNoSNPs.to_csv("results/variantAnalysis/SNPstats/nGenesZeroSNPs.tsv", sep="\t", index=True)

# snpEff summary
snpeffdf = pd.concat(snpeffdict)
snpeffdf.to_csv("results/variantAnalysis/SNPstats/snpEffProportions.tsv", sep="\t", index=True)

Python Snakemake Pandas numpy snpEff pop-tools From line 8 of scripts/SNPstatistics.py

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library(tidyverse)
library(data.table)
library(glue)
library(openxlsx)

load_metadata <- function(metadata_path) {
  # Check the file extension and load metadata accordingly
  if (tools::file_ext(metadata_path) == "xlsx") {
    metadata <- readxl::read_excel(metadata_path)
  } else if (tools::file_ext(metadata_path) == "tsv") {
    metadata <- data.table::fread(metadata_path, sep = "\t")
  } else if (tools::file_ext(metadata_path) == "csv") {
    metadata <- data.table::fread(metadata_path, sep = ",")
  } else {
    stop("Metadata file must be .xlsx, .tsv, or .csv")
  }
  return(metadata)
}

#### allele imbalance ####
metadata = load_metadata(snakemake@input[['metadata']])
samples = metadata$sampleID
# Read IR mutation data 
mutation_data = fread(snakemake@input[['mutations']], sep="\t")

all_list = list()
mean_list = list()

# Loop through each mutation, finding allele coverage at each position
for (m in mutation_data$Name){

  base = mutation_data[mutation_data$Name == m]$ALT
  propstring = glue("proportion{base}")
  base2 = mutation_data[mutation_data$Name == m]$ALT2 #### add in if second alt
  propstring2 = glue("proportion{base2}")

  #### load allele balance data ####
  allele_list = list()
  # read data for each sample and subset to what we want
  for (sample in samples){
    allele_list[[sample]] = fread(glue("results/variantAnalysis/variantsOfInterest/counts/{sample}_{m}_allele_counts.tsv"))[,c(1:8)] #for each sample read data, and store first 8 columns 
    allele_list[[sample]]$sample = sample                                            #add sample column
    allele_list[[sample]]$treatment = metadata$treatment[samples == sample]         #add treatment column
    allele_list[[sample]]$mutation = m
    allele_list[[sample]]$gene = mutation_data[mutation_data$Name == m]$Gene

    cover = allele_list[[sample]] %>% select(A,C,G,T) %>% rowSums()
    allele_list[[sample]] = allele_list[[sample]] %>% mutate(!!propstring := (!!sym(base))/cover) #new column, proportion of Alts to total.  

    if (!base2 %in% c("", NA)){
      allele_list[[sample]] = allele_list[[sample]] %>% mutate(!!propstring2 := (!!sym(base2))/cover) #new column, proportion of Alts to total.  
    }
  }

  # We have 24 separate dataframes in a list, bind them together into one big dataframe
  alleles = rbindlist(allele_list, fill = TRUE)

  # now lets calculate population means and lower and upper CIs
  alleles_per_pop_list = list()
  for (pop in unique(metadata$treatment)){

    alleles_per_pop = alleles %>% filter(treatment == pop) 

    pop_prop = sum(alleles_per_pop[, ..base])/ sum(alleles_per_pop[, 'cov'])
    error = sqrt((pop_prop*(1-pop_prop))/nrow(alleles_per_pop))*1.96
    lower = pmax(pop_prop - error, 0)
    upper = pmin(pop_prop + error, 1)

    alleles_per_pop_list[[pop]] = alleles_per_pop %>% mutate(!!propstring := pop_prop, lowerCI = lower, upperCI = upper)

    # average across replicates
    if (!base2 %in% c("", NA)){
      pop_prop2 = sum(alleles_per_pop[, ..base2])/ sum(alleles_per_pop[, 'cov'])
      error2 = sqrt((pop_prop*(1-pop_prop))/nrow(alleles_per_pop))*1.96
      lower2 = pmax(pop_prop - error, 0)
      upper2 = pmin(pop_prop + error, 1)

      alleles_per_pop_list[[pop]] = alleles_per_pop_list[[pop]] %>% mutate(!!propstring2 := pop_prop2, lowerCI_2 = lower2, upperCI_2 = upper2)
    }
  }

  mean_alleles = rbindlist(alleles_per_pop_list, fill=TRUE)

  # average across replicates
  if (!base2 %in% c("", NA)){
    mean_alleles = mean_alleles %>% 
      group_by(chrom, pos, ref, mutation, treatment, !!sym(propstring), lowerCI, upperCI, !!sym(propstring2), lowerCI_2, upperCI_2) %>% 
      summarise_at(.vars = c("cov","A","C","G","T"), .funs = c(mean="mean"), na.rm = TRUE) %>% 
      select(chrom, pos, ref, mutation, treatment, cov_mean, A_mean, C_mean, G_mean, T_mean, lowerCI, upperCI, !!propstring, !!propstring2, lowerCI_2, upperCI_2)
  } else {
    mean_alleles = mean_alleles %>% 
      group_by(chrom, pos, ref, mutation, treatment, !!sym(propstring), lowerCI, upperCI) %>% 
      summarise_at(.vars = c("cov","A","C","G","T"), .funs = c(mean="mean"), na.rm = TRUE) %>% 
      select(chrom,pos, ref, mutation, treatment, cov_mean, A_mean, C_mean, G_mean, T_mean, lowerCI, upperCI, !!propstring)
  }

  #write to file, reorder mean_kdr_alleles
  fwrite(alleles, glue("results/variantAnalysis/variantsOfInterest/csvs/{m}_alleleBalance.csv"))
  fwrite(mean_alleles, glue("results/variantAnalysis/variantsOfInterest/csvs/mean_{m}_alleleBalance.csv"))

  all_list[[m]] = alleles
  mean_list[[m]] = mean_alleles
}

#### write to excel file on diff sheets #### 
results_list = all_list
sheets = unique(mutation_data$Name)
wb <- createWorkbook("Workbook")

for (i in 1:length(sheets)){
  addWorksheet(wb, glue("{sheets[[i]]}"))
  writeData(wb, sheets[i], results_list[[i]], rowNames = TRUE, colNames = TRUE)
}
#### save workbook to disk once all worksheets and data have been added ####
saveWorkbook(wb,file=snakemake@output[['alleleBalance']], overwrite = TRUE)


### mean balance ####
#### write to excel file on diff sheets #### 
results_list = mean_list
sheets = unique(mutation_data$Name)
wb <- createWorkbook("Workbook")

for (i in 1:length(sheets)){
  addWorksheet(wb, glue("{sheets[[i]]}"))
  writeData(wb, sheets[i], results_list[[i]], rowNames = TRUE, colNames = TRUE)
}
#### save workbook to disk once all worksheets and data have been added ####
saveWorkbook(wb,file=snakemake@output[['mean_alleleBalance']], overwrite = TRUE)

sessionInfo()

R tidyverse data.table GLUE openxlsx From line 2 of scripts/VariantsOfInterestAlleleBalance.R

import sys
sys.stderr = open(snakemake.log[0], "w")

import itertools
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib_venn import venn2, venn3


direction = snakemake.wildcards['dir_']
pval = snakemake.params['padj_threshold']
dataset = snakemake.params['dataset']
comparisons = snakemake.params['comparisons'] #['BusiaParental_BusiaSelected', 'Kisumu_BusiaParental', 'Kisumu_BusiaSelected'] 

assert len(comparisons) > 1, "Only one differential expression comparison is specified, cannot run venn analysis. Please disable venn in config.yaml or specify more comparisons"

de_data = {}
de_genes = {}
for comp in comparisons:
    n_comps = len(comparisons)

    de_data[comp] = pd.read_csv(f"results/genediff/{comp}.csv")
    de_genes[comp] = de_data[comp].query(f"padj < {pval}")
    if direction == 'up':
        de_genes[comp] = de_genes[comp].query("FC > 1")['GeneID'].to_numpy()
    elif direction == 'down':
        de_genes[comp] = de_genes[comp].query("FC < 1")['GeneID'].to_numpy()

for comp1, comp2 in itertools.combinations(comparisons, 2):
    all_de_comps = [comp1, comp2]
    all_de_comps_string = '.'.join(all_de_comps)
    all_de_comps = [a.replace("_", " v ") for a in all_de_comps]

    de1 = set(de_genes[comp1])
    de2 = set(de_genes[comp2])

    s = [de1, de2]
    venn2(s, all_de_comps)
    plt.title(f"Venn - {all_de_comps_string} - {direction}")
    plt.savefig(f"results/genediff/venn/{all_de_comps_string}-{direction}-Venn.png")
    plt.close()

if n_comps >= 3:
    for all_de_comps in itertools.combinations(comparisons, 3):
        comp1, comp2, comp3 = all_de_comps
        all_de_comps_string = '.'.join(all_de_comps)
        all_de_comps = [a.replace("_", " v ") for a in all_de_comps]

        de_genes2 = dict((k, de_genes[k]) for k in (comp1, comp2, comp3))

        s = [set(v) for v in de_genes2.values()]

        venn3(s, all_de_comps)
        plt.title(f"Venn - {all_de_comps_string} - {direction}")
        plt.savefig(f"results/genediff/venn/{all_de_comps_string}-{direction}-Venn.png")
        plt.close()

Python Pandas matplotlib matplotlib-venn From line 7 of scripts/VennDiagrams.py

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell
from snakemake_wrapper_utils.bcftools import get_bcftools_opts

bcftools_opts = get_bcftools_opts(snakemake, parse_ref=False, parse_memory=False)
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell("bcftools view {bcftools_opts} {extra} {snakemake.input[0]} {log}")

Python Snakemake BCFtools snakemake-wrapper-utils From line 1 of view/wrapper.py

__author__ = "Joël Simoneau"
__copyright__ = "Copyright 2019, Joël Simoneau"
__email__ = "[email protected]"
__license__ = "MIT"

from snakemake.shell import shell

# Creating log
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

# Placeholder for optional parameters
extra = snakemake.params.get("extra", "")

# Allowing for multiple FASTA files
fasta = snakemake.input.get("fasta")
assert fasta is not None, "input-> a FASTA-file is required"
fasta = " ".join(fasta) if isinstance(fasta, list) else fasta

shell(
    "kallisto index "  # Tool
    "{extra} "  # Optional parameters
    "--index={snakemake.output.index} "  # Output file
    "{fasta} "  # Input FASTA files
    "{log}"  # Logging
)

Python Snakemake kallisto From line 3 of index/wrapper.py

__author__ = "Joël Simoneau"
__copyright__ = "Copyright 2019, Joël Simoneau"
__email__ = "[email protected]"
__license__ = "MIT"

from snakemake.shell import shell

# Creating log
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

# Placeholder for optional parameters
extra = snakemake.params.get("extra", "")

# Allowing for multiple FASTQ files
fastq = snakemake.input.get("fastq")
assert fastq is not None, "input-> a FASTQ-file is required"
fastq = " ".join(fastq) if isinstance(fastq, list) else fastq

shell(
    "kallisto quant "  # Tool
    "{extra} "  # Optional parameters
    "--threads={snakemake.threads} "  # Number of threads
    "--index={snakemake.input.index} "  # Input file
    "--output-dir={snakemake.output} "  # Output directory
    "{fastq} "  # Input FASTQ files
    "{log}"  # Logging
)

Python Snakemake Quant kallisto From line 3 of quant/wrapper.py

__author__ = "Michael Chambers"
__copyright__ = "Copyright 2019, Michael Chambers"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell
from snakemake_wrapper_utils.samtools import get_samtools_opts

samtools_opts = get_samtools_opts(
    snakemake, parse_threads=False, parse_write_index=False, parse_output_format=False
)
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell("samtools faidx {samtools_opts} {extra} {snakemake.input[0]} {log}")

Python Snakemake SAMtools snakemake-wrapper-utils From line 1 of faidx/wrapper.py

__author__ = "Christopher Preusch"
__copyright__ = "Copyright 2017, Christopher Preusch"
__email__ = "cpreusch[at]ust.hk"
__license__ = "MIT"


from snakemake.shell import shell
from snakemake_wrapper_utils.samtools import get_samtools_opts

samtools_opts = get_samtools_opts(
    snakemake, parse_write_index=False, parse_output=False, parse_output_format=False
)
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)

shell(
    "samtools flagstat {samtools_opts} {extra} {snakemake.input[0]} > {snakemake.output[0]} {log}"
)

Python Snakemake SAMtools snakemake-wrapper-utils From line 1 of flagstat/wrapper.py

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell

extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

# Samtools takes additional threads through its option -@
# One thread for samtools merge
# Other threads are *additional* threads passed to the '-@' argument
threads = "" if snakemake.threads <= 1 else " -@ {} ".format(snakemake.threads - 1)

shell(
    "samtools index {threads} {extra} {snakemake.input[0]} {snakemake.output[0]} {log}"
)

Python Snakemake SAMtools From line 1 of index/wrapper.py

__author__ = "Sebastian Kurscheid"
__copyright__ = "Copyright 2019, Sebastian Kurscheid"
__email__ = "[email protected]"
__license__ = "MIT"

from snakemake.shell import shell
import re

extra = snakemake.params.get("extra", "")
adapters = snakemake.params.get("adapters", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)


# Assert input
n = len(snakemake.input.sample)
assert (
    n == 1 or n == 2
), "input->sample must have 1 (single-end) or 2 (paired-end) elements."


# Input files
if n == 1:
    reads = "--in1 {}".format(snakemake.input.sample)
else:
    reads = "--in1 {} --in2 {}".format(*snakemake.input.sample)


# Output files
trimmed_paths = snakemake.output.get("trimmed", None)
if trimmed_paths:
    if n == 1:
        trimmed = "--out1 {}".format(snakemake.output.trimmed)
    else:
        trimmed = "--out1 {} --out2 {}".format(*snakemake.output.trimmed)

        # Output unpaired files
        unpaired = snakemake.output.get("unpaired", None)
        if unpaired:
            trimmed += f" --unpaired1 {unpaired} --unpaired2 {unpaired}"
        else:
            unpaired1 = snakemake.output.get("unpaired1", None)
            if unpaired1:
                trimmed += f" --unpaired1 {unpaired1}"
            unpaired2 = snakemake.output.get("unpaired2", None)
            if unpaired2:
                trimmed += f" --unpaired2 {unpaired2}"

        # Output merged PE reads
        merged = snakemake.output.get("merged", None)
        if merged:
            if not re.search(r"--merge\b", extra):
                raise ValueError(
                    "output.merged specified but '--merge' option missing from params.extra"
                )
            trimmed += f" --merged_out {merged}"
else:
    trimmed = ""


# Output failed reads
failed = snakemake.output.get("failed", None)
if failed:
    trimmed += f" --failed_out {failed}"


# Stats
html = "--html {}".format(snakemake.output.html)
json = "--json {}".format(snakemake.output.json)


shell(
    "(fastp --thread {snakemake.threads} "
    "{extra} "
    "{adapters} "
    "{reads} "
    "{trimmed} "
    "{json} "
    "{html} ) {log}"
)

Python Snakemake JSON From line 1 of fastp/wrapper.py

__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path

from snakemake.shell import shell


extra = snakemake.params.get("extra", "")
# Set this to False if multiqc should use the actual input directly
# instead of parsing the folders where the provided files are located
use_input_files_only = snakemake.params.get("use_input_files_only", False)

if not use_input_files_only:
    input_data = set(path.dirname(fp) for fp in snakemake.input)
else:
    input_data = set(snakemake.input)

output_dir = path.dirname(snakemake.output[0])
output_name = path.basename(snakemake.output[0])
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "multiqc"
    " {extra}"
    " --force"
    " -o {output_dir}"
    " -n {output_name}"
    " {input_data}"
    " {log}"
)