RNA-Seq-Pop: Analysis Workflow for Illumina Paired-End RNA Sequencing Data

public public 1yr ago Version: v0.8.0 0 bookmarks

RNA-Seq-Pop

This snakemake workflow performs various analyses of illumina paired-end RNA-Sequencing data:

  • Quality control of fastq reads with FASTQC

  • QC metrics integrated into one final QC report with multiQC

  • Differential expression analysis with Kallisto at the gene level ( DESeq2 ) and transcript level ( Sleuth )

  • Variant calling with freebayes , and an Fst and Population branch statistic (PBS) analysis, both in windows and at the gene-level ( Scikit-allel ).

  • Allele counts and frequencies at variants of interest (pre-specified loci of choice).

  • Various summary statistics are calculated (Wattersons Theta, Sequence Diversity, Dxy etc)

  • Differential SNP testing with the R package kissDE , which accounts for allele-specific expression.

  • Gene Set Enrichment analyses.

  • Anopheles gambiae s.l - Analysis of Ancestry Informative Markers (AIMs) to determine relative ancestry of An.gambiae/coluzzii/arabiensis .

  • Anopheles gambiae s.l - Reports if DE genes are found underneath known selective sweep signals in the Ag1000g .

  • Anopheles gambiae s.l - Determines Karyotype of chromosome 2 inversions using compkaryo - GH

The workflow is generalised, and will function with any trimmed Illumina paired-end RNA-sequencing. However, certain modules, such as the AIMs analysis, are only appropriate for specific species. These can be activated in the configuration file (config.yaml). The workflow works with pooled samples, diploid, or haploid individuals.

The workflow is still in construction, and not yet ready for release. If you have any feedback on how the workflow may be improved, please get in touch, or feel free to fork the github repo and create a pull request for any additional features you would like to implement. If you are using the workflow and would like to give feedback or troubleshoot, consider joining the discord server here

Authors

Usage

If you use this workflow in a paper, don't forget to give credits to the author by citing the URL of this (original) repository and, if available, its DOI -

Workflow

Step 1: Obtain a copy of this workflow

  1. Create a new github repository using this workflow as a template .

  2. Clone the newly created repository to your local system, into the place where you want to perform the data analysis.

  3. As the workflow contains submodules (compkaryo & mpileup2readcounts), git clone --recursive should be used to clone the repo.

  4. If you want to run the differential SNPs analysis, please compile mpileup2readcounts by navigating to its folder in workflow/scripts and: g++ -std=c++11 -O3 mpileup2readcounts.cc -o mpileup2readcounts

Step 2: Configure workflow

Configure the workflow according to your needs via editing the files in the config/ folder. Adjust the example config.yaml to configure the workflow execution, and samples.tsv to specify your sample setup.

Step 3: Install Snakemake

Install Snakemake using conda :

conda create -c bioconda -c conda-forge -n snakemake snakemake

For installation details, see the instructions in the Snakemake documentation .

Step 4: Execute workflow

Activate the conda environment:

conda activate snakemake

Test your configuration by performing a dry-run via

snakemake --use-conda -n

Execute the workflow locally via

snakemake --use-conda --cores $N

using $N cores or run it in a cluster environment via

snakemake --use-conda --cluster qsub --jobs 100

or

snakemake --use-conda --drmaa --jobs 100

If you not only want to fix the software stack but also the underlying OS, use

snakemake --use-conda --use-singularity

in combination with any of the modes above. See the Snakemake documentation for further details.

Step 5: Commit changes

Whenever you change something, don't forget to commit the changes back to your github copy of the repository:

git commit -a
git push

Step 6: Obtain updates from upstream

Whenever you want to synchronize your workflow copy with new developments from upstream, do the following.

  1. Once, register the upstream repository in your local copy: git remote add -f upstream [email protected]:snakemake-workflows/rna-seq-pop.git or git remote add -f upstream https://github.com/snakemake-workflows/rna-seq-pop.git if you do not have setup ssh keys.

  2. Update the upstream version: git fetch upstream .

  3. Create a diff with the current version: git diff HEAD upstream/master workflow > upstream-changes.diff .

  4. Investigate the changes: vim upstream-changes.diff .

  5. Apply the modified diff via: git apply upstream-changes.diff .

  6. Carefully check whether you need to update the config files: git diff HEAD upstream/master config . If so, do it manually, and only where necessary, since you would otherwise likely overwrite your settings and samples.

Step 7: Contribute back

In case you have also changed or added steps, please consider contributing them back to the original repository:

  1. Fork the original repo to a personal or lab account.

  2. Clone the fork to your local system, to a different place than where you ran your analysis.

  3. Copy the modified files from your analysis to the clone of your fork, e.g., cp -r workflow path/to/fork . Make sure to not accidentally copy config file contents or sample sheets. Instead, manually update the example config files if necessary.

  4. Commit and push your changes to your fork.

  5. Create a pull request against the original repository.

Testing

Test cases are in the subfolder .test . They are automatically executed via continuous integration with Github Actions .

Code Snippets

14
15
wrapper:
    "0.66.0/bio/kallisto/index"
35
36
wrapper:
    "0.66.0/bio/kallisto/quant"
49
50
wrapper:
    "v0.69.0/bio/samtools/faidx"
69
70
shell:
    "hisat2-build -p {threads} {input.fasta} {params.prefix}  2> {log}"
94
95
96
97
98
shell:
    """
    hisat2 {params.extra} --threads {threads} -x {params.idx} {params.readflags} 2> {log.align} | 
    samblaster 2> {log.sort} | samtools sort -@{threads} - -o {output} 2>> {log.sort}
    """
111
112
wrapper:
    "0.65.0/bio/samtools/index"
127
128
shell:
    "samtools dict {input} > {params.output} 2> {log} "
145
146
wrapper:
    "v1.4.0/bio/gatk/splitncigarreads"
165
166
wrapper:
    "v1.4.0/bio/gatk/baserecalibrator"
184
185
wrapper:
    "v1.4.0/bio/gatk/applybqsr"
197
198
wrapper:
    "0.65.0/bio/samtools/index"
211
212
wrapper:
    "0.65.0/bio/samtools/index"
35
36
script:
    "../scripts/DeseqGeneDE.R"
63
64
script:
    "../scripts/SleuthIsoformsDE.R"
97
98
script:
    "../scripts/ProgressiveDE.R"
151
152
script:
    "../scripts/GeneSetEnrichment.R"
174
175
script:
    "../scripts/Ag1000gSweepsDE.py"
191
192
script:
    "../scripts/GeneCategoryContribution.R"
16
17
shell:
    "bcftools concat {input.calls} | vcfuniq > {output} 2> {log}"
32
33
wrapper:
    "v1.0.0/bio/bcftools/view"
49
50
shell:
    "snpEff download {params.ref} -dataDir {params.dir} 2> {log}"
71
72
73
74
75
shell:
    """
    snpEff eff {params.db} -dataDir {params.dir} -csvStats {output.csvStats} {input.calls} > {params.prefix} 2> {log}
    bgzip {params.prefix}
    """
92
93
94
95
shell:
    """
    SnpSift filter "{params.expression}" {input.vcf} > {output} 2> {log}
    """
116
117
script:
    "../scripts/ExtractBedVCF.py"
21
22
23
24
25
shell:
    """
    samtools mpileup -f {input.ref} -l {input.bed} {input.bam} 2> {log} |
    {params.basedir}/scripts/mpileup2readcounts/mpileup2readcounts.cc 0 {params.baseflt} {params.ignore_indels} {params.min_alt} {params.min_af} > {output} 2>> {log}
    """
58
59
script:
    "../scripts/DifferentialSNPs.R"
SnakeMake From line 58 of rules/misc.smk
91
92
script:
    "../scripts/VennDiagrams.py"
SnakeMake From line 91 of rules/misc.smk
26
27
script:
    "../scripts/checkInputs.py"
SnakeMake From line 26 of rules/qc.smk
43
44
wrapper:
    "0.74.0/bio/fastqc"
64
65
wrapper:
    "v0.86.0/bio/cutadapt/pe"
SnakeMake From line 64 of rules/qc.smk
82
83
wrapper:
    "0.70.0/bio/samtools/flagstat"
SnakeMake From line 82 of rules/qc.smk
103
104
shell:
    "mosdepth --threads {threads} --fast-mode --by {params.windowsize} --no-per-base {params.prefix} {input.bam}"
119
120
121
122
shell:
    """
    bcftools stats {input} > {output} 2> {log}
    """
138
139
shell:
    "qualimap bamqc -bam {input.bam} -c -gff {input.gff} -outfile {output} 2> {log}"
161
162
wrapper:
    "0.74.0/bio/multiqc"
36
37
script:
    "../scripts/SNPstatistics.py"
66
67
script:
    "../scripts/pca.py"
96
97
script:
    "../scripts/SummaryStats.py"
143
144
script:
    "../scripts/WindowedFstPBS.py"
176
177
script:
    "../scripts/PerGeneFstPBS.py"
209
210
script:
    "../scripts/AncestryInformativeMarkers.py"
232
233
234
235
236
237
shell:
    """
    paste <(bcftools query -l {input.vcf}) \
    <(python {params.basedir}/scripts/compkaryo/compkaryo/compkaryo.py {input.vcf} {wildcards.karyo} -p {params.ploidy}) | 
    column -s $'\\t' -t | sort -k 1 > {output}
    """
258
259
script:
    "../scripts/KaryoPlots.py"
20
21
script:
    "../scripts/GenerateFreebayesParams.R"
44
45
shell:
    "freebayes -f {input.ref} -t {input.regions} --ploidy {params.ploidy} --populations {params.pops} --pooled-discrete --use-best-n-alleles 5 -L {input.samples} > {output} 2> {log}"
63
64
65
66
67
shell:
    """
    octopus -R {input.reference} -I {input.bam} -T {wildcards.chrom} --organism-ploidy {params.ploidy} --downsample-above {params.max_coverage} \
     --sequence-error-model {params.err_model} --forest {params.forest} -o {output} --threads {threads} 2> {log}
    """
21
22
23
24
25
shell:
    """
    samtools mpileup {input.bam} -r {params.region} -f {params.ref} 2> {log} | 
    python2 {params.basedir}/scripts/BaseParser.py > {output} 2>> {log}
    """
52
53
script:
    "../scripts/VariantsOfInterestAlleleBalance.R"
74
75
script:
    "../scripts/VariantsOfInterestPlot.py"
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import sys
sys.stderr = open(snakemake.log[0], "w")

import pandas as pd
import numpy as np
from collections import defaultdict

# Read in parameters 
pval_threshold = snakemake.params['pval']
upper_fc = snakemake.params['fc']
lower_fc = 1/upper_fc # if someone wants a FC threshold of 2, need to have lower threshold of 0.5.
# Read in list of contrasts
comparisons = pd.DataFrame(snakemake.params['DEcontrasts'], columns=['contrast'])

# Read in .csv file containing selection signals and associated metadata
signals = pd.read_csv("resources/signals.csv")

for comp in comparisons['contrast']:

    # Load DE results
    DEgenes = pd.read_csv(f"results/genediff/{comp}.csv")
    # Filter to overexpressed genes
    sigup = DEgenes[np.logical_and(DEgenes['padj'] < pval_threshold, np.logical_or(DEgenes['FC'] > upper_fc, DEgenes['FC'] < lower_fc))]

    sweep = {}
    nswept = {}

    # Loop through each signal, recording if any DE genes are found in one
    for i, cols in signals.iterrows():

        if pd.isnull(cols['overlapping_genes']): # Skip signal if no overlapping genes
            continue

        sweptgenes = np.array(cols['overlapping_genes'].split(" "))

        # Get boolean array - if list of swept genes isin our DE genes
        overlap = np.isin(sweptgenes, sigup['GeneID'])

        sweep[cols['uid']] = sweptgenes[overlap]
        nswept[cols['uid']] = sweptgenes

    genes = np.concatenate(list(sweep.values()))
    swept = sigup[np.isin(sigup['GeneID'], genes)]

    for k,v in sweep.items():
        sweep[k] = ' '.join(v)

    # Build dataframe and add sweep metadata columns
    sweptDE = pd.DataFrame.from_dict(sweep, orient='index', columns=['overlapping_DE_genes'])
    sweptDE = sweptDE.reset_index().rename(columns={'index': 'Ag1000g_sweep'})
    sweptDE['overlapping_genes'] = signals['overlapping_genes'][~pd.isnull(signals['overlapping_genes'])].reset_index(drop=True)
    sweptDE['chromosome'] = signals['peak_end_seqid'][~pd.isnull(signals['overlapping_genes'])].reset_index(drop=True)
    sweptDE['epicenter'] = signals['epicenter_coord'][~pd.isnull(signals['overlapping_genes'])].reset_index(drop=True)
    sweptDE['known_loci'] = signals['known_loci'][~pd.isnull(signals['overlapping_genes'])].reset_index(drop=True)

    wheresweep = defaultdict(dict)
    whatsweep = defaultdict(list)

    # Now loop through each swept gene to find name of sweeps it lies under
    # And location of sweep
    for gene in genes:

        for i, cols in sweptDE.iterrows():

            sweptgenes = np.array(cols['overlapping_DE_genes'].split(" "))

            if np.isin(sweptgenes, gene).any():
                wheresweep[gene]['chrom'] = cols['chromosome']
                wheresweep[gene]['epicenter'] = cols['epicenter']
                wheresweep[gene]['known_loci'] = cols['known_loci']

                whatsweep[gene].append(cols['Ag1000g_sweep'])

    # Join name of sweeps column into a single string, so it fits in one column of data frame
    for k,v in whatsweep.items():
        whatsweep[k] = ' '.join(v)

    dfwhere = pd.DataFrame.from_dict(wheresweep, orient='index')
    dfwhat = pd.DataFrame.from_dict(whatsweep, orient='index', columns=['Ag1000g_sweeps'])

    df = pd.concat([dfwhat, dfwhere], axis=1)
    df = df.reset_index().rename(columns={'index': 'GeneID'})

    swept = swept.merge(df)
    swept.to_csv(f"results/genediff/ag1000gSweeps/{comp}_swept.tsv", sep="\t")
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
import sys
sys.stderr = open(snakemake.log[0], "w")

import rnaseqpoptools as rnaseqpop
import pandas as pd
import numpy as np
import zarr
import allel
from collections import defaultdict

### AIMS ###
dataset = snakemake.params['dataset']
metadata = pd.read_csv(snakemake.input['metadata'], sep="\t")
metadata = metadata.sort_values(by='species').reset_index(drop=True)
chroms = snakemake.params['chroms']
ploidy = snakemake.params['ploidy']
numbers = rnaseqpop.get_numbers_dict(ploidy)
qualflt = snakemake.params['qualflt']
missingprop = snakemake.params['missingprop']

# read AIMs
aims = zarr.open(snakemake.input['aims_zarr_gambcolu'], mode='r')

## initialize dicts
ancestryPerAim = {}
aims_chrom_gamb = {}
aims_chrom_colu = {}
all_gamb = defaultdict(list)
all_colu = defaultdict(list)
n_aims_per_chrom = {}

for chrom in chroms:

    # read in and filter data
    path = f"results/variantAnalysis/vcfs/{dataset}.{chrom}.vcf.gz"
    vcf, geno, acsubpops, pos, depth, snpeff, subpops, pops =  rnaseqpop.readAndFilterVcf(path=path,
                                                               chrom=chrom,
                                                               samples=metadata,
                                                               numbers=numbers,
                                                               ploidy=ploidy,
                                                               qualflt=qualflt,
                                                               missingfltprop=missingprop)
    aimspos = aims[chrom]['POS'][:]

    # get intersection of aims and our SNPs
    aims_pos_mask, aims_mask_2 = pos.locate_intersection(aimspos)
    our_aims = pos[aims_pos_mask]
    print(f"\n In the data, across all samples there are {our_aims.shape[0]} Ancestry Informative markers on Chromosome {chrom}")

    # get gamb and colu alleles, and subset to aims that we have in the rna-seq data 
    aimscolu = aims[chrom]['colu_allele'][:][aims_mask_2]
    aimsgamb = aims[chrom]['gamb_allele'][:][aims_mask_2]

    # get mask that was used in readAndFilterVcf()
    mask = pos.locate_intersection(vcf['variants/POS'])[1]
    ref  = vcf['variants/REF'][mask][aims_pos_mask]
    alt = vcf['variants/ALT'][mask][aims_pos_mask]

    # filter geno array to set of aims
    geno_aims = geno.compress(aims_pos_mask, axis=0)

    totalgambscore = {}
    totalcoluscore = {}

    for aim in our_aims:

        gambscore = {}
        coluscore = {}

        # filter arrays 
        mask = our_aims == aim
        ref_ = ref[mask]
        alt_ = alt[mask]
        aimscolu_ = aimscolu[mask]
        aimsgamb_ = aimsgamb[mask]

        gn_aim = geno_aims.compress(mask, axis=0)

        # convert genotypes to nucleotides
        gn2nucleotide = {0:ref_[0],
                        1:alt_[0][0],
                         2:alt_[0][1],
                         3:alt_[0][2],
                        -1:float("nan")}
        gn = rnaseqpop.replace_with_dict2_generic(gn_aim, gn2nucleotide)

        # for each sample, get proportion of gambiae/coluzzii alleles
        # alleles that are different to both will be missed here
        for sample in metadata.treatment.unique():
            alleles = gn.take(subpops[sample], axis=1).flatten()

            # at each AIM, do we have gamb or colu alleles
            gamb = alleles[alleles != 'nan'] == aimsgamb_
            colu = alleles[alleles != 'nan'] == aimscolu_

            # get proportion of gamb v colu alleles at each locus
            gambscore[sample] = np.mean(gamb)
            coluscore[sample] = np.mean(colu)

        totalgambscore[aim] = dict(gambscore)
        totalcoluscore[aim] = dict(coluscore)

        gambscores = rnaseqpop.flip_dict(totalgambscore)
        coluscores = rnaseqpop.flip_dict(totalcoluscore)

        prop_gambiae = {}
        prop_colu = {}
        n_aims_per_sample = {}

        for sample in metadata.treatment.unique():

            prop_gambiae[sample] = np.nanmean(np.array(list(gambscores[sample].values())))
            all_gamb[sample].append(np.nanmean(np.array(list(gambscores[sample].values()))))
            prop_colu[sample] = np.nanmean(np.array(list(coluscores[sample].values())))
            all_colu[sample].append(np.nanmean(np.array(list(coluscores[sample].values()))))

            arr = np.array(list(gambscores[sample].values()))
            dim = arr.shape[0]
            n_aims_per_sample[sample] = dim-np.sum(np.isnan(arr))

    # store AIM fractions for each chromosome in outer dict 
    aims_chrom_gamb[chrom] = dict(prop_gambiae)
    aims_chrom_colu[chrom] = dict(prop_colu)
    n_aims_per_chrom[chrom] = dict(n_aims_per_sample)

    # Store ancestry score per aim
    ancestryPerAim[chrom] = pd.concat([pd.DataFrame(gambscores).add_suffix("_gamb"), pd.DataFrame(coluscores).add_suffix("_colu")], axis=1)
    ancestryPerAim[chrom]['contig'] = chrom

    # plot and store for each chromosome
    coludf = pd.DataFrame.from_dict(prop_colu, orient='index', columns=['AIM_fraction_coluzzii'])
    gambdf = pd.DataFrame.from_dict(prop_gambiae, orient='index', columns=['AIM_fraction_gambiae'])
    perchromdf = gambdf.merge(coludf, left_index=True, right_index=True)
    aimsperchromdf = pd.DataFrame.from_dict(n_aims_per_sample, orient='index', columns=['n_AIMs'])

    perchromdf.to_csv(f"results/variantAnalysis/ancestry/AIM_fraction_{chrom}.tsv", sep="\t", index=True)
    rnaseqpop.plot_aims(perchromdf, aimsperchromdf, species1="coluzzii", species2="gambiae", figtitle=f"AIM_fraction_{chrom}", total=False)


aims_chrom_gamb = rnaseqpop.flip_dict(aims_chrom_gamb)
aims_chrom_colu = rnaseqpop.flip_dict(aims_chrom_colu)
n_aims_per_chrom = rnaseqpop.flip_dict(n_aims_per_chrom)

# get ancestry per aim for later plotting on chromosome
ancestryPerAim = pd.concat(ancestryPerAim, axis=0)
ancestryPerAim.to_csv("results/variantAnalysis/ancestry/ancestryScorePerAim.tsv", sep="\t")

# get genome wide average AIM fractions
for k in all_gamb:
    all_gamb[k] = np.nanmean(all_gamb[k])
    all_colu[k] = np.nanmean(all_colu[k])

df1 = pd.DataFrame.from_dict(all_gamb, orient='index',columns=['AIM_fraction_gambiae'])
df2 = pd.DataFrame.from_dict(all_colu, orient='index', columns=['AIM_fraction_coluzzii'])
n_aimsdf = pd.DataFrame.from_dict(n_aims_per_chrom)
n_aimsdf.to_csv(f"results/variantAnalysis/ancestry/n_AIMS_per_chrom.tsv", sep="\t", index=True)

df = df1.merge(df2, left_index=True, right_index=True)
df.to_csv(f"results/variantAnalysis/ancestry/AIMs_summary.tsv", sep="\t", index=True)

rnaseqpop.plot_aims(df, n_aimsdf, species1="coluzzii", species2="gambiae", figtitle="AIM_fraction_whole_genome", total=True)




########### The same but for arabiensis v gambiae/coluzzii 
# script should be modularised but no time atm, isnt necessary

if metadata['species'].isin(['arabiensis']).any():

    aims = zarr.open(snakemake.input['aims_zarr_arab'], mode='r')

    aims_chrom_gamb = {}
    aims_chrom_arab = {}
    all_gamb = defaultdict(list)
    all_arab = defaultdict(list)
    n_aims_per_chrom = {}

    for chrom in chroms:

        # read in and filter data
        path = f"results/variantAnalysis/vcfs/{dataset}.{chrom}.vcf.gz"
        vcf, geno, acsubpops, pos, depth, snpeff, subpops, pops = rnaseqpop.readAndFilterVcf(path=path,
                                                                chrom=chrom,
                                                                samples=metadata,
                                                                numbers=numbers,
                                                                qualflt=qualflt,
                                                                missingfltprop=missingprop)
        aimspos = aims[chrom]['POS'][:]

        # get intersection of aims and our SNPs
        aims_pos_mask, aims_mask_2 = pos.locate_intersection(aimspos)
        our_aims = pos[aims_pos_mask]
        print(f"\n In the data, across all samples there are {our_aims.shape[0]} gamb-arab Ancestry Informative markers on Chromosome {chrom}")

        # get gamb and colu alleles, and subset to aims that we have in the rna-seq data 
        aimsgamb = aims[chrom]['gambcolu_allele'][:][aims_mask_2]
        aimsarab = aims[chrom]['arab_allele'][:][aims_mask_2]

        # get mask that was used in readAndFilterVcf()
        mask = pos.locate_intersection(vcf['variants/POS'])[1]
        ref  = vcf['variants/REF'][mask][aims_pos_mask]
        alt = vcf['variants/ALT'][mask][aims_pos_mask]

        # filter geno array to set of aims
        geno_aims = geno.compress(aims_pos_mask, axis=0)

        totalgambscore = {}
        totalarabscore = {}

        for aim in our_aims:

            gambscore = {}
            arabscore = {}

            # filter arrays 
            mask = our_aims == aim
            ref_ = ref[mask]
            alt_ = alt[mask]
            aimsarab_ = aimsarab[mask]
            aimsgamb_ = aimsgamb[mask]

            gn_aim = geno_aims.compress(mask, axis=0)

            #convert genotypes to nucleotides
            gn2nucleotide = {0:ref_[0],
                            1:alt_[0][0],
                            2:alt_[0][1],
                            3:alt_[0][2],
                            -1:float("nan")}
            gn = rnaseqpop.replace_with_dict2_generic(gn_aim, gn2nucleotide)

            # for each sample, get proportion of gambiae/arabiensis alleles
            # alleles that are different to both will be missed here
            for sample in metadata.treatment.unique():
                alleles = gn.take(subpops[sample], axis=1).flatten()

                # at each AIM, do we have gamb or arab alleles
                gamb = alleles[alleles != 'nan'] == aimsgamb_
                arab = alleles[alleles != 'nan'] == aimsarab_

                # get proportion of gamb v arab alleles at each locus
                gambscore[sample] = np.mean(gamb)
                arabscore[sample] = np.mean(arab)

            totalgambscore[aim] = dict(gambscore)
            totalarabscore[aim] = dict(arabscore)

            gambscores = rnaseqpop.flip_dict(totalgambscore)
            arabscores = rnaseqpop.flip_dict(totalarabscore)

            prop_gambiae = {}
            prop_arab = {}
            n_aims_per_sample = {}

            for sample in metadata.treatment.unique():

                prop_gambiae[sample] = np.nanmean(np.array(list(gambscores[sample].values())))
                all_gamb[sample].append(np.nanmean(np.array(list(gambscores[sample].values()))))
                prop_arab[sample] = np.nanmean(np.array(list(arabscores[sample].values())))
                all_arab[sample].append(np.nanmean(np.array(list(arabscores[sample].values()))))

                arr = np.array(list(gambscores[sample].values()))
                dim = arr.shape[0]
                n_aims_per_sample[sample] = dim-np.sum(np.isnan(arr))

        aims_chrom_gamb[chrom] = dict(prop_gambiae)
        aims_chrom_arab[chrom] = dict(prop_arab)
        n_aims_per_chrom[chrom] = dict(n_aims_per_sample)

        # Store ancestry score per aim
        ancestryPerAim[chrom] = pd.concat([pd.DataFrame(gambscores).add_suffix("_gamb"), pd.DataFrame(coluscores).add_suffix("_colu")], axis=1)
        ancestryPerAim[chrom]['contig'] = chrom

        # plot and store for each chromosome
        gambdf = pd.DataFrame.from_dict(prop_gambiae, orient='index', columns=['AIM_fraction_gambiae'])
        arabdf = pd.DataFrame.from_dict(prop_arab, orient='index', columns=['AIM_fraction_arabiensis'])
        perchromdf = gambdf.merge(arabdf, left_index=True, right_index=True)
        aimsperchromdf = pd.DataFrame.from_dict(n_aims_per_sample, orient='index', columns=['n_AIMs'])

        perchromdf.to_csv(f"results/variantAnalysis/ancestry/AIM_fraction_{chrom}.arab.tsv", sep="\t", index=True)
        rnaseqpop.plot_aims(perchromdf, aimsperchromdf, species1="arabiensis", species2="gambiae", figtitle=f"AIM_fraction_arab_{chrom}", total=False)

    aims_chrom_gamb = rnaseqpop.flip_dict(aims_chrom_gamb)
    aims_chrom_arab = rnaseqpop.flip_dict(aims_chrom_arab)
    n_aims_per_chrom = rnaseqpop.flip_dict(n_aims_per_chrom)

    # get ancestry per aim for later plotting on chromosome
    ancestryPerAim = pd.concat(ancestryPerAim, axis=0)
    ancestryPerAim.to_csv("results/variantAnalysis/ancestry/ancestryScorePerAim.tsv", sep="\t")

    # get genome wide average AIM fractions
    for k in all_gamb:
        all_gamb[k] = np.nanmean(all_gamb[k])
        all_arab[k] = np.nanmean(all_arab[k])

    df1 = pd.DataFrame.from_dict(all_gamb, orient='index', columns=['AIM_fraction_gambiae'])
    df2 = pd.DataFrame.from_dict(all_arab, orient='index', columns=['AIM_fraction_arabiensis'])
    n_aimsdf = pd.DataFrame.from_dict(n_aims_per_chrom)
    n_aimsdf.to_csv(f"results/variantAnalysis/ancestry/n_AIMS_per_chrom_arab.tsv", sep="\t", index=True)

    df = df1.merge(df2, left_index=True, right_index=True)
    df.to_csv(f"results/variantAnalysis/ancestry/AIMs_summary_arab.tsv", sep="\t", index=True)

    rnaseqpop.plot_aims(df, n_aimsdf, species1="arabiensis", species2="gambiae", figtitle="AIM_fraction_whole_genome", total=True)
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import sys
sys.stderr = open(snakemake.log[0], "w")
import os
import pandas as pd
import numpy as np
import re
import allel

# Read in parameters 
metadata = pd.read_csv(snakemake.params['metadata'], sep="\t")
ref = snakemake.input['ref']
gffpath = snakemake.params['gffpath']
chroms = snakemake.params['chroms']
autofastq = snakemake.params['fastq']
sweeps = snakemake.params['sweeps']
signalpath = "resources/signals.csv"


## Read fasta file and grep chromosome names, check if they match config
file = open(ref, "r")
refchroms = []
for line in file:
    if re.search(">", line):
        refchroms.append(line.split()[0].replace(">", ""))
assert (np.isin(chroms, refchroms)).all(), f"Not all chromosome names specified in config.yaml are found in the reference fasta file ({ref})"


# Check that the contrasts specified in config.yaml are all in the treatment column
comparisons = pd.DataFrame(snakemake.params['contrasts'], columns=['contrast'])
comparisons = comparisons.contrast.str.split("_", expand=True) # split contrasts into case v control 
comparisons = comparisons.values.ravel() # make 1d array 
assert (np.isin(comparisons, metadata['treatment']).all()), f"treatments specified in config.yaml contrasts do not exist in samples.tsv. {comparisons}"


# Check that samples in fastq.tsv match those in metadata.tsv
if autofastq is False:
    fastq = pd.read_csv(snakemake.params['table'], sep="\t")
    assert np.isin(fastq['sampleID'], metadata['sampleID']).all(), f"all samples specified in fastq table do not match those in samples.tsv, please check your fastq.tsv file"
else:
    # Check to see if fastq files match the metadata
    for sample in metadata['sampleID']:
        for n in [1,2]:
            fqpath = f"resources/reads/{sample}_{n}.fastq.gz"
            assert os.path.isfile(fqpath), f"all sample names in 'samples.tsv' do not match a .fastq.gz file in the {snakemake.workflow.basedir}/resources/reads/ directory, please rename or consider using the fastq table option"


# Check column names of gene_names.tsv
gene_names = pd.read_csv(snakemake.params['gene_names'], sep="\t")
colnames = ['GeneID', 'TranscriptID', 'GeneDescription', 'GeneName']
assert (gene_names.columns.isin(colnames)).all(), f"Column names of gene_names.tsv should be {colnames}"


# Check that the chromosomes are present in the gff file 
gff = allel.gff3_to_dataframe(gffpath)
assert np.isin(chroms, gff.seqid).all(), f"All provided chromosome names ({chroms}) are not present in the reference GFF file"
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

#' This script performs differential expression analysis at the 
#' gene level using DESeq2. It outputs heatmaps, PCAs, gene counts,
#' as well as DE results. Final DE results for all contrasts are 
#' saved in an .xslx report 

## Differential expression
library(DESeq2)
library(pheatmap)
library(data.table)
library(ggrepel)
library(openxlsx)
library(glue)
library(RColorBrewer)
library(tidyverse)
library(jsonlite)

#read metadata and get contrasts
metadata = fread(snakemake@input[['metadata']], sep="\t") %>% as.data.frame()
gene_names = fread(snakemake@input[['genes2transcripts']], sep="\t") %>% distinct()

contrastsdf = data.frame("contrast" = snakemake@params[['DEcontrasts']])
contrasts = contrastsdf$contrast

##### define functions ######
round_df = function(df, digits) {

  #' This function rounds all the numeric columns of a data.frame
  nums = vapply(df, is.numeric, FUN.VALUE = logical(1))
  df[,nums] = round(df[,nums], digits = digits)
  (df)
}

vst_pca = function(counts, samples, colourvar, name="PCA", st="", comparison=""){

  #' This function takes counts and sample metadata as input and builds a DESeq2 dataset
  #' Returning normalised and variance-stabilised counts, and performs PCA on the data
  #' Plotting and saving to pdf

  # make DESeq dataset
  dds = DESeqDataSetFromMatrix(countData = counts, 
                               colData = samples, 
                               design = ~ treatment)
  ###### estimate paramters and normalise 
  dds = estimateSizeFactors(dds)
  dds = estimateDispersions(dds)
  vsd = varianceStabilizingTransformation(dds)
  normcounts = counts(dds, normalized=TRUE)
  vstcounts = assay(vsd)
  vstcounts = vstcounts[order(apply(vstcounts,1,sum),decreasing=TRUE),]

  #### write pca of samples to pdf
  pca2=prcomp(t(vstcounts),center=TRUE)
  pc = data.frame(pca2$x) %>% rownames_to_column("sampleID")
  pc = left_join(pc, samples)

  pdf(glue("results/plots/{name}{st}{comparison}.pdf"))  
  print(ggplot(data=pc,aes(x=PC1, y=PC2, colour=treatment)) + 
    geom_point(size=6, alpha=0.8) + 
    geom_text_repel(aes(label=sampleID), colour="black") + 
    theme_light() + 
    labs(title=glue("{name} {st} {comparison}"),
         x=glue("PC1 - Variance explained - {round(summary(pca2)$importance[2,][1], 3)}"),
         y=glue("PC2 - Variance explained - {round(summary(pca2)$importance[2,][2], 3)}"))  + 
    theme(legend.position = "bottom",
          legend.title = element_blank(),
          legend.background=element_blank(),
          panel.grid.minor = element_blank(),
          panel.grid.major = element_blank(),
          panel.background = element_blank(),
          plot.title = element_text(hjust = 0.5),
          axis.text.x = element_text(colour="black", size=18),
          axis.text.y = element_text(colour="black", size=18)))
  null = dev.off()

  return(list(vstcounts, dds, normcounts))
}

volcano = function(data, title){

  data = data %>% mutate(col = case_when(padj < 0.05 & abs(log2FoldChange) > 1 ~ "#ff6666",
                                         padj > 0.05 & abs(log2FoldChange) > 1 ~ "#39e600",
                                         padj < 0.05 & abs(log2FoldChange) < 1 ~ "#3385ff",
                                         padj > 0.05 & abs(log2FoldChange) < 1 ~ "#b3b3b3"))

  data$labels = data %>% dplyr::mutate("Gene_name" = case_when(GeneName == "" ~ GeneID,
                                                               is.na(GeneName) ~ GeneID,
                                                               TRUE ~ GeneName)) %>% select(Gene_name) %>% deframe()

  plot = ggplot(data=data, aes(x=log2FoldChange, y=-log10(padj), color=col, alpha=0.4), size=2) + 
    geom_point() + 
    scale_color_identity() +
    xlim(-10, 10) +
    ylab("-log10(P)") +
    xlab("log2 Fold Change") +
    ggtitle(title) + 
    theme_light() + 
    theme(legend.position = "none")  +
    geom_vline(xintercept = log2(2), linetype='dashed', colour='grey') + 
    geom_vline(xintercept = log2(0.5), linetype='dashed', colour='grey') + 
    geom_hline(yintercept = -log10(0.05), linetype='dashed', colour='grey')# + 
    #geom_text_repel(data = subset(data, data[['padj']] < 0.01 & abs(data[['log2FoldChange']]) > 2), aes(label = subset(data, data[['padj']] < 0.01 & abs(data[['log2FoldChange']]) > 2)[["labels"]], colour='black'))

  print(plot)
}

#### main ####
cat("\n", "------------- Kallisto - DESeq2 - RNASeq Differential expression ---------", "\n")

#### Read counts for each sample
df = list()
for (sample in metadata$sampleID){
  df[[sample]]= fread(glue("results/quant/{sample}/abundance.tsv"), sep = "\t")
}

counts = data.frame('TranscriptID' = df[[1]]$target_id)
# Get read counts for each gene and fill table
for (sample in metadata$sampleID){
  reads = df[[sample]]$est_counts
  counts = cbind(counts, reads)
}

# Rename columns
colnames(counts) = c("TranscriptID", metadata$sampleID)
## Aggregate to gene level
counts = left_join(counts, gene_names) %>% select(GeneID, metadata$sampleID)
counts = counts %>% group_by(GeneID) %>% summarise_all(sum)
gene_names = gene_names %>% select(-TranscriptID)

### Count total reads counted
counts = counts[!is.na(counts$GeneID),]
counts = counts %>% column_to_rownames('GeneID')
#### Get count statistics for each sample and plot ####
count_stats = apply(counts, 2, sum) %>% enframe(name="Sample", value="total_counts") # total counts
ngenes = nrow(counts)
count_stats$genes_zerocounts = apply(counts, 2, function(x){sum(x==0)}) # genes with zero counts 
count_stats$genes_lessthan10counts = apply(counts, 2, function(x){sum(x<10)}) # genes with less than 10 counts
count_stats = count_stats %>% dplyr::mutate("proportion_zero" = genes_zerocounts/ngenes,
                                     "proportion_low" = genes_lessthan10counts/ngenes)
count_stats %>% fwrite(., "results/quant/countStatistics.tsv",sep="\t")

print("Counting and plotting total reads per sample...")
pdf("results/quant/total_reads_counted.pdf")
ggplot(count_stats, aes(x=Sample, y=total_counts, fill=metadata$treatment)) + 
  geom_bar(stat='identity') + 
  theme_light() +
  ggtitle("Total reads counted (mapped to Ag transcriptome (PEST))") +
  theme(axis.text.x = element_text(angle=45),
        plot.title = element_text(hjust = 0.5))
null = dev.off() 

# round numbers to be whole, they are not due averaging across transcripts
counts = counts %>% rownames_to_column('GeneID') %>% 
  dplyr::mutate_if(is.numeric, round) %>% column_to_rownames('GeneID')

############ Plots PCA with all data, and performs DESeq2 normalisation ########################
res = vst_pca(counts, metadata, colourvar = 'strain', name="PCA")
vstcounts = res[[1]]
dds = res[[2]]
normcounts = res[[3]]

### write out raw and normalised counts 
counts %>% 
  rownames_to_column("GeneID") %>% 
  fwrite(., "results/quant/rawcounts.tsv", sep="\t", row.names = FALSE)

normcounts %>% 
  as.data.frame() %>% 
  rownames_to_column("GeneID") %>% 
  round_df(., 1) %>% 
  fwrite(., "results/quant/normCounts.tsv", sep="\t", row.names = FALSE)

# calculate correlations between samples based on the count data, and plot heatmap
correlations = cor(vstcounts)
pdf("results/plots/heatmap_correlations.pdf")
pheatmap(correlations)
garbage = dev.off()

# add column if it doesnt exist
if(!("lab" %in% colnames(metadata)))
{
  metadata$lab = 'FALSE'
}

# for each strain in the dataset, do a PCA plot
# analysis of case v control with DESEq2 and make PCAs and volcano plots.
if ("strain" %in% colnames(metadata)){
  for (sp in unique(metadata$species)){
    df_samples = metadata %>% filter(species == sp)
    df_samples = df_samples %>% filter(lab == 'FALSE') #remove lab samples
    for (st in unique(df_samples$strain)){
      ai_list = list()
      ai_list[[st]] = df_samples[df_samples$strain == st,]
      # if the strain has both case and control (i.e. exposed v unexposed)
      if (length(unique(ai_list[[st]]$cohort)) > 1){
        cat(glue("\n Running PCA for all {st} samples \n"))
        # subset counts data to our strains of interest
        subcounts = counts[colnames(counts) %in% ai_list[[st]]$sampleID]
        # perform PCA on the data at strain level
        vstcounts = vst_pca(subcounts, ai_list[[st]], 'treatment', "PCA_", st=st)[[1]]
      }
    }
  }
}

results_list = list()
names_list = list()
ngenes_list = list()
######### subset data and run DESeq for each combo we need, store in xlsx ########
for (cont in contrasts){
  control = str_split(cont, "_")[[1]][1] # get first of string, which is control 
  case = str_split(cont, "_")[[1]][2] # get case 
  controls = which(metadata$treatment %in% control)
  cases = which(metadata$treatment %in% case)

  ## Perform PCA for each comparison
  subcounts = counts[,c(controls, cases)]
  subsamples = metadata[c(controls, cases),]
  # make treatment a factor with the 'susceptible' as reference
  subsamples$treatment = as.factor(as.character(subsamples$treatment))
  subsamples$treatment = relevel(subsamples$treatment, control)
  null = vst_pca(subcounts, subsamples, colourvar='treatment', "PCA_", comparison=cont)[[]]

  # perform DE with each comparison, using DESeq dataset (dds) from whole library
  cat("\n", glue("--- Running DESeq2 differential expression analysis on {cont} ---"), "\n")
  cds = nbinomWaldTest(dds)
  results = results(cds, contrast = c("treatment", case, control)) %>% as.data.frame() 
  results = results[order(results$padj),] #order by pvalue 
  results = results %>% rownames_to_column("GeneID") %>% dplyr::mutate("FC" = (2^log2FoldChange))

  ### absolute difference
  #### Get rowsums of counts, grouping by case/control. Then get difference of counts and join with DE results
  readdiff = data.frame(t(rowsum(t(subcounts), group = subsamples$treatment, na.rm = T))) #transpose and get rowsums for each group
  readdiff$absolute_diff = readdiff[,case] - readdiff[,control] #get difference
  readdiff = data.frame(readdiff) %>% rownames_to_column('GeneID')
  results = unique(left_join(results, readdiff[,c('GeneID','absolute_diff')]))

  # join DE results with normal gene names
  results = unique(left_join(results, gene_names))
  fwrite(results, glue("results/genediff/{cont}.csv")) #write to csv 
  # volcano plot for each comparison, using EnhancedVolcano. First make vector of labels which is AGAPs unless a gene name exists
  results$labels = results %>% dplyr::mutate("Gene_name" = case_when(GeneName == "" ~ GeneID,
                                     is.na(GeneName) ~ GeneID,
                                     TRUE ~ GeneName)) %>% select(Gene_name) %>% deframe()

  #get number of sig genes 
  res1 = results %>% filter(padj < 0.05) %>% 
    count("direction" = FC > 1) %>% 
    dplyr::mutate("direction" = case_when(direction == FALSE ~ "Downregulated, padj = 0.05",
                                   direction == TRUE ~ "Upregulated, padj = 0.05")) %>%
    dplyr::rename(!!glue("{cont}_ngenes") := "n")

  res2 = results %>% filter(padj < 0.001) %>% 
    count("direction" = FC > 1) %>% 
    dplyr::mutate("direction" = case_when(direction == FALSE ~ "Downregulated, padj = 0.001",
                                   direction == TRUE ~ "Upregulated, padj = 0.001")) %>%
    dplyr::rename(!!glue("{cont}_ngenes") := "n")

  ngenes_list[[cont]] = bind_rows(res1, res2)

  # store names of comparisons for xlsx report sheets
  results_list[[cont]] = results
  names_list[[cont]] = cont

  pdf(glue("results/genediff/Volcano_plot_{cont}.pdf"))
  volcano(data = results_list[[cont]], title = cont)
  null = dev.off()
  cat("\n", glue("{cont} complete!"), "\n")
}

# Join different comparisons together and write out number of sig genes 
purrr::reduce(ngenes_list, inner_join) %>% fwrite("results/genediff/nsig_genes.tsv", sep="\t", col.names = TRUE)

#### write to excel file on diff sheets #### 
sheets = unlist(names_list)
wb <- createWorkbook("Workbook")

for (i in 1:length(sheets)){
  addWorksheet(wb, glue("{sheets[[i]]}"))
  writeData(wb, sheets[i], results_list[[i]], rowNames = FALSE, colNames = TRUE)
}
#### save workbook to disk once all worksheets and data have been added ####
saveWorkbook(wb,file=snakemake@output[['xlsx']], overwrite = TRUE)


### Get kallisto statistics ###

totalReads = c()
totalAligned = c()

# open kallisto json info and store stats, save to file 
for (sample in metadata$sampleID){
  run = fromJSON(glue("results/quant/{sample}/run_info.json"), flatten=TRUE)

  totalReads = c(totalReads, run$n_processed)
  totalAligned = c(totalAligned, run$n_pseudoaligned)

}

df = data.frame("sample" = c(metadata$sampleID, "Total"),
                "totalReads" = c(totalReads, sum(totalReads)),
                "totalAligned" = c(totalAligned, sum(totalAligned))) %>% 
  mutate("percentage" = (totalAligned/totalReads)*100) %>% 
  fwrite("results/quant/KallistoQuantSummary.tsv", sep="\t", col.names = TRUE)


sessionInfo()
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")


#' This script reads in tables of allele counts, produced with samtools mpileup.
#' Using the R package kissDE, it performs a differential SNP testing 
#' analysis, testing if certain alleles are biased to be found in one 
#' treatment group 

library(tidyverse)
library(data.table)
library(kissDE)
library(glue)
library(rtracklayer)


######## parse inputs #############
chroms = snakemake@params[['chroms']]
metadata = fread(snakemake@input[['metadata']])
contrasts = data.frame("contrast" = snakemake@params[['DEcontrasts']])
comparisons = contrasts %>% separate(contrast, into = c("control", "case"), sep = "_")
mincounts = snakemake@params[['mincounts']]
pval = snakemake@params[['pval_flt']]
gene_names = fread(snakemake@input[['geneNames']], sep="\t") %>% select(-TranscriptID) %>% distinct()

gffpath = snakemake@input[['gff']]
#load gff with rtracklayer and filter to genes only 
gff = rtracklayer::import(gffpath) %>% 
  as.data.frame() %>% 
  filter(type == 'gene') %>% 
  select(seqnames, start, end, ID) %>% 
  dplyr::rename("chrom" = 'seqnames') %>% 
  arrange(chrom, start) %>% 
  as.data.table()
#remove prefix for chrom column, so it matches the bed file 

#for each contrast/comparison, perform differential SNP analysis 
for (i in 1:nrow(comparisons)){
  name = contrasts[i,]
  control = comparisons$control[i]
  case = comparisons$case[i]

  nrepscontrol =  sum(metadata$treatment %in% control)
  nrepscase = sum(metadata$treatment %in% case)

  samples = metadata[metadata$treatment %in% c(case, control),]$sampleID
  print(glue("Extracting allele tables for {case}, {control}"))

  sample_list = list()
  for (sample in samples){

    chrom_list = list()
    for (chrom in chroms){
      #read in data 
      alleles  = fread(glue("results/variantAnalysis/alleleTables/{sample}.chr{chrom}.allele.table"))
      #sum lowercase and uppercase alleles, make new column (refcount)
      chrom_list[[chrom]] = alleles %>% 
        mutate("A" = A+a,"T" = T+t,"G" = G+g,"C" = C+c) %>% 
        select(-c(a,t,c,g,V15)) %>% 
        mutate(refcount = case_when(ref == 'T' ~ T,
                                    ref == 'G' ~ G, 
                                    ref == 'A' ~ A,
                                    ref == 'C' ~ C))
    }
    #bind chroms together
    alleles = rbindlist(chrom_list)
    #this step find the most abundant ALT base, whether A,T,C or G, assigns its count as 'altcount'
    list_ = list()
    for (base in c("A", "C", "G", "T")){
      a = alleles %>% filter(ref == base) %>% select(A,T,C,G) %>% select(-base)
      b = alleles %>% filter(ref == base)
      col = colnames(a)[apply(a, 1 , which.max)]
      b$alt = col
      b$altcount = apply(a, 1, max)
      list_[[base]] = b
    }

    alleles = rbindlist(list_) %>% arrange(chr, loc)
    #pivot to get separate rows for refcount and altcount as preferred by kissDE package
    sample_list[[sample]] = alleles %>% select(chr, loc, ref,alt, refcount, altcount) %>% 
      pivot_longer(c(refcount, altcount),names_to="type", values_to=sample) %>% 
      mutate(eventsName = paste0("chr",chr,"_",loc,"_",ref,">",alt)) %>% 
      select(eventsName,type, sample)

  }

  #merge all counts across samples 
  counts = sample_list %>% 
    purrr::reduce(full_join, by=c("eventsName", "type")) %>% 
    mutate("eventsLength" = 1) %>% 
    replace(is.na(.), 0) %>% 
    select(-type) %>% 
    relocate(eventsLength, .after=eventsName) %>% 
    as.data.frame()

  #sum refs and alts together for each snp to use as filter
  print(glue("Filtering SNP count data at a minimum of {mincounts} summed across samples"))
  summed = counts %>% 
    group_by(eventsName, eventsLength) %>% 
    summarise_all(sum)
  #get rowsums of counts, to use as filter
  rowcounts = summed %>% ungroup %>% select(-c(eventsName, eventsLength)) %>% rowSums()
  summed = summed[rowcounts >= mincounts,]
  print(glue("Retaining {nrow(summed)} SNPs, out of {length(rowcounts)}"))
  #remove snps that have a count across samples less than mincounts(100)
  counts = counts[counts$eventsName %in% summed$eventsName,]

  conditionsde = c(rep(control, nrepscontrol), rep(case, nrepscase))
  #rename counts columns for kissde
  #colnames(counts) = colnames(counts) %>% 
  #  str_replace("1", "rep1") %>% 
  #  str_replace("2", "rep2") %>% 
  #  str_replace("3", "rep3")
  #run kissde quality control
  print(glue("Running kissDE QC for {name}"))
  qualityControl(counts, conditionsde, storeFigs = glue("results/variantAnalysis/diffsnps/kissDEfigs_{name}"))

  #Run kissde algorithm to find differentially expressed variants
  de_Vars = diffExpressedVariants(counts, conditions = conditionsde)                    

  #parse results to more readable, filterable form 
  results = de_Vars[[1]] %>% separate(ID, into = c("chrom", "pos", "REF>ALT"), sep="_") %>% 
    mutate("pos" = as.numeric(pos), "chrom" = str_remove(chrom, "chr")) %>% 
    arrange(chrom, pos)

  ##### intersect (data.table::foverlap) gff and results to get gene names and descriptions of snps #######
  bed = results %>% 
    select(chrom, pos, Adjusted_pvalue, `Deltaf/DeltaPSI`, `REF>ALT`) %>% 
    mutate("chrom" = as.character(chrom), "start" = as.numeric(pos) -1, "end" = as.numeric(pos)) %>% 
    select(chrom, start, end, Adjusted_pvalue, `Deltaf/DeltaPSI`,`REF>ALT`, -pos) %>% 
    as.data.table()

  # Data table fast overlaps, useful to find intersections and join  
  setkey(gff, chrom, start, end)
  de_variants = foverlaps(bed, gff, 
                          by.x=c("chrom", "start", "end"), 
                          by.y=c("chrom", "start", "end"),
                          type = "within",
                          nomatch = 0L) %>% 
    dplyr::rename("GeneID" = "ID")

    # Write to file
  print(glue("Writing {name} results to results/variantAnalysis/diffsnps/"))
  results %>% fwrite(., glue("results/variantAnalysis/diffsnps/{name}.normcounts.tsv"), sep="\t", row.names=FALSE)

  de_variants = left_join(de_variants, gene_names) %>% distinct()

  fwrite(de_variants, glue("results/variantAnalysis/diffsnps/{name}.kissDE.tsv"), sep="\t", row.names=FALSE)

  de_variants %>% 
    filter(Adjusted_pvalue <= pval) %>% 
    fwrite(., glue("results/variantAnalysis/diffsnps/{name}.sig.kissDE.tsv"), sep="\t", row.names=FALSE)
}

sessionInfo()
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import sys
sys.stderr = open(snakemake.log[0], "w")

import numpy as np
import pandas as pd
import allel

chroms = snakemake.params['chroms']

for chrom in chroms:
    vcf = allel.read_vcf(f"results/variantAnalysis/vcfs/annot.missense.{chrom}.vcf")

    pos = vcf['variants/POS']
    pos1 = pos+1

    data = {'chrom':chrom, 
        'start':pos,
        'stop':pos1}

    bed = pd.DataFrame(data)
    bed.to_csv(f"resources/regions/missense.pos.{chrom}.bed", sep="\t", header=None, index=None)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

### Contribution of IR gene categories to overall gene expression
library(tidyverse)
library(data.table)
library(glue)

##### define functions ######
round_df = function(df, digits) {

  #' This function rounds all the numeric columns of a data.frame
  nums = vapply(df, is.numeric, FUN.VALUE = logical(1))
  df[,nums] = round(df[,nums], digits = digits)
  (df)
}

metadata = fread(snakemake@input[['metadata']], sep="\t") %>% as.data.frame()
#samples = fread("config/samples.tsv", sep="\t") %>% as.data.frame()
gene_names = fread("resources/gene_names.tsv", sep="\t")[1:14459,] %>% distinct()

# Read in normalised counts and sum biological replicates
counts = fread("results/quant/normcounts.tsv", sep="\t") %>% column_to_rownames("GeneID")
counts = data.frame(t(rowsum(t(counts), group = metadata$treatment, na.rm = T)))

apply(counts, 2, var)

colSums(counts)

# Get total counts for each treatment
totalCounts = colSums(counts)

# Get lists of P450s, COEs and GSTs based on gene description
p450genes = gene_names[str_detect(gene_names$Gene_description, "P450"),] %>% select(Gene_stable_ID, Gene_name)
p450genes = p450genes[!p450genes$Gene_name %in% c("CYP4G16", "CYP4G17")] %>% distinct() # remove these genes from P450 column as I have added to cuticular
coegenes = gene_names[str_detect(gene_names$Gene_description, "carboxylesterase"),] %>% select(Gene_stable_ID)
coegenes = bind_rows(coegenes, data.frame("Gene_stable_ID" = "AGAP006227"))
gstgenes = gene_names[str_detect(gene_names$Gene_description, c("glutathione S"))] %>% select(Gene_stable_ID, Gene_name, Gene_description)


genegrps = list()
for (file in list.files("resources/annotation/")){
  group = str_remove(file, ".txt") 

  genegrps[[group]] = fread(glue("resources/annotation/{file}"), sep="\t") %>% 
    rename("Gene_stable_ID" = "Gene stable ID") %>% 
    select(`Gene_stable_ID`) %>% distinct()

}

# Add manually searched groups
genegrps[["P450s"]] = p450genes
genegrps[["COEs"]] = coegenes
genegrps[["GSTs"]] = gstgenes


# Loop through groups and calculate % contribution to total counts
perc_contrib = list()
for (group in names(genegrps)){
  subcounts = counts %>% 
    rownames_to_column("GeneID") %>% 
    filter(GeneID %in% genegrps[[group]]$`Gene_stable_ID`) %>% 
    column_to_rownames("GeneID")

  perc_contrib[[group]] = data.frame((colSums(subcounts) / totalCounts) *100) %>% 
    rownames_to_column("sample")
  colnames(perc_contrib[[group]]) = c("sample", group)

}

# Join dfs
perc_contrib_df = purrr::reduce(perc_contrib, merge) %>% round_df(3)

# Loop through each category and plot % contribution to total counts
for (group in names(genegrps)){
  plt =  ggplot(perc_contrib_df, aes_string(x="sample", y=glue("{group}"), fill="sample")) + 
    geom_bar(stat="identity") + 
    ggtitle(glue("% of read counts {group}")) +
    theme_light()

  pdf(glue("results/plots/percentageContribution_{group}.pdf"))
  print(plt)
  dev.off()
}

# Write to file 
fwrite(perc_contrib_df, 
       "results/quant/percentageContributionGeneCategories.tsv", sep="\t")



sessionInfo()
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

## Generate freebayes params ##
# 1) make bed for parallelising freebayes
library(dplyr)
library(data.table)
library(glue)

# read inputs
chroms = snakemake@params[['chroms']]
chunks = snakemake@params[['chunks']]
fai = fread(snakemake@input[['index']])

# select chroms we want, and start, end columns
fai = fai[fai$V1 %in% chroms, c(1,2)]

# for each chromsome
for (chrom in chroms){
   #subset index to desired chrom
   f = fai[fai$V1 == chrom]
   #get sequence of n chunks from 0 to length of chrom
   bedseq = round(seq(0, f$V2, length.out = chunks))

   #for each chunk
   for (i in 1:(chunks-1)){
      #write bed file, one for each chunk/interval, which will be passed as input to freebayes
      row = c(chrom, bedseq[i], bedseq[i+1])
      data.frame(row) %>% t() %>% fwrite(., glue("resources/regions/genome.{chrom}.region.{i}.bed"), sep="\t", col.names = FALSE)
   }
}



# 2) Make bamlist and populations.tsv file

metadata = fread(snakemake@params[['metadata']], sep="\t")

metadata$bams = paste0("results/alignments/", metadata$sampleID,".split.bq.bam")

metadata %>% 
   select(bams, strain) %>% 
   fwrite(., snakemake@output[['pops']], sep="\t", row.names = FALSE, col.names = FALSE)

metadata %>% 
   select(bams) %>% 
   fwrite(.,snakemake@output[['bamlist']], sep="\t", row.names = FALSE, col.names = FALSE)

sessionInfo()
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")


#' This script runs GSEA analysis on each contrast in the workflow
#' on gene lists, ranked using fold-change from DE data, Fst, PBS 
#' (optional) and differential SNPs analysis

# GSEA RNA-Seq Ag
library(GO.db)
library(gage)
library(fgsea)
library(data.table)
library(glue)
library(tidyverse)

## functions ##
runEnrich = function(rankedList, GeneSetList, outName){

  for (set in c("kegg", "GO")){
    # pathway enrichment with fgsea for both kegg and  GO terms
    fgseaRes = fgsea(pathways = GeneSetList[[set]], 
                     stats = na.omit(rankedList),
                     minSize  = 15,
                     maxSize  = 500,
                     nperm=10000)

    fgseaRes %>% arrange(padj) %>% fwrite(., file = glue("results/gsea/{outName}.{set}.tsv"), sep = "\t")

    topPathwaysUp = fgseaRes[ES > 0][head(order(pval), n=10), pathway]
    topPathwaysDown = fgseaRes[ES < 0][head(order(pval), n=10), pathway]
    topPathways = c(topPathwaysUp, rev(topPathwaysDown))
    pdf(glue("results/gsea/{outName}.{set}.fgsea.pdf"))
    plotGseaTable(GeneSetList[[set]][topPathways], na.omit(rankedList), fgseaRes)
    null = dev.off()
    print(glue("{set} complete"))
  }
}


###### configuration - metadata and parameters ######
metadata = fread(snakemake@input[['metadata']]) %>% as.data.frame()
comparisons = data.frame("contrast" = snakemake@params[['DEcontrasts']])
gaffile = snakemake@input[['gaf']]
variantCalling = snakemake@params[['VariantCalling']]
pbs = snakemake@params[['pbs']]
diffsnps = snakemake@params[['diffsnps']]
pbscomps = snakemake@params[['pbscomps']]
replaceString = snakemake@params[['replaceStringKegg']]
speciesID = snakemake@params[['KeggSpeciesID']]


GeneSetList = list()
######### GO Terms #########
# read in gaf file and select to appropriate columns and rename
go = fread(gaffile, sep = "\t", skip = 1, header = FALSE) %>% 
  as_tibble() %>% dplyr::select(c(2,5)) %>% distinct() %>% dplyr::rename("GeneID" = V2, "GO.id" = V5)
# download GO terms and descriptions as we need descriptions
goterms = Term(GOTERM) %>% enframe() %>% dplyr::rename("GO.id" = "name", "description" = "value")
# join with our go terms
go = inner_join(go, goterms) %>% dplyr::select("description", "GeneID")
# turn into a large list of terms and all the genes with that term
GOlist = go %>% pivot_wider(names_from="description", values_from="GeneID") %>% transpose()
GeneSetList[["GO"]] = GOlist[[1]]

#### KEGG Pathways ####
kg.geneset = kegg.gsets(speciesID)
GeneSetList[['kegg']] = kg.geneset$kg.sets[kg.geneset$sigmet.idx]
# optional str_replace due to strange KEGG naming
if (!is.null(replaceString)){
  GeneSetList[['kegg']] = lapply(GeneSetList[['kegg']], str_remove, replaceString)
}


##### gene DE ####
for (comp in comparisons$contrast){
  print(glue("Running KEGG and GO enrichment analyses for {comp}"))
  # make ranked list using DE results, rank based on log2foldchange
  rank = fread(glue("results/genediff/{comp}.csv")) %>% filter(FC > 1) %>%
    arrange("padj") %>% 
    dplyr::select(c('GeneID', all_of("padj"))) %>% 
    deframe()
  runEnrich(rankedList = rank, GeneSetList = GeneSetList, outName = glue("genediff/{comp}.DE"))
}


### Fst ####
if (variantCalling == TRUE){
  for (comp in comparisons$contrast){
  print(glue("Running KEGG and GO enrichment analyses for Fst {comp}"))
  # make ranked list using DE results, rank based on log2foldchange
  rank = fread("results/variantAnalysis/selection/FstPerGene.tsv") %>% 
    distinct() %>% 
    dplyr::select(GeneID, glue("{comp}_zFst")) %>% 
    deframe()

  rank = rank %>% sort(decreasing = TRUE)
  runEnrich(rankedList = rank, GeneSetList = GeneSetList, outName = glue("/fst/{comp}.FST"))
  }
}



### PBS ####
if (pbs == TRUE){
  for (comp in pbscomps){
  print(glue("Running KEGG and GO enrichment analyses for PBS {comp}"))
  # make ranked list using DE results, rank based on log2foldchange
  rank = fread("results/variantAnalysis/selection/PbsPerGene.tsv") %>% 
    distinct() %>% 
    drop_na() %>% 
    dplyr::select(GeneID, glue("{comp}PBS")) %>% 
    deframe()

  rank = rank %>% abs() %>% sort(decreasing = TRUE)

  runEnrich(rankedList = rank, GeneSetList = GeneSetList, outName = glue("/pbs/{comp}.PBS"))
  }
}


#### diff SNPs ####
if (diffsnps == TRUE){
  for (comp in comparisons$contrast){
  print(glue("Running KEGG and GO enrichment analyses for diffSNPs {comp}"))
  # make ranked list using DE results, rank based on log2foldchange
  rank = fread(glue("results/variantAnalysis/diffsnps/{comp}.kissDE.tsv"), sep="\t") %>% 
    arrange("Deltaf/DeltaPSI") %>% 
    dplyr::select(c('GeneID', all_of("Deltaf/DeltaPSI"))) %>% 
    deframe()
  runEnrich(rankedList = rank, GeneSetList = GeneSetList, outName = glue("/diffsnps/{comp}.diffsnps"))
  }
}

sessionInfo()
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import sys
sys.stderr = open(snakemake.log[0], "w")

import rnaseqpoptools as rnaseqpop
import pandas as pd 
import matplotlib.pyplot as plt

# Read in parameters from snakemake
ploidy = snakemake.params['ploidy']
invs = snakemake.params['inversions']
metadata = pd.read_csv(snakemake.params['metadata'], sep="\t")
metadata = metadata.sort_values(by='species')
dataset = snakemake.params['dataset']


karyo = {}
for inv in invs:
    df = pd.read_csv(f"results/karyotype/{inv}.{dataset}.karyo.txt", sep="\s+", header=None)
    df = df.rename(columns={0:'sampleID', 1:'KaryoScore', 2:'n_SNPtags'})
    df[inv] = df['KaryoScore']/ploidy
    df.rename(columns={inv:f'{inv} frequency'}).to_csv(f"results/karyotype/{inv}.{dataset}.karyo.txt", sep="\t")

    karyo[inv] = df[['sampleID', inv]]

# concat all the dfs in the dict and remove duplicate cols
karyo = pd.concat(karyo.values(), axis=1).T.drop_duplicates().T.set_index("sampleID")

## transpose and round to 2 decimals
karyo = karyo.T.astype("float64").round(2)
rnaseqpop.plotRectangular(karyo, path="results/karyotype/karyoFreqs.svg" , cmap='mako_r', ylab='Inversion', figsize=[10,5])

# Produce for average karyos per treatment
df = karyo.T.reset_index()
df = df.merge(metadata[['sampleID', 'treatment']])
df = df.groupby("treatment").agg('mean').T.astype("float64").round(2)
rnaseqpop.plotRectangular(df, path="results/karyotype/karyoOverallFreqs.svg", ylab='Inversion', cbar=False, figsize=[8,4])
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import sys
sys.stderr = open(snakemake.log[0], "w")

import rnaseqpoptools as rnaseqpop
import pandas as pd 
import numpy as np
import allel
from adjustText import adjust_text


# Read in parameters from snakemake
dataset = snakemake.params['dataset']
metadata = pd.read_csv(snakemake.input['metadata'], sep="\t")
metadata = metadata.sort_values(by='species')
chroms = snakemake.params['chroms']
ploidy = snakemake.params['ploidy']
numbers = rnaseqpop.get_numbers_dict(ploidy)
qualflt = snakemake.params['qualflt']
missingprop = snakemake.params['missingprop']


for i, chrom in enumerate(chroms):

    # Read in and Filter VCF
    path = f"results/variantAnalysis/vcfs/{dataset}.{chrom}.vcf.gz"
    vcf, geno, acsubpops, pos, depth, snpeff, subpops, populations = rnaseqpop.readAndFilterVcf(path=path,
                                                           chrom=chrom,
                                                           samples=metadata,
                                                           numbers=numbers,
                                                           ploidy=ploidy,
                                                           qualflt=qualflt,
                                                           missingfltprop=missingprop)


    #### Principal Components Analysis (PCA) ####
    # Set up dict to store indices for colours
    d={}
    for name, inds in subpops.items():
        for n in range(len(inds)):
            p = inds[n]
            d[p] = name

    # Store dict as a dataframe and get colours 
    treatment_indices = pd.DataFrame.from_dict(d, orient='index').reset_index()
    treatment_indices = treatment_indices.rename(columns = {'index':'sample_index', 0:"name"})
    pop_colours = rnaseqpop.get_colour_dict(treatment_indices['name'], "viridis")

    # Run PCA function defined in tools.py
    print(f"Performing PCA on {dataset} chromosome {chrom}")
    rnaseqpop.pca(geno, chrom, ploidy, dataset, populations, metadata, pop_colours, prune=True, scaler=None)
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
import sys
sys.stderr = open(snakemake.log[0], "w")

import rnaseqpoptools as rnaseqpop
import pandas as pd
import numpy as np
import allel
from scipy import stats
from functools import partial, reduce
import warnings
warnings.filterwarnings('ignore') # suppress numpy runtime warnings, this is a bit dangerous, should be removed for release or resolve source of warnings

# snakemake inputs and params
dataset = snakemake.params['dataset']
metadata_path = snakemake.input['metadata']
metadata = pd.read_csv(metadata_path, sep="\s+")
gffpath = snakemake.input['gff']
pbs = snakemake.params['pbs']
pbscomps = snakemake.params['pbscomps']
chroms = snakemake.params['chroms']
ploidy = snakemake.params['ploidy']
numbers = rnaseqpop.get_numbers_dict(ploidy)
missingprop = snakemake.params['missingprop']

# gff
features = allel.gff3_to_dataframe(gffpath,
                       attributes=["ID", "description"])
gff = features[features.type == 'gene']
# gene names file, rename
gene_names = pd.read_csv(snakemake.input['geneNames'], sep="\t").drop(columns=['TranscriptID']).drop_duplicates()

### main ####
# Read in list of contrasts
comparisons = pd.DataFrame(snakemake.params['DEcontrasts'], columns=['contrast'])
comparisons = comparisons.contrast.str.split("_", expand=True)
comparisons.columns = ['sus', 'res']
comparisons = [list(row) for i,row in comparisons.iterrows()]

print(f"The pairwise comparisons for Fst are {comparisons}")

fstbychrom={}
if pbs: pbsbychrom={}
tajdbychrom={}
gdivbychrom = {}
dxybychrom = {}

for chrom in chroms:

    # path to vcf
    path = f"results/variantAnalysis/vcfs/{dataset}.{chrom}.vcf.gz"
    # function to read in vcfs and associated SNP data
    vcf, geno, acsubpops, pos, depth, snpeff, subpops, pops = rnaseqpop.readAndFilterVcf(path=path,
                                                               chrom=chrom, 
                                                               samples=metadata,
                                                               numbers=numbers,
                                                               ploidy=ploidy,
                                                               qualflt=30,
                                                               missingfltprop=missingprop)
    # subset gff to appropriate chrom
    genes = gff[gff.seqid == chrom].sort_values('start').reset_index(drop=True)

    ### Average Fst, pbs, tajima d for each gene
    fst_per_comp = {}
    fst_per_gene = {}
    pbs_per_gene = {}
    pbs_per_comp = {}
    tajd_per_pop = {}
    tajd_per_gene = {}
    gdiv_per_pop = {}
    gdiv_per_gene = {}
    se_per_comp = {}
    se_per_gene = {}
    dxy_per_comp = {}
    dxy_per_gene = {}
    pos_dict = {}
    n_dict = {}

    # loop through each gene and calculate fst, pbs, tajimas d, or sequence diversity for each comparison
    for i, gene in genes.iterrows():
        ID = gene.ID
        # locate_ranges() to get a boolean, needed as locate_range() will throw errors if no snps found in gene
        gene_bool = pos.locate_ranges([gene['start']], [gene['end']], strict=False)
        nsnps = gene_bool.sum()

        # if there are less than 3 snps in this gene then skip to next in loop
        if nsnps < 2:
            continue

        # store number of snps per gene
        n_dict[ID] = nsnps
        # store midpoint positions of gene
        pos_dict[ID] = (gene['start'] + gene['end'])/2

        # fst and dxy per gene between each comparison
        for comp1,comp2 in comparisons:
            name = comp1 + "_" + comp2
            ac1 = acsubpops[comp1].compress(gene_bool, axis=0)
            ac2 = acsubpops[comp2].compress(gene_bool, axis=0)

            fst_per_comp[name], se_per_comp[name],_,_= allel.average_hudson_fst(ac1, ac2, blen=1)

            dxy_per_comp[name] = allel.sequence_divergence(pos[gene_bool], ac1, ac2)

        # tajimas d and sequence diversity per gene for each subpop(i.e treatment)
        for subpop in subpops:
            ac = acsubpops[subpop].compress(gene_bool)
            genepos = pos[gene_bool]
            tajd_per_pop[subpop] = allel.tajima_d(ac=ac, pos=genepos)
            gdiv_per_pop[subpop] = allel.sequence_diversity(ac=ac, pos=genepos)

        # pbs for each gene for each pbc comparison as defined in config.yaml
        if pbs is True:
            for pbscomp in pbscomps:
                pop1, pop2, outpop = pbscomp.split("_")
                pbs_per_comp[pbscomp],se,_,_ = rnaseqpop.meanPBS(acsubpops[pop1].compress(gene_bool, axis=0),
                                          acsubpops[pop2].compress(gene_bool, axis=0),
                                          acsubpops[outpop].compress(gene_bool, axis=0),
                                                     window_size=1,
                                                    normalise=True)
        # store inner dict in outer dicts
        fst_per_gene[ID] = dict(fst_per_comp)
        se_per_gene[ID] = dict(se_per_comp)
        if pbs is True : pbs_per_gene[ID] = dict(pbs_per_comp)
        tajd_per_gene[ID] = dict(tajd_per_pop)
        gdiv_per_gene[ID] = dict(gdiv_per_pop)
        dxy_per_gene[ID] = dict(dxy_per_comp)

    #reverse the dicts so the comparisons/subpops are on the outer dict
    fst_per_gene = rnaseqpop.flip_dict(fst_per_gene)
    se_per_gene = rnaseqpop.flip_dict(se_per_gene)
    if pbs is True : pbs_per_gene = rnaseqpop.flip_dict(pbs_per_gene)
    tajd_per_gene = rnaseqpop.flip_dict(tajd_per_gene)
    gdiv_per_gene = rnaseqpop.flip_dict(gdiv_per_gene)
    dxy_per_gene = rnaseqpop.flip_dict(dxy_per_gene)

    print(f"Chromosome {chrom} complete...\n")
    for comp1,comp2 in comparisons:
        name = comp1 + "_" + comp2
        a = np.array(list(fst_per_gene[name].values()))
        print(f"Overall Fst (averaged across genes) for chromosome {chrom} between {name} is {np.nanmean(a)}")

    # Make dataframe of number of snps per gene (that pass quality and missingness filters)
    ndf = pd.DataFrame.from_dict(n_dict, orient='index').reset_index(drop=False)
    ndf.columns = ['GeneID', 'nSNPs']

    # Make dataframe of midpoints of each gene
    posdf = pd.DataFrame.from_dict(pos_dict, orient='index').reset_index(drop=False)
    posdf.columns = ['GeneID', 'Gene_midpoint']

    # Make dataframe of fst for each comparison
    fst_dfs = {}
    se_dfs = {}
    dxy_dfs = {}
    for comp1,comp2 in comparisons:
        name = comp1 + "_" + comp2
        fst_df = pd.DataFrame.from_dict(fst_per_gene[name], orient='index').reset_index(drop=False)
        fst_df.columns = ['GeneID', (name + '_zFst')]
        fst_dfs[name] = fst_df

        se_df = pd.DataFrame.from_dict(se_per_gene[name], orient='index').reset_index(drop=False)
        se_df.columns = ['GeneID', (name + '_SE')]
        se_dfs[name] = se_df

        dxy_df = pd.DataFrame.from_dict(dxy_per_gene[name], orient='index').reset_index(drop=False)
        dxy_df.columns = ['GeneID', (name + '_dxy')]
        dxy_dfs[name] = dxy_df


    my_reduce = partial(pd.merge, on='GeneID', how='outer')
    fst_allcomparisons = reduce(my_reduce, fst_dfs.values())
    se_allcomparisons = reduce(my_reduce, se_dfs.values())
    fst_allcomparisons = reduce(my_reduce, [fst_allcomparisons, se_allcomparisons])
    fst_allcomparisons['chrom'] = chrom
    dxy_allcomparisons = reduce(my_reduce, dxy_dfs.values())
    dxy_allcomparisons['chrom'] = chrom

    tajd_dfs = {}
    gdiv_dfs = {}
    # Store sequence diversity and tajimas d for each gene and each subpop
    for subpop in subpops:
        tajd_df = pd.DataFrame.from_dict(tajd_per_gene[subpop], orient='index').reset_index(drop=False)
        tajd_df.columns = ['GeneID', (subpop+"_Tajima_d")]
        tajd_dfs[subpop] = tajd_df
        gdiv_df = pd.DataFrame.from_dict(gdiv_per_gene[subpop], orient='index').reset_index(drop=False)
        gdiv_df.columns = ['GeneID', (subpop+"_SeqDiv")]
        gdiv_dfs[subpop] = gdiv_df

    # Combine tajimas d and sequence diversity for each sample
    tajdall = reduce(my_reduce, tajd_dfs.values())
    gdivall = reduce(my_reduce, gdiv_dfs.values())
    tajdall['chrom'] = chrom
    gdivall['chrom'] = chrom

    if pbs is True:
        # PBS store as dataframes 
        pbs_dfs = {}
        for pbscomp in pbscomps:
            pbs_df = pd.DataFrame.from_dict(pbs_per_gene[pbscomp], orient='index').reset_index(drop=False)
            pbs_df.columns = ['GeneID', (pbscomp+"PBS")]
            pbs_dfs[pbscomp] = pbs_df

        pbs_allcomparisons = reduce(my_reduce, pbs_dfs.values())
        pbs_allcomparisons['chrom'] = chrom

    fstbychrom[chrom] = reduce(lambda left, right: pd.merge(left,right, on=['GeneID'],
                                                how='inner'), [fst_allcomparisons, gene_names, ndf, posdf])
    tajdbychrom[chrom] = reduce(lambda left, right: pd.merge(left,right, on=['GeneID'],
                                                how='inner'), [tajdall, gene_names, ndf,posdf])
    gdivbychrom[chrom] = reduce(lambda left, right: pd.merge(left,right, on=['GeneID'],
                                                how='inner'), [gdivall, gene_names, ndf, posdf])
    dxybychrom[chrom] = reduce(lambda left, right: pd.merge(left, right, on=['GeneID'],
                                                how='inner'), [dxy_allcomparisons, gene_names, ndf, posdf])
    if pbs is True:
        pbsbychrom[chrom] = reduce(lambda left,right: pd.merge(left,right,on=['GeneID'],
                                                    how='inner'), [pbs_allcomparisons, gene_names, ndf, posdf])


## Concatenate chromosome dfs to one big data frame and remove duplicates
fstall = pd.concat(fstbychrom.values(), ignore_index=True).drop_duplicates()
tajdall = pd.concat(tajdbychrom.values(), ignore_index=True).drop_duplicates()
gdivall = pd.concat(gdivbychrom.values(), ignore_index=True).drop_duplicates()
dxyall = pd.concat(dxybychrom.values(), ignore_index=True).drop_duplicates()

# Write to csv
fstall.to_csv(f"results/variantAnalysis/selection/FstPerGene.tsv", index=False, sep="\t")
tajdall.to_csv(f"results/variantAnalysis/selection/TajimasDPerGene.tsv", index=False, sep="\t")
gdivall.to_csv(f"results/variantAnalysis/diversity/SequenceDivPerGene.tsv", index=False, sep="\t")
dxyall.to_csv(f"results/variantAnalysis/diversity/DxyPerGene.tsv", index=False, sep="\t")

if pbs is True:
    pbsall = pd.concat(pbsbychrom.values(), ignore_index=True).drop_duplicates()
    pbsall.to_csv(f"results/variantAnalysis/selection/PbsPerGene.tsv", index=False, sep="\t")
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")


#' Finds genes that are differentially expressed in the same direction
#' Across two DE comparisons. For example, a gene may be significantly 
#' upregulated in the control v intermediate phenotype comparison. 
#' And also in the intermediate phenotype vs full phenotype comparison.

library(data.table)
library(tidyverse)
library(glue)

# read metadata and get contrasts
comps = snakemake@params['comps'][[1]]
padj_threshold = snakemake@params['pval']
upper_fc = snakemake@params['fc']
lower_fc = 1/as.numeric(upper_fc) # if someone wants a FC threshold of 2, need to have lower threshold of 0.5.

for (cont in comps){
  sus = str_split(cont, "_")[[1]][1]           # get control/susceptible name (such as Kisumu)
  intermediate = str_split(cont, "_")[[1]][2]     # get intermediate
  res = str_split(cont, "_")[[1]][3]              # get last of string, which is resistant/case 

  #### Gene diff ####
  one = fread(glue("results/genediff/{intermediate}_{res}.csv"))
  up1  = one %>% filter(FC > upper_fc, padj < padj_threshold)
  down1 = one %>% filter(FC < lower_fc, padj < padj_threshold)

  two = fread(glue("results/genediff/{sus}_{intermediate}.csv"))
  up2  = two %>% filter(FC > upper_fc, padj < padj_threshold)
  down2 = two %>% filter(FC < lower_fc, padj < padj_threshold)

  intersectdown = inner_join(down1, down2, by="GeneID", suffix=c("_field", "_lab"))
  intersectup = inner_join(up1, up2, by="GeneID", suffix=c("_field", "_lab"))

  fwrite(intersectup, glue("results/genediff/{sus}_{intermediate}_{res}.up.progressive.tsv"), sep="\t")
  fwrite(intersectdown, glue("results/genediff/{sus}_{intermediate}_{res}.down.progressive.tsv"), sep="\t")

  #### Isoforms #### 

  one = fread(glue("results/isoformdiff/{intermediate}_{res}.csv"))
  up1  = one %>% filter(FC > upper_fc, qval < padj_threshold)
  down1 = one %>% filter(FC < lower_fc, qval < padj_threshold)

  two = fread(glue("results/isoformdiff/{sus}_{intermediate}.csv"))
  up2  = two %>% filter(FC > upper_fc, qval < padj_threshold)
  down2 = two %>% filter(FC < lower_fc, qval < padj_threshold)

  intersectdown = inner_join(down1, down2, by="TranscriptID", suffix=c("_field", "_lab"))
  intersectup = inner_join(up1, up2, by="TranscriptID", suffix=c("_field", "_lab"))

  fwrite(intersectup, glue("results/isoformdiff/{sus}_{intermediate}_{res}.up.progressive.tsv"), sep="\t")
  fwrite(intersectdown, glue("results/isoformdiff/{sus}_{intermediate}_{res}.down.progressive.tsv"), sep="\t")
}


sessionInfo()
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")


#' This script uses the sleuth R package to perform differential isoform analysis
#' This will be performed for each 
#' Sleuth uses the bootstraps of kallisto to estimate uncertainty 

## Differential expression
library(tidyverse)
library(sleuth)
library(data.table)
library(ggrepel)
library(openxlsx)
library(glue)
library(RColorBrewer)
library(EnhancedVolcano)

print("------------- Kallisto - Sleuth - RNASeq isoform Differential expression ---------")
#### read data ####

metadata = fread(snakemake@input[['metadata']], sep="\t") %>% 
  as.data.frame() %>% 
  dplyr::rename('sample' = "sampleID")

#add path column for sleuth object
metadata$path = paste0("results/quant/", metadata$sample)

#read metadata and get contrasts
gene_names = fread(snakemake@input[['genes2transcripts']], sep="\t") %>% select(-GeneID)

#contrasts
contrastsdf = data.frame("contrast" = snakemake@params[['DEcontrasts']])
contrasts = contrastsdf$contrast

######### subset data and run DESeq for each combo we need, store in xlsx ########
results_list = list()
names_list = list()
#loop through each chosen contrast and perform DE 
for (cont in contrasts){
  control = str_split(cont, "_")[[1]][1] #get first of string, which is control (or susceptible)
  case = str_split(cont, "_")[[1]][2] #get second of string, which is case (or resistant)
  controls = which(metadata$treatment %in% control) #get indices of case/control
  cases = which(metadata$treatment %in% case)

  ##subset to subcounts of our contrast
  subsamples = metadata[c(controls, cases),]

  #make treatment a factor with the 'susceptible' as reference
  subsamples$treatment = as.factor(as.character(subsamples$treatment))
  subsamples$treatment = relevel(subsamples$treatment, control)

  print(glue("Running Sleuth differential isoform expression analysis on {cont}"))
  so = sleuth_prep(subsamples, extra_bootstrap_summary = TRUE, 
                    transformation_function = function(x) log2(x + 0.5))
  so = sleuth_fit(so, ~treatment, 'full')
  #fit reduced model necessary for isoform analysis
  so = sleuth_fit(so, ~1, 'reduced')
  #fit likelihood ratio test
  so = sleuth_wt(so, which_beta = paste0("treatment",case))

  results = sleuth_results(so, test =paste0("treatment",case), 'reduced:full', show_all = FALSE) %>% 
    dplyr::rename("TranscriptID" = "target_id") %>% 
    dplyr::mutate("FC" = (2^b))

  # Join DE results with normal gene names
  results = unique(left_join(results, gene_names))
  fwrite(results, glue("results/isoformdiff/{cont}.csv")) # write to csv 

  # Store names of contrasts and results tables for xlsx report sheets
  results_list[[cont]] = results
  names_list[[cont]] = cont

    # volcano plot for each comparison, using EnhancedVolcano. First make vector of labels which is AGAPs unless a gene name exists
  labels = results %>% dplyr::mutate("Gene_name" = case_when(GeneName == "" ~ TranscriptID,
                                     is.na(GeneName) ~ TranscriptID,
                                     TRUE ~ GeneName)) %>% select(Gene_name) %>% deframe()
  pdf(glue("results/isoformdiff/Volcano_plot_{cont}.pdf"))
  print(EnhancedVolcano(results_list[[cont]],
                  lab=labels,
                  x='b',
                  y='pval',
                  title = cont))
  garbage = dev.off()
  print(glue("{cont} complete!"))
}

#### write to excel file on diff sheets #### 
sheets = unlist(names_list)
wb <- createWorkbook("Workbook")

for (i in 1:length(sheets)){
  addWorksheet(wb, glue("{sheets[[i]]}"))
  writeData(wb, sheets[i], results_list[[i]], rowNames = FALSE, colNames = TRUE)
}
#### save workbook to disk once all worksheets and data have been added ####
saveWorkbook(wb,file=snakemake@output[['xlsx']], overwrite = TRUE)

sessionInfo()
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
import sys
sys.stderr = open(snakemake.log[0], "w")

import pandas as pd
import numpy as np
import allel
import rnaseqpoptools as rnaseqpop

# Read in parameters from snakemake
dataset = snakemake.params['dataset']
metadata = pd.read_csv(snakemake.input['metadata'], sep="\t")
metadata = metadata.sort_values(by='species')
chroms = snakemake.params['chroms']
ploidy = snakemake.params['ploidy']
numbers = rnaseqpop.get_numbers_dict(ploidy)
qualflt = snakemake.params['qualflt']
missingprop = snakemake.params['missingprop']
gffpath = snakemake.input['gff']

# load gff
features = allel.gff3_to_dataframe(gffpath,
                        attributes=["ID", "Parent", "description"])

# define functions
def isnotmissing(gn):
    """
    Function to check if missing values are present at a SNP
    """
    return((gn != -1).all())

#initialise dicts
snpsPerGff = {}
total_snps_per_chrom = {}
snps_per_gene_allchroms = {}
snpeffdict = {}

seqdivdictchrom = {}
thetadictchrom = {}
coefdictchrom= {}

for i, chrom in enumerate(chroms):

    # Read in and Filter VCF
    path = f"results/variantAnalysis/vcfs/{dataset}.{chrom}.vcf.gz"
    vcf, geno, acsubpops, pos, depth, snpeff, subpops, populations = rnaseqpop.readAndFilterVcf(path=path,
                                                           chrom=chrom,
                                                           samples=metadata,
                                                           numbers=numbers,
                                                           ploidy=ploidy,
                                                           qualflt=qualflt,
                                                           missingfltprop=missingprop)
    # Store total SNPs per chromosome
    # And summaries from SnpEff
    total_snps_per_chrom[chrom] = geno.shape[0]
    snpeffdict[chrom] = snpeff[1].value_counts(normalize=True)

    # Plot SNP density
    rnaseqpop.plot_density(pos, 
                window_size=100000, 
                title=f"Variant Density | {dataset} | Chromosome {chrom}", 
                path=f"results/variantAnalysis/diversity/{dataset}_SNPdensity_{chrom}.svg")

    ######## SNP counts per gene ########
    # Subset GFF to appropriate chromosome
    gff = features.query("seqid == @chrom").sort_values('start').reset_index(drop=True)
    genes = gff.query("type == 'gene'")
    exons = features.query("type == 'exon'").reset_index(drop=True)

    ## Proportion SNPs per GFF feature
    snpsPerGff[chrom] = rnaseqpop.getSNPGffstats(gff,  pos)
    snpsPerGff[chrom]['chromosome'] = chrom

    ## Calculate missing SNPs per sample, SNPs per gene etc
    snpsnotmissing = {}
    missing_array = {}
    snps_per_gene = {}
    snps_per_sample = {}

    # For each sample in metadata, compress the genotype array and check is data is missing
    for sample in metadata['sampleID']:

        bool_ = sample == populations
        gn = geno.compress(bool_, axis=1)

        res = map(isnotmissing, gn[:,0])
        missing_array[sample] = np.fromiter(res, dtype=bool)
        snpsnotmissing[sample] = missing_array[sample].sum()

    # For each gene, find out how many SNPs per gene 
    for i, gene in genes.iterrows():
        # locate_ranges() to get a boolean, needed as locate_range() will throw errors if no snps found in gene
        gene_bool = pos.locate_ranges([gene['start']], [gene['end']], strict=False)

        for sample in metadata['sampleID']:
            presentSNPs = missing_array[sample].compress(gene_bool, axis=0) # subset to each gene
            snps_per_sample[sample] = presentSNPs.sum()

        snps_per_gene[gene['ID']] = dict(snps_per_sample)

    snps_per_gene_allchroms[chrom] = pd.DataFrame.from_dict(rnaseqpop.flip_dict(snps_per_gene))

snpsPerGff = pd.concat(snpsPerGff).reset_index().rename(columns={'level_1':'feature'})
snpsPerGff = snpsPerGff.groupby('feature').agg({'ReferenceGenomeProportion': 'mean', 'called':'sum', 'proportion': 'mean', 'total':'sum'})
snpsPerGff.to_csv(snakemake.output['snpsPerGenomicFeature'], sep="\t", index=True)

# Total SNPs per chrom, all samples
totalsnpsdf = pd.DataFrame.from_dict(total_snps_per_chrom, orient='index', columns=['Total_SNPs'])
totalsnpsdf.to_csv(f"results/variantAnalysis/SNPstats/totalSNPs.tsv", sep="\t", index=True)

# SNPcount per gene
snpcountsdf = pd.concat(snps_per_gene_allchroms)  
snpcountsdf.to_csv("results/variantAnalysis/SNPstats/nSNPsPerGene.tsv", sep="\t", index=True)
genesNoSNPs = pd.DataFrame((snpcountsdf == 0).sum(axis=0), columns=['Genes with zero SNPs'])
genesNoSNPs.to_csv("results/variantAnalysis/SNPstats/nGenesZeroSNPs.tsv", sep="\t", index=True)

# snpEff summary
snpeffdf = pd.concat(snpeffdict)
snpeffdf.to_csv("results/variantAnalysis/SNPstats/snpEffProportions.tsv", sep="\t", index=True)
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
import sys
sys.stderr = open(snakemake.log[0], "w")
import pandas as pd
import numpy as np
import allel
from collections import defaultdict
import rnaseqpoptools as rnaseqpop

# Read in parameters from snakemake
dataset = snakemake.params['dataset']
metadata = pd.read_csv(snakemake.input['metadata'], sep="\t")
metadata = metadata.sort_values(by='species')
chroms = snakemake.params['chroms']
ploidy = snakemake.params['ploidy']
numbers = rnaseqpop.get_numbers_dict(ploidy)
qualflt = snakemake.params['qualflt']
missingprop = snakemake.params['missingprop']


#initialise dicts
total_snps_per_chrom = {}
snps_per_gene_allchroms = {}
snpeffdict = {}

seqdivdictchrom = {}
thetadictchrom = {}
coefdictchrom= {}

for i, chrom in enumerate(chroms):

    # Read in and Filter VCF
    path = f"results/variantAnalysis/vcfs/{dataset}.{chrom}.vcf.gz"
    vcf, geno, acsubpops, pos, depth, snpeff, subpops, populations = rnaseqpop.readAndFilterVcf(path=path,
                                                           chrom=chrom,
                                                           samples=metadata,
                                                           numbers=numbers,
                                                           ploidy=ploidy,
                                                           qualflt=qualflt,
                                                           missingfltprop=missingprop)


    #### Genome-wide statistics (seqDiv, Wattersons Theta, LD, inbreeding coefficient) ####
    seqdivdict = {}
    thetadict = {}
    coefdict= {}
    allcoef = defaultdict(list)

    for pop in metadata['treatment'].unique():

        # Sequence diversity 
        seqdivdict[pop] = allel.sequence_diversity(pos, acsubpops[pop])

        # Wattersons theta
        thetadict[pop] = allel.watterson_theta(pos, acsubpops[pop])

        # Inbreeding coefficient
        if ploidy > 1:
            gn = geno.take(subpops[pop], axis=1)
            coef = allel.moving_statistic(gn, statistic=allel.inbreeding_coefficient, 
                                                size=1000, step=100)
            coef = np.nanmean(coef, axis=1)
            coefdict[pop] = np.mean(coef)
            allcoef[pop].append(np.array(coef))

        print(f"{pop} | {chrom} | Nucleotide Diversity (Pi) =", seqdivdict[pop])
        print(f"{pop} | {chrom} | Wattersons Theta =", thetadict[pop])
        if ploidy > 1: print(f"{pop} | {chrom} | Inbreeding Coef =", np.mean(coef), "\n")

    seqdivdictchrom[chrom] = dict(seqdivdict)
    thetadictchrom[chrom] = dict(thetadict)
    if ploidy > 1: coefdictchrom[chrom] = dict(coefdict)

seqdivdictchrom= rnaseqpop.flip_dict(seqdivdictchrom)
thetadictchrom = rnaseqpop.flip_dict(thetadictchrom)
if ploidy > 1: coefdictchrom = rnaseqpop.flip_dict(coefdictchrom)

# Get stats per chromosome and plot heatmap
pidf = pd.DataFrame.from_dict(seqdivdictchrom)
pidf.to_csv("results/variantAnalysis/diversity/SequenceDiversity.tsv", sep="\t", index=True)
rnaseqpop.plotRectangular(pidf.round(4), path="results/variantAnalysis/diversity/piPerChrom.svg", ylab="Chromosome", xlab="Treatment", figsize=[5,5], title=r'$\pi$')
thetadf = pd.DataFrame.from_dict(thetadictchrom)
rnaseqpop.plotRectangular(thetadf.round(4), path="results/variantAnalysis/diversity/thetaPerChrom.svg", ylab="Chromosome", xlab="Treatment", figsize=[5,5], title=r'$\theta$')
thetadf.to_csv("results/variantAnalysis/diversity/WattersonsTheta.tsv", sep="\t", index=True)

thetamean = thetadf.apply(np.mean, axis=0).round(4)
pimean = pidf.apply(np.mean, axis=0).round(4)
summaryStats = pd.DataFrame({r'$\theta$':thetamean, r'$\pi$':pimean})
rnaseqpop.plotRectangular(summaryStats, path="results/variantAnalysis/diversity/pi_theta.overall.svg", ylab="", xlab="Statistic", figsize=[5,5], rotate=False)

theta = pd.DataFrame(summaryStats.iloc[:,0]).round(4)
pi = pd.DataFrame(summaryStats.iloc[:,1]).round(4)
rnaseqpop.plotTwoRectangular(pi, True, theta, True, path="results/variantAnalysis/diversity/piThetaBoth.svg", cmap="Greys", figsize=[5,5], annotFontsize=20, ytickfontsize=12, ylab="")

if ploidy > 1: pd.DataFrame.from_dict(coefdictchrom).to_csv("results/variantAnalysis/diversity/inbreedingCoef.tsv", sep="\t", index=True)

# Get genome wide average stats
if ploidy > 1:
    for pop in allcoef.keys():
        allcoef[pop] = np.nanmean(allcoef[pop])

    coefdf = pd.DataFrame.from_dict(allcoef, orient='index', columns=['InbreedingCoefficient'])
    coefdf.to_csv(f"results/variantAnalysis/diversity/inbreedingCoef.mean.tsv", sep="\t", index=True)
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library(tidyverse)
library(data.table)
library(glue)
library(openxlsx)

#### allele imbalance ####
metadata = fread(snakemake@input[['metadata']], sep="\t")
samples = metadata$sampleID
# Read IR mutation data 
mutation_data = fread(snakemake@input[['mutations']], sep="\t")

all_list = list()
mean_list = list()

# Loop through each mutation, finding allele coverage at each position
for (m in mutation_data$Name){

  base = mutation_data[mutation_data$Name == m]$ALT
  propstring = glue("proportion{base}")
  base2 = mutation_data[mutation_data$Name == m]$ALT2 #### add in if second alt
  propstring2 = glue("proportion{base2}")

  #### load allele balance data ####
  allele_list = list()
  # read data for each sample and subset to what we want
  for (sample in samples){
    allele_list[[sample]] = fread(glue("results/variantAnalysis/variantsOfInterest/counts/{sample}_{m}_allele_counts.tsv"))[,c(1:8)] #for each sample read data, and store first 8 columns 
    allele_list[[sample]]$sample = sample                                            #add sample column
    allele_list[[sample]]$treatment = metadata$treatment[samples == sample]         #add treatment column
    allele_list[[sample]]$mutation = m
    allele_list[[sample]]$gene = mutation_data[mutation_data$Name == m]$Gene

    cover = allele_list[[sample]] %>% select(A,C,G,T) %>% rowSums()
    allele_list[[sample]] = allele_list[[sample]] %>% mutate(!!propstring := (!!sym(base))/cover) #new column, proportion of Alts to total.  

    if (!base2 %in% c("", NA)){
      allele_list[[sample]] = allele_list[[sample]] %>% mutate(!!propstring2 := (!!sym(base2))/cover) #new column, proportion of Alts to total.  
    }
  }

  # We have 24 separate dataframes in a list, bind them together into one big dataframe
  alleles = rbindlist(allele_list, fill = TRUE)

  # now lets calculate population means and lower and upper CIs
  alleles_per_pop_list = list()
  for (pop in unique(metadata$treatment)){

    alleles_per_pop = alleles %>% filter(treatment == pop) 

    pop_prop = sum(alleles_per_pop[, ..base])/ sum(alleles_per_pop[, 'cov'])
    error = sqrt((pop_prop*(1-pop_prop))/nrow(alleles_per_pop))*1.96
    lower = pmax(pop_prop - error, 0)
    upper = pmin(pop_prop + error, 1)

    alleles_per_pop_list[[pop]] = alleles_per_pop %>% mutate(!!propstring := pop_prop, lowerCI = lower, upperCI = upper)

    # average across replicates
    if (!base2 %in% c("", NA)){
      pop_prop2 = sum(alleles_per_pop[, ..base2])/ sum(alleles_per_pop[, 'cov'])
      error2 = sqrt((pop_prop*(1-pop_prop))/nrow(alleles_per_pop))*1.96
      lower2 = pmax(pop_prop - error, 0)
      upper2 = pmin(pop_prop + error, 1)

      alleles_per_pop_list[[pop]] = alleles_per_pop_list[[pop]] %>% mutate(!!propstring2 := pop_prop2, lowerCI_2 = lower2, upperCI_2 = upper2)
    }
  }

  mean_alleles = rbindlist(alleles_per_pop_list, fill=TRUE)

  # average across replicates
  if (!base2 %in% c("", NA)){
    mean_alleles = mean_alleles %>% 
      group_by(chrom, pos, ref, mutation, treatment, !!sym(propstring), lowerCI, upperCI, !!sym(propstring2), lowerCI_2, upperCI_2) %>% 
      summarise_at(.vars = c("cov","A","C","G","T"), .funs = c(mean="mean"), na.rm = TRUE) %>% 
      select(chrom, pos, ref, mutation, treatment, cov_mean, A_mean, C_mean, G_mean, T_mean, lowerCI, upperCI, !!propstring, !!propstring2, lowerCI_2, upperCI_2)
  } else {
    mean_alleles = mean_alleles %>% 
      group_by(chrom, pos, ref, mutation, treatment, !!sym(propstring), lowerCI, upperCI) %>% 
      summarise_at(.vars = c("cov","A","C","G","T"), .funs = c(mean="mean"), na.rm = TRUE) %>% 
      select(chrom,pos, ref, mutation, treatment, cov_mean, A_mean, C_mean, G_mean, T_mean, lowerCI, upperCI, !!propstring)
  }

  #write to file, reorder mean_kdr_alleles
  fwrite(alleles, glue("results/variantAnalysis/variantsOfInterest/csvs/{m}_alleleBalance.csv"))
  fwrite(mean_alleles, glue("results/variantAnalysis/variantsOfInterest/csvs/mean_{m}_alleleBalance.csv"))

  all_list[[m]] = alleles
  mean_list[[m]] = mean_alleles
}

#### write to excel file on diff sheets #### 
results_list = all_list
sheets = unique(mutation_data$Name)
wb <- createWorkbook("Workbook")

for (i in 1:length(sheets)){
  addWorksheet(wb, glue("{sheets[[i]]}"))
  writeData(wb, sheets[i], results_list[[i]], rowNames = TRUE, colNames = TRUE)
}
#### save workbook to disk once all worksheets and data have been added ####
saveWorkbook(wb,file=snakemake@output[['alleleBalance']], overwrite = TRUE)


### mean balance ####
#### write to excel file on diff sheets #### 
results_list = mean_list
sheets = unique(mutation_data$Name)
wb <- createWorkbook("Workbook")

for (i in 1:length(sheets)){
  addWorksheet(wb, glue("{sheets[[i]]}"))
  writeData(wb, sheets[i], results_list[[i]], rowNames = TRUE, colNames = TRUE)
}
#### save workbook to disk once all worksheets and data have been added ####
saveWorkbook(wb,file=snakemake@output[['mean_alleleBalance']], overwrite = TRUE)

sessionInfo()
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import sys
sys.stderr = open(snakemake.log[0], "w")

import pandas as pd
import matplotlib
import rnaseqpoptools as rnaseqpop


### Variants of Interest patjh ###
voiPath = snakemake.input['VariantsOfInterest']
## Read VOI data
muts = pd.read_csv(voiPath, sep="\t")

## separate chrom and pos data and sort 
muts['chrom'] = muts['Location'].str.split(":").str.get(0)
muts['pos'] = muts['Location'].str.split(":").str.get(1).str.split("-").str.get(0)
muts = muts.sort_values(['chrom', 'pos'])


## Run for all samples
df, annot = rnaseqpop.getAlleleFreqTable(muts, "results/variantAnalysis/variantsOfInterest/csvs/{mut}_alleleBalance.csv", var="sample")
rnaseqpop.plotRectangular(df, annot=annot, path="results/variantAnalysis/variantsOfInterest/VOI.heatmapPerSample.svg")
rnaseqpop.plotRectangular(df, annot=annot, path="results/variantAnalysis/variantsOfInterest/VOI.heatmapPerSample.pdf")


## Run for avarage frequencies across treatments
df2, annot2 = rnaseqpop.getAlleleFreqTable(muts, "results/variantAnalysis/variantsOfInterest/csvs/mean_{mut}_alleleBalance.csv", var="treatment", mean_=True)
rnaseqpop.plotRectangular(df2, annot=annot2, path="results/variantAnalysis/variantsOfInterest/VOI.heatmapPerTreatment.svg", xlab="strain")
rnaseqpop.plotRectangular(df2, annot=annot2, path="results/variantAnalysis/variantsOfInterest/VOI.heatmapPerTreatment.pdf", xlab="strain")

# Join both plots
rnaseqpop.plotTwoRectangular(df, annot, df2, annot2, path="results/variantAnalysis/variantsOfInterest/VOI.heatmapBothPlots.svg", ratio='auto')
rnaseqpop.plotTwoRectangular(df, annot, df2, annot2, path="results/variantAnalysis/variantsOfInterest/VOI.heatmapBothPlots.pdf", ratio='auto')
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
import sys
sys.stderr = open(snakemake.log[0], "w")

import matplotlib
matplotlib.use('agg')
import matplotlib.pyplot as plt
from matplotlib_venn import *
import pandas as pd
import numpy as np
from pathlib import Path

def plotvenn2(name, group1, group2, nboth,stat="DE_PBS", group1name='Significant up DE genes', group2name='High PBS'):

    print(f"There are {group2} high Fst genes in {name}") 
    print(f"There are {nboth} shared in {name}")

    venn2(subsets = (group1, group2, nboth), set_labels = (group1name, group2name), 
          set_colors=('r', 'g'), 
          alpha = 0.5);
    venn2_circles(subsets = (group1, group2, nboth))
    plt.title(f"{name}")
    plt.savefig(f"results/venn/{name}_{stat}.venn.png")
    plt.close()

def intersect2(one, two, df, write=True, path=None):
    inter = [x for x in list(one.GeneID) if x in list(two.GeneID)]
    length = len(inter)
    intersected_df = df[df.GeneID.isin(inter)]
    intersected_df.to_csv(f"{path}", sep="\t")

    return(length, intersected_df)


def add_columns_xlsx(name, de, fst, highfst, diffsnps, diffsnpsdf=None):

    rnaxlsx = pd.read_excel("results/genediff/RNA-Seq_diff.xlsx", 
                       sheet_name=name)

    highfst_bool = de.GeneID.isin(highfst.GeneID).astype(str)

    rnaxlsx['HighFst'] = highfst_bool

    if diffsnps:
        diffsnps_bool = de.GeneID.isin(diffsnpsdf.GeneID).astype(str)
        rnaxlsx['DiffSNPs'] = diffsnps_bool

    # add column of number of SNPs
    merged = pd.merge(de, fst, how="outer")
    rnaxlsx['nSNPs'] = merged['nSNPs']

    return(rnaxlsx)


#### Main ####
# Read contrasts in and other snakemake params
comparisons = pd.DataFrame(snakemake.params['DEcontrasts'], columns=['contrast'])
comparisons = comparisons.contrast.str.split("_", expand=True)
comparisons = [list(row) for i,row in comparisons.iterrows()]

percentile = snakemake.params['percentile']
diffsnps = snakemake.params['diffsnps']

# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('results/RNA-Seq-full.xlsx', engine='xlsxwriter')

#### Differential expression v Fst venn diagram
for comp1,comp2 in comparisons:
    name = comp1 + "_" + comp2
    print(f"\n-------------- Venn Diagram for {name} --------------")
    de = pd.read_csv(f"results/genediff/{name}.csv")
    fst = pd.read_csv("results/variantAnalysis/selection/FstPerGene.tsv", sep="\t")
    #compare sig DE genes and top 5% fst genes?
    #get sig up and down diffexp genes
    sigde = de[de['padj'] < pval_threshold]
    sigde_up = sigde[sigde['FC'] > upper_fc]
    sigde_down = sigde[sigde['FC'] < lower_fc]

    #take top percentile of fst genes
    highfst = fst.nlargest(int(fst.shape[0]*percentile),f"{name}_zFst")

    #how many fst? how many sig de up and down?
    nfst = highfst.shape[0]
    nde_up = sigde_up.shape[0]
    nde_down = sigde_down.shape[0]

    print(f"There are {nde_up} significantly upregulated genes in {name}") 
    print(f"There are {nde_down} significantly downregulated genes in {name}")

    nboth, _ = intersect2(sigde_up, 
               highfst, 
               de, 
               write=True, 
               path=f"results/venn/{name}.DE.Fst.intersection.tsv")


    ###### XLSX file ######
    if diffsnps:
        diffsnpsDE = pd.read_csv("results/diffsnps/{name}.sig.kissDE.tsv", sep="\t")
        sheet = add_columns_xlsx(name, de, fst, highfst, diffsnps, diffsnpsDE)
    else:
        sheet = add_columns_xlsx(name, de, fst, highfst, diffsnps, diffsnpsDE=None)

    # Write each dataframe to a different worksheet.
    sheet.to_excel(writer, sheet_name=name)

# Close the Pandas Excel writer and output the Excel file.
writer.save()
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
import sys
sys.stderr = open(snakemake.log[0], "w")
import pandas as pd
import numpy as np
import allel
import rnaseqpoptools as rnaseqpop

dataset = snakemake.params['dataset']
metadata = pd.read_csv(snakemake.input['metadata'], sep="\t")
metadata = metadata.sort_values(by='species')
chroms = snakemake.params['chroms']
ploidy = snakemake.params['ploidy']
numbers = rnaseqpop.get_numbers_dict(ploidy)
pbs = snakemake.params['pbs']
pbscomps = snakemake.params['pbscomps']
qualflt = snakemake.params['qualflt']
missingprop = snakemake.params['missingprop']

#Fst/PBS window size
windowsizes = snakemake.params['windowsizes']
windowsteps = snakemake.params['windowsteps']
windownames = snakemake.params['windownames']

# Read in list of contrasts
comparisons = pd.DataFrame(snakemake.params['DEcontrasts'], columns=['contrast'])
comparisons = comparisons.contrast.str.split("_", expand=True)
comparisons.columns = ['sus', 'res']
comparisons = [list(row) for i,row in comparisons.iterrows()]

for i, chrom in enumerate(chroms):

    path = f"results/variantAnalysis/vcfs/{dataset}.{chrom}.vcf.gz"
    vcf, geno, acsubpops, pos, depth, snpeff, subpops, populations = rnaseqpop.readAndFilterVcf(path=path,
                                                           chrom=chrom,
                                                           samples=metadata,
                                                           numbers=numbers,
                                                           ploidy=ploidy,
                                                           qualflt=qualflt,
                                                           missingfltprop=missingprop)

    #### Fst in windows #### 
    for sus, res in comparisons:
        name = sus + "_" + res
        cohortText = f"{sus} v {res}"
        print(f"Calculating Fst values in sliding windows for {name}\n")

        for wname, size, step in zip(windownames, windowsizes, windowsteps):
            FstArray = allel.moving_hudson_fst(acsubpops[sus], 
                            acsubpops[res], 
                            size=size, step=step)
            midpoint = allel.moving_statistic(pos, np.median, size=size, step=step)
            print(FstArray)
            print(midpoint)

            cohortNoSpaceText = name + "." + wname
            rnaseqpop.plotWindowed(statName="Fst",
                        cohortText=cohortText,
                        cohortNoSpaceText=cohortNoSpaceText,
                        values=FstArray, 
                        midpoints=midpoint,
                        colour='dodgerblue',
                        prefix="results/variantAnalysis/selection/fst", 
                        chrom=chrom, 
                        ylim=0.5, 
                        save=True)


    #### Population Branch Statistic (PBS) in windows ####
    if pbs:
        for pbscomp in pbscomps:
            pop1, pop2, outpop = pbscomp.split("_")
            cohortText = f"(({pop1}, {pop2}), {outpop})"
            print(f"Calculating PBS values in sliding window for {pbscomp}\n")

            for wname, size, step in zip(windownames, windowsizes, windowsteps):
                pbsArray = allel.pbs(acsubpops[pop1], 
                                acsubpops[pop2], 
                                acsubpops[outpop], 
                                window_size=size, window_step=step, normed=True)
                midpoint = allel.moving_statistic(pos, np.median, size=size, step=step)

                cohortNoSpaceText = pbscomp + "." + wname
                rnaseqpop.plotWindowed(statName="PBS", 
                            cohortText=cohortText,
                            cohortNoSpaceText=cohortNoSpaceText,
                            values=pbsArray, 
                            midpoints=midpoint, 
                            colour='dodgerblue',
                            prefix="results/variantAnalysis/selection/pbs",
                            chrom=chrom, 
                            ylim=0.5, 
                            save=True)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


shell("samtools index {snakemake.params} {snakemake.input[0]} {snakemake.output[0]}")
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
__author__ = "Joël Simoneau"
__copyright__ = "Copyright 2019, Joël Simoneau"
__email__ = "[email protected]"
__license__ = "MIT"

from snakemake.shell import shell

# Creating log
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

# Placeholder for optional parameters
extra = snakemake.params.get("extra", "")

# Allowing for multiple FASTA files
fasta = snakemake.input.get("fasta")
assert fasta is not None, "input-> a FASTA-file is required"
fasta = " ".join(fasta) if isinstance(fasta, list) else fasta

shell(
    "kallisto index "  # Tool
    "{extra} "  # Optional parameters
    "--index={snakemake.output.index} "  # Output file
    "{fasta} "  # Input FASTA files
    "{log}"  # Logging
)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
__author__ = "Joël Simoneau"
__copyright__ = "Copyright 2019, Joël Simoneau"
__email__ = "[email protected]"
__license__ = "MIT"

from snakemake.shell import shell

# Creating log
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

# Placeholder for optional parameters
extra = snakemake.params.get("extra", "")

# Allowing for multiple FASTQ files
fastq = snakemake.input.get("fastq")
assert fastq is not None, "input-> a FASTQ-file is required"
fastq = " ".join(fastq) if isinstance(fastq, list) else fastq

shell(
    "kallisto quant "  # Tool
    "{extra} "  # Optional parameters
    "--threads={snakemake.threads} "  # Number of threads
    "--index={snakemake.input.index} "  # Input file
    "--output-dir={snakemake.output} "  # Output directory
    "{fastq} "  # Input FASTQ files
    "{log}"  # Logging
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
__author__ = "Christopher Preusch"
__copyright__ = "Copyright 2017, Christopher Preusch"
__email__ = "cpreusch[at]ust.hk"
__license__ = "MIT"


from snakemake.shell import shell


shell("samtools flagstat {snakemake.input[0]} > {snakemake.output[0]}")
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path
import re
from tempfile import TemporaryDirectory

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=True, stderr=True)


def basename_without_ext(file_path):
    """Returns basename of file path, without the file extension."""

    base = path.basename(file_path)
    # Remove file extension(s) (similar to the internal fastqc approach)
    base = re.sub("\\.gz$", "", base)
    base = re.sub("\\.bz2$", "", base)
    base = re.sub("\\.txt$", "", base)
    base = re.sub("\\.fastq$", "", base)
    base = re.sub("\\.fq$", "", base)
    base = re.sub("\\.sam$", "", base)
    base = re.sub("\\.bam$", "", base)

    return base


# Run fastqc, since there can be race conditions if multiple jobs
# use the same fastqc dir, we create a temp dir.
with TemporaryDirectory() as tempdir:
    shell(
        "fastqc {snakemake.params} -t {snakemake.threads} "
        "--outdir {tempdir:q} {snakemake.input[0]:q}"
        " {log}"
    )

    # Move outputs into proper position.
    output_base = basename_without_ext(snakemake.input[0])
    html_path = path.join(tempdir, output_base + "_fastqc.html")
    zip_path = path.join(tempdir, output_base + "_fastqc.zip")

    if snakemake.output.html != html_path:
        shell("mv {html_path:q} {snakemake.output.html:q}")

    if snakemake.output.zip != zip_path:
        shell("mv {zip_path:q} {snakemake.output.zip:q}")
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path

from snakemake.shell import shell


input_dirs = set(path.dirname(fp) for fp in snakemake.input)
output_dir = path.dirname(snakemake.output[0])
output_name = path.basename(snakemake.output[0])
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "multiqc"
    " {snakemake.params}"
    " --force"
    " -o {output_dir}"
    " -n {output_name}"
    " {input_dirs}"
    " {log}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
__author__ = "Michael Chambers"
__copyright__ = "Copyright 2019, Michael Chambers"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


shell("samtools faidx {snakemake.params} {snakemake.input[0]} > {snakemake.output[0]}")
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


n = len(snakemake.input)
assert n == 2, "Input must contain 2 (paired-end) elements."

extra = snakemake.params.get("extra", "")
adapters = snakemake.params.get("adapters", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)

assert (
    extra != "" or adapters != ""
), "No options provided to cutadapt. Please use 'params: adapters=' or 'params: extra='."

shell(
    "cutadapt"
    " {adapters}"
    " {extra}"
    " -o {snakemake.output.fastq1}"
    " -p {snakemake.output.fastq2}"
    " -j {snakemake.threads}"
    " {snakemake.input}"
    " > {snakemake.output.qc} {log}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell
from snakemake_wrapper_utils.bcftools import get_bcftools_opts

bcftools_opts = get_bcftools_opts(snakemake)
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "bcftools view {bcftools_opts} "
    "{extra} "
    "{snakemake.input[0]} "
    "-o {snakemake.output} "
    "{log}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
__author__ = "Christopher Schröder"
__copyright__ = "Copyright 2020, Christopher Schröder"
__email__ = "[email protected]"
__license__ = "MIT"


import tempfile
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts

extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)

log = snakemake.log_fmt_shell(stdout=True, stderr=True, append=True)

with tempfile.TemporaryDirectory() as tmpdir:
    shell(
        "gatk --java-options '{java_opts}' ApplyBQSR"
        " --input {snakemake.input.bam}"
        " --bqsr-recal-file {snakemake.input.recal_table}"
        " --reference {snakemake.input.ref}"
        " {extra}"
        " --tmp-dir {tmpdir}"
        " --output {snakemake.output.bam}"
        " {log}"
    )
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
__author__ = "Christopher Schröder"
__copyright__ = "Copyright 2020, Christopher Schröder"
__email__ = "[email protected]"
__license__ = "MIT"


import tempfile
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts

extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)

log = snakemake.log_fmt_shell(stdout=True, stderr=True)
known = snakemake.input.get("known", "")
if known:
    if isinstance(known, str):
        known = [known]
    known = list(map("--known-sites {}".format, known))

with tempfile.TemporaryDirectory() as tmpdir:
    shell(
        "gatk --java-options '{java_opts}' BaseRecalibrator"
        " --input {snakemake.input.bam}"
        " --reference {snakemake.input.ref}"
        " {known}"
        " {extra}"
        " --tmp-dir {tmpdir}"
        " --output {snakemake.output.recal_table}"
        " {log}"
    )
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
__author__ = "Jan Forster"
__copyright__ = "Copyright 2019, Jan Forster"
__email__ = "[email protected]"
__license__ = "MIT"

import os
import tempfile
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts

extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)

log = snakemake.log_fmt_shell(stdout=True, stderr=True)

with tempfile.TemporaryDirectory() as tmpdir:
    shell(
        "gatk --java-options '{java_opts}' SplitNCigarReads"
        " --reference {snakemake.input.ref}"
        " --input {snakemake.input.bam}"
        " {extra}"
        " --tmp-dir {tmpdir}"
        " --output {snakemake.output[0]}"
        " {log}"
    )
ShowHide 55 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...