Clinical Whole Genome Sequencing Pipeline

public public 1yr ago 0 bookmarks

genome-seek 🔬

Whole Genome Clinical Sequencing Pipeline.


This is the home of the pipeline, genome-seek. Its long-term goals: to accurately call germline and somatic variants, to infer SVs & CNVs, and to boldly annotate variants like no pipeline before!

Overview

Welcome to genome-seek! Before getting started, we highly recommend reading through genome-seek's documentation .

The ./genome-seek pipeline is composed several inter-related sub commands to setup and run the pipeline across different systems. Each of the available sub commands perform different functions:

genome-seek is a comprehensive clinical WGS pipeline that is focused on speed. Each tool in the pipeline was benchmarked and selected due to its low run times without sacrificing accuracy or precision. It relies on technologies like Singularity1 to maintain the highest-level of reproducibility. The pipeline consists of a series of data processing and quality-control steps orchestrated by Snakemake2 , a flexible and scalable workflow management system, to submit jobs to a cluster.

The pipeline is compatible with data generated from Illumina short-read sequencing technologies. As input, it accepts a set of FastQ files and can be run locally on a compute instance, on-premise using a cluster, or on the cloud (feature coming soon!). A user can define the method or mode of execution. The pipeline can submit jobs to a cluster using a job scheduler like SLURM, or run on AWS using Tibanna (feature coming soon!). A hybrid approach ensures the pipeline is accessible to all users.

Before getting started, we highly recommend reading through the usage section of each available sub command.

For more information about issues or trouble-shooting a problem, please checkout our FAQ prior to opening an issue on Github .

Dependencies

Requires: singularity>=3.5 snakemake>=7.8

At the current moment, the pipeline uses a mixture of enviroment modules and docker images; however, this will be changing soon! In the very near future, the pipeline will only use docker images. With that being said, snakemake and singularity must be installed on the target system. Snakemake orchestrates the execution of each step in the pipeline. To guarantee the highest level of reproducibility, each step of the pipeline will rely on versioned images from DockerHub . Snakemake uses singularity to pull these images onto the local filesystem prior to job execution, and as so, snakemake and singularity will be the only two dependencies in the future.

Installation

Please clone this repository to your local filesystem using the following command:

# Clone Repository from Github
git clone https://github.com/OpenOmics/genome-seek.git
# Change your working directory
cd genome-seek/
# Add dependencies to $PATH
# Biowulf users should run
module load snakemake singularity
# Get usage information
./genome-seek -h

Contribute

This site is a living document, created for and by members like you. genome-seek is maintained by the members of NCBR and is improved by continous feedback! We encourage you to contribute new content and make improvements to existing content via pull request to our GitHub repository .

References

1. Kurtzer GM, Sochat V, Bauer MW (2017). Singularity: Scientific containers for mobility of compute. PLoS ONE 12(5): e0177459.
2. Koster, J. and S. Rahmann (2018). "Snakemake-a scalable bioinformatics workflow engine." Bioinformatics 34(20): 3600.

Code Snippets

64
65
66
67
68
69
70
71
72
73
74
75
76
shell: """
vcftools \\
    --gzvcf {input.vcf} \\
    --plink \\
    --out {params.intermediate} \\
    --chr {params.peddy_chr}
cut -f1-6 {params.intermediate}.ped \\
> {output.ped}
peddy -p {threads} \\
    --prefix {params.prefix} \\
    {input.vcf} \\
    {output.ped}
"""
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
shell: """
# Get biological sex
# predicted by peddy 
predicted_sex=$(awk -F ',' \\
    '$8=="{params.sample}" \\
    {{print $7}}' \\
    {input.csv}
)

# Copy over base ploidy 
# vcf for predicted sex 
# and add name to header
if [ "$predicted_sex" == "male" ]; then 
    cp {params.male_ploidy} {output.ploidy}
else
    # prediction is female
    cp {params.female_ploidy} {output.ploidy}
fi
sed -i 's/SAMPLENAME/{params.sample}/g' \\
    {output.ploidy}

# Delete Canvas checkpoints
if [ -d {params.checkpoints} ]; then
    # Forces Canvas to start 
    # over from the beginning
    rm -rf '{params.checkpoints}'
fi

# CANVAS in Germline WGS mode
export COMPlus_gcAllowVeryLargeObjects=1
Canvas.dll Germline-WGS \\
    -b {input.bam} \\
    -n {params.sample} \\
    -o {params.outdir} \\
    -r {params.canvas_kmer} \\
    --ploidy-vcf={output.ploidy} \\
    -g {params.canvas_genome} \\
    -f {params.canvas_filter} \\
    --sample-b-allele-vcf={input.vcf}

# Filter predicted CNVs
bcftools filter \\
    --include 'FILTER="PASS" && INFO/SVTYPE="CNV"' \\
    {output.vcf} \\
> {output.filtered}

# Rank and annotate CNVs
AnnotSV \\
    -annotationsDir  {params.annotsv_annot} \\
    -genomeBuild {params.annotsv_build} \\
    -outputDir {params.outdir} \\
    -outputFile {output.annotated} \\
    -SVinputFile {output.filtered} \\
    -snvIndelFiles {input.joint} \\
    -snvIndelSamples {params.sample}

# Check if AnnotSV silently failed
if [ ! -f "{output.annotated}" ]; then
    # AnnotSV failed to process
    # provided SVinputFile file, 
    # usually due to passing an 
    # empty filtered SV file
    echo "WARNING: AnnotSV silently failed..." 1>&2
    touch {output.annotated}
fi
"""
232
233
234
235
236
237
238
239
240
shell: """
java -Xmx{params.memory}g -cp {params.amber_jar} \\
    com.hartwig.hmftools.amber.AmberApplication \\
        -tumor {params.tumor} {params.normal_name} \\
        -tumor_bam {input.tumor}  {params.normal_bam} \\
        -output_dir {params.outdir} \\
        -threads {threads} {params.tumor_flag} \\
        -loci {params.loci_ref}
"""
SnakeMake From line 232 of rules/cnv.smk
282
283
284
285
286
287
288
289
290
shell: """
java -Xmx{params.memory}g -cp {params.cobalt_jar} \\
    com.hartwig.hmftools.cobalt.CountBamLinesApplication \\
        -tumor {params.tumor} {params.normal_name} \\
        -tumor_bam {input.tumor}  {params.normal_bam} \\
        -output_dir {params.outdir} \\
        -threads {threads} {params.tumor_flag} \\
        -gc_profile {params.gc_profile}
"""
SnakeMake From line 282 of rules/cnv.smk
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
shell: """
# Set output directories
# for Amber and Cobalt
amber_outdir="$(dirname "{input.amber}")"
cobalt_outdir="$(dirname "{input.cobalt}")"
echo "Amber output directory: $amber_outdir"
echo "Cobalt output directory: $cobalt_outdir"

# Run Purple to find CNVs,
# purity and ploidy, and 
# cancer driver events
java -Xmx{params.memory}g -jar {params.purple_jar} \\
    -tumor {params.tumor} {params.normal_name} \\
    -output_dir {params.outdir} \\
    -amber "$amber_outdir" \\
    -cobalt "$cobalt_outdir" \\
    -circos circos \\
    -gc_profile {params.gc_profile} \\
    -ref_genome {params.genome} \\
    -ref_genome_version {params.ref_ver} \\
    -run_drivers \\
    -driver_gene_panel {params.panel} \\
    -somatic_hotspots {params.somatic_hotspot} \\
    -germline_hotspots {params.germline_hotspot} \\
    -threads {threads} {params.tumor_flag} \\
    -somatic_vcf {input.vcf} {params.sv_option}
"""
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
shell: """
java -Xmx{params.memory}g -jar ${{GATK_JAR}} -T RealignerTargetCreator \\
    --use_jdk_inflater \\
    --use_jdk_deflater \\
    -I {input.bam} \\
    -R {params.genome} \\
    -o {output.intervals} \\
    {params.knowns}

java -Xmx{params.memory}g -jar ${{GATK_JAR}} -T IndelRealigner \\
    -R {params.genome} \\
    -I {input.bam} \\
    {params.knowns} \\
    --use_jdk_inflater \\
    --use_jdk_deflater \\
    -targetIntervals {output.intervals} \\
    -o {output.bam}
"""
76
77
78
79
80
81
82
83
shell: """
gatk --java-options '-Xmx{params.memory}g' BaseRecalibrator \\
    --input {input.bam} \\
    --reference {params.genome} \\
    {params.knowns} \\
    --output {output.recal} \\
    {params.intervals}
"""
111
112
113
114
115
116
117
118
119
120
shell: """
# Create GatherBQSR list
find {params.bams} -iname '{params.sample}_recal*_data.grp' \\
    > {output.lsl}
# Gather per sample BQSR results
gatk --java-options '-Xmx{params.memory}g' GatherBQSRReports \\
    --use-jdk-inflater --use-jdk-deflater \\
    -I {output.lsl} \\
    -O {output.recal}
"""
149
150
151
152
153
154
155
156
157
158
shell: """
gatk --java-options '-Xmx{params.memory}g' ApplyBQSR \\
    --use-jdk-inflater --use-jdk-deflater \\
    --reference {params.genome} \\
    --bqsr-recal-file {input.recal} \\
    --input {input.bam} \\
    --output {output.bam}

samtools index -@ {threads} {output.bam} {output.bai}
"""
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
shell: """
# Setups temporary directory for
# intermediate files with built-in 
# mechanism for deletion on exit
if [ ! -d "{params.tmpdir}" ]; then mkdir -p "{params.tmpdir}"; fi
tmp=$(mktemp -d -p "{params.tmpdir}")
trap 'rm -rf "${{tmp}}"' EXIT

run_deepvariant \\
    --model_type=WGS \\
    --ref={params.genome} \\
    --reads={input.bam} \\
    --output_gvcf={output.gvcf} \\
    --output_vcf={output.vcf} \\
    --num_shards={threads} \\
    --intermediate_results_dir=${{tmp}}
"""
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
shell: """
# Setups temporary directory for
# intermediate files with built-in 
# mechanism for deletion on exit, 
# GLnexus tmpdir should NOT exist
# prior to running it. If it does
# exist prior to runnning, it will
# immediately error out.
if [ ! -d "{params.tmpdir}" ]; then mkdir -p "{params.tmpdir}"; fi
tmp_parent=$(mktemp -d -p "{params.tmpdir}")
tmp_dne=$(echo "${{tmp_parent}}"| sed 's@$@/GLnexus.DB@')
trap 'rm -rf "${{tmp_parent}}"' EXIT

# Avoids ARG_MAX issue which will
# limit max length of a command
find {params.gvcfdir} -iname '*.g.vcf.gz' \\
> {output.gvcfs}

glnexus_cli \\
    --dir ${{tmp_dne}} \\
    --config DeepVariant_unfiltered \\
    --list {output.gvcfs} \\
    --threads {threads} \\
    --mem-gbytes {params.memory} \\
> {output.bcf}

bcftools norm \\
    -m - \\
    -Oz \\
    --threads {threads} \\
    -f {params.genome} \\
    -o {output.norm} \\
    {output.bcf}

bcftools view \\
    -Oz \\
    --threads {threads} \\
    -o {output.jvcf} \\
    {output.bcf}

bcftools index \\
    -f -t \\
    --threads {threads} \\
    {output.norm}

bcftools index \\
    -f -t \\
    --threads {threads} \\
    {output.jvcf}
"""
152
153
154
155
156
157
158
159
shell: """
gatk --java-options '-Xmx{params.memory}g -XX:ParallelGCThreads={threads}' SelectVariants \\
    -R {params.genome} \\
    --variant {input.vcf} \\
    --sample-name {params.sample} \\
    --exclude-non-variants \\
    --output {output.vcf}
"""
49
50
51
52
53
54
55
56
57
shell: """
singularity exec -B {params.bind} {params.sif} \\
HLA-LA.pl \\
    --BAM {input.bam} \\
    --graph PRG_MHC_GRCh38_withIMGT \\
    --sampleID sample \\
    --maxThreads {threads} \\
    --workingDir {params.outdir}
"""
32
33
34
35
36
37
38
39
40
41
42
shell: """
gatk --java-options '-Xmx{params.memory}g -XX:ParallelGCThreads={threads}' SelectVariants \\
    -R {params.genome} \\
    --variant {input.vcf} \\
    -L {params.chrom} \\
    --exclude-non-variants \\
    --output {output.vcf}

tabix --force \\
    -p vcf {output.vcf}
"""
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
shell: """
# Environment variable for modules dir
export OC_MODULES="{params.module}"

oc run \\
    -t vcf \\
    -x \\
    --newlog \\
    --cleanrun \\
    --system-option "modules_dir={params.module}" \\
    -a {params.annot} \\
    -n {params.prefix} \\
    -l {params.genome} \\
    -d {params.outdir} \\
    --mp {threads} \\
    {input.vcf}
"""
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
    shell: """
# Create first filtering script,
# Filters based on AF 
mkdir -p {params.scripts}
cp {input.db} {output.filter_1} 
cat << EOF > {params.filter_1}
import sqlite3
import os

maf = str({params.maf_thres})
mafc = str(1 - float({params.maf_thres}))
so = "{params.so}"
conn = sqlite3.connect("{output.filter_1}")
conn.isolation_level = None
cursor = conn.cursor()
conn.execute('CREATE TABLE variant2 AS SELECT * FROM variant WHERE (base__so IN (' + '"' + '", "'.join(so.split(',')) + '"' + ')) AND ((gnomad__af IS NULL AND gnomad3__af IS NULL AND thousandgenomes__af IS NULL) OR (gnomad__af <= ' + maf +' OR gnomad__af >= '+mafc+' OR gnomad3__af <= '+maf+' OR gnomad3__af >= '+mafc+' OR thousandgenomes__af <= '+maf+' OR thousandgenomes__af >= '+mafc+'))')
conn.execute('DROP TABLE variant')
conn.execute('ALTER TABLE variant2 RENAME TO variant')
conn.execute('DELETE from sample WHERE base__uid NOT IN (SELECT base__uid FROM variant)')
conn.execute('DELETE FROM gene WHERE base__hugo NOT IN (SELECT base__hugo FROM variant)')
conn.execute('DELETE FROM mapping WHERE base__uid NOT IN (SELECT base__uid FROM variant)')
conn.execute('UPDATE info SET colval = (SELECT COUNT(*) FROM variant) WHERE colkey == "Number of unique input variants"')
conn.execute('VACUUM')
conn.close()
EOF

# Create second filtering script,
# Filters based on filters in config
cp {output.filter_1} {output.filter_2}
cat << EOF > {params.filter_2}
import pandas as pd
import sqlite3
import os

conn = sqlite3.connect("{output.filter_2}")
conn.isolation_level = None
cursor = conn.cursor()
filter = {params.secondary}

def keep(dd, used_annotators):
    final_annotators = {{annotator: dd[annotator] for annotator in used_annotators}}
    return final_annotators

def filtercol(dd, annot):
    if dd['relation']=='IN':
        return annot + '__' + dd['col'] + ' ' + dd['relation'] + ' ("' + '", "'.join(dd['value'].split(',')) + '")'
    elif dd['relation']=='IS' or dd['relation']=='IS NOT':
        return annot + '__' + dd['col'] + ' ' + dd['relation'] + ' ' + dd['value']
    else:
        return annot + '__'+ dd['col'] + ' ' + dd['relation'] + dd['value']    

def filterunit(annot):
    dd = filter[annot]
    if dd['col'].lower()=='multiple':
        cols = dd['cols'].split(',')
        relations = dd['relation'].split(',')
        values = dd['value'].split(',')
        return(' OR '.join([filtercol({{'col':cols[0], 'relation': relations[0], 'value': values[0]}}, annot) for i in range(len(cols))]))
    else:
        return(filtercol(dd, annot))

def filterunit_null(annot):
    dd = filter[annot]
    if dd['col'].lower()=='multiple':
        cols = dd['cols'].split(',')
        relations = dd['relation'].split(',')
        values = dd['value'].split(',')
        return(' AND '.join([annot + '__' + cols[i] + ' IS NULL' for i in range(len(cols))]))
    else:
        return(annot + '__' + dd['col'] + ' IS NULL')

# Find the intersect of annotators listed
# in colnames of the SQLite variants table 
# and annotators in filter config. Must run
# this step, if a module in OpenCRAVAT does
# not have any annotations for a provided set
# of variants, then that modules annotations 
# may not exist in the SQLite table.
df = pd.read_sql_query("SELECT * FROM variant", conn)
table_var_annotators = set([col for col in df.columns])
filter_annotators = []
column2module = {{}}
for ann in set(filter.keys()):
    try:
        # Multiple column filters
        col_names = filter[ann]['cols']
        col_names = [c.strip() for c in col_names.split(',')]
    except KeyError:
        # One column filter 
        col_names = [filter[ann]['col'].strip()]
    for col in col_names:
        coln = '{{}}__{{}}'.format(ann, col)
        filter_annotators.append(coln)
        column2module[coln] = ann

filter_annotators = set(filter_annotators)
tmp_annotators = table_var_annotators.intersection(filter_annotators)
keep_annotators = set([column2module[ann] for ann in tmp_annotators])

# Sanity check 
if len(keep_annotators) == 0:
    print('WARNING: No filter annotators were provided that match oc run annotators.', file=sys.stderr)
    print('WARNING: The next filtering step may fail.', file=sys.stderr)

# Filter to avoid SQL filtering issues
filter = keep(filter, keep_annotators)
print('Apply final filters to SQLite: ', filter)

filter_query_nonnull = ' OR '.join([filterunit(annot) for annot in filter.keys()])
filter_query_null = ' AND '.join([filterunit_null(annot) for annot in filter.keys()])
filter_query = filter_query_nonnull + ' OR (' + filter_query_null + ')'
print(filter_query)
conn.execute('CREATE TABLE variant2 AS SELECT * FROM variant WHERE (' + filter_query + ')')
conn.execute('DROP TABLE variant')
conn.execute('ALTER TABLE variant2 RENAME TO variant')
conn.execute('DELETE from sample WHERE base__uid NOT IN (SELECT base__uid FROM variant)')
conn.execute('DELETE FROM gene WHERE base__hugo NOT IN (SELECT base__hugo FROM variant)')
conn.execute('DELETE FROM mapping WHERE base__uid NOT IN (SELECT base__uid FROM variant)')
conn.execute('UPDATE info SET colval = (SELECT COUNT(*) FROM variant) WHERE colkey == "Number of unique input variants"')
conn.execute('VACUUM')
conn.close()
EOF

# Create fix column script,
# Fixes columns to a min and max AF 
cp {output.filter_2} {output.fixed}
cat << EOF > {params.fixed}
import sqlite3
import os
import pandas as pd

conn = sqlite3.connect("{output.fixed}")
conn.isolation_level = None
cursor = conn.cursor()
depth = pd.read_sql_query('SELECT base__uid,vcfinfo__tot_reads,vcfinfo__af from variant', conn, index_col = 'base__uid')
tot_read = depth.vcfinfo__tot_reads.str.split(';', expand = True).astype('float')
af = depth.vcfinfo__af.str.split(';', expand = True).apply(pd.to_numeric)
depth['vcfinfo__Max_read'] = tot_read.max(axis = 1)
depth['vcfinfo__Min_read'] = tot_read.min(axis = 1)
depth['vcfinfo__Max_af'] = af.max(axis = 1)
depth['vcfinfo__Min_af'] = af.min(axis = 1)
depth.to_sql('tmp', conn, if_exists = 'replace', index = True)
conn.execute('alter table variant add column vcfinfo__Max_read numeric(50)')
conn.execute('alter table variant add column vcfinfo__Min_read numeric(50)')
conn.execute('alter table variant add column vcfinfo__Max_af numeric(50)')
conn.execute('alter table variant add column vcfinfo__Min_af numeric(50)')
qry = 'update variant set vcfinfo__Max_read = (select vcfinfo__Max_read from tmp where tmp.base__uid = variant.base__uid) where vcfinfo__Max_read is NULL'
conn.execute(qry)
qry = 'update variant set vcfinfo__Min_read = (select vcfinfo__Min_read from tmp where tmp.base__uid = variant.base__uid) where vcfinfo__Min_read is NULL'
conn.execute(qry)
qry = 'update variant set vcfinfo__Max_af = (select vcfinfo__Max_af from tmp where tmp.base__uid = variant.base__uid) where vcfinfo__Max_af is NULL'
conn.execute(qry)
qry = 'update variant set vcfinfo__Min_af = (select vcfinfo__Min_af from tmp where tmp.base__uid = variant.base__uid) where vcfinfo__Min_af is NULL'
conn.execute(qry)
conn.execute('''INSERT INTO variant_header (col_name, col_def) VALUES ('vcfinfo__Max_read','{{"index": null, "name": "vcfinfo__Max_read", "title": "Max reads", "type": "float", "categories": [], "width": 70, "desc": null, "hidden": false, "category": null, "filterable": true, "link_format": null, "genesummary": false, "table": false}}')''')
conn.execute('''INSERT INTO variant_header (col_name, col_def) VALUES ('vcfinfo__Min_read','{{"index": null, "name": "vcfinfo__Min_read", "title": "Min reads", "type": "float", "categories": [], "width": 70, "desc": null, "hidden": false, "category": null, "filterable": true, "link_format": null, "genesummary": false, "table": false}}')''')
conn.execute('''INSERT INTO variant_header (col_name, col_def) VALUES ('vcfinfo__Max_af','{{"index": null, "name": "vcfinfo__Max_af", "title": "Max AF", "type": "float", "categories": [], "width": 70, "desc": null, "hidden": false, "category": null, "filterable": true, "link_format": null, "genesummary": false, "table": false}}')''')
conn.execute('''INSERT INTO variant_header (col_name, col_def) VALUES ('vcfinfo__Min_af','{{"index": null, "name": "vcfinfo__Min_af", "title": "Min AF", "type": "float", "categories": [], "width": 70, "desc": null, "hidden": false, "category": null, "filterable": true, "link_format": null, "genesummary": false, "table": false}}')''')
conn.commit()
conn.execute('drop table tmp')
conn.execute('VACUUM')
conn.close()
EOF

echo 'Running first filtering script'
python3 {params.filter_1}

echo 'Running secondary filtering script'
python3 {params.filter_2}

echo 'Running column fixing script'
python3 {params.fixed}
"""
353
354
355
356
357
shell: """
oc util mergesqlite \\
    -o {output.merged} \\
    {input.dbs} 
"""
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
shell: """
# Environment variable for modules dir
export OC_MODULES="{params.module}"

oc run \\
    -t vcf \\
    -x \\
    --newlog \\
    --cleanrun \\
    --system-option "modules_dir={params.module}" \\
    -a {params.annot} \\
    -n {params.prefix} \\
    -l {params.genome} \\
    -d {params.outdir} \\
    --mp {threads} \\
    {input.vcfs}
"""
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
    shell: """
# Create first filtering script,
# Filters based on AF 
mkdir -p {params.scripts}
cp {input.db} {output.filter_1} 
cat << EOF > {params.filter_1}
import sqlite3
import os

maf = str({params.maf_thres})
mafc = str(1 - float({params.maf_thres}))
so = "{params.so}"
conn = sqlite3.connect("{output.filter_1}")
conn.isolation_level = None
cursor = conn.cursor()
conn.execute('CREATE TABLE variant2 AS SELECT * FROM variant WHERE (base__so IN (' + '"' + '", "'.join(so.split(',')) + '"' + ')) AND ((gnomad__af IS NULL AND gnomad3__af IS NULL AND thousandgenomes__af IS NULL) OR (gnomad__af <= ' + maf +' OR gnomad__af >= '+mafc+' OR gnomad3__af <= '+maf+' OR gnomad3__af >= '+mafc+' OR thousandgenomes__af <= '+maf+' OR thousandgenomes__af >= '+mafc+'))')
conn.execute('DROP TABLE variant')
conn.execute('ALTER TABLE variant2 RENAME TO variant')
conn.execute('DELETE from sample WHERE base__uid NOT IN (SELECT base__uid FROM variant)')
conn.execute('DELETE FROM gene WHERE base__hugo NOT IN (SELECT base__hugo FROM variant)')
conn.execute('DELETE FROM mapping WHERE base__uid NOT IN (SELECT base__uid FROM variant)')
conn.execute('UPDATE info SET colval = (SELECT COUNT(*) FROM variant) WHERE colkey == "Number of unique input variants"')
conn.execute('VACUUM')
conn.close()
EOF

# Create second filtering script,
# Filters based on filters in config
cp {output.filter_1} {output.filter_2}
cat << EOF > {params.filter_2}
import pandas as pd
import sqlite3
import os

conn = sqlite3.connect("{output.filter_2}")
conn.isolation_level = None
cursor = conn.cursor()
filter = {params.secondary}

def keep(dd, used_annotators):
    final_annotators = {{annotator: dd[annotator] for annotator in used_annotators}}
    return final_annotators

def filtercol(dd, annot):
    if dd['relation']=='IN':
        return annot + '__' + dd['col'] + ' ' + dd['relation'] + ' ("' + '", "'.join(dd['value'].split(',')) + '")'
    elif dd['relation']=='IS' or dd['relation']=='IS NOT':
        return annot + '__' + dd['col'] + ' ' + dd['relation'] + ' ' + dd['value']
    else:
        return annot + '__'+ dd['col'] + ' ' + dd['relation'] + dd['value']    

def filterunit(annot):
    dd = filter[annot]
    if dd['col'].lower()=='multiple':
        cols = dd['cols'].split(',')
        relations = dd['relation'].split(',')
        values = dd['value'].split(',')
        return(' OR '.join([filtercol({{'col':cols[0], 'relation': relations[0], 'value': values[0]}}, annot) for i in range(len(cols))]))
    else:
        return(filtercol(dd, annot))

def filterunit_null(annot):
    dd = filter[annot]
    if dd['col'].lower()=='multiple':
        cols = dd['cols'].split(',')
        relations = dd['relation'].split(',')
        values = dd['value'].split(',')
        return(' AND '.join([annot + '__' + cols[i] + ' IS NULL' for i in range(len(cols))]))
    else:
        return(annot + '__' + dd['col'] + ' IS NULL')

# Find the intersect of annotators listed
# in colnames of the SQLite variants table 
# and annotators in filter config. Must run
# this step, if a module in OpenCRAVAT does
# not have any annotations for a provided set
# of variants, then that modules annotations 
# may not exist in the SQLite table.
df = pd.read_sql_query("SELECT * FROM variant", conn)
table_var_annotators = set([col for col in df.columns])
filter_annotators = []
column2module = {{}}
for ann in set(filter.keys()):
    try:
        # Multiple column filters
        col_names = filter[ann]['cols']
        col_names = [c.strip() for c in col_names.split(',')]
    except KeyError:
        # One column filter 
        col_names = [filter[ann]['col'].strip()]
    for col in col_names:
        coln = '{{}}__{{}}'.format(ann, col)
        filter_annotators.append(coln)
        column2module[coln] = ann

filter_annotators = set(filter_annotators)
tmp_annotators = table_var_annotators.intersection(filter_annotators)
keep_annotators = set([column2module[ann] for ann in tmp_annotators])

# Sanity check 
if len(keep_annotators) == 0:
    print('WARNING: No filter annotators were provided that match oc run annotators.', file=sys.stderr)
    print('WARNING: The next filtering step may fail.', file=sys.stderr)

# Filter to avoid SQL filtering issues
filter = keep(filter, keep_annotators)
print('Apply final filters to SQLite: ', filter)

filter_query_nonnull = ' OR '.join([filterunit(annot) for annot in filter.keys()])
filter_query_null = ' AND '.join([filterunit_null(annot) for annot in filter.keys()])
filter_query = filter_query_nonnull + ' OR (' + filter_query_null + ')'
print(filter_query)
conn.execute('CREATE TABLE variant2 AS SELECT * FROM variant WHERE (' + filter_query + ')')
conn.execute('DROP TABLE variant')
conn.execute('ALTER TABLE variant2 RENAME TO variant')
conn.execute('DELETE from sample WHERE base__uid NOT IN (SELECT base__uid FROM variant)')
conn.execute('DELETE FROM gene WHERE base__hugo NOT IN (SELECT base__hugo FROM variant)')
conn.execute('DELETE FROM mapping WHERE base__uid NOT IN (SELECT base__uid FROM variant)')
conn.execute('UPDATE info SET colval = (SELECT COUNT(*) FROM variant) WHERE colkey == "Number of unique input variants"')
conn.execute('VACUUM')
conn.close()
EOF

# Create fix column script,
# Fixes columns to a min and max AF 
cp {output.filter_2} {output.fixed}
cat << EOF > {params.fixed}
import sqlite3
import os
import pandas as pd

conn = sqlite3.connect("{output.fixed}")
conn.isolation_level = None
cursor = conn.cursor()
depth = pd.read_sql_query('SELECT base__uid,vcfinfo__tot_reads,vcfinfo__af from variant', conn, index_col = 'base__uid')
tot_read = depth.vcfinfo__tot_reads.str.split(';', expand = True).astype('float')
af = depth.vcfinfo__af.str.split(';', expand = True).apply(pd.to_numeric)
depth['vcfinfo__Max_read'] = tot_read.max(axis = 1)
depth['vcfinfo__Min_read'] = tot_read.min(axis = 1)
depth['vcfinfo__Max_af'] = af.max(axis = 1)
depth['vcfinfo__Min_af'] = af.min(axis = 1)
depth.to_sql('tmp', conn, if_exists = 'replace', index = True)
conn.execute('alter table variant add column vcfinfo__Max_read numeric(50)')
conn.execute('alter table variant add column vcfinfo__Min_read numeric(50)')
conn.execute('alter table variant add column vcfinfo__Max_af numeric(50)')
conn.execute('alter table variant add column vcfinfo__Min_af numeric(50)')
qry = 'update variant set vcfinfo__Max_read = (select vcfinfo__Max_read from tmp where tmp.base__uid = variant.base__uid) where vcfinfo__Max_read is NULL'
conn.execute(qry)
qry = 'update variant set vcfinfo__Min_read = (select vcfinfo__Min_read from tmp where tmp.base__uid = variant.base__uid) where vcfinfo__Min_read is NULL'
conn.execute(qry)
qry = 'update variant set vcfinfo__Max_af = (select vcfinfo__Max_af from tmp where tmp.base__uid = variant.base__uid) where vcfinfo__Max_af is NULL'
conn.execute(qry)
qry = 'update variant set vcfinfo__Min_af = (select vcfinfo__Min_af from tmp where tmp.base__uid = variant.base__uid) where vcfinfo__Min_af is NULL'
conn.execute(qry)
conn.execute('''INSERT INTO variant_header (col_name, col_def) VALUES ('vcfinfo__Max_read','{{"index": null, "name": "vcfinfo__Max_read", "title": "Max reads", "type": "float", "categories": [], "width": 70, "desc": null, "hidden": false, "category": null, "filterable": true, "link_format": null, "genesummary": false, "table": false}}')''')
conn.execute('''INSERT INTO variant_header (col_name, col_def) VALUES ('vcfinfo__Min_read','{{"index": null, "name": "vcfinfo__Min_read", "title": "Min reads", "type": "float", "categories": [], "width": 70, "desc": null, "hidden": false, "category": null, "filterable": true, "link_format": null, "genesummary": false, "table": false}}')''')
conn.execute('''INSERT INTO variant_header (col_name, col_def) VALUES ('vcfinfo__Max_af','{{"index": null, "name": "vcfinfo__Max_af", "title": "Max AF", "type": "float", "categories": [], "width": 70, "desc": null, "hidden": false, "category": null, "filterable": true, "link_format": null, "genesummary": false, "table": false}}')''')
conn.execute('''INSERT INTO variant_header (col_name, col_def) VALUES ('vcfinfo__Min_af','{{"index": null, "name": "vcfinfo__Min_af", "title": "Min AF", "type": "float", "categories": [], "width": 70, "desc": null, "hidden": false, "category": null, "filterable": true, "link_format": null, "genesummary": false, "table": false}}')''')
conn.commit()
conn.execute('drop table tmp')
conn.execute('VACUUM')
conn.close()
EOF

echo 'Running first filtering script'
python3 {params.filter_1}

echo 'Running secondary filtering script'
python3 {params.filter_2}

echo 'Running column fixing script'
python3 {params.fixed}
"""
36
37
38
39
40
41
shell: """
python3 {params.get_flowcell_lanes} \\
    {input.r1} \\
    {wildcards.name} \\
> {output.txt}
"""
SnakeMake From line 36 of rules/qc.smk
66
67
68
69
70
71
72
shell: """
fastqc \\
    {input.r1} \\
    {input.r2} \\
    -t {threads} \\
    -o {params.outdir}
"""
105
106
107
108
109
110
111
112
113
114
shell: """
fastq_screen --conf {params.fastq_screen_config} \\
    --outdir {params.outdir} \\
    --threads {threads} \\
    --subset 1000000 \\
    --aligner bowtie2 \\
    --force \\
    {input.fq1} \\
    {input.fq2}
"""
140
141
142
143
144
145
146
shell: """
fastqc -t {threads} \\
    -f bam \\
    --contaminants {params.adapters} \\
    -o {params.outdir} \\
    {input.bam} 
"""
172
173
174
175
176
177
178
179
180
181
182
183
shell: """
unset DISPLAY
qualimap bamqc -bam {input.bam} \\
    --java-mem-size=92G \\
    -c -ip --gd HUMAN \\
    -outdir {params.outdir} \\
    -outformat HTML \\
    -nt {threads} \\
    --skip-duplicated \\
    -nw 500 \\
    -p NON-STRAND-SPECIFIC
"""
207
208
209
210
211
shell: """
samtools flagstat --threads {threads} \\
    {input.bam} \\
> {output.txt}
"""
238
239
240
241
242
shell: """
bcftools stats \\
    {input.vcf} \\
> {output.txt}
"""
272
273
274
275
276
277
278
shell: """
gatk --java-options '-Xmx{params.memory}g -XX:ParallelGCThreads={threads}' VariantEval \\
    -R {params.genome} \\
    -O {output.grp} \\
    --dbsnp {params.dbsnp} \\
    --eval {input.vcf} 
"""
307
308
309
310
311
312
313
314
shell: """
java -Xmx{params.memory}g -jar ${{SNPEFF_JAR}} \\
    -v -canon -c {params.config} \\
    -csvstats {output.csv} \\
    -stats {output.html} \\
    {params.genome} \\
    {input.vcf} > {output.vcf}
"""
SnakeMake From line 307 of rules/qc.smk
340
341
342
343
344
345
shell: """
vcftools \\
    --gzvcf {input.vcf} \\
    --het \\
    --out {params.prefix}
"""
372
373
374
375
376
377
378
379
shell: """
java -Xmx{params.memory}g -jar ${{PICARDJARPATH}}/picard.jar \\
    CollectVariantCallingMetrics \\
    INPUT={input.vcf} \\
    OUTPUT={params.prefix} \\
    DBSNP={params.dbsnp} \\
    Validation_Stringency=SILENT
"""
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
shell: """ 
echo "Extracting sites to estimate ancestry."
somalier extract \\
    -d {params.outdir} \\
    --sites {params.sites} \\
    -f {params.genome} \\
    {input.vcf}

# Check if pedigree file exists,
# pulled from patient database
pedigree_option=""
if [ -f "{params.ped}" ]; then
    # Use PED with relate command
    pedigree_option="-p {params.ped}"
fi
echo "Estimating relatedness with pedigree $pedigree_option"
somalier relate "$pedigree_option" \\
    -i -o {params.outdir}/relatedness \\
    {output.somalier}

echo "Estimating ancestry."
somalier ancestry \\
    --n-pcs=10 \\
    -o {params.outdir}/ancestry \\
    --labels {params.ancestry_db}/ancestry-labels-1kg.tsv \\
    {params.ancestry_db}/*.somalier ++ \\
    {output.somalier} || {{
# Somalier ancestry error,
# usually due to not finding
# any sites compared to the 
# its references, expected 
# with sub-sampled datasets
echo "WARNING: Somalier ancestry failed..." 1>&2
touch {output.ancestry}
}}
"""
SnakeMake From line 410 of rules/qc.smk
500
501
502
503
504
505
506
507
shell: """
multiqc --ignore '*/.singularity/*' \\
    --ignore 'slurmfiles/' \\
    --exclude peddy \\
    -f --interactive \\
    -n {output.report} \\
    {params.workdir}
"""
100
101
102
103
104
105
106
107
108
109
110
111
112
113
shell: """
mkdir -p '{params.tmppath}'
octopus --threads {threads} \\
    -C cancer \\
    --working-directory {params.wd} \\
    --temp-directory-prefix {params.tmpdir} \\
    -R {params.genome} \\
    -I {input.normal} {input.tumor} {params.normal_option} \\
    -o {output.vcf} \\
    --forest-model {params.g_model} \\
    --somatic-forest-model {params.s_model} \\
    --annotations AC AD DP \\
    -T {params.chunk}
"""
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
shell: """
# Create list of chunks to merge
find {params.octopath} -iname '{params.tumor}.vcf.gz' \\
    > {output.lsl}
# Merge octopus chunk calls,
# contains both germline and
# somatic variants
bcftools concat \\
    --threads {threads} \\
    -d exact \\
    -a \\
    -f {output.lsl} \\
    -o {output.raw} \\
    -O v
# Filter Octopus callset for 
# variants with SOMATIC tag
grep -E "#|CHROM|SOMATIC" {output.raw} \\
    > {output.vcf}
"""
196
197
198
199
200
201
202
203
204
205
206
shell: """
octopus --threads {threads} \\
    --working-directory {params.wd} \\
    --temp-directory-prefix {params.tmpdir} \\
    -R {params.genome} \\
    -I {input.normal} \\
    -o {output.vcf} \\
    --forest-model {params.model} \\
    --annotations AC AD AF DP \\
    -T {params.chroms}
"""
246
247
248
249
250
251
252
253
254
255
256
shell: """
gatk Mutect2 \\
    -R {params.genome} \\
    -I {input.tumor} {params.i_option} {params.normal_option} \\
    --panel-of-normals {params.pon} \\
    --germline-resource {params.germsource} \\
    -L {params.chrom} \\
    -O {output.vcf} \\
    --f1r2-tar-gz {output.orien} \\
    --independent-mates
"""
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
shell: """
# Setups temporary directory for
# intermediate files with built-in 
# mechanism for deletion on exit
if [ ! -d "{params.tmpdir}" ]; then mkdir -p "{params.tmpdir}"; fi
tmp=$(mktemp -d -p "{params.tmpdir}")
trap 'rm -rf "${{tmp}}"' EXIT

java -Xmx{params.memory}g -Djava.io.tmpdir=${{tmp}} \
    -XX:ParallelGCThreads={threads} -jar $GATK_JAR -T CombineVariants \\
    --use_jdk_inflater --use_jdk_deflater \\
    -R {params.genome} \\
    --filteredrecordsmergetype KEEP_UNCONDITIONAL \\
    --assumeIdenticalSamples \\
    -o {output.vcf} \\
    {params.multi_variant_option}
"""
347
348
349
350
351
352
353
354
355
356
357
358
shell: """
# Gather Mutect2 stats
gatk MergeMutectStats \\
    {params.multi_stats_option} \\
    -O {output.stats} &
# Learn read orientaion model
# for artifact filtering 
gatk LearnReadOrientationModel \\
    --output {output.orien} \\
    {params.multi_orien_option} &
wait
"""
386
387
388
389
390
391
392
shell: """
gatk --java-options '-Xmx{params.memory}g' GetPileupSummaries \\
    -I {input.tumor} \\
    -V {params.gsnp} \\
    -L {params.gsnp} \\
    -O {output.summary}
"""
422
423
424
425
426
427
428
shell: """
gatk --java-options '-Xmx{params.memory}g' GetPileupSummaries \\
    -I {input.normal} \\
    -V {params.gsnp} \\
    -L {params.gsnp} \\
    -O {output.summary}
"""
458
459
460
461
462
shell: """
gatk CalculateContamination \\
    -I {input.tumor} {params.normal_option} \\
    -O {output.summary}
"""
493
494
495
496
497
498
499
500
501
502
shell: """
# Mutect2 orien bias filter
gatk FilterMutectCalls \\
    -R {params.genome} \\
    -V {input.vcf} \\
    --ob-priors {input.orien} \\
    --contamination-table {input.summary} \\
    -O {output.vcf} \\
    --stats {input.stats} 
"""
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
shell: """
# Prefilter and calculate position
# specific summary statistics 
MuSE call \\
    -n {threads} \\
    -f {params.genome} \\
    -O {params.tumor} \\
    {input.tumor} {params.normal_option} 
# Calculate cutoffs from a 
# sample specific error model
MuSE sump \\
    -n {threads} \\
    -G \\
    -I {output.txt} \\
    -O {output.vcf} \\
    -D {params.dbsnp}
# Renaming TUMOR/NORMAL in VCF 
# with real sample names
echo -e "TUMOR\\t{params.rename}{params.normal_header}" \\
> {output.header} 
bcftools reheader \\
    -o {output.final} \\
    -s {output.header} \\
    {output.vcf}
"""
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
shell: """
# Setups temporary directory for
# intermediate files with built-in 
# mechanism for deletion on exit
if [ ! -d "{params.tmpdir}" ]; then mkdir -p "{params.tmpdir}"; fi
tmp=$(mktemp -d -p "{params.tmpdir}")
trap 'rm -rf "${{tmp}}"' EXIT

# Delete previous attempts output
# directory to ensure hard restart
if [ -d "{params.outdir}" ]; then
    rm -rf "{params.outdir}"
fi

# Configure Strelka somatic workflow
configureStrelkaSomaticWorkflow.py \\
    --referenceFasta {params.genome} \\
    --tumorBam {input.tumor} {params.normal_option} \\
    --runDir {params.outdir} \\
    --callRegions {params.regions}

# Call somatic variants with Strelka
echo "Starting Strelka workflow..."
{params.workflow} \\
    -m local \\
    -j {threads} \\
    -g {params.memory} 

# Combine and filter results
echo "Running CombineVariants..."
java -Xmx{params.memory}g -Djava.io.tmpdir=${{tmp}} \
    -XX:ParallelGCThreads={threads} -jar $GATK_JAR -T CombineVariants \\
    --use_jdk_inflater --use_jdk_deflater \\
    -R {params.genome} \\
    --variant {output.snps} \\
    --variant {output.indels} \\
    --assumeIdenticalSamples \\
    --filteredrecordsmergetype KEEP_UNCONDITIONAL \\
    -o {output.vcf}

# Renaming TUMOR/NORMAL in 
# VCF with real sample names
echo -e "TUMOR\\t{params.tumor}{params.normal_header}" \\
> {output.header} 
bcftools reheader \\
    -o {output.final} \\
    -s {output.header} \\
    {output.vcf}
"""
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
shell: """
# Normalize VCF prior to SelectVar
# which needs multi-allelic sites
# to be split prior to running 
echo "Running bcftools norm..."
bcftools norm \\
    -c w \\
    -m - \\
    -Ov \\
    --threads {threads} \\
    -f {params.genome} \\
    -o {output.norm} \\
    {input.vcf}
# Remove filtered sites and output
# variants not called in the PON
echo "Running SelectVariants..."
gatk --java-options "-Xmx{params.memory}g" SelectVariants \\
    -R {params.genome} \\
    --variant {output.norm} \\
    --discordance {params.pon} \\
    --exclude-filtered \\
    --output {output.filt}
# Fix format number metadata, gatk 
# SelectVariants converts Number
# metadata incorrectly when it
# it is set to Number=.
sed -i 's/Number=R/Number=./g' \\
    {output.filt}
"""
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
shell: """
# Filter call set to tumor sites
bcftools view \\
    -c1 \\
    -Oz \\
    -s '{params.sample}' \\
    -o {output.tmp} \\
    {input.vcf}
# Renaming sample name in VCF 
# to contain caller name
echo -e "{params.sample}\\t{params.rename}" \\
> {output.header} 
bcftools reheader \\
    -o {output.tumor} \\
    -s {output.header} \\
    {output.tmp}
# Create an VCF index for intersect
bcftools index \\
    -f \\
    --tbi \\
    {output.tumor} 
"""
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
shell: """
# Filter call set to normal sites
bcftools view \\
    --force-samples \\
    -c1 \\
    -Oz \\
    -s '{params.sample}' \\
    -o {output.tmp} \\
    {input.vcf}
# Renaming sample name in VCF 
# to contain caller name
echo -e "{params.sample}\\t{params.rename}" \\
> {output.header} 
bcftools reheader \\
    -o {output.normal} \\
    -s {output.header} \\
    {output.tmp}
# Create an VCF index for intersect
bcftools index \\
    -f \\
    --tbi \\
    {output.normal}
"""
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
shell: """
# Delete previous attempts output
# directory to ensure hard restart
if [ -d "{params.isec_dir}" ]; then
    rm -rf "{params.isec_dir}"
fi
# Intersect somatic callset to find
# variants in at least two callers
bcftools isec \\
    -Oz \\
    -n+2 \\
    -c none \\
    -p {params.isec_dir} \\
    {input.tumors} 
# Create list of files to merge 
find {params.isec_dir} \\
    -name '*.vcf.gz' \\
    | sort \\
> {output.lsl} 
# Merge variants found in at 
# least two callers 
bcftools merge \\
    -Oz \\
    -o {output.merged} \\
    -l {output.lsl}
# Create an VCF index for merge
bcftools index \\
    -f \\
    --tbi \\
    {output.merged}
"""
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
shell: """
# vcf2maf needs an uncompressed VCF file
zcat {input.vcf} \\
> {output.vcf}
# Run VEP and convert VCF into MAF file
vcf2maf.pl \\
    --input-vcf {output.vcf} \\
    --output-maf {output.maf} \\
    --vep-path ${{VEP_HOME}} \\
    --vep-data {params.vep_data} \\
    --cache-version {params.ref_version} \\
    --ref-fasta {params.genome} \\
    --vep-forks {threads} \\
    --tumor-id {params.tumor} {params.normal_option} \\
    --ncbi-build {params.vep_build} \\
    --species {params.vep_species}
"""
958
959
960
961
962
shell: """
echo "Combining MAFs..."
head -2 {input.mafs[0]} > {output.maf}
awk 'FNR>2 {{print}}' {input.mafs} >> {output.maf}
"""
992
993
994
995
996
997
998
shell: """
Rscript {params.script} \\
    {params.wdir} \\
    {input.maf} \\
    {output.summary} \\
    {output.oncoplot}
"""
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
shell: """
# SigProfiler input directory must
# only contain input MAF
mkdir -p "{params.wdir}"
ln -sf {input.maf} {params.wdir}
python3 {params.script} \\
    -i {params.wdir}/ \\
    -o {params.odir}/ \\
    -p {params.sample} \\
    -r {params.genome}
"""
SnakeMake From line 1040 of rules/somatic.smk
1075
1076
1077
1078
1079
shell: """
# Merge SigProfiler PDFs
pdfunite {input.pdfs} \\
    {output.pdf}
"""
SnakeMake From line 1075 of rules/somatic.smk
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
shell: """
# Delete previous attempts output
# directory to ensure hard restart
if [ -d "{params.outdir}" ]; then
    rm -rf "{params.outdir}"
fi

# Configure Manta germline SV workflow 
configManta.py \\
    --callRegions {params.regions} \\
    --bam {input.bam} \\
    --referenceFasta {params.genome} \\
    --runDir {params.outdir}

# Call germline SV with Manta workflow
echo "Starting Manta workflow..."
{params.workflow} \\
    -j {threads} \\
    -g {params.memory} 
"""
SnakeMake From line 55 of rules/sv.smk
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
shell: """
# Delete previous attempts output
# directory to ensure hard restart
if [ -d "{params.outdir}" ]; then
    rm -rf "{params.outdir}"
fi

# Configure Manta somatic SV workflow 
configManta.py {params.normal_option} \\
    --callRegions {params.regions} \\
    --tumorBam {input.tumor} \\
    --referenceFasta {params.genome} \\
    --runDir {params.outdir} \\
    --outputContig

# Call somatic SV with Manta workflow
echo "Starting Manta workflow..."
{params.workflow} \\
    -j {threads} \\
    -g {params.memory} 
"""
SnakeMake From line 108 of rules/sv.smk
29
30
31
32
33
34
35
36
37
38
shell: """
fastp -w {threads} \\
    --detect_adapter_for_pe \\
    --in1 {input.r1} \\
    --in2 {input.r2} \\
    --out1 {output.r1} \\
    --out2 {output.r2} \\
    --json {output.json} \\
    --html {output.html}
"""
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
shell: """
# Setups temporary directory for
# intermediate files with built-in 
# mechanism for deletion on exit
if [ ! -d "{params.tmpdir}" ]; then mkdir -p "{params.tmpdir}"; fi
tmp=$(mktemp -d -p "{params.tmpdir}")
trap 'rm -rf "${{tmp}}"' EXIT

bwa-mem2 mem \\
    -t {threads} \\
    -K 100000000 \\
    -M \\
    -R \'@RG\\tID:{params.sample}\\tSM:{params.sample}\\tPL:illumina\\tLB:{params.sample}\\tPU:{params.sample}\\tCN:ncbr\\tDS:wgs\' \\
    {params.genome} \\
    {input.r1} \\
    {input.r2} \\
| samblaster -M \\
| samtools sort -@{params.sort_threads} \\
    -T ${{tmp}} \\
    --write-index \\
    -m 10G - \\
    -o {output.bam}##idx##{output.bai}
"""
ShowHide 39 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://openomics.github.io/genome-seek/
Name: genome-seek
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...