An easy-to-use, flexible variant calling pipeline for use on the Biowulf cluster at NIH

public public 1yr ago Version: v1.0.0 0 bookmarks

GATK4 Whole Exome-sequencing Pipeline . This is the home of the pipeline, exome-seek. Its long-term goals: to accurately call germline and somatic variants, to infer CNVs, and to boldly annotate variants like no pipeline before!

Overview

Welcome to exome-seek! Before getting started, we highly recommend reading through exome-seek's documentation .

The ./exome-seek pipeline is composed several inter-related sub commands to setup and run the pipeline across different systems. Each of the available sub commands perform different functions:

Exome-seek is a comprehensive whole exome-sequencing pipeline following the Broad's set of best practices. It relies on technologies like Singularity1 to maintain the highest-level of reproducibility. The pipeline consists of a series of data processing and quality-control steps orchestrated by Snakemake2 , a flexible and scalable workflow management system, to submit jobs to a cluster or cloud provider.

The pipeline is compatible with data generated from Illumina short-read sequencing technologies. As input, it accepts a set of FastQ or BAM files and can be run locally on a compute instance, on-premise using a cluster, or on the cloud (feature coming soon!). A user can define the method or mode of execution. The pipeline can submit jobs to a cluster using a job scheduler like SLURM, or run on AWS using Tibanna (feature coming soon!). A hybrid approach ensures the pipeline is accessible to all users.

Before getting started, we highly recommend reading through the usage section of each available sub command.

For more information about issues or trouble-shooting a problem, please checkout our FAQ prior to opening an issue on Github .

Dependencies

Requires: singularity>=3.5 snakemake==6.X

Snakemake and singularity must be installed on the target system. Snakemake orchestrates the execution of each step in the pipeline. To guarantee the highest level of reproducibility, each step relies on versioned images from DockerHub . Snakemake uses singaularity to pull these images onto the local filesystem prior to job execution, and as so, snakemake and singularity are the only two dependencies.

Installation

Please clone this repository to your local filesystem using the following command:

# Clone Repository from Github
git clone https://github.com/mtandon09/CCBR_GATK4_Exome_Seq_Pipeline.git
# Change your working directory
cd CCBR_GATK4_Exome_Seq_Pipeline/

Contribute

This site is a living document, created for and by members like you. EXOME-seek is maintained by the members of CCBR and is improved by continous feedback! We encourage you to contribute new content and make improvements to existing content via pull request to our repository .

References

1. Kurtzer GM, Sochat V, Bauer MW (2017). Singularity: Scientific containers for mobility of compute. PLoS ONE 12(5): e0177459.
2. Koster, J. and S. Rahmann (2018). "Snakemake-a scalable bioinformatics workflow engine." Bioinformatics 34(20): 3600.

Code Snippets

27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
shell: """
myoutdir="$(dirname {output.cnvs})/{params.tumorsample}"
if [ ! -d "$myoutdir" ]; then mkdir -p "$myoutdir"; fi

perl "{params.config_script}" \\
    "$myoutdir" \\
    {params.lengths} \\
    {params.chroms} \\
    {input.tumor} \\
    {input.normal} \\
    {params.pile} \\
    {params.fasta} \\
    {params.snps} \\
    {input.targets}

freec -conf "$myoutdir/freec_exome_config.txt"

cat "{params.sig_script}" | \\
    R --slave \\
    --args $myoutdir/{params.tumorsample}.bam_CNVs \\
    $myoutdir/{params.tumorsample}.bam_ratio.txt

mv $myoutdir/{params.tumorsample}.bam_CNVs.p.value.txt {output.cnvs}
cat "{params.plot_script}" | \\
    R --slave \\
    --args 2 \\
    $myoutdir/{params.tumorsample}.bam_ratio.txt \\
    $myoutdir/{params.tumorsample}.bam_BAF.txt
"""
SnakeMake From line 27 of rules/cnv.smk
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
shell: """
myoutdir="$(dirname {output.fit})/{params.tumorsample}"
if [ ! -d "$myoutdir" ]; then mkdir -p "$myoutdir"; fi

gzip -c "$(dirname {input.freeccnvs})/{params.tumorsample}/{params.normalsample}.bam_minipileup.pileup" \\
    > "$myoutdir/{params.normalsample}.recal.bam_minipileup.pileup.gz"
gzip -c "$(dirname {input.freeccnvs})/{params.tumorsample}/{params.tumorsample}.bam_minipileup.pileup" \\
    > "$myoutdir/{params.tumorsample}.recal.bam_minipileup.pileup.gz"

sequenza-utils bam2seqz \\
    -p \\
    -gc {params.gc} \\
    -n "$myoutdir/{params.normalsample}.recal.bam_minipileup.pileup.gz" \\
    -t "$myoutdir/{params.tumorsample}.recal.bam_minipileup.pileup.gz" \\
    | gzip > "$myoutdir/{params.tumorsample}.seqz.gz"

sequenza-utils seqz_binning \\
    -w 100 \\
    -s "$myoutdir/{params.tumorsample}.seqz.gz" \\
    | tee "$myoutdir/{params.tumorsample}.bin100.seqz" \\
    | gzip > "$myoutdir/{params.tumorsample}.bin100.seqz.gz"

Rscript "{params.run_script}" \\
    "$myoutdir/{params.tumorsample}.bin100.seqz" \\
    "$myoutdir" \\
    "{params.normalsample}+{params.tumorsample}" \\
    {threads}

mv "$myoutdir/{params.normalsample}+{params.tumorsample}_alternative_solutions.txt" "{output.fit}"
rm "$myoutdir/{params.tumorsample}.bin100.seqz"
"""
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
shell: """
myoutdir="$(dirname {output.cnvs})/{params.tumorsample}"
if [ ! -d "$myoutdir" ]; then mkdir -p "$myoutdir"; fi

perl {params.config_script} \\
    "$myoutdir" \\
    {params.lengths} \\
    {params.chroms} \\
    {input.tumor} \\
    {input.normal} \\
    {params.pile} \\
    {params.fasta} \\
    {params.snps} \\
    {input.targets} \\
    {input.fit}

freec -conf "$myoutdir/freec_exome_config.txt"

cat "{params.sig_script}" | \\
    R --slave \\
    --args $myoutdir/{params.tumorsample}.bam_CNVs \\
    $myoutdir/{params.tumorsample}.bam_ratio.txt

mv $myoutdir/{params.tumorsample}.bam_CNVs.p.value.txt {output.cnvs}
cat "{params.plot_script}" | \\
    R --slave \\
    --args 2 \\
    $myoutdir/{params.tumorsample}.bam_ratio.txt \\
    $myoutdir/{params.tumorsample}.bam_BAF.txt
"""
SnakeMake From line 134 of rules/cnv.smk
 8
 9
10
11
shell: """
wget https://github.com/mikdio/SOBDetector/releases/download/v1.0.2/SOBDetector_v1.0.2.jar \\
    -O {output.SOBDetector_jar}
"""
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
shell: """
if [ ! -d "$(dirname {output.pass1_vcf})" ]; then 
    mkdir -p "$(dirname {output.pass1_vcf})"
fi

echo "Running SOBDetector..."
# Try/catch for running SOB Dectetor
# with an empty input VCF file 
java -jar {input.SOBDetector_jar} \\
    --input-type VCF \\
    --input-variants "{input.vcf}" \\
    --input-bam {input.bam} \\
    --output-variants {output.pass1_vcf} \\
    --only-passed false || {{
# Compare length of VCF header to 
# the total length of the file
header_length=$(grep '^#' "{input.vcf}" | wc -l)
file_length=$(cat "{input.vcf}" | wc -l)
if [ $header_length -eq $file_length ]; then
    # VCF file only contains header
    # File contains no variants, catch
    # problem so pipeline can continue
    cat "{input.vcf}" > {output.pass1_vcf}
else
    # SOB Dectector failed for another reason
    echo "SOB Detector Failed... exiting now!" 1>&2
    exit 1
fi
}}

bcftools query \\
    -f '%INFO/numF1R2Alt\\t%INFO/numF2R1Alt\\t%INFO/numF1R2Ref\\t%INFO/numF2R1Ref\\t%INFO/numF1R2Other\\t%INFO/numF2R1Other\\t%INFO/SOB\\n' \\
    {output.pass1_vcf} \\
    | awk '{{if ($1 != "."){{tum_alt=$1+$2; tum_depth=$1+$2+$3+$4+$5+$6; if (tum_depth==0){{tum_af=1}} else {{tum_af=tum_alt/tum_depth }}; print tum_alt,tum_depth,tum_af,$7}}}}' > {output.pass1_info} 
"""
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
shell: """
echo -e "#TUMOR.alt\\tTUMOR.depth\\tTUMOR.AF\\tSOB\\tFS\\tSOR\\tTLOD\\tReadPosRankSum" > {output.all_info_file}
cat {input.info_files} >> {output.all_info_file}

# Try/catch for running calculating
# mean and standard deviation with 
# with a set of empty input VCF files
all_length=$(tail -n+2 {output.all_info_file} | wc -l)
if [ $all_length -eq 0 ]; then 
    echo 'WARNING: All SOB Dectect pass1 samples contained no variants.' \\
    | tee {output.params_file}
else
    # Calculate mean and standard deviation
    grep -v '^#' {output.all_info_file} \\
    | awk '{{ total1 += $1; ss1 += $1^2; total2 += $2; ss2 += $2^2; total3 += $3; ss3 += $3^2; total4 += $4; ss4 += $4^2 }} END {{ print total1/NR,total2/NR,total3/NR,total4/NR; print sqrt(ss1/NR-(total1/NR)^2),sqrt(ss2/NR-(total2/NR)^2),sqrt(ss3/NR-(total3/NR)^3),sqrt(ss4/NR-(total4/NR)^2) }}' > {output.params_file}
fi
"""
SnakeMake From line 77 of rules/ffpe.smk
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
shell: """
if [ ! -d "$(dirname {output.pass2_vcf})" ]; then 
    mkdir -p "$(dirname {output.pass2_vcf})"
fi

echo "Running SOBDetector..."
# Try/catch for running SOB Dectetor
# with an empty input VCF file
bcf_annotate_option="-e 'INFO/pArtifact < 0.05' "
java -jar {input.SOBDetector_jar} \\
    --input-type VCF \\
    --input-variants "{input.vcf}" \\
    --input-bam "{input.bam}" \\
    --output-variants "{output.pass2_vcf}" \\
    --only-passed true \\
    --standardization-parameters "{input.params_file}" || {{
# Compare length of VCF header to 
# the total length of the file
header_length=$(grep '^#' "{input.vcf}" | wc -l)
file_length=$(cat "{input.vcf}" | wc -l)
if [ $header_length -eq $file_length ]; then
    # VCF file only contains header
    # File contains no variants, catch
    # problem so pipeline can continue
    cat "{input.vcf}" > {output.pass2_vcf}
else
    # SOB Dectector failed for another reason
    echo "SOB Detector Failed... exiting now!" 1>&2
    exit 1
fi
}}

echo "Making info table..."
bcftools query \\
    -f '%INFO/numF1R2Alt\\t%INFO/numF2R1Alt\\t%INFO/numF1R2Ref\\t%INFO/numF2R1Ref\\t%INFO/numF1R2Other\\t%INFO/numF2R1Other\\t%INFO/SOB\\n' \\
    "{output.pass2_vcf}" \\
    | awk '{{if ($1 != "."){{tum_alt=$1+$2; tum_depth=$1+$2+$3+$4+$5+$6; if (tum_depth==0){{tum_af=1}} else {{tum_af=tum_alt/tum_depth }}; print tum_alt,tum_depth,tum_af,$7}}}}' > "{output.pass2_info}"

echo "Filtering out artifacts..."
if [ "{wildcards.vc_outdir}" == "{config[output_params][MERGED_SOMATIC_OUTDIR]}" ]; then
    echo "Adding 'set' annotation back from merged variants..."
    bgzip --threads {threads} -c "{input.vcf}" > "{input.vcf}.gz"
    bcftools index -f -t "{input.vcf}.gz"
    bgzip --threads {threads} -c "{output.pass2_vcf}" > "{output.pass2_vcf}.gz"
    bcftools index -f -t "{output.pass2_vcf}.gz"
    bcftools annotate \\
        -a "{input.vcf}.gz" \\
        -c "INFO/set" \\
        "$bcf_annotate_option" \\
        -Oz \\
        -o {output.filtered_vcf} {output.pass2_vcf}.gz
else
    bcftools filter \\
        "$bcf_annotate_option" \\
        -Oz \\
        -o {output.filtered_vcf} {output.pass2_vcf}
    bcftools index -f -t {output.filtered_vcf}
fi
"""
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
shell: """
echo -e "#ID\\tDefaultParam\\tCohortParam\\tTotalVariants" > {output.count_table}
echo -e "#SAMPLE_ID\\tParam\\tCHROM\\tPOS\\tnumF1R2Alt\\tnumF2R1Alt\\tnumF1R2Ref\\tnumF2R1Ref\\tnumF1R2Other\\tnumF2R1Other\\tSOB\\tpArtifact\\tFS\\tSOR\\tTLOD\\tReadPosRankSum" > {output.full_metric_table}

P1FILES=({input.pass1_vcf})
P2FILES=({input.pass2_vcf})
for (( i=0; i<${{#P1FILES[@]}}; i++ )); do
    MYID=$(basename -s ".sobdetect.vcf" ${{P1FILES[$i]}})
    echo "Collecting metrics from $MYID..."

    # grep may fail if input files do not contain any variants 
    total_count=$(grep -v ^# ${{P1FILES[$i]}} | wc -l) || total_count=0
    count_1p=$(bcftools query -f '%INFO/pArtifact\n' ${{P1FILES[$i]}} | awk '{{if ($1 != "." && $1 < 0.05){{print}}}}' | wc -l)
    count_2p=$(bcftools query -f '%INFO/pArtifact\n' ${{P2FILES[$i]}} | awk '{{if ($1 != "." && $1 < 0.05){{print}}}}' | wc -l)

    echo -e "$MYID\\t$count_1p\\t$count_2p\\t$total_count" >> {output.count_table}

    bcftools query -f '%CHROM\\t%POS\\t%INFO/numF1R2Alt\\t%INFO/numF2R1Alt\\t%INFO/numF1R2Ref\\t%INFO/numF2R1Ref\\t%INFO/numF1R2Other\\t%INFO/numF2R1Other\\t%INFO/SOB\\t%INFO/pArtifact\n' ${{P1FILES[$i]}} | awk -v id=$MYID 'BEGIN{{OFS="\t"}}{{print id,"PASS_1",$0}}' >> {output.full_metric_table}
    bcftools query -f '%CHROM\\t%POS\\t%INFO/numF1R2Alt\\t%INFO/numF2R1Alt\\t%INFO/numF1R2Ref\\t%INFO/numF2R1Ref\\t%INFO/numF1R2Other\\t%INFO/numF2R1Other\\t%INFO/SOB\\t%INFO/pArtifact\n' ${{P2FILES[$i]}} | awk -v id=$MYID 'BEGIN{{OFS="\t"}}{{print id,"PASS_2",$0}}' >> {output.full_metric_table}
done
"""
229
230
231
232
233
234
235
236
237
238
239
240
shell: """
echo "Converting to MAF..."
bash {params.vcf2maf_script} \\
    --vcf {input.filtered_vcf} \\
    --maf {output.maf} \\
    --tid {params.tumorsample} \\
    --genome {params.genome} \\
    --threads {threads} \\
    --vepresourcebundlepath {params.bundle} \\
    --info "set"
echo "Done converting to MAF..."
"""
SnakeMake From line 229 of rules/ffpe.smk
252
253
254
255
256
shell: """    
echo "Combining MAFs..."
head -2 {input.mafs[0]} > {output.maf}
awk 'FNR>2 {{print}}' {input.mafs} >> {output.maf}
"""
SnakeMake From line 252 of rules/ffpe.smk
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
shell:
    """
    myoutdir="$(dirname {output.gzvcf})"
    if [ ! -d "$myoutdir" ]; then mkdir -p "$myoutdir"; fi

    gatk --java-options '-Xmx24g' HaplotypeCaller \\
        --reference {params.genome} \\
        --input {input.bam} \\
        --use-jdk-inflater \\
        --use-jdk-deflater \\
        --emit-ref-confidence GVCF \\
        --annotation-group StandardAnnotation \\
        --annotation-group AS_StandardAnnotation \\
        --dbsnp {params.snpsites} \\
        --output {output.gzvcf} \\
        --intervals {params.chrom} \\
        --max-alternate-alleles 3
    """
68
69
70
71
72
73
74
75
76
77
78
79
80
81
shell:
    """
    input_str="--variant $(echo "{input.gzvcf}" | sed -e 's/ / --variant /g')"

    gatk --java-options '-Xmx24g' CombineGVCFs \\
        --reference {params.genome} \\
        --annotation-group StandardAnnotation \\
        --annotation-group AS_StandardAnnotation \\
        $input_str \\
        --output {output.gzvcf} \\
        --intervals {wildcards.chroms} \\
        --use-jdk-inflater \\
        --use-jdk-deflater
    """
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
shell:
    """
    myoutdir="$(dirname {output.vcf})"
    if [ ! -d "$myoutdir" ]; then mkdir -p "$myoutdir"; fi

    gatk --java-options '-Xmx96g' GenotypeGVCFs \\
        --reference {params.genome} \\
        --use-jdk-inflater \\
        --use-jdk-deflater \\
        --annotation-group StandardAnnotation \\
        --annotation-group AS_StandardAnnotation \\
        --dbsnp {params.snpsites} \\
        --output {output.vcf} \\
        --variant {input.gzvcf} \\
        --intervals {params.chr}
    """
142
143
144
145
146
147
148
149
150
151
shell:
    """
    # Avoids ARG_MAX issue which limits max length of a command
    ls --color=never -d $(dirname "{output.clist}")/raw_variants.*.vcf.gz > "{output.clist}"

    gatk MergeVcfs \\
        -R {params.genome} \\
        --INPUT {output.clist} \\
        --OUTPUT {output.vcf}
    """
175
176
177
178
179
180
181
182
183
184
185
shell:
    """
    gatk SelectVariants \\
        -R {params.genome} \\
        --intervals {params.targets} \\
        --variant {input.vcf} \\
        --sample-name {params.Sname} \\
        --exclude-filtered \\
        --exclude-non-variants \\
        --output {output.vcf}
    """
25
26
27
28
29
30
31
32
33
shell: """
if [ ! -d "$(dirname {output.txt})" ]; then 
    mkdir -p "$(dirname {output.txt})"
fi

python {params.get_flowcell_lanes} \\
    {input.r1} \\
    {wildcards.samples} > {output.txt}
"""
SnakeMake From line 25 of rules/qc.smk
64
65
66
67
68
69
70
71
72
shell: """
fastq_screen --conf {params.fastq_screen_config} \\
    --outdir {params.outdir} \\
    --threads {threads} \\
    --subset 1000000 \\
    --aligner bowtie2 \\
    --force \\
    {input.fq1} {input.fq2}
"""
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
shell: """
# Setups temporary directory for
# intermediate files with built-in 
# mechanism for deletion on exit
tmp=$(mktemp -d -p "{params.localdisk}")
trap 'rm -rf "${{tmp}}"' EXIT

# Copy kraken2 db to local node storage to reduce filesystem strain
cp -rv {params.bacdb} ${{tmp}}
kdb_base=$(basename {params.bacdb})
kraken2 --db ${{tmp}}/${{kdb_base}} \\
    --threads {threads} --report {output.taxa} \\
    --output {output.out} \\
    --gzip-compressed \\
    --paired {input.fq1} {input.fq2}
# Generate Krona Report
cut -f2,3 {output.out} | \\
    ktImportTaxonomy - -o {output.html}
"""
146
147
148
149
150
151
shell: """
fastqc -t {threads} \\
    -f bam \\
    -o {params.outdir} \\
    {input.bam} 
"""
176
177
178
179
180
181
182
shell: """
python {params.script_path_reformat_bed} \\
    --input_bed {input.targets} \\
    --output_bed {output.bed}.temp
python3 {params.script_path_correct_target_bed} {output.bed}.temp {output.bed}
rm -f {output.bed}.temp
"""
SnakeMake From line 176 of rules/qc.smk
209
210
211
212
213
214
215
216
217
218
219
220
221
shell: """
unset DISPLAY
qualimap bamqc -bam {input.bam} \\
    --java-mem-size=48G \\
    -c -ip \\
    -gff {input.bed} \\
    -outdir {params.outdir} \\
    -outformat HTML \\
    -nt {threads} \\
    --skip-duplicated \\
    -nw 500 \\
    -p NON-STRAND-SPECIFIC
"""
244
245
246
shell: """
samtools flagstat {input.bam} > {output.txt}
"""
271
272
273
shell: """
vcftools --gzvcf {input.vcf} --het --out {params.prefix}
"""
298
299
300
301
302
303
304
shell: """
java -Xmx24g -jar ${{PICARDJARPATH}}/picard.jar \\
    CollectVariantCallingMetrics \\
    INPUT={input.vcf} \\
    OUTPUT={params.prefix} \\
    DBSNP={params.dbsnp} Validation_Stringency=SILENT
"""
329
330
331
shell: """
bcftools stats {input.vcf} > {output.txt}
"""
360
361
362
363
364
365
366
shell: """
gatk --java-options '-Xmx12g -XX:ParallelGCThreads={threads}' VariantEval \\
    -R {params.genome} \\
    -O {output.grp} \\
    --dbsnp {params.dbsnp} \\
    --eval {input.vcf} 
"""
393
394
395
396
397
398
399
400
shell: """
java -Xmx12g -jar $SNPEFF_JAR \\
    -v -canon -c {params.config} \\
    -csvstats {output.csv} \\
    -stats {output.html} \\
    {params.genome} \\
    {input.vcf} > {output.vcf}
"""
SnakeMake From line 393 of rules/qc.smk
421
422
423
424
425
426
427
428
shell: """ 
echo "Extracting sites to estimate ancestry"
somalier extract \\
    -d "$(dirname {output.somalierOut})" \\
    --sites {params.sites_vcf} \\
    -f {params.genomeFasta} \\
    {input.bam}
"""
SnakeMake From line 421 of rules/qc.smk
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
shell: """ 
echo "Estimating relatedness"
somalier relate \\
    -o "$(dirname {output.relatedness})/relatedness" \\
    {input.somalier}

echo "Estimating ancestry"
somalier ancestry \\
    -o "$(dirname {output.relatedness})/ancestry" \\
    --labels {params.ancestry_db}/ancestry-labels-1kg.tsv \\
    {params.ancestry_db}/*.somalier ++ \\
    {input.somalier}

Rscript {params.script_path_gender} \\
    {output.relatednessSamples} \\
    {output.finalFileGender}    

Rscript {params.script_path_samples} \\
    {output.relatedness} \\
    {output.finalFilePairs}

Rscript {params.script_path_pca} \\
    {output.ancestry} \\
    {output.finalFilePairs} \\
    {output.ancestoryPlot} \\
    {output.pairAncestoryHist}
"""
SnakeMake From line 460 of rules/qc.smk
521
522
523
524
525
526
527
528
shell: """
multiqc --ignore '*/.singularity/*' \\
    --ignore '*/*/*/*/*/*/*/*/pyflow.data/*' \\
    --ignore 'slurmfiles/' \\
    -f --interactive \\
    -n {output.report} \\
    {params.workdir}
"""
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
shell: """
if [ ! -d "$(dirname {output.split_bam})" ]; then
  mkdir -p "$(dirname {output.split_bam})"
fi

samtools view \\
    -b \\
    -o {output.split_bam} \\
    -@ {threads} \\
    {input.bam} {wildcards.chroms}

samtools index \\
    -@ {threads} \\
    {output.split_bam} {output.split_bam_idx}

cp {output.split_bam_idx} {output.split_bam}.bai
"""
50
51
52
53
54
55
56
shell: """
input_str="--input $(echo "{input.read_orientation_file}" | sed -e 's/ / --input /g')"

gatk LearnReadOrientationModel \\
    --output {output.model} \\
    $input_str
"""
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
shell: """
# Setups temporary directory for
# intermediate files with built-in 
# mechanism for deletion on exit
tmp=$(mktemp -d -p "{params.tmpdir}")
trap 'rm -rf "${{tmp}}"' EXIT

statfiles="--stats $(echo "{input.statsfiles}" | sed -e 's/ / --stats /g')"

gatk MergeMutectStats \\
    $statfiles \\
    -O {output.final}.stats

gatk FilterMutectCalls \\
    -R {params.genome} \\
    -V {input.vcf} \\
    --ob-priors {input.model} \\
    --contamination-table {input.summary} \\
    -O {output.marked_vcf} \\
    --stats {output.final}.stats

gatk SelectVariants \\
    -R {params.genome} \\
    --variant {output.marked_vcf} \\
    --exclude-filtered \\
    --output {output.final}

# VarScan can output ambiguous IUPAC bases/codes
# the awk one-liner resets them to N, from:
# https://github.com/fpbarthel/GLASS/issues/23
bcftools sort -T ${{tmp}} "{output.final}" \\
    | bcftools norm --threads {threads} --check-ref s -f {params.genome} -O v \\
    | awk '{{gsub(/\y[W|K|Y|R|S|M]\y/,"N",$4); OFS = "\t"; print}}' \\
    | sed '/^$/d' > {output.norm}
"""
135
136
137
138
139
140
141
142
shell: """
input_str="-I $(echo "{input.vcf}" | sed -e 's/ / -I /g')"

gatk --java-options "-Xmx30g" MergeVcfs \\
    -O "{output.vcf}" \\
    -D {params.genomedict} \\
    $input_str
"""
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
shell: """
# Setups temporary directory for
# intermediate files with built-in 
# mechanism for deletion on exit
tmp=$(mktemp -d -p "{params.tmpdir}")
trap 'rm -rf "${{tmp}}"' EXIT

if [ ! -d "$(dirname {output.mergedvcf})" ]; then
  mkdir -p "$(dirname {output.mergedvcf})"
fi

input_str="--variant $(echo "{input.vcf}" | sed -e 's/ / --variant /g')"

java -Xmx60g -Djava.io.tmpdir=${{tmp}} -jar $GATK_JAR -T CombineVariants \\
    -R {params.genome} \\
    -nt {threads} \\
    --filteredrecordsmergetype KEEP_IF_ANY_UNFILTERED \\
    --genotypemergeoption PRIORITIZE \\
    --rod_priority_list {params.rodprioritylist} \\
    --minimumN 1 \\
    -o {output.mergedvcf} \\
    {params.variantsargs}
"""
202
203
204
205
206
207
208
209
210
211
212
213
shell: """
echo "Converting to MAF..."
bash {params.vcf2maf_script} \\
    --vcf {input.filtered_vcf} \\
    --maf {output.maf} \\
    --tid {params.tumorsample} \\
    --genome {params.genome} \\
    --threads {threads} \\
    --vepresourcebundlepath {params.bundle} \\
    --info "set"
echo "Done converting to MAF..."
"""
223
224
225
226
227
shell: """
echo "Combining MAFs..."
head -2 {input.mafs[0]} > {output.maf}
awk 'FNR>2 {{print}}' {input.mafs} >> {output.maf}
"""
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
shell: """
if [ ! -d "$(dirname {output.vcf})" ]; 
    then mkdir -p "$(dirname {output.vcf})";
fi
gatk Mutect2 \\
    -R {params.genome} \\
    -I {input.tumor} \\
    -I {input.normal} \\
    -normal {params.normalsample} \\
    --panel-of-normals {params.pon} \\
    --germline-resource {params.germsource} \\
    -L {params.chrom} \\
    -O {output.vcf} \\
    --f1r2-tar-gz {output.read_orientation_file} \\
    --independent-mates
"""
61
62
63
64
65
66
67
68
69
70
71
72
73
74
shell: """
# Run GetPileupSummaries in bg concurrently for a tumor/normal pair 
gatk --java-options '-Xmx48g' GetPileupSummaries \\
    -I {input.tumor} \\
    -V {params.germsource} \\
    -L {input.intervals} \\
    -O {output.tumor_summary} & \\
gatk --java-options '-Xmx48g' GetPileupSummaries \\
    -I {input.normal} \\
    -V {params.germsource} \\
    -L {input.intervals} \\
    -O {output.normal_summary} & \\
wait
"""
 94
 95
 96
 97
 98
 99
100
101
102
shell: """
gatk CalculateContamination \\
    -I {input.tumor} \\
    --matched-normal {input.normal} \\
    -O {output.tumor_summary}
gatk CalculateContamination \\
    -I {input.normal} \\
    -O {output.normal_summary}
"""
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
shell: """
# Setups temporary directory for
# intermediate files with built-in 
# mechanism for deletion on exit
tmp=$(mktemp -d -p "{params.tmpdir}")
trap 'rm -rf "${{tmp}}"' EXIT

workdir={params.basedir}
myoutdir="$(dirname {output.vcf})/{wildcards.samples}/{wildcards.chroms}"
if [ -d "$myoutdir" ]; then rm -r "$myoutdir"; fi
mkdir -p "$myoutdir"

configureStrelkaSomaticWorkflow.py \\
    --ref={params.genome} \\
    --tumor={input.tumor} \\
    --normal={input.normal} \\
    --runDir="$myoutdir" \\
    --exome
cd "$myoutdir"
./runWorkflow.py -m local -j {threads}

java -Xmx12g -Djava.io.tmpdir=${{tmp}} -XX:ParallelGCThreads={threads} \\
    -jar $GATK_JAR -T CombineVariants \\
    -R {params.genome} \\
    --variant results/variants/somatic.snvs.vcf.gz \\
    --variant results/variants/somatic.indels.vcf.gz \\
    --assumeIdenticalSamples \\
    --filteredrecordsmergetype KEEP_UNCONDITIONAL \\
    -o "$(basename {output.vcf})"

cd $workdir
mv "$myoutdir/$(basename {output.vcf})" "{output.vcf}"
"""
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
shell: """
# Setups temporary directory for
# intermediate files with built-in 
# mechanism for deletion on exit
tmp=$(mktemp -d -p "{params.tmpdir}")
trap 'rm -rf "${{tmp}}"' EXIT

gatk SelectVariants \\
    -R {params.genome} \\
    --variant {input.vcf} \\
    --discordance {params.pon} \\
    --exclude-filtered \\
    --output {output.filtered}

echo -e "TUMOR\t{params.tumorsample}\nNORMAL\t{params.normalsample}" > "{output.samplesfile}"

echo "Reheading VCFs with sample names..."
bcftools reheader \\
    -o "{output.final}" \\
    -s "{output.samplesfile}" "{output.filtered}"

# VarScan can output ambiguous IUPAC bases/codes
# the awk one-liner resets them to N, from:
# https://github.com/fpbarthel/GLASS/issues/23
bcftools sort \\
    -T ${{tmp}} "{output.final}" \\
    | bcftools norm --threads {threads} --check-ref s -f {params.genome} -O v \\
    | awk '{{gsub(/\y[W|K|Y|R|S|M]\y/,"N",$4); OFS = "\t"; print}}' \\
    | sed '/^$/d' > {output.norm}
"""
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
shell: """
# Setups temporary directory for
# intermediate files with built-in 
# mechanism for deletion on exit
tmp=$(mktemp -d -p "{params.tmpdir}")
trap 'rm -rf "${{tmp}}"' EXIT

if [ ! -d "$(dirname {output.vcf})" ]; then mkdir -p "$(dirname {output.vcf})"; fi

java -Xmx8g -Djava.io.tmpdir=${{tmp}} -jar ${{MUTECT_JAR}} \\
    --analysis_type MuTect \\
    --reference_sequence {params.genome} \\
    --normal_panel {params.pon} \\
    --vcf {output.vcf} \\
    --cosmic {params.cosmic} \\
    --dbsnp {params.dbsnp} \\
    --disable_auto_index_creation_and_locking_when_reading_rods \\
    --input_file:normal {input.normal} \\
    --input_file:tumor {input.tumor} \\
    --out {output.stats} \\
    -rf BadCigar
"""
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
shell: """
# Setups temporary directory for
# intermediate files with built-in 
# mechanism for deletion on exit
tmp=$(mktemp -d -p "{params.tmpdir}")
trap 'rm -rf "${{tmp}}"' EXIT

gatk SelectVariants \\
    -R {params.genome} \\
    --variant {input.vcf} \\
    --exclude-filtered \\
    --output {output.final}

# VarScan can output ambiguous IUPAC bases/codes
# the awk one-liner resets them to N, from:
# https://github.com/fpbarthel/GLASS/issues/23
bcftools sort -T ${{tmp}} "{output.final}" \\
    | bcftools norm --threads {threads} --check-ref s -f {params.genome} -O v \\
    | awk '{{gsub(/\y[W|K|Y|R|S|M]\y/,"N",$4); OFS = "\t"; print}}' \\
    | sed '/^$/d' > {output.norm}
"""
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
shell: """
if [ ! -d "$(dirname {output.vcf})" ]; then mkdir -p "$(dirname {output.vcf})"; fi
VarDict \\
    -G {params.genome} \\
    -f 0.05 \\
    -N \"{params.tumorsample}|{params.normalsample}\" \\
    --nosv \\
    -b \"{input.tumor}|{input.normal}\" \\
    -t \\
    -Q 20 \\
    -c 1 \\
    -S 2 \\
    -E 3 {params.targets} \\
    | testsomatic.R \\
    | var2vcf_paired.pl \\
        -S \\
        -Q 20 \\
        -d 10 \\
        -M \\
        -N \"{params.tumorsample}|{params.normalsample}\" \\
        -f 0.05 > {output.vcf}
"""
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
shell: """
# Setups temporary directory for
# intermediate files with built-in 
# mechanism for deletion on exit
tmp=$(mktemp -d -p "{params.tmpdir}")
trap 'rm -rf "${{tmp}}"' EXIT

bcftools filter \\
    --exclude \'STATUS=\"Germline\" | STATUS=\"LikelyLOH\" | STATUS=\"AFDiff\"\' \\
    {input.vcf} > {output.filtered}

gatk SelectVariants \\
    -R {params.genome} \\
    --variant {output.filtered} \\
    --discordance {params.pon} \\
    --exclude-filtered \\
    --output {output.final}

# VarScan can output ambiguous IUPAC bases/codes
# the awk one-liner resets them to N, from:
# https://github.com/fpbarthel/GLASS/issues/23
bcftools sort -T ${{tmp}} "{output.final}" \\
    | bcftools norm --threads {threads} --check-ref s -f {params.genome} -O v \\
    | awk '{{gsub(/\y[W|K|Y|R|S|M]\y/,"N",$4); OFS = "\t"; print}}' \\
    | sed '/^$/d' > {output.norm}
"""
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
shell: """
# Setups temporary directory for
# intermediate files with built-in 
# mechanism for deletion on exit
tmp=$(mktemp -d -p "{params.tmpdir}")
trap 'rm -rf "${{tmp}}"' EXIT

if [ ! -d "$(dirname {output.vcf})" ]; then mkdir -p "$(dirname {output.vcf})"; fi

tumor_purity=$( echo "1-$(printf '%.6f' $(tail -n -1 {input.tumor_summary} | cut -f2 ))" | bc -l)
normal_purity=$( echo "1-$(printf '%.6f' $(tail -n -1 {input.normal_summary} | cut -f2 ))" | bc -l)
varscan_opts="--strand-filter 1 --min-var-freq 0.01 --min-avg-qual 30 --somatic-p-value 0.05 --output-vcf 1 --normal-purity $normal_purity --tumor-purity $tumor_purity"
dual_pileup="samtools mpileup -d 10000 -q 15 -Q 15 -f {params.genome} {input.normal} {input.tumor}"
varscan_cmd="varscan somatic <($dual_pileup) {output.vcf} $varscan_opts --mpileup 1"    
eval "$varscan_cmd"

java -Xmx12g -Djava.io.tmpdir=${{tmp}} -XX:ParallelGCThreads={threads} \\
    -jar $GATK_JAR -T CombineVariants \\
    -R {params.genome} \\
    --variant {output.vcf}.snp \\
    --variant {output.vcf}.indel \\
    --assumeIdenticalSamples \\
    --filteredrecordsmergetype KEEP_UNCONDITIONAL \\
    -o {output.vcf}     
"""
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
shell: """
# Setups temporary directory for
# intermediate files with built-in 
# mechanism for deletion on exit
tmp=$(mktemp -d -p "{params.tmpdir}")
trap 'rm -rf "${{tmp}}"' EXIT

varscan filter \\
    {input.vcf} \\
    {params.filter_settings} > {output.filtered1}

gatk SelectVariants \\
    -R {params.genome} \\
    --variant {output.filtered1} \\
    --discordance {params.pon} \\
    --exclude-filtered \\
    --output {output.filtered}

samplesFile="{output.samplesfile}"
echo -e "TUMOR\t{params.tumorsample}\nNORMAL\t{params.normalsample}" > "{output.samplesfile}"

bcftools reheader \\
    -o "{output.final}" \\
    -s "{output.samplesfile}" \\
    "{output.filtered}"

# VarScan can output ambiguous IUPAC bases/codes
# the awk one-liner resets them to N, from:
# https://github.com/fpbarthel/GLASS/issues/23
bcftools sort -T ${{tmp}} "{output.final}" \\
    | bcftools norm --threads {threads} --check-ref s -f {params.genome} -O v \\
    | awk '{{gsub(/\y[W|K|Y|R|S|M]\y/,"N",$4); OFS = "\t"; print}}' \\
    | sed '/^$/d' > {output.norm}
"""
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
shell: """
if [ ! -d "$(dirname {output.vcf})" ]; then
    mkdir -p "$(dirname {output.vcf})"
fi

gatk Mutect2 \\
    -R {params.genome} \\
    -I {input.tumor} \\
    --panel-of-normals {params.pon} \\
    --germline-resource {params.germsource} \\
    -L {wildcards.chroms} \\
    -O {output.vcf} \\
    --f1r2-tar-gz {output.read_orientation_file} \\
    --independent-mates
"""
56
57
58
59
60
61
62
63
64
65
66
67
68
69
shell: """
# Setups temporary directory for
# intermediate files with built-in 
# mechanism for deletion on exit
tmp=$(mktemp -d -p "{params.tmpdir}")
trap 'rm -rf "${{tmp}}"' EXIT

gatk --java-options "-Xmx10g -Djava.io.tmpdir=${{tmp}}" GetPileupSummaries \\
    -R {params.genome} \\
    -I {input.tumor} \\
    -V {params.germsource} \\
    -L {input.intervals} \\
    -O {output.pileup}
"""
87
88
89
90
91
shell: """
gatk CalculateContamination \\
    -I {input.pileup} \\
    -O {output.tumor_summary}
"""
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
shell: """
# Setups temporary directory for
# intermediate files with built-in 
# mechanism for deletion on exit
tmp=$(mktemp -d -p "{params.tmpdir}")
trap 'rm -rf "${{tmp}}"' EXIT

if [ ! -d "$(dirname {output.vcf})" ]; then 
    mkdir -p "$(dirname {output.vcf})"
fi

java -Xmx8g -Djava.io.tmpdir=${{tmp}} -jar ${{MUTECT_JAR}} \\
    --analysis_type MuTect \\
    --reference_sequence {params.genome} \\
    --normal_panel {params.pon} \\
    --vcf {output.vcf} \\
    --cosmic {params.cosmic} \\
    --dbsnp {params.dbsnp} \\
    -L {wildcards.chroms} \\
    --disable_auto_index_creation_and_locking_when_reading_rods \\
    --input_file:tumor {input.tumor} \\
    --out {output.stats} \\
    -rf BadCigar
"""
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
shell: """
# Setups temporary directory for
# intermediate files with built-in 
# mechanism for deletion on exit
tmp=$(mktemp -d -p "{params.tmpdir}")
trap 'rm -rf "${{tmp}}"' EXIT

gatk SelectVariants \\
    -R {params.genome} \\
    --variant {input.vcf} \\
    --exclude-filtered \\
    --output {output.final}

# VarScan can output ambiguous IUPAC bases/codes
# the awk one-liner resets them to N, from:
# https://github.com/fpbarthel/GLASS/issues/23
bcftools sort -T ${{tmp}} "{output.final}" \\
    | bcftools norm --threads {threads} --check-ref s -f {params.genome} -O v \\
    | awk '{{gsub(/\y[W|K|Y|R|S|M]\y/,"N",$4); OFS = "\t"; print}}' \\
    | sed '/^$/d' > {output.norm}
"""
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
shell: """
if [ ! -d "$(dirname {output.vcf})" ]; then 
    mkdir -p "$(dirname {output.vcf})"
fi

VarDict \\
    -G {params.genome} \\
    -f 0.05 \\
    -x 500 \\
    --nosv \\
    -b {input.tumor} \\
    -t \\
    -Q 20 \\
    -c 1 \\
    -S 2 \\
    -E 3 {params.targets} \\
    | teststrandbias.R \\
    | var2vcf_valid.pl \\
        -N {wildcards.samples} \\
        -Q 20 \\
        -d 10 \\
        -v 6 \\
        -S \\
        -E \\
        -f 0.05 > {output.vcf}
"""
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
shell: """
# Setups temporary directory for
# intermediate files with built-in 
# mechanism for deletion on exit
tmp=$(mktemp -d -p "{params.tmpdir}")
trap 'rm -rf "${{tmp}}"' EXIT

gatk SelectVariants \\
    -R {params.genome} \\
    --variant {input.vcf} \\
    --discordance {params.pon} \\
    --exclude-filtered \\
    --output {output.final}

# VarScan can output ambiguous IUPAC bases/codes
# the awk one-liner resets them to N, from:
# https://github.com/fpbarthel/GLASS/issues/23
bcftools sort -T ${{tmp}} "{output.final}" \\
    | bcftools norm --threads {threads} --check-ref s -f {params.genome} -O v \\
    | awk '{{gsub(/\y[W|K|Y|R|S|M]\y/,"N",$4); OFS = "\t"; print}}' \\
    | sed '/^$/d' > {output.norm}
"""
285
286
287
288
289
290
291
292
293
294
295
shell: """
if [ ! -d "$(dirname {output.vcf})" ]; then
    mkdir -p "$(dirname {output.vcf})"
fi

varscan_opts="--strand-filter 0 --min-var-freq 0.01 --output-vcf 1 --variants 1"
pileup_cmd="samtools mpileup -d 100000 -q 15 -Q 15 -f {params.genome} {input.tumor}"
varscan_cmd="varscan mpileup2cns <($pileup_cmd) $varscan_opts"
eval "$varscan_cmd > {output.vcf}.gz"
eval "bcftools view -U {output.vcf}.gz > {output.vcf}"
"""
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
shell: """
# Setups temporary directory for
# intermediate files with built-in 
# mechanism for deletion on exit
tmp=$(mktemp -d -p "{params.tmpdir}")
trap 'rm -rf "${{tmp}}"' EXIT

varscan filter \\
    {input.vcf} \\
    {params.filter_settings} > {output.filtered1}

gatk SelectVariants \\
    -R {params.genome} \\
    --variant {output.filtered1} \\
    --discordance {params.pon} \\
    --exclude-filtered \\
    --output {output.filtered}

samplesFile="{output.samplesfile}"
echo -e "TUMOR\t{params.tumorsample}\n" > "{output.samplesfile}"

bcftools reheader \\
    -o "{output.final}" \\
    -s "{output.samplesfile}" \\
    "{output.filtered}"

# VarScan can output ambiguous IUPAC bases/codes
# the awk one-liner resets them to N, from:
# https://github.com/fpbarthel/GLASS/issues/23
bcftools sort -T ${{tmp}} "{output.final}" \\
    | bcftools norm --threads {threads} --check-ref s -f {params.genome} -O v \\
    | awk '{{gsub(/\y[W|K|Y|R|S|M]\y/,"N",$4); OFS = "\t"; print}}' \\
    | sed '/^$/d' > {output.norm}
"""
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
shell: """
# Setups temporary directory for
# intermediate files with built-in 
# mechanism for deletion on exit
tmp=$(mktemp -d -p "{params.tmpdir}")
trap 'rm -rf "${{tmp}}"' EXIT

mkdir -p fastqs
gatk SamToFastq \\
    --INPUT {input.bam} \\
    --FASTQ {output.r1} \\
    --SECOND_END_FASTQ {output.r2} \\
    --UNPAIRED_FASTQ {output.orphans} \\
    --TMP_DIR ${{tmp}} \\
    -R {params.genome}
"""
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
shell: """
myoutdir="$(dirname {output.one})"
if [ ! -d "$myoutdir" ]; then mkdir -p "$myoutdir"; fi
trimmomatic PE \\
    -threads {threads} \\
    -phred33 \\
    {input.r1} {input.r2} \\
    {output.one} {output.two} \\
    {output.three} {output.four} \\
    ILLUMINACLIP:{params.adapterfile}:3:30:10 \\
    LEADING:10 \\
    TRAILING:10 \\
    SLIDINGWINDOW:4:20 \\
    MINLEN:20      
"""
118
119
120
121
122
123
124
125
126
127
128
shell: """
myoutdir="$(dirname {output})"
if [ ! -d "$myoutdir" ]; then mkdir -p "$myoutdir"; fi
bwa mem -M \\
    -R \'@RG\\tID:{params.sample}\\tSM:{params.sample}\\tPL:illumina\\tLB:{params.sample}\\tPU:{params.sample}\\tCN:hgsc\\tDS:wes\' \\
    -t {threads} \\
    {params.genome} \\
    {input} | \\
samblaster -M | \\
samtools sort -@12 -m 4G - -o {output}
"""
153
154
155
shell: """
samtools index -@ 2 {input.bam} {output.bai}
"""
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
shell: """
gatk --java-options '-Xmx48g' BaseRecalibrator \\
    --input {input.bam} \\
    --reference {params.genome} \\
    {params.knowns} \\
    --output {output.re} \\
    --intervals {params.intervals}

gatk --java-options '-Xmx48g' ApplyBQSR \\
    --reference {params.genome} \\
    --input {input.bam} \\
    --bqsr-recal-file {output.re} \\
    --output {output.bam} \\
    --use-jdk-inflater \\
    --use-jdk-deflater
"""
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
shell: """
sample={wildcards.samples}
ID=$sample
PL="ILLUMINA"  # exposed as a config param
LB="na"        # exposed as a config param 

# Check if there is no header or any of the info
HEADER=`samtools view -H {input.bam} | grep ^@RG`
if [[ "$HEADER" != "" ]]; then
    t=(${{HEADER//\t/ }})
    echo ${{t[1]}}
    ID=`printf '%s\n' "${{t[@]}}" | grep -P '^ID' | cut -d":" -f2` #(${{t[1]//:/ }})
    PL=`printf '%s\n' "${{t[@]}}" | grep -P '^PL' | cut -d":" -f2` #(${{t[3]//:/ }})
    LB=`printf '%s\n' "${{t[@]}}" | grep -P '^LB' | cut -d":" -f2` #(${{t[2]//:/ }})
    if [[ "$ID" == "$sample" ]]; then
        echo "The header of the BAM file is correct"
    else
        ID=$sample
    fi
fi

gatk AddOrReplaceReadGroups \\
    --INPUT {input.bam} \\
    --OUTPUT {output.bam} \\
    --RGID ${{ID}} \\
    --RGLB ${{LB}} \\
    --RGPL ${{PL}} \\
    --RGSM ${{ID}} \\
    --RGPU na

samtools index -@ 2 {output.bam} {output.bai}
cp {output.bai} {output.bai2}
"""
ShowHide 47 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://mtandon09.github.io/CCBR_GATK4_Exome_Seq_Pipeline/
Name: ccbr_gatk4_exome_seq_pipeline
Version: v1.0.0
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...