An easy-to-use, flexible variant calling pipeline for use on the Biowulf cluster at NIH

public public 1yr ago Version: v2.0 0 bookmarks

XAVIER - eXome Analysis and Variant explorER . This is the home of the pipeline, XAVIER. Its long-term goals: to accurately call germline and somatic variants, to infer CNVs, and to boldly annotate variants like no pipeline before!

Overview

Welcome to XAVIER! Before getting started, we highly recommend reading through xavier's documentation .

The xavier pipeline is composed several inter-related sub commands to setup and run the pipeline across different systems. Each of the available sub commands perform different functions:

XAVIER is a comprehensive whole exome-sequencing pipeline following the Broad's set of best practices. It relies on technologies like Singularity1 to maintain the highest-level of reproducibility. The pipeline consists of a series of data processing and quality-control steps orchestrated by Snakemake2 , a flexible and scalable workflow management system, to submit jobs to a cluster or cloud provider.

The pipeline is compatible with data generated from Illumina short-read sequencing technologies. As input, it accepts a set of FastQ or BAM files and can be run locally on a compute instance, on-premise using a cluster, or on the cloud (feature coming soon!). A user can define the method or mode of execution. The pipeline can submit jobs to a cluster using a job scheduler like SLURM, or run on AWS using Tibanna (feature coming soon!). A hybrid approach ensures the pipeline is accessible to all users.

Before getting started, we highly recommend reading through the usage section of each available sub command.

For more information about issues or trouble-shooting a problem, please checkout our FAQ prior to opening an issue on Github .

Dependencies

Requires: singularity>=3.5 snakemake==6.X

Snakemake and singularity must be installed on the target system. Snakemake orchestrates the execution of each step in the pipeline. To guarantee the highest level of reproducibility, each step relies on versioned images from DockerHub . Snakemake uses singaularity to pull these images onto the local filesystem prior to job execution, and as so, snakemake and singularity are the only two dependencies.

Installation

Please clone this repository to your local filesystem using the following command:

# Clone Repository from Github
git clone https://github.com/CCBR/XAVIER.git
# Change your working directory
cd XAVIER/

Contribute

This site is a living document, created for and by members like you. XAVIER is maintained by the members of CCBR and is improved by continous feedback! We encourage you to contribute new content and make improvements to existing content via pull request to our repository .

References

1. Kurtzer GM, Sochat V, Bauer MW (2017). Singularity: Scientific containers for mobility of compute. PLoS ONE 12(5): e0177459.
2. Koster, J. and S. Rahmann (2018). "Snakemake-a scalable bioinformatics workflow engine." Bioinformatics 34(20): 3600.

Code Snippets

27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
shell: """
myoutdir="$(dirname {output.cnvs})/{params.tumorsample}"
if [ ! -d "$myoutdir" ]; then mkdir -p "$myoutdir"; fi

perl "{params.config_script}" \\
    "$myoutdir" \\
    {params.lengths} \\
    {params.chroms} \\
    {input.tumor} \\
    {input.normal} \\
    {params.pile} \\
    {params.fasta} \\
    {params.snps} \\
    {input.targets}

freec -conf "$myoutdir/freec_exome_config.txt"

cat "{params.sig_script}" | \\
    R --slave \\
    --args $myoutdir/{params.tumorsample}.bam_CNVs \\
    $myoutdir/{params.tumorsample}.bam_ratio.txt

mv $myoutdir/{params.tumorsample}.bam_CNVs.p.value.txt {output.cnvs}
cat "{params.plot_script}" | \\
    R --slave \\
    --args 2 \\
    $myoutdir/{params.tumorsample}.bam_ratio.txt \\
    $myoutdir/{params.tumorsample}.bam_BAF.txt
"""
SnakeMake From line 27 of rules/cnv.smk
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
shell: """
myoutdir="$(dirname {output.fit})/{params.tumorsample}"
if [ ! -d "$myoutdir" ]; then mkdir -p "$myoutdir"; fi

gzip -c "$(dirname {input.freeccnvs})/{params.tumorsample}/{params.normalsample}.bam_minipileup.pileup" \\
    > "$myoutdir/{params.normalsample}.recal.bam_minipileup.pileup.gz"
gzip -c "$(dirname {input.freeccnvs})/{params.tumorsample}/{params.tumorsample}.bam_minipileup.pileup" \\
    > "$myoutdir/{params.tumorsample}.recal.bam_minipileup.pileup.gz"

sequenza-utils bam2seqz \\
    -p \\
    -gc {params.gc} \\
    -n "$myoutdir/{params.normalsample}.recal.bam_minipileup.pileup.gz" \\
    -t "$myoutdir/{params.tumorsample}.recal.bam_minipileup.pileup.gz" \\
    | gzip > "$myoutdir/{params.tumorsample}.seqz.gz"

sequenza-utils seqz_binning \\
    -w 100 \\
    -s "$myoutdir/{params.tumorsample}.seqz.gz" \\
    | tee "$myoutdir/{params.tumorsample}.bin100.seqz" \\
    | gzip > "$myoutdir/{params.tumorsample}.bin100.seqz.gz"

Rscript "{params.run_script}" \\
    "$myoutdir/{params.tumorsample}.bin100.seqz" \\
    "$myoutdir" \\
    "{params.normalsample}+{params.tumorsample}" \\
    {threads}

mv "$myoutdir/{params.normalsample}+{params.tumorsample}_alternative_solutions.txt" "{output.fit}"
rm "$myoutdir/{params.tumorsample}.bin100.seqz"
"""
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
shell: """
myoutdir="$(dirname {output.cnvs})/{params.tumorsample}"
if [ ! -d "$myoutdir" ]; then mkdir -p "$myoutdir"; fi

perl {params.config_script} \\
    "$myoutdir" \\
    {params.lengths} \\
    {params.chroms} \\
    {input.tumor} \\
    {input.normal} \\
    {params.pile} \\
    {params.fasta} \\
    {params.snps} \\
    {input.targets} \\
    {input.fit}

freec -conf "$myoutdir/freec_exome_config.txt"

cat "{params.sig_script}" | \\
    R --slave \\
    --args $myoutdir/{params.tumorsample}.bam_CNVs \\
    $myoutdir/{params.tumorsample}.bam_ratio.txt

mv $myoutdir/{params.tumorsample}.bam_CNVs.p.value.txt {output.cnvs}
cat "{params.plot_script}" | \\
    R --slave \\
    --args 2 \\
    $myoutdir/{params.tumorsample}.bam_ratio.txt \\
    $myoutdir/{params.tumorsample}.bam_BAF.txt
"""
SnakeMake From line 134 of rules/cnv.smk
 8
 9
10
11
shell: """
wget https://github.com/mikdio/SOBDetector/releases/download/v1.0.2/SOBDetector_v1.0.2.jar \\
    -O {output.SOBDetector_jar}
"""
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
shell: """
if [ ! -d "$(dirname {output.pass1_vcf})" ]; then 
    mkdir -p "$(dirname {output.pass1_vcf})"
fi

echo "Running SOBDetector..."
# Try/catch for running SOB Dectetor
# with an empty input VCF file 
java -jar {input.SOBDetector_jar} \\
    --input-type VCF \\
    --input-variants "{input.vcf}" \\
    --input-bam {input.bam} \\
    --output-variants {output.pass1_vcf} \\
    --only-passed false || {{
# Compare length of VCF header to 
# the total length of the file
header_length=$(grep '^#' "{input.vcf}" | wc -l)
file_length=$(cat "{input.vcf}" | wc -l)
if [ $header_length -eq $file_length ]; then
    # VCF file only contains header
    # File contains no variants, catch
    # problem so pipeline can continue
    cat "{input.vcf}" > {output.pass1_vcf}
else
    # SOB Dectector failed for another reason
    echo "SOB Detector Failed... exiting now!" 1>&2
    exit 1
fi
}}

bcftools query \\
    -f '%INFO/numF1R2Alt\\t%INFO/numF2R1Alt\\t%INFO/numF1R2Ref\\t%INFO/numF2R1Ref\\t%INFO/numF1R2Other\\t%INFO/numF2R1Other\\t%INFO/SOB\\n' \\
    {output.pass1_vcf} \\
    | awk '{{if ($1 != "."){{tum_alt=$1+$2; tum_depth=$1+$2+$3+$4+$5+$6; if (tum_depth==0){{tum_af=1}} else {{tum_af=tum_alt/tum_depth }}; print tum_alt,tum_depth,tum_af,$7}}}}' > {output.pass1_info} 
"""
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
shell: """
echo -e "#TUMOR.alt\\tTUMOR.depth\\tTUMOR.AF\\tSOB\\tFS\\tSOR\\tTLOD\\tReadPosRankSum" > {output.all_info_file}
cat {input.info_files} >> {output.all_info_file}

# Try/catch for running calculating
# mean and standard deviation with 
# with a set of empty input VCF files
all_length=$(tail -n+2 {output.all_info_file} | wc -l)
if [ $all_length -eq 0 ]; then 
    echo 'WARNING: All SOB Dectect pass1 samples contained no variants.' \\
    | tee {output.params_file}
else
    # Calculate mean and standard deviation
    grep -v '^#' {output.all_info_file} \\
    | awk '{{ total1 += $1; ss1 += $1^2; total2 += $2; ss2 += $2^2; total3 += $3; ss3 += $3^2; total4 += $4; ss4 += $4^2 }} END {{ print total1/NR,total2/NR,total3/NR,total4/NR; print sqrt(ss1/NR-(total1/NR)^2),sqrt(ss2/NR-(total2/NR)^2),sqrt(ss3/NR-(total3/NR)^3),sqrt(ss4/NR-(total4/NR)^2) }}' > {output.params_file}
fi
"""
SnakeMake From line 77 of rules/ffpe.smk
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
shell: """
if [ ! -d "$(dirname {output.pass2_vcf})" ]; then 
    mkdir -p "$(dirname {output.pass2_vcf})"
fi

echo "Running SOBDetector..."
# Try/catch for running SOB Dectetor
# with an empty input VCF file
bcf_annotate_option="-e 'INFO/pArtifact < 0.05' "
java -jar {input.SOBDetector_jar} \\
    --input-type VCF \\
    --input-variants "{input.vcf}" \\
    --input-bam "{input.bam}" \\
    --output-variants "{output.pass2_vcf}" \\
    --only-passed true \\
    --standardization-parameters "{input.params_file}" || {{
# Compare length of VCF header to 
# the total length of the file
header_length=$(grep '^#' "{input.vcf}" | wc -l)
file_length=$(cat "{input.vcf}" | wc -l)
if [ $header_length -eq $file_length ]; then
    # VCF file only contains header
    # File contains no variants, catch
    # problem so pipeline can continue
    cat "{input.vcf}" > {output.pass2_vcf}
else
    # SOB Dectector failed for another reason
    echo "SOB Detector Failed... exiting now!" 1>&2
    exit 1
fi
}}

echo "Making info table..."
bcftools query \\
    -f '%INFO/numF1R2Alt\\t%INFO/numF2R1Alt\\t%INFO/numF1R2Ref\\t%INFO/numF2R1Ref\\t%INFO/numF1R2Other\\t%INFO/numF2R1Other\\t%INFO/SOB\\n' \\
    "{output.pass2_vcf}" \\
    | awk '{{if ($1 != "."){{tum_alt=$1+$2; tum_depth=$1+$2+$3+$4+$5+$6; if (tum_depth==0){{tum_af=1}} else {{tum_af=tum_alt/tum_depth }}; print tum_alt,tum_depth,tum_af,$7}}}}' > "{output.pass2_info}"

echo "Filtering out artifacts..."
if [ "{wildcards.vc_outdir}" == "{config[output_params][MERGED_SOMATIC_OUTDIR]}" ]; then
    echo "Adding 'set' annotation back from merged variants..."
    bgzip --threads {threads} -c "{input.vcf}" > "{input.vcf}.gz"
    bcftools index -f -t "{input.vcf}.gz"
    bgzip --threads {threads} -c "{output.pass2_vcf}" > "{output.pass2_vcf}.gz"
    bcftools index -f -t "{output.pass2_vcf}.gz"
    bcftools annotate \\
        -a "{input.vcf}.gz" \\
        -c "INFO/set" \\
        "$bcf_annotate_option" \\
        -Oz \\
        -o {output.filtered_vcf} {output.pass2_vcf}.gz
else
    bcftools filter \\
        "$bcf_annotate_option" \\
        -Oz \\
        -o {output.filtered_vcf} {output.pass2_vcf}
    bcftools index -f -t {output.filtered_vcf}
fi
"""
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
shell: """
echo -e "#ID\\tDefaultParam\\tCohortParam\\tTotalVariants" > {output.count_table}
echo -e "#SAMPLE_ID\\tParam\\tCHROM\\tPOS\\tnumF1R2Alt\\tnumF2R1Alt\\tnumF1R2Ref\\tnumF2R1Ref\\tnumF1R2Other\\tnumF2R1Other\\tSOB\\tpArtifact\\tFS\\tSOR\\tTLOD\\tReadPosRankSum" > {output.full_metric_table}

P1FILES=({input.pass1_vcf})
P2FILES=({input.pass2_vcf})
for (( i=0; i<${{#P1FILES[@]}}; i++ )); do
    MYID=$(basename -s ".sobdetect.vcf" ${{P1FILES[$i]}})
    echo "Collecting metrics from $MYID..."

    # grep may fail if input files do not contain any variants 
    total_count=$(grep -v ^# ${{P1FILES[$i]}} | wc -l) || total_count=0
    count_1p=$(bcftools query -f '%INFO/pArtifact\n' ${{P1FILES[$i]}} | awk '{{if ($1 != "." && $1 < 0.05){{print}}}}' | wc -l)
    count_2p=$(bcftools query -f '%INFO/pArtifact\n' ${{P2FILES[$i]}} | awk '{{if ($1 != "." && $1 < 0.05){{print}}}}' | wc -l)

    echo -e "$MYID\\t$count_1p\\t$count_2p\\t$total_count" >> {output.count_table}

    bcftools query -f '%CHROM\\t%POS\\t%INFO/numF1R2Alt\\t%INFO/numF2R1Alt\\t%INFO/numF1R2Ref\\t%INFO/numF2R1Ref\\t%INFO/numF1R2Other\\t%INFO/numF2R1Other\\t%INFO/SOB\\t%INFO/pArtifact\n' ${{P1FILES[$i]}} | awk -v id=$MYID 'BEGIN{{OFS="\t"}}{{print id,"PASS_1",$0}}' >> {output.full_metric_table}
    bcftools query -f '%CHROM\\t%POS\\t%INFO/numF1R2Alt\\t%INFO/numF2R1Alt\\t%INFO/numF1R2Ref\\t%INFO/numF2R1Ref\\t%INFO/numF1R2Other\\t%INFO/numF2R1Other\\t%INFO/SOB\\t%INFO/pArtifact\n' ${{P2FILES[$i]}} | awk -v id=$MYID 'BEGIN{{OFS="\t"}}{{print id,"PASS_2",$0}}' >> {output.full_metric_table}
done
"""
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
shell: """
filetype=$(file -b --mime-type {input.filtered_vcf})
if [ $filetype == "application/gzip" ] ; then
    zcat {input.filtered_vcf} > {output.filtered_vcf}
else 
    {input.filtered_vcf} > {output.filtered_vcf}
fi

vcf2maf.pl \\
    --input-vcf {output.filtered_vcf} \\
    --output-maf {output.maf} \\
    --tumor-id {params.tumorsample} \\
    --vep-path /opt/vep/src/ensembl-vep \\
    --vep-data {params.bundle} \\
    --ncbi-build {params.build} \\
    --species {params.species} \\
    --vep-forks {threads} \\
    --ref-fasta {params.genome} \\
    --vep-overwrite

"""
SnakeMake From line 232 of rules/ffpe.smk
264
265
266
267
268
shell: """    
echo "Combining MAFs..."
head -2 {input.mafs[0]} > {output.maf}
awk 'FNR>2 {{print}}' {input.mafs} >> {output.maf}
"""
SnakeMake From line 264 of rules/ffpe.smk
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
shell:
    """
    myoutdir="$(dirname {output.gzvcf})"
    if [ ! -d "$myoutdir" ]; then mkdir -p "$myoutdir"; fi

    gatk --java-options '-Xmx24g' HaplotypeCaller \\
        --reference {params.genome} \\
        --input {input.bam} \\
        --use-jdk-inflater \\
        --use-jdk-deflater \\
        --emit-ref-confidence GVCF \\
        --annotation-group StandardAnnotation \\
        --annotation-group AS_StandardAnnotation \\
        --dbsnp {params.snpsites} \\
        --output {output.gzvcf} \\
        --intervals {params.chrom} \\
        --max-alternate-alleles 3
    """
68
69
70
71
72
73
74
75
76
77
78
79
80
81
shell:
    """
    input_str="--variant $(echo "{input.gzvcf}" | sed -e 's/ / --variant /g')"

    gatk --java-options '-Xmx24g' CombineGVCFs \\
        --reference {params.genome} \\
        --annotation-group StandardAnnotation \\
        --annotation-group AS_StandardAnnotation \\
        $input_str \\
        --output {output.gzvcf} \\
        --intervals {wildcards.chroms} \\
        --use-jdk-inflater \\
        --use-jdk-deflater
    """
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
shell:
    """
    myoutdir="$(dirname {output.vcf})"
    if [ ! -d "$myoutdir" ]; then mkdir -p "$myoutdir"; fi

    gatk --java-options '-Xmx96g' GenotypeGVCFs \\
        --reference {params.genome} \\
        --use-jdk-inflater \\
        --use-jdk-deflater \\
        --annotation-group StandardAnnotation \\
        --annotation-group AS_StandardAnnotation \\
        --dbsnp {params.snpsites} \\
        --output {output.vcf} \\
        --variant {input.gzvcf} \\
        --intervals {params.chr}
    """
142
143
144
145
146
147
148
149
150
151
shell:
    """
    # Avoids ARG_MAX issue which limits max length of a command
    ls --color=never -d $(dirname "{output.clist}")/raw_variants.*.vcf.gz > "{output.clist}"

    gatk MergeVcfs \\
        -R {params.genome} \\
        --INPUT {output.clist} \\
        --OUTPUT {output.vcf}
    """
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
shell:
    """
    gatk SelectVariants \\
    -V {input.vcf} \\
    -select-type SNP \\
    -O snps.vcf.gz

    gatk SelectVariants \\
    -V {input.vcf} \\
    -select-type INDEL \\
    -O indels.vcf.gz

    gatk VariantFiltration \\
    -V snps.vcf.gz \\
    -filter "QD < 2.0" --filter-name "QD2" \\
    -filter "QUAL < 30.0" --filter-name "QUAL30" \\
    -filter "SOR > 3.0" --filter-name "SOR3" \\
    -filter "FS > 60.0" --filter-name "FS60" \\
    -filter "MQ < 40.0" --filter-name "MQ40" \\
    -filter "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" \\
    -filter "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" \\
    -O {output.snpvcf}

    gatk VariantFiltration \\
    -V indels.vcf.gz \\
    -filter "QD < 2.0" --filter-name "QD2" \\
    -filter "QUAL < 30.0" --filter-name "QUAL30" \\
    -filter "FS > 200.0" --filter-name "FS200" \\
    -filter "ReadPosRankSum < -20.0" --filter-name "ReadPosRankSum-20" \\
    -O {output.indelvcf}

    gatk MergeVcfs \\
    -R {params.genome} \\
    --INPUT {output.indelvcf} \\
    --INPUT {output.snpvcf} \\
    --OUTPUT {output.vcf}
    """
235
236
237
238
239
240
241
242
243
244
245
shell:
    """
    gatk SelectVariants \\
        -R {params.genome} \\
        --intervals {params.targets} \\
        --variant {input.vcf} \\
        --sample-name {params.Sname} \\
        --exclude-filtered \\
        --exclude-non-variants \\
        --output {output.vcf}
    """
14
15
16
17
18
19
20
21
22
23
24
25
    shell:"""
set -exo pipefail
if [ -d {params.outdir} ];then rm -rf {params.outdir};fi 
mkdir -p {params.outdir}
cd {params.outdir}
# last file in inputs is NIDAP_files.tsv ... col1 is file ... col2 is the same file hardlinked in the NIDAP folder
# this file is created in get_nidap_folder_input_files function
linking_file=$(echo {input}|awk '{{print $NF}}')
while read a b;do 
    ln $a $b
done < $linking_file
"""
21
22
23
24
25
26
27
28
shell: """ 
echo "Extracting sites to estimate ancestry"
somalier extract \\
    -d "$(dirname {output.somalierOut})" \\
    --sites {params.sites_vcf} \\
    -f {params.genomeFasta} \\
    {input.bam}
"""
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
shell: """ 
echo "Estimating relatedness"
somalier relate \\
    -o "$(dirname {output.relatedness})/relatedness" \\
    {input.somalier}

echo "Estimating ancestry"
somalier ancestry \\
    -o "$(dirname {output.relatedness})/ancestry" \\
    --labels {params.ancestry_db}/ancestry-labels-1kg.tsv \\
    {params.ancestry_db}/*.somalier ++ \\
    {input.somalier}
Rscript {params.script_path_gender} \\
    {output.relatednessSamples} \\
    {output.finalFileGender}    

Rscript {params.script_path_samples} \\
    {output.relatedness} \\
    {output.finalFilePairs}

Rscript {params.script_path_pca} \\
    {output.ancestry} \\
    {output.finalFilePairs} \\
    {output.ancestoryPlot} \\
    {output.pairAncestoryHist}
"""
121
122
123
124
125
126
127
128
shell: """
multiqc --ignore '*/.singularity/*' \\
    --ignore '*/*/*/*/*/*/*/*/pyflow.data/*' \\
    --ignore 'slurmfiles/' \\
    -f --interactive \\
    -n {output.report} \\
    {params.workdir}
"""
21
22
23
24
25
26
27
28
shell: """ 
echo "Extracting sites to estimate ancestry"
somalier extract \\
    -d "$(dirname {output.somalierOut})" \\
    --sites {params.sites_vcf} \\
    -f {params.genomeFasta} \\
    {input.bam}
"""
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
shell: """ 
echo "Estimating relatedness"
somalier relate \\
    -o "$(dirname {output.relatedness})/relatedness" \\
    {input.somalier}

Rscript {params.script_path_gender} \\
    {output.relatednessSamples} \\
    {output.finalFileGender}    

Rscript {params.script_path_samples} \\
    {output.relatedness} \\
    {output.finalFilePairs}

"""
108
109
110
111
112
113
114
115
shell: """
multiqc --ignore '*/.singularity/*' \\
    --ignore '*/*/*/*/*/*/*/*/pyflow.data/*' \\
    --ignore 'slurmfiles/' \\
    -f --interactive \\
    -n {output.report} \\
    {params.workdir}
"""
25
26
27
28
29
30
31
32
33
shell: """
if [ ! -d "$(dirname {output.txt})" ]; then 
    mkdir -p "$(dirname {output.txt})"
fi

python {params.get_flowcell_lanes} \\
    {input.r1} \\
    {wildcards.samples} > {output.txt}
"""
SnakeMake From line 25 of rules/qc.smk
64
65
66
67
68
69
70
71
72
shell: """
fastq_screen --conf {params.fastq_screen_config} \\
    --outdir {params.outdir} \\
    --threads {threads} \\
    --subset 1000000 \\
    --aligner bowtie2 \\
    --force \\
    {input.fq1} {input.fq2}
"""
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
shell: """
# Setups temporary directory for
# intermediate files with built-in 
# mechanism for deletion on exit
{params.set_tmp}

# Copy kraken2 db to local node storage to reduce filesystem strain
cp -rv {params.bacdb} ${{tmp}}
kdb_base=$(basename {params.bacdb})
kraken2 --db ${{tmp}}/${{kdb_base}} \\
    --threads {threads} --report {output.taxa} \\
    --output {output.out} \\
    --gzip-compressed \\
    --paired {input.fq1} {input.fq2}
# Generate Krona Report
cut -f2,3 {output.out} | \\
    ktImportTaxonomy - -o {output.html}
"""
145
146
147
148
149
150
shell: """
fastqc -t {threads} \\
    -f bam \\
    -o {params.outdir} \\
    {input.bam} 
"""
175
176
177
178
179
180
181
shell: """
python {params.script_path_reformat_bed} \\
    --input_bed {input.targets} \\
    --output_bed {output.bed}.temp
python3 {params.script_path_correct_target_bed} {output.bed}.temp {output.bed}
rm -f {output.bed}.temp
"""
SnakeMake From line 175 of rules/qc.smk
208
209
210
211
212
213
214
215
216
217
218
219
220
shell: """
unset DISPLAY
qualimap bamqc -bam {input.bam} \\
    --java-mem-size=48G \\
    -c -ip \\
    -gff {input.bed} \\
    -outdir {params.outdir} \\
    -outformat HTML \\
    -nt {threads} \\
    --skip-duplicated \\
    -nw 500 \\
    -p NON-STRAND-SPECIFIC
"""
243
244
245
shell: """
samtools flagstat {input.bam} > {output.txt}
"""
270
271
272
shell: """
vcftools --gzvcf {input.vcf} --het --out {params.prefix}
"""
297
298
299
300
301
302
303
shell: """
java -Xmx24g -jar ${{PICARDJARPATH}}/picard.jar \\
    CollectVariantCallingMetrics \\
    INPUT={input.vcf} \\
    OUTPUT={params.prefix} \\
    DBSNP={params.dbsnp} Validation_Stringency=SILENT
"""
328
329
330
shell: """
bcftools stats {input.vcf} > {output.txt}
"""
359
360
361
362
363
364
365
shell: """
gatk --java-options '-Xmx12g -XX:ParallelGCThreads={threads}' VariantEval \\
    -R {params.genome} \\
    -O {output.grp} \\
    --dbsnp {params.dbsnp} \\
    --eval {input.vcf} 
"""
392
393
394
395
396
397
398
399
shell: """
java -Xmx12g -jar $SNPEFF_JAR \\
    -v -canon -c {params.config} \\
    -csvstats {output.csv} \\
    -stats {output.html} \\
    {params.genome} \\
    {input.vcf} > {output.vcf}
"""
SnakeMake From line 392 of rules/qc.smk
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
shell: """
if [ ! -d "$(dirname {output.split_bam})" ]; then
  mkdir -p "$(dirname {output.split_bam})"
fi

samtools view \\
    -b \\
    -o {output.split_bam} \\
    -@ {threads} \\
    {input.bam} {wildcards.chroms}

samtools index \\
    -@ {threads} \\
    {output.split_bam} {output.split_bam_idx}

cp {output.split_bam_idx} {output.split_bam}.bai
"""
51
52
53
54
55
56
57
shell: """
input_str="--input $(echo "{input.read_orientation_file}" | sed -e 's/ / --input /g')"

gatk LearnReadOrientationModel \\
    --output {output.model} \\
    $input_str
"""
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
shell: """
# Setups temporary directory for
# intermediate files with built-in 
# mechanism for deletion on exit
{params.set_tmp}

statfiles="--stats $(echo "{input.statsfiles}" | sed -e 's/ / --stats /g')"

gatk MergeMutectStats \\
    $statfiles \\
    -O {output.final}.stats

gatk FilterMutectCalls \\
    -R {params.genome} \\
    -V {input.vcf} \\
    --ob-priors {input.model} \\
    --contamination-table {input.summary} \\
    -O {output.marked_vcf} \\
    --stats {output.final}.stats

gatk SelectVariants \\
    -R {params.genome} \\
    --variant {output.marked_vcf} \\
    --exclude-filtered \\
    --output {output.final}

# VarScan can output ambiguous IUPAC bases/codes
# the awk one-liner resets them to N, from:
# https://github.com/fpbarthel/GLASS/issues/23
bcftools sort -T ${{tmp}} "{output.final}" \\
    | bcftools norm --threads {threads} --check-ref s -f {params.genome} -O v \\
    | awk '{{gsub(/\y[W|K|Y|R|S|M]\y/,"N",$4); OFS = "\t"; print}}' \\
    | sed '/^$/d' > {output.norm}
"""
135
136
137
138
139
140
141
142
shell: """
input_str="-I $(echo "{input.vcf}" | sed -e 's/ / -I /g')"

gatk --java-options "-Xmx30g" MergeVcfs \\
    -O "{output.vcf}" \\
    -D {params.genomedict} \\
    $input_str
"""
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
shell: """
# Setups temporary directory for
# intermediate files with built-in 
# mechanism for deletion on exit
{params.set_tmp}

if [ ! -d "$(dirname {output.mergedvcf})" ]; then
  mkdir -p "$(dirname {output.mergedvcf})"
fi

input_str="--variant $(echo "{input.vcf}" | sed -e 's/ / --variant /g')"

java -Xmx60g -Djava.io.tmpdir=${{tmp}} -jar $GATK_JAR -T CombineVariants \\
    -R {params.genome} \\
    -nt {threads} \\
    --filteredrecordsmergetype KEEP_IF_ANY_UNFILTERED \\
    --genotypemergeoption PRIORITIZE \\
    --rod_priority_list {params.rodprioritylist} \\
    --minimumN 1 \\
    -o {output.mergedvcf} \\
    {params.variantsargs}
"""
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
shell: """

vcf2maf.pl \\
    --input-vcf {input.filtered_vcf} \\
    --output-maf {output.maf} \\
    --tumor-id {params.tumorsample} {params.normalsample} \\
    --vep-path /opt/vep/src/ensembl-vep \\
    --vep-data {params.bundle} \\
    --ncbi-build {params.build} \\
    --species {params.species} \\
    --vep-forks {threads} \\
    --ref-fasta {params.genome} \\
    --vep-overwrite

"""
230
231
232
233
234
shell: """
echo "Combining MAFs..."
head -2 {input.mafs[0]} > {output.maf}
awk 'FNR>2 {{print}}' {input.mafs} >> {output.maf}
"""
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
shell: """
if [ ! -d "$(dirname {output.vcf})" ]; 
    then mkdir -p "$(dirname {output.vcf})";
fi
gatk Mutect2 \\
    -R {params.genome} \\
    -I {input.tumor} \\
    -I {input.normal} \\
    -normal {params.normalsample} \\
    --panel-of-normals {params.pon} \\
    {params.germsource} \\
    -L {params.chrom} \\
    -O {output.vcf} \\
    --f1r2-tar-gz {output.read_orientation_file} \\
    --independent-mates
"""
61
62
63
64
65
66
67
68
69
70
71
72
73
74
shell: """
# Run GetPileupSummaries in bg concurrently for a tumor/normal pair 
gatk --java-options '-Xmx48g' GetPileupSummaries \\
    -I {input.tumor} \\
    -V {params.germsource} \\
    -L {input.intervals} \\
    -O {output.tumor_summary} & \\
gatk --java-options '-Xmx48g' GetPileupSummaries \\
    -I {input.normal} \\
    -V {params.germsource} \\
    -L {input.intervals} \\
    -O {output.normal_summary} & \\
wait
"""
 94
 95
 96
 97
 98
 99
100
101
102
shell: """
gatk CalculateContamination \\
    -I {input.tumor} \\
    --matched-normal {input.normal} \\
    -O {output.tumor_summary}
gatk CalculateContamination \\
    -I {input.normal} \\
    -O {output.normal_summary}
"""
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
shell: """
# Setups temporary directory for
# intermediate files with built-in 
# mechanism for deletion on exit
{params.set_tmp}

workdir={params.basedir}
myoutdir="$(dirname {output.vcf})/{wildcards.samples}/{wildcards.chroms}"
if [ -d "$myoutdir" ]; then rm -r "$myoutdir"; fi
mkdir -p "$myoutdir"

configureStrelkaSomaticWorkflow.py \\
    --ref={params.genome} \\
    --tumor={input.tumor} \\
    --normal={input.normal} \\
    --runDir="$myoutdir" \\
    --exome
cd "$myoutdir"
./runWorkflow.py -m local -j {threads}

java -Xmx12g -Djava.io.tmpdir=${{tmp}} -XX:ParallelGCThreads={threads} \\
    -jar $GATK_JAR -T CombineVariants \\
    -R {params.genome} \\
    --variant results/variants/somatic.snvs.vcf.gz \\
    --variant results/variants/somatic.indels.vcf.gz \\
    --assumeIdenticalSamples \\
    --filteredrecordsmergetype KEEP_UNCONDITIONAL \\
    -o "$(basename {output.vcf})"

cd $workdir
mv "$myoutdir/$(basename {output.vcf})" "{output.vcf}"
"""
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
shell: """
# Setups temporary directory for
# intermediate files with built-in 
# mechanism for deletion on exit
{params.set_tmp}

gatk SelectVariants \\
    -R {params.genome} \\
    --variant {input.vcf} \\
    --discordance {params.pon} \\
    --exclude-filtered \\
    --output {output.filtered}

echo -e "TUMOR\t{params.tumorsample}\nNORMAL\t{params.normalsample}" > "{output.samplesfile}"

echo "Reheading VCFs with sample names..."
bcftools reheader \\
    -o "{output.final}" \\
    -s "{output.samplesfile}" "{output.filtered}"

# VarScan can output ambiguous IUPAC bases/codes
# the awk one-liner resets them to N, from:
# https://github.com/fpbarthel/GLASS/issues/23
bcftools sort \\
    -T ${{tmp}} "{output.final}" \\
    | bcftools norm --threads {threads} --check-ref s -f {params.genome} -O v \\
    | awk '{{gsub(/\y[W|K|Y|R|S|M]\y/,"N",$4); OFS = "\t"; print}}' \\
    | sed '/^$/d' > {output.norm}
"""
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
shell: """
# Setups temporary directory for
# intermediate files with built-in 
# mechanism for deletion on exit
{params.set_tmp}

if [ ! -d "$(dirname {output.vcf})" ]; then mkdir -p "$(dirname {output.vcf})"; fi

java -Xmx8g -Djava.io.tmpdir=${{tmp}} -jar ${{MUTECT_JAR}} \\
    --analysis_type MuTect \\
    --reference_sequence {params.genome} \\
    --normal_panel {params.pon} \\
    --vcf {output.vcf} \\
    {params.dbsnp_cosmic} \\
    --disable_auto_index_creation_and_locking_when_reading_rods \\
    --input_file:normal {input.normal} \\
    --input_file:tumor {input.tumor} \\
    --out {output.stats} \\
    -rf BadCigar
"""
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
shell: """
# Setups temporary directory for
# intermediate files with built-in 
# mechanism for deletion on exit
{params.set_tmp}

gatk SelectVariants \\
    -R {params.genome} \\
    --variant {input.vcf} \\
    --exclude-filtered \\
    --output {output.final}

# VarScan can output ambiguous IUPAC bases/codes
# the awk one-liner resets them to N, from:
# https://github.com/fpbarthel/GLASS/issues/23
bcftools sort -T ${{tmp}} "{output.final}" \\
    | bcftools norm --threads {threads} --check-ref s -f {params.genome} -O v \\
    | awk '{{gsub(/\y[W|K|Y|R|S|M]\y/,"N",$4); OFS = "\t"; print}}' \\
    | sed '/^$/d' > {output.norm}
"""
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
shell: """
if [ ! -d "$(dirname {output.vcf})" ]; then mkdir -p "$(dirname {output.vcf})"; fi
VarDict \\
    -G {params.genome} \\
    -f 0.05 \\
    -N \"{params.tumorsample}|{params.normalsample}\" \\
    --nosv \\
    -b \"{input.tumor}|{input.normal}\" \\
    -t \\
    -Q 20 \\
    -c 1 \\
    -S 2 \\
    -E 3 {params.targets} \\
    | testsomatic.R \\
    | var2vcf_paired.pl \\
        -S \\
        -Q 20 \\
        -d 10 \\
        -M \\
        -N \"{params.tumorsample}|{params.normalsample}\" \\
        -f 0.05 > {output.vcf}
"""
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
shell: """
# Setups temporary directory for
# intermediate files with built-in 
# mechanism for deletion on exit
{params.set_tmp}

bcftools filter \\
    --exclude \'STATUS=\"Germline\" | STATUS=\"LikelyLOH\" | STATUS=\"AFDiff\"\' \\
    {input.vcf} > {output.filtered}

gatk SelectVariants \\
    -R {params.genome} \\
    --variant {output.filtered} \\
    --discordance {params.pon} \\
    --exclude-filtered \\
    --output {output.final}

# VarScan can output ambiguous IUPAC bases/codes
# the awk one-liner resets them to N, from:
# https://github.com/fpbarthel/GLASS/issues/23
bcftools sort -T ${{tmp}} "{output.final}" \\
    | bcftools norm --threads {threads} --check-ref s -f {params.genome} -O v \\
    | awk '{{gsub(/\y[W|K|Y|R|S|M]\y/,"N",$4); OFS = "\t"; print}}' \\
    | sed '/^$/d' > {output.norm}
"""
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
shell: """
# Setups temporary directory for
# intermediate files with built-in 
# mechanism for deletion on exit
{params.set_tmp}

if [ ! -d "$(dirname {output.vcf})" ]; then mkdir -p "$(dirname {output.vcf})"; fi

tumor_purity=$( echo "1-$(printf '%.6f' $(tail -n -1 {input.tumor_summary} | cut -f2 ))" | bc -l)
normal_purity=$( echo "1-$(printf '%.6f' $(tail -n -1 {input.normal_summary} | cut -f2 ))" | bc -l)
varscan_opts="--strand-filter 1 --min-var-freq 0.01 --min-avg-qual 30 --somatic-p-value 0.05 --output-vcf 1 --normal-purity $normal_purity --tumor-purity $tumor_purity"
dual_pileup="samtools mpileup -d 10000 -q 15 -Q 15 -f {params.genome} {input.normal} {input.tumor}"
varscan_cmd="varscan somatic <($dual_pileup) {output.vcf} $varscan_opts --mpileup 1"    
eval "$varscan_cmd"

# VarScan can output ambiguous IUPAC bases/codes
# the awk one-liner resets them to N, from:
# https://github.com/fpbarthel/GLASS/issues/23
awk '{{gsub(/\y[W|K|Y|R|S|M]\y/,"N",$4); OFS = "\t"; print}}' {output.vcf}.snp \\
    | sed '/^$/d' > {output.vcf}.snp_temp
awk '{{gsub(/\y[W|K|Y|R|S|M]\y/,"N",$4); OFS = "\t"; print}}' {output.vcf}.indel \\
    | sed '/^$/d' > {output.vcf}.indel_temp

java -Xmx12g -Djava.io.tmpdir=${{tmp}} -XX:ParallelGCThreads={threads} \\
    -jar $GATK_JAR -T CombineVariants \\
    -R {params.genome} \\
    --variant {output.vcf}.snp_temp \\
    --variant {output.vcf}.indel_temp \\
    --assumeIdenticalSamples \\
    --filteredrecordsmergetype KEEP_UNCONDITIONAL \\
    -o {output.vcf}     
"""
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
shell: """
# Setups temporary directory for
# intermediate files with built-in 
# mechanism for deletion on exit
{params.set_tmp}

varscan filter \\
    {input.vcf} \\
    {params.filter_settings} > {output.filtered1}

gatk SelectVariants \\
    -R {params.genome} \\
    --variant {output.filtered1} \\
    --discordance {params.pon} \\
    --exclude-filtered \\
    --output {output.filtered}

samplesFile="{output.samplesfile}"
echo -e "TUMOR\t{params.tumorsample}\nNORMAL\t{params.normalsample}" > "{output.samplesfile}"

bcftools reheader \\
    -o "{output.final}" \\
    -s "{output.samplesfile}" \\
    "{output.filtered}"

# VarScan can output ambiguous IUPAC bases/codes
# the awk one-liner resets them to N, from:
# https://github.com/fpbarthel/GLASS/issues/23
bcftools sort -T ${{tmp}} "{output.final}" \\
    | bcftools norm --threads {threads} --check-ref s -f {params.genome} -O v \\
    | awk '{{gsub(/\y[W|K|Y|R|S|M]\y/,"N",$4); OFS = "\t"; print}}' \\
    | sed '/^$/d' > {output.norm}
"""
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
shell: """
if [ ! -d "$(dirname {output.vcf})" ]; then
    mkdir -p "$(dirname {output.vcf})"
fi

gatk Mutect2 \\
    -R {params.genome} \\
    -I {input.tumor} \\
    --panel-of-normals {params.pon} \\
    {params.germsource} \\
    -L {wildcards.chroms} \\
    -O {output.vcf} \\
    --f1r2-tar-gz {output.read_orientation_file} \\
    --independent-mates
"""
56
57
58
59
60
61
62
63
64
65
66
67
68
shell: """
# Setups temporary directory for
# intermediate files with built-in 
# mechanism for deletion on exit
{params.set_tmp}

gatk --java-options "-Xmx10g -Djava.io.tmpdir=${{tmp}}" GetPileupSummaries \\
    -R {params.genome} \\
    -I {input.tumor} \\
    -V {params.germsource} \\
    -L {input.intervals} \\
    -O {output.pileup}
"""
86
87
88
89
90
shell: """
gatk CalculateContamination \\
    -I {input.pileup} \\
    -O {output.tumor_summary}
"""
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
shell: """
# Setups temporary directory for
# intermediate files with built-in 
# mechanism for deletion on exit
{params.set_tmp}

if [ ! -d "$(dirname {output.vcf})" ]; then 
    mkdir -p "$(dirname {output.vcf})"
fi

java -Xmx8g -Djava.io.tmpdir=${{tmp}} -jar ${{MUTECT_JAR}} \\
    --analysis_type MuTect \\
    --reference_sequence {params.genome} \\
    --normal_panel {params.pon} \\
    --vcf {output.vcf} \\
    {params.dbsnp_cosmic} \\
    -L {wildcards.chroms} \\
    --disable_auto_index_creation_and_locking_when_reading_rods \\
    --input_file:tumor {input.tumor} \\
    --out {output.stats} \\
    -rf BadCigar
"""
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
shell: """
# Setups temporary directory for
# intermediate files with built-in 
# mechanism for deletion on exit
{params.set_tmp}

gatk SelectVariants \\
    -R {params.genome} \\
    --variant {input.vcf} \\
    --exclude-filtered \\
    --output {output.final}

# VarScan can output ambiguous IUPAC bases/codes
# the awk one-liner resets them to N, from:
# https://github.com/fpbarthel/GLASS/issues/23
bcftools sort -T ${{tmp}} "{output.final}" \\
    | bcftools norm --threads {threads} --check-ref s -f {params.genome} -O v \\
    | awk '{{gsub(/\y[W|K|Y|R|S|M]\y/,"N",$4); OFS = "\t"; print}}' \\
    | sed '/^$/d' > {output.norm}
"""
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
shell: """
if [ ! -d "$(dirname {output.vcf})" ]; then 
    mkdir -p "$(dirname {output.vcf})"
fi

VarDict \\
    -G {params.genome} \\
    -f 0.05 \\
    -x 500 \\
    --nosv \\
    -b {input.tumor} \\
    -t \\
    -Q 20 \\
    -c 1 \\
    -S 2 \\
    -E 3 {params.targets} \\
    | teststrandbias.R \\
    | var2vcf_valid.pl \\
        -N {wildcards.samples} \\
        -Q 20 \\
        -d 10 \\
        -v 6 \\
        -S \\
        -E \\
        -f 0.05 > {output.vcf}
"""
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
shell: """
# Setups temporary directory for
# intermediate files with built-in 
# mechanism for deletion on exit
{params.set_tmp}

gatk SelectVariants \\
    -R {params.genome} \\
    --variant {input.vcf} \\
    --discordance {params.pon} \\
    --exclude-filtered \\
    --output {output.final}

# VarScan can output ambiguous IUPAC bases/codes
# the awk one-liner resets them to N, from:
# https://github.com/fpbarthel/GLASS/issues/23
bcftools sort -T ${{tmp}} "{output.final}" \\
    | bcftools norm --threads {threads} --check-ref s -f {params.genome} -O v \\
    | awk '{{gsub(/\y[W|K|Y|R|S|M]\y/,"N",$4); OFS = "\t"; print}}' \\
    | sed '/^$/d' > {output.norm}
"""
279
280
281
282
283
284
285
286
287
288
289
shell: """
if [ ! -d "$(dirname {output.vcf})" ]; then
    mkdir -p "$(dirname {output.vcf})"
fi

varscan_opts="--strand-filter 0 --min-var-freq 0.01 --output-vcf 1 --variants 1"
pileup_cmd="samtools mpileup -d 100000 -q 15 -Q 15 -f {params.genome} {input.tumor}"
varscan_cmd="varscan mpileup2cns <($pileup_cmd) $varscan_opts"
eval "$varscan_cmd > {output.vcf}.gz"
eval "bcftools view -U {output.vcf}.gz > {output.vcf}"
"""
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
shell: """
# Setups temporary directory for
# intermediate files with built-in 
# mechanism for deletion on exit
{params.set_tmp}

varscan filter \\
    {input.vcf} \\
    {params.filter_settings} > {output.filtered1}

gatk SelectVariants \\
    -R {params.genome} \\
    --variant {output.filtered1} \\
    --discordance {params.pon} \\
    --exclude-filtered \\
    --output {output.filtered}

samplesFile="{output.samplesfile}"
echo -e "TUMOR\t{params.tumorsample}\n" > "{output.samplesfile}"

bcftools reheader \\
    -o "{output.final}" \\
    -s "{output.samplesfile}" \\
    "{output.filtered}"

# VarScan can output ambiguous IUPAC bases/codes
# the awk one-liner resets them to N, from:
# https://github.com/fpbarthel/GLASS/issues/23
bcftools sort -T ${{tmp}} "{output.final}" \\
    | bcftools norm --threads {threads} --check-ref s -f {params.genome} -O v \\
    | awk '{{gsub(/\y[W|K|Y|R|S|M]\y/,"N",$4); OFS = "\t"; print}}' \\
    | sed '/^$/d' > {output.norm}
"""
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
shell: """
# Setups temporary directory for
# intermediate files with built-in 
# mechanism for deletion on exit
{params.set_tmp}

mkdir -p fastqs
gatk SamToFastq \\
    --INPUT {input.bam} \\
    --FASTQ {output.r1} \\
    --SECOND_END_FASTQ {output.r2} \\
    --UNPAIRED_FASTQ {output.orphans} \\
    --TMP_DIR ${{tmp}} \\
    -R {params.genome}
"""
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
shell: """
myoutdir="$(dirname {output.one})"
if [ ! -d "$myoutdir" ]; then mkdir -p "$myoutdir"; fi
trimmomatic PE \\
    -threads {threads} \\
    -phred33 \\
    {input.r1} {input.r2} \\
    {output.one} {output.two} \\
    {output.three} {output.four} \\
    ILLUMINACLIP:{params.adapterfile}:3:30:10 \\
    LEADING:10 \\
    TRAILING:10 \\
    SLIDINGWINDOW:4:20 \\
    MINLEN:20      
"""
117
118
119
120
121
122
123
124
125
126
127
shell: """
myoutdir="$(dirname {output})"
if [ ! -d "$myoutdir" ]; then mkdir -p "$myoutdir"; fi
bwa mem -M \\
    -R \'@RG\\tID:{params.sample}\\tSM:{params.sample}\\tPL:illumina\\tLB:{params.sample}\\tPU:{params.sample}\\tCN:hgsc\\tDS:wes\' \\
    -t {threads} \\
    {params.genome} \\
    {input} | \\
samblaster -M | \\
samtools sort -@12 -m 4G - -o {output}
"""
152
153
154
shell: """
samtools index -@ 2 {input.bam} {output.bai}
"""
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
shell: """
gatk --java-options '-Xmx48g' BaseRecalibrator \\
    --input {input.bam} \\
    --reference {params.genome} \\
    {params.knowns} \\
    --output {output.re} \\
    --intervals {params.intervals}

gatk --java-options '-Xmx48g' ApplyBQSR \\
    --reference {params.genome} \\
    --input {input.bam} \\
    --bqsr-recal-file {output.re} \\
    --output {output.bam} \\
    --use-jdk-inflater \\
    --use-jdk-deflater
"""
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
shell: """
sample={wildcards.samples}
ID=$sample
PL="ILLUMINA"  # exposed as a config param
LB="na"        # exposed as a config param 

# Check if there is no header or any of the info
HEADER=`samtools view -H {input.bam} | grep ^@RG`
if [[ "$HEADER" != "" ]]; then
    t=(${{HEADER//\t/ }})
    echo ${{t[1]}}
    ID=`printf '%s\n' "${{t[@]}}" | grep -P '^ID' | cut -d":" -f2` #(${{t[1]//:/ }})
    PL=`printf '%s\n' "${{t[@]}}" | grep -P '^PL' | cut -d":" -f2` #(${{t[3]//:/ }})
    LB=`printf '%s\n' "${{t[@]}}" | grep -P '^LB' | cut -d":" -f2` #(${{t[2]//:/ }})
    if [[ "$ID" == "$sample" ]]; then
        echo "The header of the BAM file is correct"
    else
        ID=$sample
    fi
fi

gatk AddOrReplaceReadGroups \\
    --INPUT {input.bam} \\
    --OUTPUT {output.bam} \\
    --RGID ${{ID}} \\
    --RGLB ${{LB}} \\
    --RGPL ${{PL}} \\
    --RGSM ${{ID}} \\
    --RGPU na

samtools index -@ 2 {output.bam} {output.bai}
cp {output.bai} {output.bai2}
"""
ShowHide 52 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://ccbr.github.io/XAVIER/
Name: exome-seek
Version: v2.0
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...