Whole exome sequencing snakemake workflow based on GATK best practice

public public 1yr ago 0 bookmarks

THIS WORKFLOW IS STILL UNDER DEVELOPMENT AND IS NOT READY YET!

Navigate to:

Introduction

Note: This workflow is still under development.

WES GATK is a flexible and user-friendly whole exome sequencing workflow based on GATK best practices . It is designed for processing Illumina WES short reads data and features automatic sample table generation, Snakemake configuration file, and simplified workflow execution.

Workflow

Description:

The workflow starts by converting raw fastq file to an unmapped bam file to include the read group information. Then, Illumina adaptors are marked and reads are aligned to hg38 reference human genome using BWA and consdier the alternative haplotype mapping. Multiple BAM files of the same sample are merged then PCR and optical duplicates are marked and reads are sorted by genome coordinates. Sorted BAM files are analayzied to generate mapping QC metrics. The following steps include the Base Quality score recalibration calculation using known variants files and GATK's Machine learning algorithm to recalibrate the quality scores and apply these new qualities to the BAM file. We now can call the variants using GATK's "haplotypecaller" in GVCF format, and combine all samples in a single gvcf file. The merged file is used for a final joint variant calling step to produce a single vcf file the include the genotypes of all samples. For better annotation, the file is splitted into two files, one for indels, and the other for snps. The last steps are variant filtration and variant annotation using Nirvana and Annovar.

Features:

This workflow is an implemetation of the Gold Standard GATK best practice in addition to these features:

  • Exome implementation ( uses user provided intervals file for specefic location calling + X basepair padding (default = 100pb see below ) )

  • Merging samples run on multiple lanes

  • QUALIMAP and multiqc QC reports

  • Nirvana and Annovar Annotation (more info)

  • Joint Gentotyping for all samples

  • Automatic sample name, group, lane and read number recognition.

  • Automatic snakemake sample table and config file generation.

  • progress bar of the analysis (use argument --parse-snakemake-output see below )

Workflow Diagram:

Workflow

Quick Start

Installation

Clone the repository:

git clone https://github.com/AbdelrahmanYahia/wes_gatk.git

If you prefer to download the dependencies manually, you can find them [here].

Step-by-Step Installation Guide

Register ANNOVAR

To obtain the link, you need to register at Annovar website .

Make sure you have Conda and Mamba installed. If not, follow these steps:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh # Follow the prompts

Install Mamba:

conda install mamba -n base -c conda-forge

Restart your terminal.

Prepare the Environment

Change the permissions for the Prep_ENV.sh file:

chmod 777 wes_gatk/scripts/Prep_ENV.sh

Create the environment, install the tools, and download the annotations database:

./wes_gatk/scripts/Prep_ENV.sh $ANNOVAR_LINK

Follow the instruction here to download GATK required python env for CNV workflow This process may take some time.

You can also download all the required reference files using wes_gatk/scripts/gatk_download_data.sh :

chmod 777 wes_gatk/scripts/gatk_download_data.sh
bash wes_gatk/scripts/gatk_download_data.sh $DOWNLOAD_DIR

Running the Analysis

To start the analysis, activate the wes_gatk environment and run the wes.py file:

conda activate wes_gatk
cd wes_gatk
python3 wes.py WES --help

You can also use the Python file to generate the sample table and the config file, and then run Snakemake independently. Modify the parameters according to your needs:

conda activate wes_gatk
python3 wes.py WES \
 --input PTH/to/samples \
 --output PTH/to/outdir \
 --reference-fasta broad_hg38/Homo_sapiens_assembly38.fasta \
 --bed-file exome_bed/S07604715_Padded.bed \
 --gff-file broad_hg38/Homo_sapiens.GRCh38.109.gff3.gz \
 --nirvana-path ~/Nirvana \
 --annovar-path ~/annovar_source/annovar \
 --known-variants broad_hg38/1000G_omni2.5.hg38.vcf.gz \
 --reference-index broad_hg38/Homo_sapiens_assembly38.fasta \
 --generate-confs-only

As an alternative, you can edit the options in "prep_files.sh", then run it:

conda activate wes_gatk
bash prep_files.sh

To run a Snakemake dry-run:

conda activate wes_gatk
snakemake \ --snakefile workflow/Snakefile \ --configfile PTH/to/outdir/config.yml \ -n -j THREADS

Or you can run it by adding --dry-run argument to the python script:

conda activate wes_gatk
python3 wes.py WES \
 --input PTH/to/samples \
 --output PTH/to/outdir \
 --reference-fasta broad_hg38/Homo_sapiens_assembly38.fasta \
 --bed-file exome_bed/S07604715_Padded.bed \
 --gff-file broad_hg38/Homo_sapiens.GRCh38.109.gff3.gz \
 --nirvana-path ~/Nirvana \
 --annovar-path ~/annovar_source/annovar \
 --known-variants broad_hg38/1000G_omni2.5.hg38.vcf.gz \
 --reference-index broad_hg38/Homo_sapiens_assembly38.fasta \
 --generate-confs-only \
 --dry-run

Advanced Parameters

For advanced usage, you can refer to the following command-line options: usage: Basic Run Usage example: guap WES -i indir -o outdir --bed-file file --reference-fasta fasta.fasta --reference-index indexpath

options:
 -h, --help show this help message and exit
basic config:
 -i in path, --input in path
 Input directory path
 -o out path, --output out path
 Output directory path
Workflow configure:
 --threads N Number of total threads to use [default = all]
 --reference-fasta path/to/file.fa
 path to reference fasta file
 --bed-file path bed file path
 --gff-file path gff file path
 --nirvana-path path Path for Nirvana
 --annovar-path path Path for annovar
 --generate-confs-only
 Generate sample table and config file only
 --contig-ploidy-priors path
 Path prior ploidy file
 --call-CNV Perform GATK CNV pipeline
 --dont-use-gatk-bestparctice
 Generate sample table and config file only
QC configuration:
 --trimmomatic Use trimmomatic
 --trim-t N Number of threads to use during trim step
 --trim-min-length N trimmomatic min length [default = 30]
 --slidingwindow-size N
 trimmomatic sliding window size [default = 4]
 --slidingwindow-quality N
 trimmomatic sliding window quality score [default = 10]
 --trimmomatic-extra-args ='-args'
 A string value of extra args for trimmomatic (must be used with = with no spaces (--trimmomatic-extra-args='-arg1 -arg2'))
 --skip-QC Skipp Fastqc step
Samples pre-processing:
 --gen-ubam-threads N Number of threads to use during creating ubam and marking adaptors [default = 4]
 --gen-ubam-mem N Memory (GB) to use during creating ubam and marking adaptors [default = 5]
Aligner configuration:
 --index-threads N Number of threads to use during indexing ref [default = 4]
 --align-threads N Number of threads to use during sample alignment [default = 4]
 --align-mem N Memory (GB) to use during sample alignment [default = 32]
 --aligner-extra-args '-args'
 Extra arguments for aligner, use it with no spaces and add = ( --aligner-extra-args='-arg1 -arg2' )
 --reference-index path/to/ref
 path to reference index
 --reference-output-path path/to/ref
 path to reference index
 --reference-output-prefix path/to/ref
 path to reference index
 --index-fasta Index fasta file
Variant caller configuration:
 --padding N Interval padding to include
 --known-variants-indels path
 path to reference fasta file
 --known-variants-indels2 path
 path to reference fasta file
 --known-variants-snps path
 path to reference fasta file
 --caller-extra-args '-args'
 Extra arguments for caller, use it with no spaces and add = ( --caller-extra-args='-arg1 -arg2' )
 --calling-threads N Number of threads to use during variant calling [default = 4]
 --calling-mem N Memory in GB to use during variant calling [default = 8]
Snakemake Options:
 --dry-run performs snakemake dry run
 --export-dag performs snakemake dry run and exports DAG
 --smk-extra-args ='-args'
 A string value of extra args for snakemake(must be used with = with no spaces (--smk-extra-args='-arg1 -arg2'))
 --parse-snakemake-output
 prints progress bar instead of snakemake regular output
Annotation configuration:
 --annovar-protocol str
 Annovar Protocol defaults: refGene,avsnp150,clinvar_20221231,cosmic70,dbnsfp31a_interpro,EAS.sites.2015_08,EUR.sites.2015_08,gme,gnomad211_exome,SAS.sites.2015_08
 --annovar-operation str
 Annovar Protocol defaults: g,f,f,f,f,f,f,f,f,f
 --annovar-extra-args ='-args'
 A string value of extra args for annovar(must be used with = with no spaces (--annovar-extra-args='-arg1 -arg2'))
 --nirvana-extra-args ='-args'
 A string value of extra args for nirvana(must be used with = with no spaces (--nirvana-extra-args='-arg1 -arg2'))
 --annotation-threads N
 Number of threads to use during creating ubam and marking adaptors [default = 4]
 --annotation-mem N Memory (GB) to use during creating ubam and marking adaptors [default = 5]
Other:
 --continue continue analysis when re-run
 --overwrite overwrite output dir if exsits
 -n str, --name str Name of files [ default = guap_run[date time] ]
 --verbose verbose
 --quit print many output

Code Snippets

186
187
shell:
    "multiqc . -o multiqc/"
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
shell:
    """
    ext={params.ext}
    R1={input.R1}
    SM={wildcards.sample}
    PL="Illumina"
    LB=$SM
    if [[ "$ext" == *".gz" ]]; then
        RGID=$(head -n1 <(zcat {input.R1}) | sed 's/:/_/g' |cut -d "_" -f1,2,3,4)
        #RGID=$(zcat {input.R1} | head -n1 | sed 's/:/_/g' |cut -d "_" -f1,2,3,4)
    else
        RGID=$(head {input.R1} -n1 | sed 's/:/_/g' |cut -d "_" -f1,2,3,4)
    fi
    ## TODO: confirm to use this
    PU=$RGID.$LB 

    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        FastqToSam \
        -F1 {input.R1} \
        -F2 {input.R2} \
        -O {output} \
        -SM $SM \
        -LB $LB \
        -PL $PL \
        -RG $RGID \
        -PU $PU \
    """
253
254
255
256
257
258
259
260
261
shell:
    '''
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        MarkIlluminaAdapters \
        -I {input} \
        -O {output.bam} \
        -M {output.metrics} \
        # > {log} 2>&1
    '''
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
shell:
    '''
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        SamToFastq \
        -I {input.bam} \
        --FASTQ /dev/stdout \
        --CLIPPING_ATTRIBUTE XT --CLIPPING_ACTION 2 \
        --INTERLEAVE true -NON_PF true | bwa mem -K 100000000 \
        -v 3 {params.bwa_args} -t {threads} -p  -Y {params.index} /dev/stdin | gatk \
        --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        MergeBamAlignment \
        --VALIDATION_STRINGENCY SILENT \
        --EXPECTED_ORIENTATIONS FR \
        --ATTRIBUTES_TO_RETAIN X0 \
        --ALIGNED_BAM /dev/stdin \
        --UNMAPPED_BAM {input.bam} \
        --OUTPUT {output.bam} \
        -R {params.fa} \
        --PAIRED_RUN true \
        --SORT_ORDER "unsorted" \
        --IS_BISULFITE_SEQUENCE false \
        --ALIGNED_READS_ONLY false \
        --CLIP_ADAPTERS false \
        --MAX_RECORDS_IN_RAM 2000000 \
        --ADD_MATE_CIGAR true \
        --MAX_INSERTIONS_OR_DELETIONS -1 \
        --PRIMARY_ALIGNMENT_STRATEGY MostDistant \
        --UNMAPPED_READ_STRATEGY COPY_TO_TAG \
        --ALIGNER_PROPER_PAIR_FLAGS true \
        --UNMAP_CONTAMINANT_READS true {params.warp_workflow_params}
    '''
360
361
362
363
364
shell:
    """
    samtools depth {input} | awk '{{sum+=$3}} END {{print "Average = ",sum/NR, "No of covered Nuc = ", NR}}' > {output.cov}
    samtools flagstat {input} > {output.stats}
    """
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
shell:
    """
    qualimap \
        bamqc \
        -bam {input.bam} \
        --java-mem-size=15G \
        --paint-chromosome-limits \
        --genome-gc-distr HUMAN \
        -nt {threads} \
        -skip-duplicated \
        --skip-dup-mode 0 \
        -outdir {output} \
        --feature-file {input.bed} \
        -outformat HTML
    """
421
422
423
424
425
426
427
shell:
    """
    picard MergeSamFiles  \
        {params.bams} \
        -OUTPUT {output} \
        --SORT_ORDER "unsorted" 
    """
448
449
450
451
452
453
454
455
456
457
458
459
shell:
    """
    picard MarkDuplicates -I {input} \
        -O {output.bam} \
        -M {output.matrix} \
        --VALIDATION_STRINGENCY SILENT \
        --OPTICAL_DUPLICATE_PIXEL_DISTANCE 2500 \
        --ASSUME_SORT_ORDER "queryname" \
        --CREATE_MD5_FILE true \
        --CLEAR_DT false \
        --ADD_PG_TAG_TO_READS false
    """
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
shell:
    """
gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
  SortSam \
  --INPUT {input} \
  --OUTPUT /dev/stdout \
  --SORT_ORDER "coordinate" \
  --CREATE_INDEX false \
  --CREATE_MD5_FILE false | gatk \
  --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
  SetNmMdAndUqTags \
  --INPUT /dev/stdin \
  --OUTPUT {output} \
  --CREATE_INDEX true \
  --CREATE_MD5_FILE true \
  --REFERENCE_SEQUENCE {params.fa}
    """
516
517
518
519
520
521
522
523
524
525
526
527
528
529
shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        BaseRecalibrator \
        -R {params.ref} \
        -I {input} \
        --use-original-qualities \
        --known-sites {params.known_sites} \
        --known-sites {params.known_sites2} \
        --known-sites {params.known_sites3} \
        -L {params.bed} \
        --interval-padding {params.padding} \
        -O {output} 
    """
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        ApplyBQSR  -R {params.ref} \
        -I {input.bam} \
        -bqsr {input.report} -O {output} \
        -L {params.bed} \
        --interval-padding {params.padding} \
        --static-quantized-quals 10 --static-quantized-quals 20 --static-quantized-quals 30 \
        --add-output-sam-program-record \
        --create-output-bam-md5 \
        --use-original-qualities

    samtools index {output} 
    """
584
585
586
587
588
589
590
591
592
593
594
shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" BaseRecalibrator -R {params.ref} \
        -I {input} \
        --known-sites {params.known_sites} \
        --known-sites {params.known_sites2} \
        --known-sites {params.known_sites3} \
        -L {params.bed} \
        --interval-padding {params.padding} \
        -O {output} 
    """
613
614
615
616
617
shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" AnalyzeCovariates -before {input.raw} \
        -after {input.bqsr} -plots {output}
    """
640
641
642
643
shell:
    """
    bedtools slop -i {input.bed} -g {params.ref}.fai -b {params.padding} > {output}
    """
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
shell:
    '''
    grep -vE "^@" {input.intervalslist} |
        awk -v OFS='\t' '$2=$2-1' |
        bedtools intersect -c -a {input.BED} -b - |
        cut -f6 > {output.target_overlap_counts}

    function restrict_to_overlaps() {{
        # print lines from whole-genome file from loci with non-zero overlap
        # with target intervals
        WGS_FILE=$1
        EXOME_FILE=$2
        paste {output.target_overlap_counts} $WGS_FILE |
            grep -Ev "^0" |
            cut -f 2- > $EXOME_FILE
        echo "Generated $EXOME_FILE"
    }}

    restrict_to_overlaps {input.UD} {output.UD}
    restrict_to_overlaps {input.BED} {output.BED}
    restrict_to_overlaps {input.MU} {output.MU}

    '''
720
721
722
723
724
725
726
727
728
729
730
shell:
    '''
    python3 {params.get_con_file} \
        --prefix {params.prefix} \
        --output {output.contamination} \
        --svdprefix {params.svdprefix} \
        --get_con_file {params.get_con_file} \
        --target_bam {input.target_bam} \
        --ref {input.ref} \
        --threads {threads} 
    '''
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
shell:
    """
    cont=$(cat {input.contamination})
    echo $cont

    gatk --java-options "-Xmx{resources.reduced}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        HaplotypeCaller -R {params.ref} \
        -G StandardAnnotation -G StandardHCAnnotation \
        -G AS_StandardAnnotation {params.extra_args} \
        -GQB 20 -GQB 30 -GQB 40 -GQB 50 -GQB 60 -GQB 70 -GQB 80 -GQB 90 \
        -L {params.bed} \
        --interval-padding {params.padding} \
        -I {input.bam} --native-pair-hmm-threads {threads} -ERC GVCF -O {output.vcf} \
        -contamination $cont \
        -bamout {output.bam} \

    """
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
shell:
    """
    ## IMPORTANT!
    # Note that when uncalled alleles are dropped, 
    # the original GQ may increase. Use --keep-all-alts 
    # if GQ accuracy is a concern.


    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        ReblockGVCF \
        -R {params.ref} \
        -G StandardAnnotation \
        -G StandardHCAnnotation \
        -G AS_StandardAnnotation \
        --floor-blocks -OVI \
        -GQB 20 -GQB 30 -GQB 40 -GQB 50 -GQB 60 -GQB 70 -GQB 80 -GQB 90 \
        -L {params.bed} \
        --interval-padding {params.padding} \
        -do-qual-approx \
        -V {input} -O {output}

    """
864
865
866
867
868
869
870
871
872
873
874
shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        VariantFiltration \
        -V {input} \
        -L {params.bed} \
        --interval-padding {params.padding} \
        --filter-expression "QD < 2.0 || FS > 30.0 || SOR > 3.0 || MQ < 40.0 || MQRankSum < -3.0 || ReadPosRankSum < -3.0" \
        --filter-name "HardFiltered" \
        -O {output}
    """
899
900
901
902
903
904
905
906
907
908
shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        CNNScoreVariants \
        -V {input.vcf} \
        -R {params.ref} \
        -O {output} \
        -I {input.bam} \
        -tensor-type read-tensor
    """
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        FilterVariantTranches \
        -V ~{input_vcf} \
        -O ~{vcf_basename}.filtered.vcf.gz \
        --resource {params.hapmap_resource_vcf} \
        --resource {params.omni_resource_vcf} \
        --resource {params.one_thousand_genomes_resource_vcf} \
        --resource {params.dbsnp_resource_vcf} \
        --info-key CNN_1D \
        --snp-tranche 99.95 \
        --indel-tranche 99.4 \
        --create-output-variant-index true
    """
978
979
980
981
982
983
984
985
shell:
    """
    touch {output}

    for i in {params.IDs} ; do
        echo -e  "${{i}}\t04-1_gvcf-processing/reblocked/${{i}}.gvcf.gz" | cat >> {output}
    done
    """
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
shell:
    """
    sort -k1,1 -k2,2n {input.bedfile} | awk '
    {{
        # If it is a different chromosome or if it is a non-overlapping interval
        if ($1 != chrom) {{
            # If it is not the first interval
            if (chrom) {{
                print chrom, start, end;
            }}
            chrom = $1;
            start = $2;
            end = $3;
        }} else if ($3 > end) {{
            # If it is an overlapping interval
            end = $3;
        }}
    }}
    END {{
        # Print the last interval
        print chrom, start, end;
    }}
    ' > {output.newbed}
    """
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        SplitIntervals \
        -R {input.ref} \
        -L {input.intervals} \
        --scatter-count {params.scatter_count} \
        --interval-merging-rule ALL \
        -mode INTERVAL_SUBDIVISION \
        -O {output.dir} 
    """
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
shell:
    """
    # TODO: we can implement the update here: by checking and replacing:
    ## --genomicsdb-update-workspace-path path_to_dir instead of:
    ## --genomicsdb-workspace-path path_to_dir
    ## but when updating I don't need to add -L as it figures that out from the samples automatically

    # sources: https://eriqande.github.io/eca-bioinf-handbook/variant-calling.html#genomics-db
    # finally  some notes:
    ## the documentation for GenomicsDBImport is replete with warnings 
    ## about how you should backup your genomics data bases before trying
    ## to update them. Apparently there is a good chance that any little
    ## glitch during the updating process could corrupt the entire data base and render it useless. Joy!

    ## With that as a preamble, it does seem that we should try creating some data bases and then adding samples to them.


    ### use --genomicsdb-shared-posixfs-optimizations  for HPC


    mkdir -p 04-2_GenomicsDB


    gatk --java-options "-Xmx{resources.reduced}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads} -DGATK_STACKTRACE_ON_USER_EXCEPTION=true" \
        GenomicsDBImport \
        --sample-name-map {input.sample_map} \
        -L {input.intervals} \
        --genomicsdb-workspace-path {output} \
        --batch-size 50 \
        --reader-threads 4 \
        --merge-input-intervals \
        --consolidate \
        --genomicsdb-shared-posixfs-optimizations true \

    """
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        GenotypeGVCFs \
        -G StandardAnnotation -G StandardHCAnnotation \
        -G AS_StandardAnnotation \
        -R {params.ref} -V gendb://{input.infile} -O {output} \
        {params.Additional_annotation} --merge-input-intervals \
        -L {input.intervals} \
        --only-output-calls-starting-in-intervals \

    """
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
shell:
    """
    gatk --java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true' \
        GatherVcfsCloud \
        --ignore-safety-checks \
        --gather-type BLOCK \
        {params.gvcfs} \
        -O {output}

    tabix {output}
    """
1204
1205
1206
1207
1208
1209
1210
1211
shell:
    """
    touch {output}
    gatk VariantEval \
        -R {params.ref} \
        --eval:{wildcards.sample} {input} \
        -O {output}
    """
1227
1228
1229
1230
1231
1232
1233
1234
shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}"  SelectVariants \
        -R {params.ref} \
        -V {input} \
        --select-type-to-include INDEL \
        -O {output}
    """
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
shell:
    """
    gatk VariantFiltration \
        --variant {input} \
        --filter-expression "QD < 2.0"                  --filter-name "QD2" \
        --filter-expression "SOR > 10.0"                  --filter-name "SOR10" \
        --filter-expression "FS > 200.0"                --filter-name "FS200" \
        --filter-expression "ReadPosRankSum < -20.0"    --filter-name "ReadPosRankSum-20" \
        --filter-expression "InbreedingCoeff < -0.8"    --filter-name "InbreedingCoeff-0.8" \
        --create-output-variant-index true \
        --output {output}
    """
1280
1281
1282
1283
1284
1285
1286
1287
shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}"  SelectVariants \
        -R {params.ref} \
        -V {input} \
        --select-type-to-include SNP \
        -O {output}
    """
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
shell:
    """
    gatk VariantFiltration \
        --variant {input} \
        --filter-expression "QD < 2.0"              --filter-name "QD2" \
        --filter-expression "SOR > 3.0"             --filter-name "SOR3" \
        --filter-expression "FS > 60.0"             --filter-name "FS60" \
        --filter-expression "MQ < 40.0"             --filter-name "MQ40" \
        --filter-expression "MQRankSum < -12.5"     --filter-name "MQRankSum-12.5" \
        --filter-expression "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" \
        --create-output-variant-index true \
        --output {output}
    """
1338
1339
1340
1341
shell:
    """
    bcftools csq -f {params.ref} -g {params.gff} {input} -Ov -o {output}
    """
1355
1356
1357
1358
shell:
    """
        bcftools stats {input} > {output}
    """
1371
1372
1373
1374
shell:
    """
    plot-vcfstats -p {output} {input}
    """
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
shell:
    """
    eval "$(conda shell.bash hook)"
    set +u; conda activate dotnet; set -u
    dotnet {params.Nirvana_cmd} \
        -i {input} \
        -o {params.file_name} \
        -r {params.Nirvana_ref} \
        --sd {params.Nirvana_supplementray} \
        -c {params.Nirvana_cache} {params.extra}
    """
1433
1434
1435
1436
1437
1438
1439
shell:
    """
    snpEff -Xmx{resources.mem_gb}G \
        -v -csvStats {output.stats} \
        -stats {output.html} \
        {params.genome} {input} > {output.vcf}
    """
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
shell:
    """
    mkdir -p {output}
    perl {params.annovar_dir}/table_annovar.pl {input} {params.annovar_dir}/humandb/ \
        -buildver hg38 \
        -out {params.output} -remove \
        -protocol {params.protocol} \
        -operation {params.operation} \
        -nastring . \
        -vcfinput \
        --thread {threads} {params.extra}
    """
1493
1494
1495
1496
1497
1498
1499
1500
shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" 
        CollectQualityYieldMetrics \
        --INPUT {input} \
        --OQ true \
        --OUTPUT {output}
    """
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        CollectMultipleMetrics \
        --INPUT {input} \
        --REFERENCE_SEQUENCE {params.ref} \
        --OUTPUT {params.prefix} \
        --ASSUME_SORTED true \
        --PROGRAM null \
        --PROGRAM CollectAlignmentSummaryMetrics \
        --PROGRAM CollectSequencingArtifactMetrics \
        --PROGRAM QualityScoreDistribution \
        --PROGRAM CollectGcBiasMetrics \
        --METRIC_ACCUMULATION_LEVEL null \
        --METRIC_ACCUMULATION_LEVEL SAMPLE \
        --METRIC_ACCUMULATION_LEVEL LIBRARY \
        --METRIC_ACCUMULATION_LEVEL ALL_READS 

    """
1560
1561
1562
1563
1564
1565
1566
1567
shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        ConvertSequencingArtifactToOxoG \
        --INPUT_BASE {params.inPasename} \
        --OUTPUT_BASE {params.prefix} \
        --REFERENCE_SEQUENCE {params.ref} \
    """
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        CrosscheckFingerprints \
        --OUTPUT {output} \
        --HAPLOTYPE_MAP {params.htdbf} \
        --EXPECT_ALL_GROUPS_TO_MATCH true \
        {params.files} \
        --LOD_THRESHOLD -10.0 \
    """
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        CollectMultipleMetrics \
        --INPUT {input} \
        --REFERENCE_SEQUENCE {params.ref} \
        --OUTPUT {params.prefix} \
        --ASSUME_SORTED true \
        --PROGRAM null \
        --PROGRAM CollectBaseDistributionByCycle \
        --PROGRAM CollectInsertSizeMetrics \
        --PROGRAM MeanQualityByCycle \
        --PROGRAM QualityScoreDistribution \
        --METRIC_ACCUMULATION_LEVEL null \
        --METRIC_ACCUMULATION_LEVEL ALL_READS

    """
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        CollectMultipleMetrics \
        --INPUT {input} \
        --REFERENCE_SEQUENCE {params.ref} \
        --OUTPUT {params.prefix} \
        --ASSUME_SORTED true \
        --PROGRAM null \
        --PROGRAM CollectAlignmentSummaryMetrics \
        --PROGRAM CollectGcBiasMetrics \
        --METRIC_ACCUMULATION_LEVEL null \
        --METRIC_ACCUMULATION_LEVEL READ_GROUP
    """
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        ValidateSamFile \
        --INPUT {input} \
        --OUTPUT {output} \
        --REFERENCE_SEQUENCE {params.ref} \
        --MODE VERBOSE \
        --IS_BISULFITE_SEQUENCED false
    """
1707
1708
1709
1710
1711
1712
1713
1714
shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        ValidateSamFile \
        -V {input} \
        -R {params.ref} \
        --validation-type-to-exclude ALLELES
    """ 
1729
1730
1731
1732
1733
1734
1735
1736
shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        BedToIntervalList \
        -I {input.bed} \
        -O {output} \
        -SD {input.seq_dict}
    """
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        CollectVariantCallingMetrics \
        -I {input.vcf} \
        --OUTPUT {params.prefix} \
        --DBSNP {params.dbsnp} \
        --SEQUENCE_DICTIONARY {params.refdict} \
        --TARGET_INTERVALS {input.intervals} \
        --GVCF_INPUT true
    """ 
1787
1788
1789
1790
1791
1792
1793
1794
shell:
    """
    java \
        -jar {params.path_to_tool}/DISCVRSeq-1.3.49.jar VariantQC \
        -R {params.ref} \
        -V {input} \
        -O {output}
    """
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
shell:
    """
    mkdir -p 000_CNV_Intervalsprep
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        PreprocessIntervals \
        -R {input.ref} \
        -L {params.bed} \
        --padding 250 \
        --bin-length 0 \
        -imr OVERLAPPING_ONLY \
        -O {output}
    """
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        CollectReadCounts \
        -R {params.ref} \
        -L {input.intervals} \
        -imr OVERLAPPING_ONLY \
        -I {input.sample} \
        -O {output}
    """
1965
1966
1967
1968
1969
1970
1971
1972
1973
shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        AnnotateIntervals \
        -L {input.intervals} \
        -R {params.ref} \
        -imr OVERLAPPING_ONLY \
        -O {output}
    """
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
shell:
    """
    echo {params.samples} 
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        FilterIntervals \
        -L {input.preprocessedIntervals}\
        --annotated-intervals {input.annotatedIntervals} \
        {params.samples} \
        -imr OVERLAPPING_ONLY \
        --minimum-gc-content 0.1 \
        --maximum-gc-content 0.9 \
        --minimum-mappability 0.9 \
        --maximum-mappability 1.0 \
        --minimum-segmental-duplication-content 0.0 \
        --maximum-segmental-duplication-content 0.5 \
        --low-count-filter-count-threshold 5 \
        --low-count-filter-percentage-of-samples 90.0 \
        --extreme-count-filter-minimum-percentile 1.0 \
        --extreme-count-filter-maximum-percentile 99.0 \
        --extreme-count-filter-percentage-of-samples 90.0 \
        -O {output}
    """
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
shell:
    """
    # eval "$(conda shell.bash hook)"
    # set +u; conda activate gatk ; set -u

    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        DetermineGermlineContigPloidy  \
        -L {input.filteredIntervals}\
        {params.samples}  \
        -imr OVERLAPPING_ONLY \
        --contig-ploidy-priors {params.contigPloidyPriors} \
        -O {output.dir} \
        --output-prefix ploidy \
        --verbosity DEBUG
    """
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
shell:
    """
    # eval "$(conda shell.bash hook)"
    # set +u; conda activate gatk; set -u

    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        GermlineCNVCaller  \
        --run-mode COHORT \
        -L {input.interval} \
        -I {params.samples} \
        --contig-ploidy-calls {input.ploidy}/dogs-calls \
        --annotated-intervals {input.annotated} \
        --output-prefix {params.outpref} \
        --interval-merging-rule OVERLAPPING_ONLY \
        -O {params.outdir}
    """
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        PostprocessGermlineCNVCalls \
        --model-shard-path {input.model} \
        --calls-shard-path {input.call} \
        --allosomal-contig chrX --allosomal-contig chrY \
        --contig-ploidy-calls ploidy-calls \
        --sample-index {params.index} \
        --output-genotyped-intervals  {output.intervals} \
        --output-genotyped-segments  {output.segments} \
        --sequence-dictionary {input.seq_dict}

    """
2161
2162
2163
2164
2165
2166
2167
2168
2169
shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        VariantsToTable \
        -V {input} \
        -F CHROM -F POS -F END -GF NP -GF CN \
        -O {output}

    """
2187
2188
2189
2190
2191
2192
2193
shell:
    """

    sampleName=$(gzcat {input.vcf} | grep -v '##' | head -n1 | cut -f10)
    awk -v sampleName=$sampleName 'BEGIN {FS=OFS="\t"} {print sampleName, $0}' {input.txt} &gt; ${i}.seg; head ${i}.seg

    """
ShowHide 47 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/AbdelrahmanYahia/wes_gatk
Name: wes_gatk
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...