Whole exome sequencing snakemake workflow based on GATK best practice

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

THIS WORKFLOW IS STILL UNDER DEVELOPMENT AND IS NOT READY YET!

Navigate to:

Description
Features
Workflow Diagram
Installation
Running the Analysis

Introduction

Note: This workflow is still under development.

WES GATK is a flexible and user-friendly whole exome sequencing workflow based on GATK best practices . It is designed for processing Illumina WES short reads data and features automatic sample table generation, Snakemake configuration file, and simplified workflow execution.

Workflow

Description:

The workflow starts by converting raw fastq file to an unmapped bam file to include the read group information. Then, Illumina adaptors are marked and reads are aligned to hg38 reference human genome using BWA and consdier the alternative haplotype mapping. Multiple BAM files of the same sample are merged then PCR and optical duplicates are marked and reads are sorted by genome coordinates. Sorted BAM files are analayzied to generate mapping QC metrics. The following steps include the Base Quality score recalibration calculation using known variants files and GATK's Machine learning algorithm to recalibrate the quality scores and apply these new qualities to the BAM file. We now can call the variants using GATK's "haplotypecaller" in GVCF format, and combine all samples in a single gvcf file. The merged file is used for a final joint variant calling step to produce a single vcf file the include the genotypes of all samples. For better annotation, the file is splitted into two files, one for indels, and the other for snps. The last steps are variant filtration and variant annotation using Nirvana and Annovar.

Features:

This workflow is an implemetation of the Gold Standard GATK best practice in addition to these features:

Exome implementation ( uses user provided intervals file for specefic location calling + X basepair padding (default = 100pb see below ) )
Merging samples run on multiple lanes
QUALIMAP and multiqc QC reports
Nirvana and Annovar Annotation (more info)
Joint Gentotyping for all samples
Automatic sample name, group, lane and read number recognition.
Automatic snakemake sample table and config file generation.
progress bar of the analysis (use argument --parse-snakemake-output see below )

Workflow Diagram:

Workflow

Quick Start

Installation

Clone the repository:

git clone https://github.com/AbdelrahmanYahia/wes_gatk.git

If you prefer to download the dependencies manually, you can find them [here].

Step-by-Step Installation Guide

Register ANNOVAR

To obtain the link, you need to register at Annovar website .

Make sure you have Conda and Mamba installed. If not, follow these steps:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh # Follow the prompts

Install Mamba:

conda install mamba -n base -c conda-forge

Restart your terminal.

Prepare the Environment

Change the permissions for the Prep_ENV.sh file:

chmod 777 wes_gatk/scripts/Prep_ENV.sh

Create the environment, install the tools, and download the annotations database:

./wes_gatk/scripts/Prep_ENV.sh $ANNOVAR_LINK

Follow the instruction here to download GATK required python env for CNV workflow This process may take some time.

You can also download all the required reference files using wes_gatk/scripts/gatk_download_data.sh :

chmod 777 wes_gatk/scripts/gatk_download_data.sh
bash wes_gatk/scripts/gatk_download_data.sh $DOWNLOAD_DIR

Running the Analysis

To start the analysis, activate the wes_gatk environment and run the wes.py file:

conda activate wes_gatk
cd wes_gatk
python3 wes.py WES --help

You can also use the Python file to generate the sample table and the config file, and then run Snakemake independently. Modify the parameters according to your needs:

conda activate wes_gatk
python3 wes.py WES \
 --input PTH/to/samples \
 --output PTH/to/outdir \
 --reference-fasta broad_hg38/Homo_sapiens_assembly38.fasta \
 --bed-file exome_bed/S07604715_Padded.bed \
 --gff-file broad_hg38/Homo_sapiens.GRCh38.109.gff3.gz \
 --nirvana-path ~/Nirvana \
 --annovar-path ~/annovar_source/annovar \
 --known-variants broad_hg38/1000G_omni2.5.hg38.vcf.gz \
 --reference-index broad_hg38/Homo_sapiens_assembly38.fasta \
 --generate-confs-only

As an alternative, you can edit the options in "prep_files.sh", then run it:

conda activate wes_gatk
bash prep_files.sh

To run a Snakemake dry-run:

conda activate wes_gatk
snakemake \ --snakefile workflow/Snakefile \ --configfile PTH/to/outdir/config.yml \ -n -j THREADS

Or you can run it by adding --dry-run argument to the python script:

conda activate wes_gatk
python3 wes.py WES \
 --input PTH/to/samples \
 --output PTH/to/outdir \
 --reference-fasta broad_hg38/Homo_sapiens_assembly38.fasta \
 --bed-file exome_bed/S07604715_Padded.bed \
 --gff-file broad_hg38/Homo_sapiens.GRCh38.109.gff3.gz \
 --nirvana-path ~/Nirvana \
 --annovar-path ~/annovar_source/annovar \
 --known-variants broad_hg38/1000G_omni2.5.hg38.vcf.gz \
 --reference-index broad_hg38/Homo_sapiens_assembly38.fasta \
 --generate-confs-only \
 --dry-run

Advanced Parameters

For advanced usage, you can refer to the following command-line options: usage: Basic Run Usage example: guap WES -i indir -o outdir --bed-file file --reference-fasta fasta.fasta --reference-index indexpath

options:
 -h, --help show this help message and exit
basic config:
 -i in path, --input in path
 Input directory path
 -o out path, --output out path
 Output directory path
Workflow configure:
 --threads N Number of total threads to use [default = all]
 --reference-fasta path/to/file.fa
 path to reference fasta file
 --bed-file path bed file path
 --gff-file path gff file path
 --nirvana-path path Path for Nirvana
 --annovar-path path Path for annovar
 --generate-confs-only
 Generate sample table and config file only
 --contig-ploidy-priors path
 Path prior ploidy file
 --call-CNV Perform GATK CNV pipeline
 --dont-use-gatk-bestparctice
 Generate sample table and config file only
QC configuration:
 --trimmomatic Use trimmomatic
 --trim-t N Number of threads to use during trim step
 --trim-min-length N trimmomatic min length [default = 30]
 --slidingwindow-size N
 trimmomatic sliding window size [default = 4]
 --slidingwindow-quality N
 trimmomatic sliding window quality score [default = 10]
 --trimmomatic-extra-args ='-args'
 A string value of extra args for trimmomatic (must be used with = with no spaces (--trimmomatic-extra-args='-arg1 -arg2'))
 --skip-QC Skipp Fastqc step
Samples pre-processing:
 --gen-ubam-threads N Number of threads to use during creating ubam and marking adaptors [default = 4]
 --gen-ubam-mem N Memory (GB) to use during creating ubam and marking adaptors [default = 5]
Aligner configuration:
 --index-threads N Number of threads to use during indexing ref [default = 4]
 --align-threads N Number of threads to use during sample alignment [default = 4]
 --align-mem N Memory (GB) to use during sample alignment [default = 32]
 --aligner-extra-args '-args'
 Extra arguments for aligner, use it with no spaces and add = ( --aligner-extra-args='-arg1 -arg2' )
 --reference-index path/to/ref
 path to reference index
 --reference-output-path path/to/ref
 path to reference index
 --reference-output-prefix path/to/ref
 path to reference index
 --index-fasta Index fasta file
Variant caller configuration:
 --padding N Interval padding to include
 --known-variants-indels path
 path to reference fasta file
 --known-variants-indels2 path
 path to reference fasta file
 --known-variants-snps path
 path to reference fasta file
 --caller-extra-args '-args'
 Extra arguments for caller, use it with no spaces and add = ( --caller-extra-args='-arg1 -arg2' )
 --calling-threads N Number of threads to use during variant calling [default = 4]
 --calling-mem N Memory in GB to use during variant calling [default = 8]
Snakemake Options:
 --dry-run performs snakemake dry run
 --export-dag performs snakemake dry run and exports DAG
 --smk-extra-args ='-args'
 A string value of extra args for snakemake(must be used with = with no spaces (--smk-extra-args='-arg1 -arg2'))
 --parse-snakemake-output
 prints progress bar instead of snakemake regular output
Annotation configuration:
 --annovar-protocol str
 Annovar Protocol defaults: refGene,avsnp150,clinvar_20221231,cosmic70,dbnsfp31a_interpro,EAS.sites.2015_08,EUR.sites.2015_08,gme,gnomad211_exome,SAS.sites.2015_08
 --annovar-operation str
 Annovar Protocol defaults: g,f,f,f,f,f,f,f,f,f
 --annovar-extra-args ='-args'
 A string value of extra args for annovar(must be used with = with no spaces (--annovar-extra-args='-arg1 -arg2'))
 --nirvana-extra-args ='-args'
 A string value of extra args for nirvana(must be used with = with no spaces (--nirvana-extra-args='-arg1 -arg2'))
 --annotation-threads N
 Number of threads to use during creating ubam and marking adaptors [default = 4]
 --annotation-mem N Memory (GB) to use during creating ubam and marking adaptors [default = 5]
Other:
 --continue continue analysis when re-run
 --overwrite overwrite output dir if exsits
 -n str, --name str Name of files [ default = guap_run[date time] ]
 --verbose verbose
 --quit print many output

Code Snippets

shell:
    "multiqc . -o multiqc/"

SnakeMake MultiQC From line 186 of rules/gatk-bwa-pipeline.smk

shell:
    """
    ext={params.ext}
    R1={input.R1}
    SM={wildcards.sample}
    PL="Illumina"
    LB=$SM
    if [[ "$ext" == *".gz" ]]; then
        RGID=$(head -n1 <(zcat {input.R1}) | sed 's/:/_/g' |cut -d "_" -f1,2,3,4)
        #RGID=$(zcat {input.R1} | head -n1 | sed 's/:/_/g' |cut -d "_" -f1,2,3,4)
    else
        RGID=$(head {input.R1} -n1 | sed 's/:/_/g' |cut -d "_" -f1,2,3,4)
    fi
    ## TODO: confirm to use this
    PU=$RGID.$LB 

    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        FastqToSam \
        -F1 {input.R1} \
        -F2 {input.R2} \
        -O {output} \
        -SM $SM \
        -LB $LB \
        -PL $PL \
        -RG $RGID \
        -PU $PU \
    """

SnakeMake gatk From line 211 of rules/gatk-bwa-pipeline.smk

shell:
    '''
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        MarkIlluminaAdapters \
        -I {input} \
        -O {output.bam} \
        -M {output.metrics} \
        # > {log} 2>&1
    '''

SnakeMake gatk From line 253 of rules/gatk-bwa-pipeline.smk

shell:
    '''
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        SamToFastq \
        -I {input.bam} \
        --FASTQ /dev/stdout \
        --CLIPPING_ATTRIBUTE XT --CLIPPING_ACTION 2 \
        --INTERLEAVE true -NON_PF true | bwa mem -K 100000000 \
        -v 3 {params.bwa_args} -t {threads} -p  -Y {params.index} /dev/stdin | gatk \
        --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        MergeBamAlignment \
        --VALIDATION_STRINGENCY SILENT \
        --EXPECTED_ORIENTATIONS FR \
        --ATTRIBUTES_TO_RETAIN X0 \
        --ALIGNED_BAM /dev/stdin \
        --UNMAPPED_BAM {input.bam} \
        --OUTPUT {output.bam} \
        -R {params.fa} \
        --PAIRED_RUN true \
        --SORT_ORDER "unsorted" \
        --IS_BISULFITE_SEQUENCE false \
        --ALIGNED_READS_ONLY false \
        --CLIP_ADAPTERS false \
        --MAX_RECORDS_IN_RAM 2000000 \
        --ADD_MATE_CIGAR true \
        --MAX_INSERTIONS_OR_DELETIONS -1 \
        --PRIMARY_ALIGNMENT_STRATEGY MostDistant \
        --UNMAPPED_READ_STRATEGY COPY_TO_TAG \
        --ALIGNER_PROPER_PAIR_FLAGS true \
        --UNMAP_CONTAMINANT_READS true {params.warp_workflow_params}
    '''

SnakeMake gatk BWA From line 309 of rules/gatk-bwa-pipeline.smk

shell:
    """
    samtools depth {input} | awk '{{sum+=$3}} END {{print "Average = ",sum/NR, "No of covered Nuc = ", NR}}' > {output.cov}
    samtools flagstat {input} > {output.stats}
    """

SnakeMake SAMtools From line 360 of rules/gatk-bwa-pipeline.smk

shell:
    """
    qualimap \
        bamqc \
        -bam {input.bam} \
        --java-mem-size=15G \
        --paint-chromosome-limits \
        --genome-gc-distr HUMAN \
        -nt {threads} \
        -skip-duplicated \
        --skip-dup-mode 0 \
        -outdir {output} \
        --feature-file {input.bed} \
        -outformat HTML
    """

SnakeMake QualiMap From line 383 of rules/gatk-bwa-pipeline.smk

shell:
    """
    picard MergeSamFiles  \
        {params.bams} \
        -OUTPUT {output} \
        --SORT_ORDER "unsorted" 
    """

SnakeMake Picard From line 421 of rules/gatk-bwa-pipeline.smk

shell:
    """
    picard MarkDuplicates -I {input} \
        -O {output.bam} \
        -M {output.matrix} \
        --VALIDATION_STRINGENCY SILENT \
        --OPTICAL_DUPLICATE_PIXEL_DISTANCE 2500 \
        --ASSUME_SORT_ORDER "queryname" \
        --CREATE_MD5_FILE true \
        --CLEAR_DT false \
        --ADD_PG_TAG_TO_READS false
    """

SnakeMake Picard From line 448 of rules/gatk-bwa-pipeline.smk

shell:
    """
gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
  SortSam \
  --INPUT {input} \
  --OUTPUT /dev/stdout \
  --SORT_ORDER "coordinate" \
  --CREATE_INDEX false \
  --CREATE_MD5_FILE false | gatk \
  --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
  SetNmMdAndUqTags \
  --INPUT /dev/stdin \
  --OUTPUT {output} \
  --CREATE_INDEX true \
  --CREATE_MD5_FILE true \
  --REFERENCE_SEQUENCE {params.fa}
    """

SnakeMake gatk From line 478 of rules/gatk-bwa-pipeline.smk

shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        BaseRecalibrator \
        -R {params.ref} \
        -I {input} \
        --use-original-qualities \
        --known-sites {params.known_sites} \
        --known-sites {params.known_sites2} \
        --known-sites {params.known_sites3} \
        -L {params.bed} \
        --interval-padding {params.padding} \
        -O {output} 
    """

SnakeMake gatk From line 516 of rules/gatk-bwa-pipeline.smk

shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        ApplyBQSR  -R {params.ref} \
        -I {input.bam} \
        -bqsr {input.report} -O {output} \
        -L {params.bed} \
        --interval-padding {params.padding} \
        --static-quantized-quals 10 --static-quantized-quals 20 --static-quantized-quals 30 \
        --add-output-sam-program-record \
        --create-output-bam-md5 \
        --use-original-qualities

    samtools index {output} 
    """

SnakeMake SAMtools gatk From line 549 of rules/gatk-bwa-pipeline.smk

shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" BaseRecalibrator -R {params.ref} \
        -I {input} \
        --known-sites {params.known_sites} \
        --known-sites {params.known_sites2} \
        --known-sites {params.known_sites3} \
        -L {params.bed} \
        --interval-padding {params.padding} \
        -O {output} 
    """

SnakeMake gatk From line 584 of rules/gatk-bwa-pipeline.smk

shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" AnalyzeCovariates -before {input.raw} \
        -after {input.bqsr} -plots {output}
    """

SnakeMake gatk From line 613 of rules/gatk-bwa-pipeline.smk

shell:
    """
    bedtools slop -i {input.bed} -g {params.ref}.fai -b {params.padding} > {output}
    """

SnakeMake BEDTools From line 640 of rules/gatk-bwa-pipeline.smk

shell:
    '''
    grep -vE "^@" {input.intervalslist} |
        awk -v OFS='\t' '$2=$2-1' |
        bedtools intersect -c -a {input.BED} -b - |
        cut -f6 > {output.target_overlap_counts}

    function restrict_to_overlaps() {{
        # print lines from whole-genome file from loci with non-zero overlap
        # with target intervals
        WGS_FILE=$1
        EXOME_FILE=$2
        paste {output.target_overlap_counts} $WGS_FILE |
            grep -Ev "^0" |
            cut -f 2- > $EXOME_FILE
        echo "Generated $EXOME_FILE"
    }}

    restrict_to_overlaps {input.UD} {output.UD}
    restrict_to_overlaps {input.BED} {output.BED}
    restrict_to_overlaps {input.MU} {output.MU}

    '''

SnakeMake BEDTools From line 668 of rules/gatk-bwa-pipeline.smk

shell:
    '''
    python3 {params.get_con_file} \
        --prefix {params.prefix} \
        --output {output.contamination} \
        --svdprefix {params.svdprefix} \
        --get_con_file {params.get_con_file} \
        --target_bam {input.target_bam} \
        --ref {input.ref} \
        --threads {threads} 
    '''

SnakeMake From line 720 of rules/gatk-bwa-pipeline.smk

shell:
    """
    cont=$(cat {input.contamination})
    echo $cont

    gatk --java-options "-Xmx{resources.reduced}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        HaplotypeCaller -R {params.ref} \
        -G StandardAnnotation -G StandardHCAnnotation \
        -G AS_StandardAnnotation {params.extra_args} \
        -GQB 20 -GQB 30 -GQB 40 -GQB 50 -GQB 60 -GQB 70 -GQB 80 -GQB 90 \
        -L {params.bed} \
        --interval-padding {params.padding} \
        -I {input.bam} --native-pair-hmm-threads {threads} -ERC GVCF -O {output.vcf} \
        -contamination $cont \
        -bamout {output.bam} \

    """

SnakeMake gatk From line 769 of rules/gatk-bwa-pipeline.smk

shell:
    """
    ## IMPORTANT!
    # Note that when uncalled alleles are dropped, 
    # the original GQ may increase. Use --keep-all-alts 
    # if GQ accuracy is a concern.


    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        ReblockGVCF \
        -R {params.ref} \
        -G StandardAnnotation \
        -G StandardHCAnnotation \
        -G AS_StandardAnnotation \
        --floor-blocks -OVI \
        -GQB 20 -GQB 30 -GQB 40 -GQB 50 -GQB 60 -GQB 70 -GQB 80 -GQB 90 \
        -L {params.bed} \
        --interval-padding {params.padding} \
        -do-qual-approx \
        -V {input} -O {output}

    """

SnakeMake gatk From line 811 of rules/gatk-bwa-pipeline.smk

shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        VariantFiltration \
        -V {input} \
        -L {params.bed} \
        --interval-padding {params.padding} \
        --filter-expression "QD < 2.0 || FS > 30.0 || SOR > 3.0 || MQ < 40.0 || MQRankSum < -3.0 || ReadPosRankSum < -3.0" \
        --filter-name "HardFiltered" \
        -O {output}
    """

SnakeMake gatk From line 864 of rules/gatk-bwa-pipeline.smk

shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        CNNScoreVariants \
        -V {input.vcf} \
        -R {params.ref} \
        -O {output} \
        -I {input.bam} \
        -tensor-type read-tensor
    """

SnakeMake gatk From line 899 of rules/gatk-bwa-pipeline.smk

shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        FilterVariantTranches \
        -V ~{input_vcf} \
        -O ~{vcf_basename}.filtered.vcf.gz \
        --resource {params.hapmap_resource_vcf} \
        --resource {params.omni_resource_vcf} \
        --resource {params.one_thousand_genomes_resource_vcf} \
        --resource {params.dbsnp_resource_vcf} \
        --info-key CNN_1D \
        --snp-tranche 99.95 \
        --indel-tranche 99.4 \
        --create-output-variant-index true
    """

SnakeMake gatk From line 940 of rules/gatk-bwa-pipeline.smk

shell:
    """
    touch {output}

    for i in {params.IDs} ; do
        echo -e  "${{i}}\t04-1_gvcf-processing/reblocked/${{i}}.gvcf.gz" | cat >> {output}
    done
    """

SnakeMake From line 978 of rules/gatk-bwa-pipeline.smk

shell:
    """
    sort -k1,1 -k2,2n {input.bedfile} | awk '
    {{
        # If it is a different chromosome or if it is a non-overlapping interval
        if ($1 != chrom) {{
            # If it is not the first interval
            if (chrom) {{
                print chrom, start, end;
            }}
            chrom = $1;
            start = $2;
            end = $3;
        }} else if ($3 > end) {{
            # If it is an overlapping interval
            end = $3;
        }}
    }}
    END {{
        # Print the last interval
        print chrom, start, end;
    }}
    ' > {output.newbed}
    """

SnakeMake From line 1003 of rules/gatk-bwa-pipeline.smk

shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        SplitIntervals \
        -R {input.ref} \
        -L {input.intervals} \
        --scatter-count {params.scatter_count} \
        --interval-merging-rule ALL \
        -mode INTERVAL_SUBDIVISION \
        -O {output.dir} 
    """

SnakeMake gatk From line 1048 of rules/gatk-bwa-pipeline.smk

shell:
    """
    # TODO: we can implement the update here: by checking and replacing:
    ## --genomicsdb-update-workspace-path path_to_dir instead of:
    ## --genomicsdb-workspace-path path_to_dir
    ## but when updating I don't need to add -L as it figures that out from the samples automatically

    # sources: https://eriqande.github.io/eca-bioinf-handbook/variant-calling.html#genomics-db
    # finally  some notes:
    ## the documentation for GenomicsDBImport is replete with warnings 
    ## about how you should backup your genomics data bases before trying
    ## to update them. Apparently there is a good chance that any little
    ## glitch during the updating process could corrupt the entire data base and render it useless. Joy!

    ## With that as a preamble, it does seem that we should try creating some data bases and then adding samples to them.


    ### use --genomicsdb-shared-posixfs-optimizations  for HPC


    mkdir -p 04-2_GenomicsDB


    gatk --java-options "-Xmx{resources.reduced}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads} -DGATK_STACKTRACE_ON_USER_EXCEPTION=true" \
        GenomicsDBImport \
        --sample-name-map {input.sample_map} \
        -L {input.intervals} \
        --genomicsdb-workspace-path {output} \
        --batch-size 50 \
        --reader-threads 4 \
        --merge-input-intervals \
        --consolidate \
        --genomicsdb-shared-posixfs-optimizations true \

    """

SnakeMake gatk From line 1085 of rules/gatk-bwa-pipeline.smk

shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        GenotypeGVCFs \
        -G StandardAnnotation -G StandardHCAnnotation \
        -G AS_StandardAnnotation \
        -R {params.ref} -V gendb://{input.infile} -O {output} \
        {params.Additional_annotation} --merge-input-intervals \
        -L {input.intervals} \
        --only-output-calls-starting-in-intervals \

    """

SnakeMake gatk From line 1141 of rules/gatk-bwa-pipeline.smk

shell:
    """
    gatk --java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true' \
        GatherVcfsCloud \
        --ignore-safety-checks \
        --gather-type BLOCK \
        {params.gvcfs} \
        -O {output}

    tabix {output}
    """

SnakeMake gatk tabix From line 1174 of rules/gatk-bwa-pipeline.smk

shell:
    """
    touch {output}
    gatk VariantEval \
        -R {params.ref} \
        --eval:{wildcards.sample} {input} \
        -O {output}
    """

SnakeMake gatk From line 1204 of rules/gatk-bwa-pipeline.smk

shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}"  SelectVariants \
        -R {params.ref} \
        -V {input} \
        --select-type-to-include INDEL \
        -O {output}
    """

SnakeMake gatk From line 1227 of rules/gatk-bwa-pipeline.smk

shell:
    """
    gatk VariantFiltration \
        --variant {input} \
        --filter-expression "QD < 2.0"                  --filter-name "QD2" \
        --filter-expression "SOR > 10.0"                  --filter-name "SOR10" \
        --filter-expression "FS > 200.0"                --filter-name "FS200" \
        --filter-expression "ReadPosRankSum < -20.0"    --filter-name "ReadPosRankSum-20" \
        --filter-expression "InbreedingCoeff < -0.8"    --filter-name "InbreedingCoeff-0.8" \
        --create-output-variant-index true \
        --output {output}
    """

SnakeMake gatk From line 1252 of rules/gatk-bwa-pipeline.smk

shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}"  SelectVariants \
        -R {params.ref} \
        -V {input} \
        --select-type-to-include SNP \
        -O {output}
    """

SnakeMake gatk From line 1280 of rules/gatk-bwa-pipeline.smk

shell:
    """
    gatk VariantFiltration \
        --variant {input} \
        --filter-expression "QD < 2.0"              --filter-name "QD2" \
        --filter-expression "SOR > 3.0"             --filter-name "SOR3" \
        --filter-expression "FS > 60.0"             --filter-name "FS60" \
        --filter-expression "MQ < 40.0"             --filter-name "MQ40" \
        --filter-expression "MQRankSum < -12.5"     --filter-name "MQRankSum-12.5" \
        --filter-expression "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" \
        --create-output-variant-index true \
        --output {output}
    """

SnakeMake gatk From line 1303 of rules/gatk-bwa-pipeline.smk

shell:
    """
    bcftools csq -f {params.ref} -g {params.gff} {input} -Ov -o {output}
    """

SnakeMake BCFtools From line 1338 of rules/gatk-bwa-pipeline.smk

shell:
    """
        bcftools stats {input} > {output}
    """

SnakeMake BCFtools From line 1355 of rules/gatk-bwa-pipeline.smk

shell:
    """
    plot-vcfstats -p {output} {input}
    """

SnakeMake vcfstats From line 1371 of rules/gatk-bwa-pipeline.smk

shell:
    """
    eval "$(conda shell.bash hook)"
    set +u; conda activate dotnet; set -u
    dotnet {params.Nirvana_cmd} \
        -i {input} \
        -o {params.file_name} \
        -r {params.Nirvana_ref} \
        --sd {params.Nirvana_supplementray} \
        -c {params.Nirvana_cache} {params.extra}
    """

SnakeMake From line 1399 of rules/gatk-bwa-pipeline.smk

shell:
    """
    snpEff -Xmx{resources.mem_gb}G \
        -v -csvStats {output.stats} \
        -stats {output.html} \
        {params.genome} {input} > {output.vcf}
    """

SnakeMake snpEff From line 1433 of rules/gatk-bwa-pipeline.smk

shell:
    """
    mkdir -p {output}
    perl {params.annovar_dir}/table_annovar.pl {input} {params.annovar_dir}/humandb/ \
        -buildver hg38 \
        -out {params.output} -remove \
        -protocol {params.protocol} \
        -operation {params.operation} \
        -nastring . \
        -vcfinput \
        --thread {threads} {params.extra}
    """

SnakeMake From line 1461 of rules/gatk-bwa-pipeline.smk

shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" 
        CollectQualityYieldMetrics \
        --INPUT {input} \
        --OQ true \
        --OUTPUT {output}
    """

SnakeMake gatk From line 1493 of rules/gatk-bwa-pipeline.smk

shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        CollectMultipleMetrics \
        --INPUT {input} \
        --REFERENCE_SEQUENCE {params.ref} \
        --OUTPUT {params.prefix} \
        --ASSUME_SORTED true \
        --PROGRAM null \
        --PROGRAM CollectAlignmentSummaryMetrics \
        --PROGRAM CollectSequencingArtifactMetrics \
        --PROGRAM QualityScoreDistribution \
        --PROGRAM CollectGcBiasMetrics \
        --METRIC_ACCUMULATION_LEVEL null \
        --METRIC_ACCUMULATION_LEVEL SAMPLE \
        --METRIC_ACCUMULATION_LEVEL LIBRARY \
        --METRIC_ACCUMULATION_LEVEL ALL_READS 

    """

SnakeMake gatk From line 1520 of rules/gatk-bwa-pipeline.smk

shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        ConvertSequencingArtifactToOxoG \
        --INPUT_BASE {params.inPasename} \
        --OUTPUT_BASE {params.prefix} \
        --REFERENCE_SEQUENCE {params.ref} \
    """

SnakeMake gatk From line 1560 of rules/gatk-bwa-pipeline.smk

shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        CrosscheckFingerprints \
        --OUTPUT {output} \
        --HAPLOTYPE_MAP {params.htdbf} \
        --EXPECT_ALL_GROUPS_TO_MATCH true \
        {params.files} \
        --LOD_THRESHOLD -10.0 \
    """

SnakeMake gatk From line 1588 of rules/gatk-bwa-pipeline.smk

shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        CollectMultipleMetrics \
        --INPUT {input} \
        --REFERENCE_SEQUENCE {params.ref} \
        --OUTPUT {params.prefix} \
        --ASSUME_SORTED true \
        --PROGRAM null \
        --PROGRAM CollectBaseDistributionByCycle \
        --PROGRAM CollectInsertSizeMetrics \
        --PROGRAM MeanQualityByCycle \
        --PROGRAM QualityScoreDistribution \
        --METRIC_ACCUMULATION_LEVEL null \
        --METRIC_ACCUMULATION_LEVEL ALL_READS

    """

SnakeMake gatk From line 1616 of rules/gatk-bwa-pipeline.smk

shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        CollectMultipleMetrics \
        --INPUT {input} \
        --REFERENCE_SEQUENCE {params.ref} \
        --OUTPUT {params.prefix} \
        --ASSUME_SORTED true \
        --PROGRAM null \
        --PROGRAM CollectAlignmentSummaryMetrics \
        --PROGRAM CollectGcBiasMetrics \
        --METRIC_ACCUMULATION_LEVEL null \
        --METRIC_ACCUMULATION_LEVEL READ_GROUP
    """

SnakeMake gatk From line 1650 of rules/gatk-bwa-pipeline.smk

shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        ValidateSamFile \
        --INPUT {input} \
        --OUTPUT {output} \
        --REFERENCE_SEQUENCE {params.ref} \
        --MODE VERBOSE \
        --IS_BISULFITE_SEQUENCED false
    """

SnakeMake gatk From line 1680 of rules/gatk-bwa-pipeline.smk

shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        ValidateSamFile \
        -V {input} \
        -R {params.ref} \
        --validation-type-to-exclude ALLELES
    """ 

SnakeMake gatk From line 1707 of rules/gatk-bwa-pipeline.smk

shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        BedToIntervalList \
        -I {input.bed} \
        -O {output} \
        -SD {input.seq_dict}
    """

SnakeMake gatk From line 1729 of rules/gatk-bwa-pipeline.smk

shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        CollectVariantCallingMetrics \
        -I {input.vcf} \
        --OUTPUT {params.prefix} \
        --DBSNP {params.dbsnp} \
        --SEQUENCE_DICTIONARY {params.refdict} \
        --TARGET_INTERVALS {input.intervals} \
        --GVCF_INPUT true
    """ 

SnakeMake gatk From line 1758 of rules/gatk-bwa-pipeline.smk

shell:
    """
    java \
        -jar {params.path_to_tool}/DISCVRSeq-1.3.49.jar VariantQC \
        -R {params.ref} \
        -V {input} \
        -O {output}
    """

SnakeMake From line 1787 of rules/gatk-bwa-pipeline.smk

shell:
    """
    mkdir -p 000_CNV_Intervalsprep
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        PreprocessIntervals \
        -R {input.ref} \
        -L {params.bed} \
        --padding 250 \
        --bin-length 0 \
        -imr OVERLAPPING_ONLY \
        -O {output}
    """

SnakeMake gatk From line 1907 of rules/gatk-bwa-pipeline.smk

shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        CollectReadCounts \
        -R {params.ref} \
        -L {input.intervals} \
        -imr OVERLAPPING_ONLY \
        -I {input.sample} \
        -O {output}
    """

SnakeMake gatk From line 1938 of rules/gatk-bwa-pipeline.smk

shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        AnnotateIntervals \
        -L {input.intervals} \
        -R {params.ref} \
        -imr OVERLAPPING_ONLY \
        -O {output}
    """

SnakeMake gatk From line 1965 of rules/gatk-bwa-pipeline.smk

shell:
    """
    echo {params.samples} 
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        FilterIntervals \
        -L {input.preprocessedIntervals}\
        --annotated-intervals {input.annotatedIntervals} \
        {params.samples} \
        -imr OVERLAPPING_ONLY \
        --minimum-gc-content 0.1 \
        --maximum-gc-content 0.9 \
        --minimum-mappability 0.9 \
        --maximum-mappability 1.0 \
        --minimum-segmental-duplication-content 0.0 \
        --maximum-segmental-duplication-content 0.5 \
        --low-count-filter-count-threshold 5 \
        --low-count-filter-percentage-of-samples 90.0 \
        --extreme-count-filter-minimum-percentile 1.0 \
        --extreme-count-filter-maximum-percentile 99.0 \
        --extreme-count-filter-percentage-of-samples 90.0 \
        -O {output}
    """

SnakeMake gatk From line 1996 of rules/gatk-bwa-pipeline.smk

shell:
    """
    # eval "$(conda shell.bash hook)"
    # set +u; conda activate gatk ; set -u

    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        DetermineGermlineContigPloidy  \
        -L {input.filteredIntervals}\
        {params.samples}  \
        -imr OVERLAPPING_ONLY \
        --contig-ploidy-priors {params.contigPloidyPriors} \
        -O {output.dir} \
        --output-prefix ploidy \
        --verbosity DEBUG
    """

SnakeMake gatk From line 2042 of rules/gatk-bwa-pipeline.smk

shell:
    """
    # eval "$(conda shell.bash hook)"
    # set +u; conda activate gatk; set -u

    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        GermlineCNVCaller  \
        --run-mode COHORT \
        -L {input.interval} \
        -I {params.samples} \
        --contig-ploidy-calls {input.ploidy}/dogs-calls \
        --annotated-intervals {input.annotated} \
        --output-prefix {params.outpref} \
        --interval-merging-rule OVERLAPPING_ONLY \
        -O {params.outdir}
    """

SnakeMake gatk From line 2082 of rules/gatk-bwa-pipeline.smk

shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        PostprocessGermlineCNVCalls \
        --model-shard-path {input.model} \
        --calls-shard-path {input.call} \
        --allosomal-contig chrX --allosomal-contig chrY \
        --contig-ploidy-calls ploidy-calls \
        --sample-index {params.index} \
        --output-genotyped-intervals  {output.intervals} \
        --output-genotyped-segments  {output.segments} \
        --sequence-dictionary {input.seq_dict}

    """

SnakeMake gatk From line 2125 of rules/gatk-bwa-pipeline.smk

shell:
    """
    gatk --java-options "-Xmx{resources.mem_gb}G -XX:+UseParallelGC -XX:ParallelGCThreads={threads}" \
        VariantsToTable \
        -V {input} \
        -F CHROM -F POS -F END -GF NP -GF CN \
        -O {output}

    """

SnakeMake gatk From line 2161 of rules/gatk-bwa-pipeline.smk

shell:
    """

    sampleName=$(gzcat {input.vcf} | grep -v '##' | head -n1 | cut -f10)
    awk -v sampleName=$sampleName 'BEGIN {FS=OFS="\t"} {print sampleName, $0}' {input.txt} &gt; ${i}.seg; head ${i}.seg

    """