A Snakemake workflow for variant calling using GATK4 best practices

public public 1yr ago Version: 1.0 0 bookmarks

Table of contents

Motivation

  • This repository contains a pipeline built with Snakemake for variant calling using Illumina-generated sequences and is based on the GATK best practices for variant calling using either hard or soft filters.

  • Additionally, this pipeline aims to reproduce a recently published pipeline that optimized the GATK4 variant calling pipeline for Plasmodium falciparum ( preprint ). However, this is not limited to P. falciparum and can be used for any organism of interest.

Pipeline sections

  • The pipeline handles paired-end reads and below are the analysis sections in the Snakefile:

Step 1 - Compile List of Output Files

  • rule all - gather all output files

Step 2 - Gather Genome Data

  • gather_genome_data : aggregate genome data from the snpeff folder

  • gatk_genome_dict : create genome dictionary for gatk tools

  • samtools_index : index the genome fasta file

  • bedops_gff2bed : convert the genome annotation .gff to .bed file

Step 3 - Perform Fastq Quality Control

  • trim_reads : trim adapters and low quality bases using trimmomatic or fastp

Step 4 - Map Reads to Genome

  • bwa_index : generate bwa genome-index files for mapping reads

  • bwa_mem : map reads to genome, fixmate, convert .sam to .bam and finally remove artifacts

  • mark_duplicates : mark duplicate reads using gatk MarkDuplicatesSpark or Samblaster

Step 5 - Generate Mapping Quality Statistics

  • samtools_idxstats : calculate alignment statistics based on the reference sequence

  • samtools_flagstats : calculates and summarizes various alignment statistics

  • samtools_depth : calculate the depth of coverage for each position in the genome

  • gatk_insert_size_metrics : collect insert size metrics

  • gatk_alignment_summary_metrics : generate a summary of alignment metrics from the BAM file

Step 6 - Perform Variant Calling

  • gatk_haplotypecaller : call snps and indels via local re-assembly of haplotypes and generate gVCFs

  • generate_sample_name_map : generate a map of sample names and the respective vcf files

  • gatk_genomics_db_import : merge gVCFs into one genomic database

  • gatk_genotype_gvcfs : perform joint genotyping and generate the final VCF in which all samples have been jointly genotyped

Step 7 - Perform Variant Filtering (Hard or Soft)

  • bcftools_normalize : normalize indels, left-align variants, split multiallelic sites into multiple rows and recover multiallelics from multiple rows

  • hard_filter_variants:

    • gatk_split_variants : separate snps and indels into separate vcf files

    • gatk_filter_hard : apply hard filters to snps and indels

    • gatk_merge_vcfs : merge snps and indels into one vcf file

  • soft_filter_variants:

    • gatk_vqsr_indels : perform variant quality score recalibration to indels

    • gatk_apply_vqsr_indels : apply variant quality score recalibration to indels

    • gatk_vqsr_snps : perform variant quality score recalibration to snps

    • gatk_apply_vqsr_snps : apply variant quality score recalibration to snps

  • gatk_filter_pass : filter out variants that do not pass the hard or soft filters

Step 8 - Annotatate Variants and Calculate Allele Frequencies

  • snpeff_annotate_variants : variant annotation and functional effect prediction

  • gatk_variants_to_table : extract variant information into a table

  • python - calculate allele frequencies and transform the summary table from wide to long format

Project dependencies:

  • Conda - an open-source package management system and environment management system that runs on various platforms, including Windows, MacOS, Linux

  • Snakemake - a workflow management system that aims to reduce the complexity of creating workflows by providing a fast and comfortable execution environment, together with a clean and modern specification language in python style.

Where to start

  • Install conda for your operating System ( the pipeline is currently tested on Linux and MacOS ):

  • Clone this project using the following command in your terminal:

    • git clone https://github.com/kevin-wamae/variant-calling-with-Snakemake-and-GATK.git
  • Type the following command in your terminal to navigate into the cloned directory using the command below. This will be the root directory of the project:

    • cd variant-calling-with-Snakemake-and-GATK
  • Note: All subsequent commands should be run from the root directory of this project. However, users can modify the scripts to their liking

Directory structure

  • Below is the default directory structure:

    • config/ - contains the Snakemake-configuration files

    • input/ - contains input files

      • bed/ - contains the bed files for specifying the intervals of interest

      • fastq/ - contains the FastQ files

      • known_sites/ - contains the positive-training dataset for variant filtering

    • output/ - contains numbered-output directories from the analysis

    • workflow/ - contains the Snakemake workflow files

      • envs/ - contains the Conda environment-configuration files

      • scripts/ - contains the scripts used in the pipeline

.
|-- config
|-- input
| |-- bed
| |-- fastq
| -- known_sites
|-- output
|-- workflow |-- envs |-- scripts
  • This pipeline uses global_wildcards() to match FastQ sample names and mate files in the input/fastq/ directory, using the naming convention below:

    • reads_R1.fastq.gz = first mate

    • reads_R2.fastq.gz = second mate

    • If you have a different naming convention (eg. this ), you can rename the FastQ files by executing the python script in the workflow/scripts/ directory:

      • python workflow/scripts/fastq_rename.py
    • Therefore, the user can deposit their FastQ files in the input/fastq/ directory or edit the config/config.yaml file to point to the directory with FastQ files and the pipeline will automatically match the sample names and mates files

  • The configuration file ( config/config.yaml ) specifies additional resources and can be modified to suit one's needs, such as:

    • Input files

    • Output directories,

    • The option to choose between tools and methods, e.g.:

      • fastp or trimmomatic for read trimming

      • gatk MarkDuplicatesSpark or samblaster for marking duplicates

      • hard or soft filtering of variants

    • Other parameters, such as the number of threads to use

Running the analysis

After navigating into the root directory of the project, run the analysis by executing the following commands in your terminal to:

  1. Create a conda analysis environment by running the command below in your terminal. This will create a conda environment named variant-calling-gatk and install Snakemake and SnpEff and Graphviz (for visualizing the workflow) in the environment. Note: This only needs to be done once.

    • conda env create --file workflow/envs/environment.yaml
  2. Activate the conda environment by running the command below in your terminal. Note: This needs to be done every time you exit and restart your terminal and want re-run this pipeline

    • conda activate variant-calling-gatk
  3. Execute the shell script below to create the SnpEff database for variant annotation. This will download the P. falciparum genome data from PlasmoDB and create a database in the output/ directory. Note: This is an important step because the genome-FASTA and GFF files are required for read-mapping and variant calling. It can also be modified to suit one's needs such as download genome files for your organism of interest:

    • bash workflow/scripts/create_snpeff_db.sh
  4. Finally, execute the whole Snakemake pipeline by running the following command in your terminal:

    • snakemake --use-conda --cores 2 --jobs 1

    • This will run the whole pipeline using a maximum of two cores and one job in parallel. The --cores flag specifies the number of cores to use for each job and the --jobs flag specifies the number of jobs to run in parallel.

    • If you want to run the pipeline using more resources, you can increase the number of cores and jobs. For example, to run the pipeline using 4 cores and 2 jobs in parallel, run the following command:

      • snakemake --use-conda --cores 4 --jobs 2
    • Additionally, you can change the threads entry in line 3 of the configuration file ( config/config.yaml ) to specify the number of cores to use for each step in the pipeline.

  5. Once the analysis is complete, look through output/ directory to view the results of the analysis

  6. Summary statistics can be generated with stand alone scripts in the workflow/scripts/ directory:

    • To do this, create an conda environment with the following command:

      • conda env create --file workflow/envs/variant-calling-stats.yaml

      • activate the conda environment by running the following command: conda activate variant-calling-stats

    • To generate a summary of the raw reads, run the following command and look through the stats_1_raw_fastq.tsv file in the project directory:

      • python workflow/scripts/get_raw_fastq_stats.py
    • To generate a summary of the trimmed reads, run the following command and look through the stats_2_trimmed_fastq.tsv file:

      • python workflow/scripts/get_trimmed_fastq_stats.py
    • To generate a summary of the mapped reads, run the following command and look through the stats_3_mapped_reads.tsv file:

      • python workflow/scripts/get_mapped_reads_stats.py
    • To generate a summary of the variants called, run the following command and look through the stats_4_variant_calling.tsv file:

      • bash workflow/scripts/get_variant_calling_stats.sh
    • Exit this conda environment by running the following command:

      • conda deactivate variant-calling-stats
  7. Finally, you can deactivate the variant calling conda environment by running the following command:

    • conda deactivate variant-calling-gatk

Feedback and Issues

Report any issues or bugs by openning an issue here or contact me via email at wamaekevin[at]gmail.com

Code Snippets

 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import os
import sys

# Get the directory path from the command line arguments
vcf_dir = sys.argv[1]
tsv_file = sys.argv[2]

# Get a list of all the VCF files in the directory
vcf_files = [f for f in os.listdir(vcf_dir) if f.endswith('.vcf.gz')]

# Create a list of tuples with the file paths and names
file_list = []
for vcf_file in vcf_files:
    file_name = os.path.splitext(vcf_file)[0]
    file_path = os.path.join(vcf_dir, vcf_file)
    file_list.append((file_path, file_name))

# Sort the list by the second element of each tuple (the file name)
file_list.sort(key=lambda x: x[1])

# Write the file names and paths to a TSV file
with open(tsv_file, 'w') as tsv_file:
    for file_path, file_name in file_list:
        tsv_file.write(os.path.splitext(file_name)[
                       0] + '\t' + file_path + '\n')
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
output_file=$1

awk -F '\t' 'BEGIN {
    OFS = FS
}
NR == 1 {
    # Print the header with new column names for ANN fields
    for (i = 1; i <= 5; ++i) {
        printf("%s%s", $i, OFS)
    }
    printf("Allele%sAnnotation%sAnnotation_Impact%sGene_Name%sGene_ID%sFeature_Type%sFeature_ID%sTranscript_BioType%sRank%sHGVS.c%sHGVS.p%scDNA.pos%sCDS.pos%sAA.pos%sDistance%sErrors", OFS, OFS, OFS, OFS, OFS, OFS, OFS, OFS, OFS, OFS, OFS, OFS, OFS, OFS, OFS)
    for (i = 7; i <= NF; ++i) {
        printf("%s%s", OFS, $i)
    }
    printf("\n")
    next
}
{
    # Print the data with ANN fields split and without the ANN field
    for (i = 1; i <= 5; ++i) {
        printf("%s%s", $i, OFS)
    }
    split($6, ann, "|")
    for (i = 1; i <= 16; ++i) {
        printf("%s%s", ann[i], OFS)
    }
    for (i = 7; i <= NF; ++i) {
        printf("%s%s", $i, (i == NF) ? "\n" : OFS)
    }
}' > /dev/stdout
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
library(argparse)    # For command line argument parsing
library(data.table)  # For fast data manipulation
library(tidyverse)   # For data wrangling and visualization

# Define command line arguments
parser <- argparse::ArgumentParser()

parser$add_argument(
                    "--input",
                    type = "character",
                    help = "Input file path")

parser$add_argument("--output",
                    type = "character",
                    help = "Output file path")

args <- parser$parse_args(commandArgs(TRUE))

# Read in the data using data.table::fread for fast reading of large files
vcf <- data.table::fread(args$input, nThread = 4)

# Reshape the data
vcf <- vcf %>%
  mutate(across(23:ncol(.), as.character)) %>% # convert columns to be reshaped to character type
  mutate(row_id = row_number()) %>%            # add a unique row identifier
  pivot_longer(
               cols = -c(CHROM:Errors, row_id),
               names_to = c("sample_name", "variable"),
               names_sep = "\\.") %>%
  pivot_wider(names_from = "variable", values_from = "value") %>%
  select(-row_id) %>% # drop the row_id column
  mutate(
    alt_allele_freq = if_else(str_detect(AD, ",0$"), 0,
                          as.numeric(str_extract(AD, "\\d+$")) / (as.numeric(str_extract(AD, "^[^,]+")) + as.numeric(str_extract(AD, "\\d+$")))),
    alt_allele_freq = round(alt_allele_freq, 3)
  ) %>%
  filter(! alt_allele_freq == 0) %>%
  arrange(sample_name, POS)

data.table::fwrite(vcf, args$output, nThread = 4)

write_tsv(vcf, args$output)
209
210
211
212
213
214
215
216
217
218
219
run:
    shell(  # cp - copy genome fasta file from snpeff database location
        """
        cp -f {input.genome} {output.genome}
        """
    )
    shell(  # cp - copy annotation file from snpeff database location
        """
        cp -f {input.gff} {output.gff}
        """
    )
235
236
237
238
239
240
241
shell:
    """
    gatk --java-options "{params.java_opts}" CreateSequenceDictionary \
        --REFERENCE {input.genome} \
        --OUTPUT {output.genome_dict} \
        2> {log}
    """
253
254
255
256
shell:
    """
    samtools faidx {input.genome}
    """
270
271
272
273
274
275
276
shell:
    """
    convert2bed \
        --input=gff \
        --output=bed < {input} |\
    grep -e {params.feature} > {output}
    """
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
shell:
    """
    echo "##############################################"
    echo "-------    Running {params.trimmer}    -------"
    echo "##############################################"

    if [[ "{params.trimmer}" == "Fastp" ]]; then
        fastp \
            --thread {threads} \
            {params.opts_fastp} \
            --in1 {input.r1} \
            --in2 {input.r2} \
            --out1 {output.r1} \
            --out2 {output.r2} \
            --unpaired1 {output.r1_unpaired} \
            --unpaired2 {output.r2_unpaired} \
            --json {log.json} \
            --html {log.html} \
            2> {log.log}
    else
        trimmomatic PE \
            -threads {threads} \
            {input.r1} {input.r2} \
            {output.r1} {output.r1_unpaired} \
            {output.r2} {output.r2_unpaired} \
            {params.opts_trimmomatic} \
            2> {log.log}
    fi
    """
361
362
363
364
365
366
367
368
shell:
    """
    echo "##############################################"
    echo "-----------    Running BWA Index    ----------"
    echo "##############################################"

    bwa index -p {output.index} {input.genome} 2> {log}
    """
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
shell:
    """
    echo "#####################################################"
    echo "Running BWA-> Fixmate-> SamFormatConverter-> CleanSam"
    echo "#####################################################"

    bwa mem \
        -t {threads} \
        {params.extra_bwa} \
        {params.read_groups} \
        {input.idx} \
        {input.reads} \
        2> {log.bwa} |\
    samtools fixmate \
        --threads {threads} \
        {params.extra_fixmate} \
        --output-fmt sam \
        /dev/stdin \
        /dev/stdout \
        2> {log.fixmate} |\
    gatk --java-options "{params.java_opts}" SamFormatConverter \
        --INPUT /dev/stdin \
        --OUTPUT /dev/stdout \
        2> {log.sam2bam} |\
    gatk --java-options "{params.java_opts}" CleanSam \
        -R {input.genome} \
        -I /dev/stdin \
        -O {output.bam} \
        2> {log.cleansam}
    """
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
shell:
    """
    echo "##############################################"
    echo "---  Running {params.mark_duplicate_tool} ---"
    echo "##############################################"

    if [ "{params.mark_duplicate_tool}" == "MarkDuplicatesSpark" ]; then
        gatk --java-options "{params.java_opts}" MarkDuplicatesSpark \
            --spark-master local[{threads}] \
            -I {input.bam} \
            -O {output.bam} \
            {params.extra_gatk} \
            2> {log}
    elif [ "{params.mark_duplicate_tool}" == "Samblaster" ]; then
        samtools view \
            -h {input.bam} |\
        samblaster \
            {params.extra_samblaster} \
            2> {log} |\
        samtools sort \
            -@ {threads} \
            -o {output.bam} \
            2>> {log}
        samtools index {output.bam}
    else
        echo "Unsupported mark duplicate tool selected: {params.mark_duplicate_tool}"
        exit 1
    fi
    """
499
500
501
502
503
504
505
506
shell:
    """
    echo "##############################################"
    echo "------    Running Samtools IdxStats    ------"
    echo "##############################################"

    samtools idxstats {input.bam} > {output.idxstats}
    """
521
522
523
524
525
526
527
528
shell:
    """
    echo "##############################################"
    echo "------    Running Samtools Flagstat    ------"
    echo "##############################################"

    samtools flagstat {input.bam} > {output.flagstat}
    """
543
544
545
546
547
548
549
550
shell:
    """
    echo "##############################################"
    echo "------    Running Samtools Depth    ------"
    echo "##############################################"

    samtools depth {input.bam} > {output.depth}
    """
573
574
575
576
577
578
579
580
581
582
583
584
585
586
shell:
    """
    echo "##############################################"
    echo "--  Running GATK CollectInsertSizeMetrics  --"
    echo "##############################################"

    gatk --java-options "{params.java_opts}" CollectInsertSizeMetrics \
        {params.extra} \
        -R {input.genome} \
        -I {input.bam} \
        -O {output.metrics} \
        -H {output.histogram} \
        2> {log}
    """
609
610
611
612
613
614
615
616
617
618
619
620
621
622
shell:
    """
    echo "##############################################"
    echo " Running GATK CollectAlignmentSummaryMetrics "
    echo "##############################################"

    gatk --java-options "{params.java_opts}" CollectAlignmentSummaryMetrics \
        {params.extra} \
        -R {input.genome} \
        -I {input.bam} \
        -O {output.metrics} \
        -H {output.histogram} \
        2> {log}
    """
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
shell:
    """
    echo "##############################################"
    echo "-----    Running GATK HaplotypeCaller    -----"
    echo "##############################################"

    gatk --java-options "{params.java_opts}" HaplotypeCaller \
        --native-pair-hmm-threads {threads} \
        {params.extra} \
        -R {input.genome} \
        -L {input.intervals} \
        -I {input.bam} \
        -O {output.vcf} \
        2> {log}
    """
677
678
679
680
681
682
683
684
685
686
shell:
    """
    echo "##############################################"
    echo "------    Generating VCF-Sample Map    ------"
    echo "##############################################"

    python workflow/scripts/sample_vcf_map.py \
        {params.directory} \
        {output}
    """
707
708
709
710
711
712
713
714
715
716
717
718
719
720
shell:
    """
    echo "##############################################"
    echo "----    Running GATK GenomicsDBImport    ----"
    echo "##############################################"

    gatk --java-options "{params.java_opts}" GenomicsDBImport \
        {params.extra} \
        --reader-threads {threads} \
        --genomicsdb-workspace-path {output.dir} \
        --sample-name-map {input.sample_map} \
        -L {input.intervals} \
        2> {log}
    """
739
740
741
742
743
744
745
746
747
748
749
750
751
752
shell:
    """
    echo "##############################################"
    echo "-----     Running GATK GenotypeGVCFs     -----"
    echo "##############################################"

    gatk --java-options "{params.java_opts}" GenotypeGVCFs \
        {params.extra} \
        -R {input.genome} \
        -V gendb://{input.db} \
        -L {input.intervals} \
        -O {output.vcf} \
        2> {log}
    """
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
shell:
    """
    echo "##############################################"
    echo "--- Running BCFtools Normalize & Index VCF ---"
    echo "##############################################"

    bcftools norm \
        {params.normalise} \
        --fasta-ref {input.genome} \
        {input.vcf} \
        2> {log} |\
    bcftools annotate \
        {params.annotate} \
        2>> {log} |\
    bcftools view \
        {params.view} \
        --output-type z \
        --output-file {output.vcf} \
        2>> {log}

    tabix -p vcf {output.vcf}
    """
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
shell:
    """
    echo "##############################################"
    echo "-----    Running GATK SplitVariants     -----"
    echo "##############################################"

    gatk --java-options "{params.java_opts}" SelectVariants \
        -R {input.genome} \
        -V {input.vcf} \
        -O {output.snps} \
        --select-type-to-include SNP \
        2> {log.snps}

    gatk --java-options "{params.java_opts}" SelectVariants \
        -R {input.genome} \
        -V {input.vcf} \
        -O {output.indels} \
        --select-type-to-include INDEL \
        2> {log.indels}
    """
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
shell:
    """
    echo "##############################################"
    echo "---  Running GATK VariantFiltration (Hard) ---"
    echo "##############################################"

    gatk --java-options "{params.java_opts}" VariantFiltration \
        {params.extra_snps} \
        -R {input.genome} \
        -V {input.snps} \
        -O {output.snps} \
        2> {log.snps}

    gatk --java-options "{params.java_opts}" VariantFiltration \
        {params.extra_indels} \
        -R {input.genome} \
        -V {input.indels} \
        -O {output.indels} \
        2> {log.indels}
    """
898
899
900
901
902
903
904
905
906
907
908
909
shell:
    """
    echo "##############################################"
    echo "---------   Running GATK MergeVcfs   ---------"
    echo "##############################################"

    gatk --java-options "{params.java_opts}" MergeVcfs \
        -I {input.snps} \
        -I {input.indels} \
        -O {output.vcf} \
        2> {log}
    """
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
shell:
    """
    echo "##############################################"
    echo "- Running GATK VariantRecalibrator (INDELS) -"
    echo "##############################################"

    gatk --java-options "{params.java_opts}" VariantRecalibrator \
        {params.extra} \
        -R {input.genome} \
        -V {input.vcf} \
        -O {output.recal} \
        -resource:{params.resource} {params.known_sites} \
        --tranches-file {output.tranches} \
        --rscript-file {output.plots} \
        2> {log}
    """
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
shell:
    """
    echo "##############################################"
    echo "----   Running GATK ApplyVQSR (INDELS)   -----"
    echo "##############################################"

    gatk --java-options "{params.java_opts}" ApplyVQSR \
        {params.extra} \
        -R {input.genome} \
        -V {input.vcf} \
        -O {output.vcf} \
        --tranches-file {input.tranches} \
        --recal-file {input.recal} \
        2> {log}
    """
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
shell:
    """
    echo "##############################################"
    echo "--  Running GATK VariantRecalibrator (SNPs) --"
    echo "##############################################"

    gatk --java-options "{params.java_opts}" VariantRecalibrator \
        {params.extra} \
        -R {input.genome} \
        -V {input.vcf} \
        -O {output.recal} \
        -resource:{params.resource} {params.known_sites} \
        --tranches-file {output.tranches} \
        --rscript-file {output.plots} \
        2> {log}
    """
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
shell:
    """
    echo "##############################################"
    echo "------  Running GATK ApplyVQSR (SNPs)  -------"
    echo "##############################################"

    gatk --java-options "{params.java_opts}" ApplyVQSR \
        {params.extra} \
        -R {input.genome} \
        -V {input.vcf} \
        -O {output.vcf} \
        --tranches-file {input.tranches} \
        --recal-file {input.recal} \
        2> {log}
    """
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
shell:
    """
    echo "##############################################"
    echo "-------   Running GATK FilterVcfPass   -------"
    echo "##############################################"

    gatk --java-options "{params.java_opts}" SelectVariants \
        -V {input.vcf} \
        -O {output.vcf} \
        --exclude-filtered \
        2> {log}
    """
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
shell:
    """
    echo "##############################################"
    echo "--------   Running SnpEff Annotate   --------"
    echo "##############################################"

    snpEff ann \
        {params.extra} \
        -config {params.config} \
        {params.database} \
        {input.vcf} | bgzip -c > {output.vcf}

    tabix -p vcf {output.vcf}
    """
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
shell:
    """
    echo "##############################################"
    echo "------   Running GATK VariantsToTable   ------"
    echo "##############################################"  

    gatk --java-options "{params.java_opts}" VariantsToTable \
        -V {input.vcf} \
        --fields CHROM \
        --fields POS \
        --fields REF \
        --fields ALT \
        --fields TYPE \
        --fields ANN \
        --genotype-fields GT \
        --genotype-fields AD \
        --genotype-fields DP \
        --genotype-fields GQ \
        --genotype-fields PGT \
        --genotype-fields PID \
        --genotype-fields PL \
        --genotype-fields PS \
        -O /dev/stdout \
        2> {log} |\
    bash workflow/scripts/split_annot_column.sh > {output.variants}
    """
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
shell:
    """
    echo "##############################################"
    echo "--------   Running R AlleleFrequency  --------"
    echo "##############################################"

    Rscript workflow/scripts/vcf_allele_frequency.R \
        --input {input.variants} \
        --output {output.variants} \
        2> {log}
    """
ShowHide 23 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/kevin-wamae/gatk-pf-variant-finder
Name: gatk-pf-variant-finder
Version: 1.0
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...