Deep Variant as a Nextflow pipeline

public 1yr ago Version: 1.0 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

deepvariant

Deep Variant as a Nextflow pipeline

A Nextflow pipeline for running the Google DeepVariant variant caller .

What is DeepVariant and why in Nextflow?

The Google Brain Team in December 2017 released a Variant Caller based on DeepLearning: DeepVariant.

In practice, DeepVariant first builds images based on the BAM file, then it uses a DeepLearning image recognition approach to obtain the variants and eventually it converts the output of the prediction in the standard VCF format.

DeepVariant as a Nextflow pipeline provides several advantages to the users. It handles automatically, through preprocessing steps , the creation of some extra needed indexed and compressed files which are a necessary input for DeepVariant, and which should normally manually be produced by the users. Variant Calling can be performed at the same time on multiple BAM files and thanks to the internal parallelization of Nextflow no resources are wasted. Nextflow's support of Docker allows to produce the results in a computational reproducible and clean way by running every step inside of a Docker container .

For more detailed information about Google's DeepVariant please refer to google/deepvariant or this blog post .
For more information about DeepVariant in Nextflow please refer to this blog post

Quick Start

Warning DeepVariant can be very computationally intensive to run.

To test the pipeline you can run:

nextflow run nf-core/deepvariant -profile test,docker

A typical run on whole genome data looks like this:

nextflow run nf-core/deepvariant --genome hg19 --bam yourBamFile --bed yourBedFile -profile standard,docker

In this case variants are called on the bam files contained in the testdata directory. The hg19 version of the reference genome is used. One vcf files is produced and can be found in the folder "results"

A typical run on whole exome data looks like this:

nextflow run nf-core/deepvariant --exome --genome hg19 --bam_folder myBamFolder --bed myBedFile -profile standard,docker

Documentation

The nf-core/deepvariant documentation is split into the following files:

Installation
Running the pipeline
Pipeline configuration
- Adding your own system
- Reference genomes
Output and how to interpret the results
Troubleshooting
More about DeepVariant

More about the pipeline

As shown in the following picture, the worklow both contains preprocessing steps ( light blue ones ) and proper variant calling steps ( darker blue ones ).

Some input files ar optional and if not given, they will be automatically created for the user during the preprocessing steps. If these are given, the preprocessing steps are skipped. For more information about preprocessing, please refer to the "INPUT PARAMETERS" section.

The worklow accepts one reference genome and multiple BAM files as input . The variant calling for the several input BAM files will be processed completely indipendently and will produce indipendent VCF result files. The advantage of this approach is that the variant calling of the different BAM files can be parallelized internally by Nextflow and take advantage of all the cores of the machine in order to get the results at the fastest.

Credits

This pipeline was originally developed at Lifebit , by @luisas, to ease and reduce cost for variant calling analyses

Many thanks to nf-core and those who have helped out along the way too, including (but not limited to): @ewels, @MaxUlysse, @apeltzer, @sven1103 & @pditommaso

Code Snippets

"""
samtools faidx $fasta
"""

NextFlow SAMtools From line 239 of master/main.nf

"""
bgzip -c ${fasta} > ${fasta}.gz
"""

NextFlow From line 257 of master/main.nf

"""
samtools faidx $fastagz
"""

NextFlow SAMtools From line 276 of master/main.nf

"""
bgzip -c -i ${fasta} > ${fasta}.gz
"""

NextFlow From line 294 of master/main.nf

"""
mkdir ready
[[ `samtools view -H ${bam} | grep '@RG' | wc -l`   > 0 ]] && { mv $bam ready;}|| { picard AddOrReplaceReadGroups \
I=${bam} \
O=ready/${bam} \
RGID=${params.rgid} \
RGLB=${params.rglb} \
RGPL=${params.rgpl} \
RGPU=${params.rgpu} \
RGSM=${params.rgsm};}
cd ready ;samtools index ${bam};
"""

NextFlow Picard From line 317 of master/main.nf

"""
mkdir logs
mkdir ${bam.baseName}_shardedExamples
dv_make_examples.py \
--cores ${task.cpus} \
--sample ${bam} \
--ref ${fastagz} \
--reads ${bam} \
--regions ${bed} \
--logdir logs \
--examples ${bam.baseName}_shardedExamples
"""

NextFlow From line 354 of master/main.nf

"""
dv_call_variants.py \
  --cores ${task.cpus} \
  --sample ${bam} \
  --outfile ${bam.baseName}_call_variants_output.tfrecord \
  --examples $shardedExamples \
  --model ${model}
"""

NextFlow From line 383 of master/main.nf

"""
dv_postprocess_variants.py \
--ref ${fastagz} \
--infile call_variants_output.tfrecord \
--outfile "${bam}.vcf"
"""

NextFlow From line 416 of master/main.nf

"""
echo $workflow.manifest.version &> v_nf_deepvariant.txt
echo $workflow.nextflow.version &> v_nextflow.txt
ls /opt/conda/pkgs/ &> v_deepvariant.txt
python --version &> v_python.txt
pip --version &> v_pip.txt
samtools --version &> v_samtools.txt
lbzip2 --version &> v_lbzip2.txt
bzip2 --version &> v_bzip2.txt
scrape_software_versions.py &> software_versions_mqc.yaml
"""