Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Deep Variant as a Nextflow pipeline
A Nextflow pipeline for running the Google DeepVariant variant caller .
What is DeepVariant and why in Nextflow?
The Google Brain Team in December 2017 released a Variant Caller based on DeepLearning: DeepVariant.
In practice, DeepVariant first builds images based on the BAM file, then it uses a DeepLearning image recognition approach to obtain the variants and eventually it converts the output of the prediction in the standard VCF format.
DeepVariant as a Nextflow pipeline provides several advantages to the users. It handles automatically, through preprocessing steps , the creation of some extra needed indexed and compressed files which are a necessary input for DeepVariant, and which should normally manually be produced by the users. Variant Calling can be performed at the same time on multiple BAM files and thanks to the internal parallelization of Nextflow no resources are wasted. Nextflow's support of Docker allows to produce the results in a computational reproducible and clean way by running every step inside of a Docker container .
For more detailed information about Google's DeepVariant please refer to
google/deepvariant
or this
blog post
.
For more information about DeepVariant in Nextflow please refer to this
blog post
Quick Start
Warning DeepVariant can be very computationally intensive to run.
To test the pipeline you can run:
nextflow run nf-core/deepvariant -profile test,docker
A typical run on whole genome data looks like this:
nextflow run nf-core/deepvariant --genome hg19 --bam yourBamFile --bed yourBedFile -profile standard,docker
In this case variants are called on the bam files contained in the testdata directory. The hg19 version of the reference genome is used. One vcf files is produced and can be found in the folder "results"
A typical run on whole exome data looks like this:
nextflow run nf-core/deepvariant --exome --genome hg19 --bam_folder myBamFolder --bed myBedFile -profile standard,docker
Documentation
The nf-core/deepvariant documentation is split into the following files:
-
Pipeline configuration
More about the pipeline
As shown in the following picture, the worklow both contains preprocessing steps ( light blue ones ) and proper variant calling steps ( darker blue ones ).
Some input files ar optional and if not given, they will be automatically created for the user during the preprocessing steps. If these are given, the preprocessing steps are skipped. For more information about preprocessing, please refer to the "INPUT PARAMETERS" section.
The worklow accepts one reference genome and multiple BAM files as input . The variant calling for the several input BAM files will be processed completely indipendently and will produce indipendent VCF result files. The advantage of this approach is that the variant calling of the different BAM files can be parallelized internally by Nextflow and take advantage of all the cores of the machine in order to get the results at the fastest.
Credits
This pipeline was originally developed at Lifebit , by @luisas, to ease and reduce cost for variant calling analyses
Many thanks to nf-core and those who have helped out along the way too, including (but not limited to): @ewels, @MaxUlysse, @apeltzer, @sven1103 & @pditommaso
Code Snippets
239 240 241 | """ samtools faidx $fasta """ |
257 258 259 | """ bgzip -c ${fasta} > ${fasta}.gz """ |
276 277 278 | """ samtools faidx $fastagz """ |
294 295 296 | """ bgzip -c -i ${fasta} > ${fasta}.gz """ |
317 318 319 320 321 322 323 324 325 326 327 328 | """ mkdir ready [[ `samtools view -H ${bam} | grep '@RG' | wc -l` > 0 ]] && { mv $bam ready;}|| { picard AddOrReplaceReadGroups \ I=${bam} \ O=ready/${bam} \ RGID=${params.rgid} \ RGLB=${params.rglb} \ RGPL=${params.rgpl} \ RGPU=${params.rgpu} \ RGSM=${params.rgsm};} cd ready ;samtools index ${bam}; """ |
354 355 356 357 358 359 360 361 362 363 364 365 | """ mkdir logs mkdir ${bam.baseName}_shardedExamples dv_make_examples.py \ --cores ${task.cpus} \ --sample ${bam} \ --ref ${fastagz} \ --reads ${bam} \ --regions ${bed} \ --logdir logs \ --examples ${bam.baseName}_shardedExamples """ |
383 384 385 386 387 388 389 390 | """ dv_call_variants.py \ --cores ${task.cpus} \ --sample ${bam} \ --outfile ${bam.baseName}_call_variants_output.tfrecord \ --examples $shardedExamples \ --model ${model} """ |
416 417 418 419 420 421 | """ dv_postprocess_variants.py \ --ref ${fastagz} \ --infile call_variants_output.tfrecord \ --outfile "${bam}.vcf" """ |
433 434 435 436 437 438 439 440 441 442 443 | """ echo $workflow.manifest.version &> v_nf_deepvariant.txt echo $workflow.nextflow.version &> v_nextflow.txt ls /opt/conda/pkgs/ &> v_deepvariant.txt python --version &> v_python.txt pip --version &> v_pip.txt samtools --version &> v_samtools.txt lbzip2 --version &> v_lbzip2.txt bzip2 --version &> v_bzip2.txt scrape_software_versions.py &> software_versions_mqc.yaml """ |
Support
- Future updates
Related Workflows





