Assembly and intrahost / low-frequency variant calling for viral samples

public public 1yr ago Version: dev 0 bookmarks

https://www.singularity-hub.org/static/img/hosted-singularity--hub-%23e32929.svg

Introduction

nf-core/vipr is a bioinformatics best-practice analysis pipeline for assembly and intrahost / low-frequency variant calling for viral samples.

The pipeline is built using Nextflow , a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker / singularity containers making installation trivial and results highly reproducible.

Pipeline Steps

Step Main program/s
Trimming, combining of read-pairs per sample and QC Skewer, FastQC
Decontamination decont
Metagenomics classification / Sample purity Kraken
Assembly to contigs BBtools' Tadpole
Assembly polishing ViPR Tools
Mapping to assembly BWA, LoFreq
Low frequency variant calling LoFreq
Coverage and variant AF plots (two processes) Bedtools, ViPR Tools

Documentation

Documentation about the pipeline can be found in the docs/ directory:

  1. Installation and configuration

  2. Running the pipeline

  3. Output and how to interpret the results

Credits

This pipeline was originally developed by Andreas Wilm ( andreas-wilm ) at Genome Institute of Singapore . It started out as an ecosystem around LoFreq and went through a couple of iterations. The current version had three predecessors ViPR 1 , ViPR 2 and ViPR 3

An incomplete list of publications using (previous versions of) ViPR:

Plenty of people provided essential feedback, including:

Code Snippets

143
144
145
146
147
148
149
150
151
152
153
154
155
"""
# loop over readunits in pairs per sample
pairno=0
echo ${reads.join(" ")} | xargs -n2 | while read fq1 fq2; do
    let pairno=pairno+1
    # note: don't make reads smaller than assembler kmer length
    skewer --quiet -t ${task.cpus} -m pe -q 3 -n -l 31 -z -o pair\${pairno}-skewer-out \$fq1 \$fq2;
    cat *-trimmed-pair1.fastq.gz >> ${sample_id}_R1-trimmed.fastq.gz;
    cat *-trimmed-pair2.fastq.gz >> ${sample_id}_R2-trimmed.fastq.gz;
    rm *-trimmed-pair[12].fastq.gz;
done
fastqc -t {task.cpus} ${sample_id}_R1-trimmed.fastq.gz ${sample_id}_R2-trimmed.fastq.gz;
"""
176
177
178
179
180
"""
decont.py -i ${fq1} ${fq2} -t ${task.cpus} -c 0.5 -r ${cont_fasta} -o ${sample_id}_trimmed_decont;
# since this is the last fastqc processing step, let's run fastqc here
fastqc -t {task.cpus} ${sample_id}_trimmed_decont_1.fastq.gz ${sample_id}_trimmed_decont_2.fastq.gz;
"""
196
197
198
199
200
201
"""
kraken --threads ${task.cpus} --preload --db ${kraken_db} \
  -paired ${fq1} ${fq2} > kraken.out;
# do not gzip! otherwise kraken-report happily runs (with some warnings) and produces rubbish results
kraken-report --db ${kraken_db} kraken.out > ${sample_id}_kraken.report
"""
219
220
221
"""
tadpole.sh -Xmx10g threads=${task.cpus} in=${fq1} in2=${fq2} out=${sample_id}_contigs.fa,
"""
241
242
243
244
245
246
247
248
249
250
251
252
253
254
"""
set +e;
log=${sample_id}-gap-filled-assembly.log;
simple_contig_joiner.py -c ${contigs_fa} -r ${input_ref_fasta} \
  -s "${sample_id}-gap-filled-assembly" -o ${sample_id}-gap-filled-assembly.fa \
  -b "${sample_id}-gap-filled-assembly.gaps.bed" >& \$log;

rc=\$?
if [ \$rc -ne 0 ]; then
    # nothing to join, means we cannot continue. so stop here.
    grep 'Nothing to join' \$log && exit 3;
    exit \$rc;
fi
"""
NextFlow From line 241 of master/main.nf
270
271
272
273
274
275
276
"""
# downsample to 1M reads to increase runtime
seqtk sample -s 666 ${fq1} 1000000 | gzip > R1_ds.R1.fastq.gz;
seqtk sample -s 666 ${fq2} 1000000 | gzip > R2_ds.R2.fastq.gz;
polish_viral_ref.sh -t ${task.cpus} -1 R1_ds.R1.fastq.gz -2 R2_ds.R2.fastq.gz \
    -r ${assembly_fa} -o ${sample_id}_polished_assembly.fa
"""
294
295
296
297
298
299
300
301
302
303
304
"""
bwa index ${ref_fa};
samtools faidx ${ref_fa};
bwa mem -t ${task.cpus} ${ref_fa} ${fq1} ${fq2} | \
    lofreq viterbi -f ${ref_fa} - | \
    lofreq alnqual -u - ${ref_fa} | \
    lofreq indelqual --dindel -f ${ref_fa} - | \
    samtools sort -o ${sample_id}.bam -T ${sample_id}.final.tmp -;
samtools index ${sample_id}.bam;
samtools stats ${sample_id}.bam > ${sample_id}.bam.stats
"""
319
320
321
322
323
"""
samtools faidx ${ref_fa};
lofreq call-parallel --pp-threads ${task.cpus} -f ${ref_fa} \
   -d 1000000 --call-indels -o ${sample_id}.vcf.gz ${bam}
"""
340
341
342
343
"""
# note: -d is one-based. -dz is zero-based but only non-zero values, so less explicit.
bedtools genomecov -d -ibam ${bam} | gzip > ${sample_id}.cov.gz;
"""
358
359
360
361
"""
vipr_af_vs_cov_html.py --vcf ${vcf} --cov ${cov} --plot ${sample_id}_af-vs-cov.html;
vipr_gaps_to_n.py -i ${ref_fa} -c ${cov} > ${sample_id}_0cov2N.fa;
"""
ShowHide 3 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public hafizmtalha
URL: https://nf-co.re/vipr
Name: vipr
Version: dev
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...