Assembly and intrahost / low-frequency variant calling for viral samples

public 1yr ago Version: dev 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Introduction

nf-core/vipr is a bioinformatics best-practice analysis pipeline for assembly and intrahost / low-frequency variant calling for viral samples.

The pipeline is built using Nextflow , a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker / singularity containers making installation trivial and results highly reproducible.

Pipeline Steps

Step	Main program/s
Trimming, combining of read-pairs per sample and QC	Skewer, FastQC
Decontamination	decont
Metagenomics classification / Sample purity	Kraken
Assembly to contigs	BBtools' Tadpole
Assembly polishing	ViPR Tools
Mapping to assembly	BWA, LoFreq
Low frequency variant calling	LoFreq
Coverage and variant AF plots (two processes)	Bedtools, ViPR Tools

Documentation

Documentation about the pipeline can be found in the docs/ directory:

Credits

This pipeline was originally developed by Andreas Wilm ( andreas-wilm ) at Genome Institute of Singapore . It started out as an ecosystem around LoFreq and went through a couple of iterations. The current version had three predecessors ViPR 1 , ViPR 2 and ViPR 3

An incomplete list of publications using (previous versions of) ViPR:

Plenty of people provided essential feedback, including:

October SESSIONS
Paola Florez DE SESSIONS
ZHU Yuan
Shuzhen SIM
CHU Wenhan Collins

Code Snippets

"""
# loop over readunits in pairs per sample
pairno=0
echo ${reads.join(" ")} | xargs -n2 | while read fq1 fq2; do
    let pairno=pairno+1
    # note: don't make reads smaller than assembler kmer length
    skewer --quiet -t ${task.cpus} -m pe -q 3 -n -l 31 -z -o pair\${pairno}-skewer-out \$fq1 \$fq2;
    cat *-trimmed-pair1.fastq.gz >> ${sample_id}_R1-trimmed.fastq.gz;
    cat *-trimmed-pair2.fastq.gz >> ${sample_id}_R2-trimmed.fastq.gz;
    rm *-trimmed-pair[12].fastq.gz;
done
fastqc -t {task.cpus} ${sample_id}_R1-trimmed.fastq.gz ${sample_id}_R2-trimmed.fastq.gz;
"""

NextFlow FastQC From line 143 of master/main.nf

"""
decont.py -i ${fq1} ${fq2} -t ${task.cpus} -c 0.5 -r ${cont_fasta} -o ${sample_id}_trimmed_decont;
# since this is the last fastqc processing step, let's run fastqc here
fastqc -t {task.cpus} ${sample_id}_trimmed_decont_1.fastq.gz ${sample_id}_trimmed_decont_2.fastq.gz;
"""

NextFlow FastQC From line 176 of master/main.nf

"""
kraken --threads ${task.cpus} --preload --db ${kraken_db} \
  -paired ${fq1} ${fq2} > kraken.out;
# do not gzip! otherwise kraken-report happily runs (with some warnings) and produces rubbish results
kraken-report --db ${kraken_db} kraken.out > ${sample_id}_kraken.report
"""

NextFlow Kraken From line 196 of master/main.nf

"""
tadpole.sh -Xmx10g threads=${task.cpus} in=${fq1} in2=${fq2} out=${sample_id}_contigs.fa,
"""

NextFlow TADpole From line 219 of master/main.nf

"""
set +e;
log=${sample_id}-gap-filled-assembly.log;
simple_contig_joiner.py -c ${contigs_fa} -r ${input_ref_fasta} \
  -s "${sample_id}-gap-filled-assembly" -o ${sample_id}-gap-filled-assembly.fa \
  -b "${sample_id}-gap-filled-assembly.gaps.bed" >& \$log;

rc=\$?
if [ \$rc -ne 0 ]; then
    # nothing to join, means we cannot continue. so stop here.
    grep 'Nothing to join' \$log && exit 3;
    exit \$rc;
fi
"""

NextFlow From line 241 of master/main.nf

"""
# downsample to 1M reads to increase runtime
seqtk sample -s 666 ${fq1} 1000000 | gzip > R1_ds.R1.fastq.gz;
seqtk sample -s 666 ${fq2} 1000000 | gzip > R2_ds.R2.fastq.gz;
polish_viral_ref.sh -t ${task.cpus} -1 R1_ds.R1.fastq.gz -2 R2_ds.R2.fastq.gz \
    -r ${assembly_fa} -o ${sample_id}_polished_assembly.fa
"""

NextFlow seqtk From line 270 of master/main.nf

"""
bwa index ${ref_fa};
samtools faidx ${ref_fa};
bwa mem -t ${task.cpus} ${ref_fa} ${fq1} ${fq2} | \
    lofreq viterbi -f ${ref_fa} - | \
    lofreq alnqual -u - ${ref_fa} | \
    lofreq indelqual --dindel -f ${ref_fa} - | \
    samtools sort -o ${sample_id}.bam -T ${sample_id}.final.tmp -;
samtools index ${sample_id}.bam;
samtools stats ${sample_id}.bam > ${sample_id}.bam.stats
"""

NextFlow SAMtools BWA lofreq From line 294 of master/main.nf

"""
samtools faidx ${ref_fa};
lofreq call-parallel --pp-threads ${task.cpus} -f ${ref_fa} \
   -d 1000000 --call-indels -o ${sample_id}.vcf.gz ${bam}
"""

NextFlow SAMtools lofreq From line 319 of master/main.nf

"""
# note: -d is one-based. -dz is zero-based but only non-zero values, so less explicit.
bedtools genomecov -d -ibam ${bam} | gzip > ${sample_id}.cov.gz;
"""

NextFlow BEDTools From line 340 of master/main.nf

"""
vipr_af_vs_cov_html.py --vcf ${vcf} --cov ${cov} --plot ${sample_id}_af-vs-cov.html;
vipr_gaps_to_n.py -i ${ref_fa} -c ${cov} > ${sample_id}_0cov2N.fa;
"""