CoVigator pipeline: variant detection pipeline for Sars-CoV-2 (and other viruses...)

public 1yr ago Version: v0.17.0 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

The Covigator pipeline transforms SARS-CoV-2 FASTQ or FASTA files into annotated and normalized VCF files for analysis. It also uses pangolin to classify samples into lineages. The pipeline is built on the Nextflow architecture (Di Tommaso, 2017), and it may be utilized independently of the CoVigator dashboard and knowledge base. Although it is set up by default to analyze SARS-CoV-2, it can also be used to analyze other microbiological organisms if the necessary references are provided. The process produces one or more annotated VCFs with a list of SNVs and indels ready for analysis.

Code Snippets

"""
mkdir -p reference
cp ${reference} reference/sequences.fa
bwa-mem2 index reference/sequences.fa
samtools faidx reference/sequences.fa
gatk CreateSequenceDictionary --REFERENCE reference/sequences.fa 
"""

NextFlow SAMtools gatk Bwa-mem2 From line 24 of modules/00_prepare_annotation.nf

"""
mkdir -p snpeff/${snpeff_organism}
echo ${snpeff_organism}.genome : ${snpeff_organism} > snpeff/snpEff.config
cp ${reference} snpeff/${snpeff_organism}/sequences.fa
cp ${gff} snpeff/${snpeff_organism}/genes.gff
cd snpeff
snpEff build -gff3 -v ${snpeff_organism} -dataDir .
"""

NextFlow snpEff From line 52 of modules/00_prepare_annotation.nf

"""
# --input_files needs to be forced, otherwise it is inherited from profile in tests
fastp \
--in1 ${fastq1} \
--in2 ${fastq2} \
--out1 ${fastq1.baseName}.trimmed.fq.gz \
--out2 ${fastq2.baseName}.trimmed.fq.gz \
--json ${name}.fastp_stats.json \
--html ${name}.fastp_stats.html
"""

NextFlow fastp From line 22 of modules/01_fastp.nf

"""
# --input_files needs to be forced, otherwise it is inherited from profile in tests
fastp \
--in1 ${fastq1} \
--out1 ${fastq1.baseName}.trimmed.fq.gz \
--json ${name}.fastp_stats.json \
--html ${name}.fastp_stats.html
"""

NextFlow fastp From line 50 of modules/01_fastp.nf

"""
bwa-mem2 mem -t ${task.cpus} ${reference} ${fastq1} ${fastq2} | \
samtools view -uS - | \
samtools sort - > ${name}.bam
"""

NextFlow SAMtools Bwa-mem2 From line 19 of modules/02_bwa.nf

"""
bwa-mem2 mem -t ${task.cpus} ${reference} ${fastq1} | \
samtools view -uS - | \
samtools sort - > ${name}.bam
"""

NextFlow SAMtools Bwa-mem2 From line 40 of modules/02_bwa.nf

"""
mkdir tmp

gatk CleanSam \
--java-options '-Xmx${params.memory} -Djava.io.tmpdir=./tmp' \
--INPUT ${bam} \
--OUTPUT /dev/stdout | \
gatk AddOrReplaceReadGroups \
--java-options '-Xmx${params.memory} -Djava.io.tmpdir=./tmp' \
--VALIDATION_STRINGENCY SILENT \
--INPUT /dev/stdin \
--OUTPUT ${name}.prepared.bam \
--REFERENCE_SEQUENCE ${reference} \
--RGPU 1 \
--RGID 1 \
--RGSM ${name} \
--RGLB 1 \
--RGPL ILLUMINA

# removes duplicates (sorted from the alignment process)
sambamba markdup \
    -r \
    -t ${task.cpus} \
    --tmpdir=./tmp \
    ${name}.prepared.bam ${name}.preprocessed.bam

# removes intermediate BAM files
rm -f ${name}.prepared.bam

# indexes the output BAM file
sambamba index \
    -t ${task.cpus} \
    ${name}.preprocessed.bam ${name}.preprocessed.bai
"""

NextFlow gatk Sambamba illumina From line 24 of modules/03_bam_preprocessing.nf

"""
ivar trim \
-i ${bam} \
-b ${primers} \
-p ${bam.baseName}.trimmed

gatk SortSam \
--java-options '-Xmx${params.memory}  -Djava.io.tmpdir=./tmp' \
--INPUT ${bam.baseName}.trimmed.bam \
--OUTPUT ${bam.baseName}.trimmed.sorted.bam \
--SORT_ORDER coordinate

gatk BuildBamIndex --INPUT ${bam.baseName}.trimmed.sorted.bam
"""

NextFlow gatk iVar From line 78 of modules/03_bam_preprocessing.nf

"""
samtools coverage ${bam} > ${name}.coverage.tsv
samtools depth -s -d 0 -H ${bam} > ${name}.depth.tsv
"""

NextFlow SAMtools From line 111 of modules/03_bam_preprocessing.nf

"""
bcftools mpileup ${params.args_bcftools_mpileup} \
--redo-BAQ \
--max-depth 0 \
--min-BQ ${params.min_base_quality} \
--min-MQ ${params.min_mapping_quality} \
--count-orphans \
--fasta-ref ${reference} \
--annotate AD ${bam} | \
bcftools call ${params.args_bcftools_call} \
--multiallelic-caller \
--variants-only \
 --ploidy 1 \
 --output-type b - > ${name}.bcftools.bcf
"""

NextFlow BCFtools From line 37 of modules/04_variant_calling.nf

"""
lofreq call ${params.args_lofreq} \
--min-bq ${params.min_base_quality} \
--min-alt-bq ${params.min_base_quality} \
--min-mq ${params.min_mapping_quality} \
--ref ${reference} \
--call-indels \
<( lofreq indelqual --dindel --ref ${reference} ${bam} ) | bgzip > ${name}.lofreq.vcf.gz

# NOTE: adding the tabix index is a dirty fix to deal with LoFreq VCF missing the chromosome in the header
bcftools index ${name}.lofreq.vcf.gz
bcftools view --output-type b ${name}.lofreq.vcf.gz > ${name}.lofreq.bcf
"""

NextFlow BCFtools tabix lofreq From line 71 of modules/04_variant_calling.nf

"""
mkdir tmp
gatk HaplotypeCaller ${params.args_gatk} \
--java-options '-Xmx${params.memory} -Djava.io.tmpdir=tmp' \
--input $bam \
--output ${name}.gatk.vcf \
--reference ${reference} \
--ploidy 1 \
--min-base-quality-score ${params.min_base_quality} \
--minimum-mapping-quality ${params.min_mapping_quality} \
--annotation AlleleFraction

bcftools view --output-type b ${name}.gatk.vcf > ${name}.gatk.bcf
"""

NextFlow gatk BCFtools From line 103 of modules/04_variant_calling.nf

"""
samtools mpileup ${params.args_ivar_samtools} \
-aa \
--count-orphans \
--max-depth 0 \
--redo-BAQ \
--min-BQ ${params.min_base_quality} \
--min-MQ ${params.min_mapping_quality} \
--fasta-ref ${reference} \
${bam} | \
ivar variants ${params.args_ivar} \
-p ${name}.ivar \
-q ${params.min_base_quality} \
-t 0.03 \
-r ${reference} \
-g ${gff}
"""

NextFlow SAMtools iVar From line 136 of modules/04_variant_calling.nf

"""
ivar2vcf.py \
--fasta ${reference} \
--ivar ${tsv} \
--output-vcf ${name}.ivar.vcf

bcftools view --output-type b ${name}.ivar.vcf > ${name}.ivar.bcf
"""

NextFlow BCFtools From line 172 of modules/04_variant_calling.nf

"""
assembly_variant_caller.py \
--fasta ${fasta} \
--reference ${reference} \
--output-vcf ${name}.${caller}.vcf \
--match-score $params.match_score \
--mismatch-score $params.mismatch_score \
--open-gap-score $params.open_gap_score \
--extend-gap-score $params.extend_gap_score
"""

NextFlow From line 200 of modules/04_variant_calling.nf

"""
# initial sort of the VCF
bcftools sort ${vcf} | \

# checks reference genome, decompose multiallelics, trim and left align indels
bcftools norm --multiallelics -any --check-ref e --fasta-ref ${reference} \
--old-rec-tag OLD_CLUMPED - | \

# remove duplicates after normalisation
bcftools norm --rm-dup exact -o ${name}.${caller}.normalized.vcf -
"""

NextFlow BCFtools From line 25 of modules/05_variant_normalization.nf

"""
phasing.py \
--fasta ${fasta} \
--gtf ${gtf} \
--input-vcf ${vcf} \
--output-vcf ${name}.${caller}.phased.vcf
"""

NextFlow From line 57 of modules/05_variant_normalization.nf

"""
# for some reason the snpEff.config file needs to be in the folder where snpeff runs...
cp ${snpeff_config} .

snpEff eff -Xmx${memory} -dataDir ${snpeff_data} \
-noStats -no-downstream -no-upstream -no-intergenic -no-intron -onlyProtein -hgvs1LetterAa -noShiftHgvs \
${snpeff_organism}  ${vcf} | bgzip -c > ${name}.${caller}.vcf.gz

tabix -p vcf ${name}.${caller}.vcf.gz
"""

NextFlow tabix snpEff From line 39 of modules/06_variant_annotation.nf

"""
bgzip -c ${vcf} > ${name}.vcf.gz

tabix -p vcf ${name}.vcf.gz

# annotates low frequency and subclonal variants
bcftools view -Ob ${name}.vcf.gz | \
bcftools filter \
--exclude 'INFO/vafator_af < ${params.low_frequency_variant_threshold}' \
--soft-filter LOW_FREQUENCY - | \
bcftools filter \
--exclude 'INFO/vafator_af >= ${params.low_frequency_variant_threshold} && INFO/vafator_af < ${params.subclonal_variant_threshold}' \
--soft-filter SUBCLONAL \
--output-type v - | \
bcftools filter \
--exclude 'INFO/vafator_af >= ${params.subclonal_variant_threshold} && INFO/vafator_af < ${params.lq_clonal_variant_threshold}' \
--soft-filter LOW_QUALITY_CLONAL \
--output-type v - > ${name}.${caller}.vcf
"""

NextFlow BCFtools tabix From line 67 of modules/06_variant_annotation.nf

"""
bcftools annotate \
--annotations ${params.conservation_sarscov2} \
--header-lines ${params.conservation_sarscov2_header} \
-c CHROM,FROM,TO,CONS_HMM_SARS_COV_2 \
--output-type z ${vcf} | \
bcftools annotate \
--annotations ${params.conservation_sarbecovirus} \
--header-lines ${params.conservation_sarbecovirus_header} \
-c CHROM,FROM,TO,CONS_HMM_SARBECOVIRUS \
--output-type z - | \
bcftools annotate \
--annotations ${params.conservation_vertebrate} \
--header-lines ${params.conservation_vertebrate_header} \
-c CHROM,FROM,TO,CONS_HMM_VERTEBRATE_COV \
--output-type z - | \
bcftools annotate \
--annotations ${params.pfam_names} \
--header-lines ${params.pfam_names_header} \
-c CHROM,FROM,TO,PFAM_NAME \
--output-type z - | \
bcftools annotate \
--annotations ${params.pfam_descriptions} \
--header-lines ${params.pfam_descriptions_header} \
-c CHROM,FROM,TO,PFAM_DESCRIPTION - | \
bcftools filter \
--exclude 'POS <= 55 | POS >= 29804' \
--output-type z - > annotated_sarscov2.vcf.gz

tabix -p vcf annotated_sarscov2.vcf.gz

bcftools annotate \
--annotations ${params.problematic_sites} \
--columns INFO/problematic:=FILTER annotated_sarscov2.vcf.gz > ${name}.${caller}.annotated_sarscov2.vcf
"""

NextFlow BCFtools tabix From line 109 of modules/06_variant_annotation.nf

"""
vafator \
--input-vcf ${vcf} \
--output-vcf ${name}.${caller}.vaf.vcf \
--bam vafator ${bam} ${mq_param} ${bq_param}
"""

NextFlow From line 165 of modules/06_variant_annotation.nf

"""
mkdir tmp

#--decompress-model
pangolin \
${fasta} \
--outfile ${name}.${caller}.pangolin.csv \
--tempdir ./tmp \
--threads ${params.cpus}
"""

NextFlow Pangolin From line 25 of modules/07_lineage_annotation.nf

"""
bcftools view -O b -o ${name}.bcf ${vcf}
bcftools index ${name}.bcf

# GATK results have all FILTER="."
bcftools consensus --fasta-ref ${reference} \
--include 'FILTER="PASS" | FILTER="."' \
--output ${name}.${caller}.fasta \
${name}.bcf
"""

NextFlow BCFtools Consensus From line 53 of modules/07_lineage_annotation.nf

"""
bgzip -c ${vcf} > ${name}.${caller}.vcf.gz
tabix -p vcf ${name}.${caller}.vcf.gz
"""

NextFlow tabix From line 24 of modules/08_compress_vcf.nf

ShowHide 15 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/TRON-Bioinformatics/covigator-ngs-pipeline

Name: covigator

Version: v0.17.0

Badge:

Insert copied code into your website to add a link to this workflow.

License: MIT License

Keywords:

Refs:

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free