Shotgun metagenomic sequencing processing pipeline

public 1yr ago Version: 0.4.3 0 bookmarks

VDB Shotgun Pipeline

Prerequisites

Snakemake
Apptainer/Singularity : while in many cases we do provide conda envs the only method of execution we support is via containers.
(optional) A Snakemake Profile : this coordinates the execution of jobs on whatever hardware you are using.

Important Notes :

Set the location of your profile to the environment variable $SNAKEMAKE_PROFILE (eg export SNAKEMAKE_PROFILE=/path/to/your/profile/ )
For the purposes of the examples, we added the --dry-run flag for the user to preview the rules to be executed. Remove this step to execute the commands.
All database paths are configured in config/config.yaml Change the paths to reflect where the databases can be found on your machine. For a uniform way to fetch and build all the databases, see https://github.com/vdblab/resources

Main Pipeline

Usage

snakemake \
 --directory tmpout/ \
 --config \
 sample=473 \
 R1=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R1_001.fastq.gz] \
 R2=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R2_001.fastq.gz] \
 nshards=4 \
 stage=all \
 --dry-run

Outputs

MultiQC-ready reports
Microbe relative abundances (MetaPhlAn3, Kraken2)
Metabolic pathway relative abundances (HUMAnN3)
Metagenome assembled genomes (MetaSPAdes)
AMR profiles with Abricate and RGI
MAGs with MetaWRAP (Metabat2, CONCOCT, Maxbin2)
Gene prediction and annotation (MetaErg)
Secondary metabolite gene clusters (antiSMASH)
Antimicrobial resistance and virulence genes (ABRicate, AMRFinderPlus)
Carbohydrate active enzyme (CAZyme) annotation (dbCAN3)

Workflow

The rule DAG for a single sample looks like this:

Main Shotgun Pipeline DAG

Different modules of the workflow can be run indenpendently using the stage config entry.

MultiQC

Just run MultiQC on a directory, no need to use Snakemake

cp -r tmppre/reports tmpreports
cp tmpassembly/quast/quast_473/report.tsv ./tmpreports/
ver="v1.12"
docker run -V $PWD:$PWD docker://ewels/multiqc:${ver} multiqc \
 --config vdb_shotgun/multiqc_config.yaml --force \
 --title "a multiqc report for some test data" \
 -b "generated by ${ver}" --filename multiqc_report.html \
 reports/ --interactive

Preprocessing

Shotgun Preprocessing Pipeline DAG

snakemake \
 --directory tmppreprocess/ \
 --config \
 sample=473 \
 R1=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R1_001.fastq.gz] \
 R2=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R2_001.fastq.gz] \
 nshards=4 \
 dedup_platform=NovaSeq \
 stage=preprocess \
 --dry-run

Tools used

BBTools ( site | paper )
SeqKit ( site | paper )
Bowtie2 ( site | paper )
Snap ( site | [paper](https://doi.org/10.1101/2021.11.23.469039 History))
SortMeRNA ( site | paper )
FastQC ( site )
2-step host removal descibed here , extended to use both human and mouse genomes

Biobakery

Shotgun Biobakery Profiling Pipeline DAG

snakemake \
 --directory tmpbiobakery/ \
 --config \
 sample=473 \
 R1=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R1_001.fastq.gz] \
 R2=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R2_001.fastq.gz] \
 stage=biobakery \
 --dry-run

Tools used

MetaPhlAn3 ( site | paper )
HUMAnN3 ( site | paper )

Kraken2/Bracken

Shotgun Kraken/Bracken Pipeline DAG

snakemake \
 --directory tmpkraken/ \
 --config \
 sample=473 \
 R1=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R1_001.fastq.gz] \
 R2=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R2_001.fastq.gz] \
 dedup_platform=NovaSeq \
 stage=kraken \
 --dry-run

Tools used

Kraken2 ( site | paper )

Assembly

Shotgun Assembly Pipeline DAG

snakemake \
 --directory tmpassembly/ \
 --config \
 sample=473 \
 R1=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R1_001.fastq.gz] \
 R2=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R2_001.fastq.gz] \
 stage=assembly \
 --dry-run

Tools used

MetaSPAdes ( site | paper )
MetaQUAST ( site | paper )

Annotation

Shotgun Assembly Annotation DAG

snakemake \
 --directory tmpannotate/ \
 --config \
 sample=473 \
 R1=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R1_001.fastq.gz] \
 R2=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R2_001.fastq.gz] \
 assembly=tmpassembly/473.contigs.fasta \
 stage=annotate \
 --dry-run

Tools used

MetaErg ( site | paper )
antiSMASH ( site | paper )
ABRicate ( site )
AMRFinderPlus ( site | paper )
dbCAN ( site | paper )

Binning

Shotgun Assembly Binning Pipeline DAG

snakemake \
 --directory tmpbinning/ \
 --config \
 sample=473 \
 R1=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R1_001.fastq.gz] \
 R2=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R2_001.fastq.gz] \
 assembly=tmpassembly/473.contigs.fasta \
 stage=binning \
 --dry-run

RGI

Shotgun RGI Pipeline DAG

snakemake \
 --directory tmprgi/ \
 --config \
 sample=473 \
 R1=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R1_001.fastq.gz] \
 R2=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R2_001.fastq.gz] \
 stage=rgi \
 --dry-run

Tools used

RGI ( site | paper )

Strainphlan Pipeline

This pipeline StrainPhlAn for each specified species. Strainphlan requires two inputs: sample-level marker pickle files, and strain-level markers extracted from the main database. These are stored in central subdirectory in the Metaphlan database directory to aid re-running. If you provide the .sam.bz2 file for a samples that has already been processed into a pkl file, it will use the pregenerated result.

This workflow accepts as input a list of sample's metaphlan sam.bz2 alignment files, and a list of species of interest. A config argument strainphlan_markers_dir serves as a central place for storing both the species- and the sample-level marker files; these are specific to a version of the MetaPHlan database, so we recommend placing that within the metaphlan database directory.

Usage

snakemake \
 --snakefile workflow/strainphlan.smk \
 --directory tmpstrain/ \
 --config \
 sams=[path/to/sample1.sam.bz2,path/to/sample2.sam.bz2] \
 strainphlan_markers_dir=/data/brinkvd/resources/dbs/metaphlan/mpa_vJan21_CHOCOPhlAnSGB_202103/marker_outputs/ \
 metaphlan_db=/data/brinkvd/resources/dbs/metaphlan/mpa_vJan21_CHOCOPhlAnSGB_202103/ \
 marker_in_n_samples=2 \
 --dry-run

Outputs

For each input species:

Multiple sequence alignment of strains detected in samples
Phylogenetic tree of strains detected in samples

Workflow

The rule DAG for two example input species looks like this:

StrainPhlAn Shotgun Pipeline DAG

Testing and Development

Please see development.md .

Code Snippets

script:
    "../scripts/parse_antismash_gbk.py"

SnakeMake From line 147 of rules/annotate.smk

script:
    "../scripts/merge_logs.py"

SnakeMake From line 466 of rules/preprocess.smk

script:
    "../scripts/plot_RGI_heatmap.R"

SnakeMake From line 82 of rules/RGI.smk

ShowHide 3 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/vdblab/vdblab-shotgun

Name: vdblab-shotgun

Version: 0.4.3

Badge:

Insert copied code into your website to add a link to this workflow.

License: MIT License

Keywords:

Gene ID Gene prediction Gene report ABRicate AMRFinderPlus antismash-lite bracken hmmer antiSMASH BBMap Bowtie 2 cloudSPAdes CoverM dbCAN2 FastQC humann kraken2 Krona MetaErg MetaPhlAn MetaQUAST metaspades MetaWRAP plasmidspades QUAST rnaspades SAMtools seqkit Snakemake SortMeRNA SPAdes RGI Metagenomic sequencing

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free