Pipeline for metagenomic community analysis using DNA isolated from virus-like particles

public public 1yr ago 0 bookmarks

Metagenomic Community Analysis of Virus-like Particles

This repo can be used to generate all of the desired output files for community analysis (shared OTUs, taxonomies, alpha/beta diversity, ordiations, etc.). The workflow is designed to work with Snakemake and Conda with minimal intervention by the user. The workflow is designed to be run on a high-performance computing cluster (HPC) because some of the steps are very computationally intense and require signficant resources (>100 GB of RAM and >24 cores). If you do not have access to a cluster, you should be able to adjust the settings to make it execute locally though it will admittedly be difficult. An overview of the workflow is provided in the rulegraph .

This workflow was designed to analyze viral community composition using DNA isolated from virus-like particles in mouse feces. It has been used to successfully analyze data from 380 samples with 10 Gb of sequencing per sample on an Illumina NovaSeq but should be able to easily scale up or down depending on the amount of data being used. Samples from other environments should also work fine though you will need to update the contaminate indices used for metagenomeBowtie2Dependencies.sh and metagenomeDecontaminateReads.sh execution.

Overview

The steps below are meant to serve as a brief overview of the pipeline. For more information, please see the comments throughout the various scripts and the documentation for the tools used.

1. Read pre-processing

  • Purpose : Remove low-quality sequences and any potential host contaminants (mouse and human in this case).

  • Trim read pairs and remove low-quality regions ( Trimmomatic ).

  • Remove potential contaminant sequences ( Bowtie2 ).


2. Assembly

  • Purpose : Combine quality-controlled reads into contigs.

  • Independently assemble reads from each sample into contigs ( metaSPAdes ).

  • Combine all generated contigs into a single contig library for the entire dataset ( Bash ).

  • Remove duplicate contigs and containments to reduce redundancy of sequences ( BBMap ).

  • Remove contigs below a specific size threshold to improve downstream processing ( SeqKit ).


3. Viral contig curation

  • Purpose : Remove low-quality contigs and any contigs predicted as being non-viral.

  • Predict which contigs are viral ( VirSorter and VirFinder ).

  • Remove any non-viral contigs from the contig library ( Bash ).


4. Binning

  • Purpose : Group contigs together into discrete units based on similarity to each other and distance from others.

  • Map reads from each sample to the viral contig library to generate coverage and abundance statistics ( BWA-MEM ).

  • Combine contigs into metagenomic bins based on tetranucleotide frequencies, coverage, and abundance correlations ( MetaBat2 ).

  • Remove any bins that contain non-viral marker genes ( CheckM ).

  • Combine all bins together into a single viral bin library ( Bash ).


5. Shared OTU file creation

  • Purpose : Summarize the number of counts for each viral bin across each sample in the dataset. Bins are treated as operational taxonomic units (OTUs).

  • Map reads from each sample to the viral bin library ( BWA-MEM ).

  • Calculate number of reads per bin for each sample. ( Samtools ).

  • Combine all read counts into a single shared OTU file (R - tidyverse ).

  • Rarefy the shared OTU files by subsampling to a specified threshold (R - tidyverse ).

  • Normalize shared OTU counts based on the total length of each bin (R - tidyverse ).


6. Taxonomic classification

  • Purpose : Provide basic taxonomic information for each of the viral OTUs (bins).

  • Classify metagenome assembled bins (MAGs; CAT/BAT ).


7. Calculate diversity metrics

  • Purpose : Calculate various metrics needed for comparing community compositions/properties.

  • Calculate alpha (within sample) and beta (between sample) diversity metrics (R - tidyverse , furrr , vegan ).

  • Generate principal coordinates of analysis (PCoA) and non-metric multidimensional (NMDS) ordinations using beta diversity (R - tidyverse , vegan , ecodist ).


8. Quality control

  • Purpose : Identify any stages where read or contig qualities are inadequate.

  • Map reads to contaminant sequences to determine degree of contamination ( FastQ Screen ).

  • Summarize read quality scores, adapter content, etc. ( FastQC ).

  • Summarize contig assembly quality scores, lengths, etc. ( Quast ).

  • Combine quality control diagnostics together into a single interactive report ( MultiQC ).

Usage

Dependencies

  • MacOSX or Linux operating system.

  • Have access to a high-performance computing (HPC) cluster.

  • Install Miniconda and configure for use on a cluster. Instructions for installing conda on a cluster are provided by me HERE .

  • Have paired end sequencing data (fastq.gz).

NOTE: The workflow assumes you are using the Swift Accel-NGS 1S Plus DNA Library Kit (Swift Biosciences; cat. no. 10024) for library prepartion and sequencing on an Illumina machine. If using a different library kit, you may need to adjust the metagenomeHeadcrop setting in config.yaml . If using a different sequencer, you will need to adjust the adapter types/location during execution of metagenomeTrimmomatic.sh in the Snakefile .


Running analysis

1. Clone this repository and move into the project directory.

git clone https://github.com/wclose/viralMetagenomicsPipeline.git
cd viralMetagenomicsPipeline

2. Transfer all of your raw paired-end sequencing data into data/raw/ .

NOTE: Because of the way sample names are automatically parsed, files are assumed to be in the form 140142_TATGCCAG-TTCCATTG_S154_R1_001.fastq.gz . Everything before the first underscore is assumed to be the sample name (Ex: 140142 ). Files are also assumed to be gzipped fastq files ending in fastq.gz . You will need to adjust accordingly if this is not the case.


Copy sequences to the raw data directory.

cp PATH/TO/SEQUENCEDIR/* data/raw/

3. Create the master snakemake environment.

NOTE: If you already have a conda environment with snakemake installed, you can skip this step.

conda env create -f envs/yaml/snakemake.yaml

4. Activate the environment that contains snakemake .

conda activate snakemake

5. Edit the options in config.yaml file to set downstream analysis options.

nano config/config.yaml

Things to change (everything else can/should be left as is):

  • metagenomeControls : Put the names (just the names of the samples, not the full filename with all of the sequencer information) of all of your controls here. Make sure names don't have underscores as these will not be parsed correctly.

  • metagenomeHeadcrop : The number of bases to trim from each read with Trimmomatic. This setting depends on the library preparation kit used and your sequencing conditions. You may need to increase/decrease this depending on the FastQC results in the MultiQC reports generated as part of the pipeline.

  • metagenomeVirfinderFpr and metagenomeVirfinderFdr : The false-positive and false-discovery rates to be used for filtering VirFinder results when predicting viral contigs.

  • metagenomeSubthresh : The number of reads to subsample to for each sample when generating the shared OTU count file.

  • metagenomeAlpha : The desired alpha (within sample) diversity metrics to calculate.

  • metagenomeBeta : The desired beta (between sample) diversity metrics to calculate.


6. Test the workflow to make sure everything looks good. This will print a list of the commands to be executed without running them and will error if something isn't set properly.

snakemake -np

7. If you want to see how everything fits together, you can run the following to generate a flowchart of the various steps. Because of the amount of parallel processes, I highly recommend using the rulegraph as opposed to the directed acyclic graph (DAG) output from snakemake as the DAG will be extremely cluttered. I have included the pre-generated rulegraph for the pipeline within this repository. You may need to download the resulting image locally to view it properly.

NOTE: If you are using MacOSX, you will need to install dot by installing graphviz with homebrew or some alternative process before running the following command.

snakemake --rulegraph | dot -Tsvg > rulegraph.svg

8. Before running any jobs on the cluster, change the ACCOUNT and EMAIL fields in the following files. If you need to change resource requests (increase/decrease time, memory, etc.), you can find those settings in the cluster profile configuration files as well.


9. Run the Snakemake workflow. Snakemake will submit each task as an individual job on the cluster, monitor the jobs, and submit new jobs as needed without needing any intervention from the user. Should there be an error, Snakemake will tell you where the error occurred and halt all jobs. It is recommended to submit the workflow using the cluster submission script as this will create a job that manages the workflow for you.

sbatch code/snakemake.sh

10. Assuming that no errors occur, the workflow will create multiple directories within the data/ directory. Outputs of interest will be as follows:

  • data/shared/ : This directory will contain output files containing normalized counts of each metagenomic bin (viral OTU) across all samples.

  • data/diversity/ : This directory contains all metrics for measuring diversity including the alpha diversity table and any beta diversity distance files.

  • data/catbat/ : This directory contains the taxonomic assignments of the various metagenomic bins (viral OTUs) using CAT/BAT and the RefSeq viral database.

  • data/multiqc/ : This directory contains all of the qualtiy control reports with interactive graphs. Reports are generated at different checkpoints throughout the pipeline and I highly encourage you to look at them, adjust the pipeline settings as necessary, and rerun the workflow to generate the best quality data.

Code Snippets

55
56
shell:
	"bash {input.script} {input.raw} {params.headcrop}"
72
73
shell:
	"bash {input.script}"
88
89
shell:
	"bash {input.script} {input.trimmed} {input.combinedIndex[0]}"
110
111
shell:
	"bash {input.script} {input.decon}"
SnakeMake From line 110 of master/Snakefile
122
123
shell:
	"bash {input.script} {input.contigs}"
SnakeMake From line 122 of master/Snakefile
135
136
shell:
	"bash {input.script} {input.library}"
SnakeMake From line 135 of master/Snakefile
150
151
shell:
	"bash {input.script} {input.deduped} {params.length}"
SnakeMake From line 150 of master/Snakefile
169
170
shell:
	"bash {input.script}"
SnakeMake From line 169 of master/Snakefile
183
184
shell:
	"bash {input.script} {input.fasta} {input.readme}"
SnakeMake From line 183 of master/Snakefile
193
194
shell:
	"bash {input.script}"
SnakeMake From line 193 of master/Snakefile
210
211
shell:
	"Rscript {input.script} {input.fasta} {input.model} {params.fpr} {params.fdr}"
SnakeMake From line 210 of master/Snakefile
223
224
shell:
	"bash {input.script} {input.fasta} {input.virsorter} {input.virfinder}"
SnakeMake From line 223 of master/Snakefile
246
247
shell:
	"bash {input.script} {input.curated}"
SnakeMake From line 246 of master/Snakefile
261
262
shell:
	"bash {input.script} {input.decon} {input.index[0]}"
SnakeMake From line 261 of master/Snakefile
279
280
shell:
	"bash {input.script} {input.curated} {params.seed} {input.bam}"
SnakeMake From line 279 of master/Snakefile
298
299
shell:
	"bash {input.script}"
SnakeMake From line 298 of master/Snakefile
312
313
shell:
	"bash {input.script} {input.bins[0]}"
SnakeMake From line 312 of master/Snakefile
325
326
shell:
	"bash {input.script} {input.metrics} {input.bins[0]}"
SnakeMake From line 325 of master/Snakefile
348
349
shell:
	"bash {input.script} {input.decon}"
SnakeMake From line 348 of master/Snakefile
363
364
shell:
	"bash {input.script} {input.deconPaired} {input.index[0]}"
SnakeMake From line 363 of master/Snakefile
376
377
shell:
	"bash {input.script} {input.bam}"
SnakeMake From line 376 of master/Snakefile
390
391
shell:
	"bash {input.script} {input.bam}"
SnakeMake From line 390 of master/Snakefile
404
405
shell:
	"Rscript {input.script} {input.idxstats}"
SnakeMake From line 404 of master/Snakefile
417
418
shell:
	"Rscript {input.script} {input.idxstats}"
SnakeMake From line 417 of master/Snakefile
433
434
shell:
	"Rscript {input.script} {input.shared} {params.controls}"
SnakeMake From line 433 of master/Snakefile
447
448
shell:
	"Rscript {input.script} {input.shared} {input.idxstats}"
SnakeMake From line 447 of master/Snakefile
464
465
shell:
	"Rscript {input.script} {input.shared} {input.idxstats} {params.subthresh}"
SnakeMake From line 464 of master/Snakefile
486
487
shell:
	"bash {input.script}"
SnakeMake From line 486 of master/Snakefile
501
502
shell:
	"bash {input.script} {input.bins[0]} {input.dmnd} {input.nodes}"
SnakeMake From line 501 of master/Snakefile
528
529
shell:
	"Rscript {input.script} {input.shared} {input.idxstats} {params.subthresh} {params.alpha}"
SnakeMake From line 528 of master/Snakefile
546
547
shell:
	"Rscript {input.script} {input.shared} {input.idxstats} {params.subthresh} {params.beta}"
SnakeMake From line 546 of master/Snakefile
560
561
shell:
	"Rscript {input.script} {input.beta}"
SnakeMake From line 560 of master/Snakefile
574
575
shell:
	"Rscript {input.script} {input.beta}"
SnakeMake From line 574 of master/Snakefile
597
598
shell:
	"bash {input.script} {input.mouseIndex[0]} {input.humanIndex[0]}"
SnakeMake From line 597 of master/Snakefile
613
614
shell:
	"bash {input.script} {input.config} {input.raw}"
SnakeMake From line 613 of master/Snakefile
628
629
shell:
	"bash {input.script} {input.raw}"
SnakeMake From line 628 of master/Snakefile
644
645
shell:
	"bash {input.script} {input.rawMap} {input.rawStats}"
SnakeMake From line 644 of master/Snakefile
659
660
shell:
	"bash {input.script} {input.trimmed}"
SnakeMake From line 659 of master/Snakefile
672
673
shell:
	"bash {input.script} {input.trimmedStats}"
SnakeMake From line 672 of master/Snakefile
688
689
shell:
	"bash {input.script} {input.config} {input.decon}"
SnakeMake From line 688 of master/Snakefile
703
704
shell:
	"bash {input.script} {input.decon}"
SnakeMake From line 703 of master/Snakefile
719
720
shell:
	"bash {input.script} {input.deconMap} {input.deconStats}"
SnakeMake From line 719 of master/Snakefile
733
734
shell:
	"bash {input.script} {input.contigs}"
SnakeMake From line 733 of master/Snakefile
746
747
shell:
	"bash {input.script} {input.assemblyStats}"
SnakeMake From line 746 of master/Snakefile
761
762
763
764
765
766
shell:
	"""
	echo PROGRESS: Removing all workflow output.
	rm -rf envs/share/*
	rm -rf $(find data/ -mindepth 1 -maxdepth 1 -type d | grep -v "raw")
	"""
SnakeMake From line 761 of master/Snakefile
ShowHide 45 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/wclose/viralMetagenomicsPipeline
Name: viralmetagenomicspipeline
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...