vRAPID: Virus Reference-based Assembly and Identification Pipeline

public public 1yr ago 0 bookmarks

vRAPID

The Virus Reference-based Assembly Pipeline and IDentification (vRAPID) tool runs the assembly, consensus calling and annotation of viral pathogen genomes.

vRAPID relies on data structures and naming conventions used by the Center for Research on Influenza Pathogenesis and Transmission (CRIPT) and the Mount Sinai Pathogen Surveillance Program (MS-PSP) at the Icahn School of Medicine at Mount Sinai. vRAPID is an expansion of the COVID_pipe pipeline that was written by Mitchell Sullivan in the Bakel Lab.

Prepare the snakemake conda environment

Installation of the required external software packages is largely handled by the pipeline itself, however a conda environment named snakemake needs to be present in your environment. We recommend miniconda, which is a free minimal installer for conda . Follow the instructions below to start the miniconda installer on Linux. When asked whether the conda environment should automatically be initialized, select 'yes'. Note that Snakemake requires the channel_priority to be set to strict. The post-installation commands to apply this setting are included in the post-installation selection below.

# Start miniconda installation
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
sh Miniconda3-latest-Linux-x86_64.sh
# Post-installation commands to enforce strict channel_priority (required for Snakemake)
conda config --set auto_activate_base false
conda config --set channel_priority strict

On the Mount Sinai Minerva high-performance computer cluster we additionally recommend setting the conda environment pkgs and envs directories to your work folder, which has a much larger storage allocation (100-200 Gb) than the home folders (5 Gb). The conda configuration commands to set the directories are as follows:

mkdir -p /sc/arion/work/$USER/conda/pkgs
mkdir -p /sc/arion/work/$USER/conda/envs
conda config --add pkgs_dirs "/sc/arion/work/$USER/conda/pkgs"
conda config --add envs_dirs "/sc/arion/work/$USER/conda/envs"

After installation and configuration of conda/miniconda, the following 'conda create' command can be used to set up the required 'snakemake' environment.

conda create -c conda-forge -c bioconda -n snakemake 'snakemake=6.8.0' 'mamba=0.24' 'tabulate=0.8'

Running the vRAPID pipeline

The following folder structure should exist in the analysis directory to run the pipeline:

<Run-ID>
├── <virus1>
│ ├── <expect_subtype>
│  ├── <batch_ID1>
│ │	│	├── <sample_ID1>
│ │	│	│ ├── <sample_ID1>
│ │	│	│ │	├── <sample_ID1>_1.fastq.gz
│ │	│	│ │	├── <sample_ID1>_2.fastq.gz
│ │	│	├── <sample_ID2>
│ │	│	│ ├── <sample_ID2>
│ │	│	│ │	├── <sample_ID2>_1.fastq.gz
│ │	│	│ │	├── <sample_ID2>_1.fastq.gz
│  ├── <batch ID2>
│ │	│	├── <sample_ID1>
│ │	│	│ ├── <sample_ID1>
│ │	│	│ │	├── <sample_ID1>_1.fastq.gz
│ │	│	│ │	├── <sample_ID1>_2.fastq.gz
│ │	│	├── <sample_ID2>
│ │	│	│ ├── <sample_ID2>
│ │	│	│ │	├── <sample_ID2>_1.fastq.gz
│ │	│	│ │	├── <sample_ID2>_1.fastq.gz
├── <virus2>
│ ├── <expect_subtype>
│  ├── <batch_ID1>
│ │	│	├── <sample_ID1>
│ │	│	│ ├── <sample_ID1>
│ │	│	│ │	├── <sample_ID1>_1.fastq.gz
│ │	│	│ │	├── <sample_ID1>_2.fastq.gz
│ │	│	├── <sample_ID2>
│ │	│	│ ├── <sample_ID2>
│ │	│	│ │	├── <sample_ID2>_1.fastq.gz
│ │	│	│ │	├── <sample_ID2>_1.fastq.gz
│  ├── <batch_ID2>
│ │	│	├── <sample_ID1>
│ │	│	│ ├── <sample_ID1>
│ │	│	│ │	├── <sample_ID1>_1.fastq.gz
│ │	│	│ │	├── <sample_ID1>_2.fastq.gz
│ │	│	├── <sample_ID2>
│ │	│	│ ├── <sample_ID2>
│ │	│	│ │	├── <sample_ID2>_1.fastq.gz
│ │	│	│ │	├── <sample_ID2>_1.fastq.gz

Then within each sample's subdirectory, create the following two files:

  1. samples.csv: with the column name Sample_ID , holding a list of the samples within each virus-batch subdirectory (the new working directory).

  2. multibamqc_input.txt: a tab-separated file that holds the list of samples from samples.csv in the first column, and the path to the bam file in the second column. Note that the bam files will always be outputted to <sample_ID>/02_assembly/<sample_ID>_ref.bam . Please note that the multibamqc_input.txt file does not require column names.

An example of both file structures can be seen below:

samples.csv

Sample_ID
sample_ID1
sample_ID2
sample_ID3

multibamqc_input.txt

sample_ID1 sample_ID1/02_assembly/sample_ID1_ref.bam
sample_ID2 sample_ID2/02_assembly/sample_ID2_ref.bam
sample_ID3 sample_ID3/02_assembly/sample_ID3_ref.bam

The final structure should be as follows:

<Run-ID>
├── <virus1>
│ ├── <expect_subtype>
│  ├── <batch_ID1>
│ │	│	├── samples.csv
│ │	│	├── multibamqc_input.txt
│ │	│	├── <sample_ID1>
│ │	│	│ ├── <sample_ID1>
│ │	│	│ │	├── <sample_ID1>_1.fastq.gz
│ │	│	│ │	├── <sample_ID1>_2.fastq.gz
│ │	│	├── <sample_ID2>
│ │	│	│ ├── <sample_ID2>
│ │	│	│ │	├── <sample_ID2>_1.fastq.gz
│ │	│	│ │	├── <sample_ID2>_1.fastq.gz
│  ├── <batch_ID2>
│ │	│	├── samples.csv
│ │	│	├── multibamqc_input.txt
│ │	│	├── <sample_ID1>
│ │	│	│ ├── <sample_ID1>
│ │	│	│ │	├── <sample_ID1>_1.fastq.gz
│ │	│	│ │	├── <sample_ID1>_2.fastq.gz
│ │	│	├── <sample_ID2>
│ │	│	│ ├── <sample_ID2>
│ │	│	│ │	├── <sample_ID2>_1.fastq.gz
│ │	│	│ │	├── <sample_ID2>_1.fastq.gz
├── <virus2>
│ ├── <expect_subtype>
│  ├── <batch_ID1>
│ │	│	├── samples.csv
│ │	│	├── multibamqc_input.txt
│ │	│	├── <sample_ID1>
│ │	│	│ ├── <sample_ID1>
│ │	│	│ │	├── <sample_ID1>_1.fastq.gz
│ │	│	│ │	├── <sample_ID1>_2.fastq.gz
│ │	│	├── <sample_ID2>
│ │	│	│ ├── <sample_ID2>
│ │	│	│ │	├── <sample_ID2>_1.fastq.gz
│ │	│	│ │	├── <sample_ID2>_1.fastq.gz
│  ├── <batch_ID2>
│ │	│	├── samples.csv
│ │	│	├── multibamqc_input.txt
│ │	│	├── <sample_ID1>
│ │	│	│ ├── <sample_ID1>
│ │	│	│ │	├── <sample_ID1>_1.fastq.gz
│ │	│	│ │	├── <sample_ID1>_2.fastq.gz
│ │	│	├── <sample_ID2>
│ │	│	│ ├── <sample_ID2>
│ │	│	│ │	├── <sample_ID2>_1.fastq.gz
│ │	│	│ │	├── <sample_ID2>_1.fastq.gz

Lastly, to successfully run the pipeline, depending on the virus, the config.yaml file within the pipeline repository might need to be edited. Please note that the current default settings are set for SARS-CoV-2. The following fields are required within the file:

  1. samples : Name of the file containing a list of the samples being run. Currently set to samples.csv .

  2. run_id: Run ID name, typically identified as TD###### from the sequencing core's data release e-mail.

  3. reference: path to the reference within the db directory that is found in the repository. Currently set to sars-cov-2/COVID.fa .

  4. genbank: path to the GenBank file within the db directory that is found in the repository. Currently set to sars-cov-2/COVID.gbk .

  5. reverse_primer: path to the 3-prime primer file within the db directory that is found in the repository. Currently set to sars-cov-2/SARS-CoV-2_primers_3prime_NI.fa .

  6. forward_primer: path to the 5-prime primer file within the db directory that is found in the repository. Currently set to sars-cov-2/SARS-CoV-2_primers_5prime_NI.fa .

  7. primer_set_2kb: path to the 2kb-prime set file within the db directory that is found in the repository. Currently set to sars-cov-2/SARS-CoV-2_primers_2kb_set.tsv .

  8. primer_set_1_5kb: path to the 1.5kb-prime set file within the db directory that is found in the repository. Currently set to sars-cov-2/SARS-CoV-2_primers_1.5kb_set.tsv .

  9. virus: the virus name for the samples being run. Currently set to SARS-CoV-2 . Other options include: Influenza-A , Influenza-B , sCoV , MPX .

  10. path: the full path to where the samples being run are located. See above for the proper structure.

  11. length: length of the reference genome(s). For multi-segmented viruses like influenza, this can be a list of lengths, ordered with respect to the FASTA headers.

  12. ref_fasta_headers: The name of FASTA header(s) in the reference genome. For multi-segmented viruses like influenza, this can be a list of headers, ordered with respect to the length.

Note for users from the Bakel lab only

From a typical data release from the sequencing core the starting directory structure will look as follows:

<Run-ID>
└───<sample_ID1>
│  <sample_ID1>_1.fastq.gz
│  <sample_ID1>_2.fastq.gz
│
└───<sample_ID2>
  <sample_ID2>_1.fastq.gz
  <sample_ID2>_2.fastq.gz

In order to create the required directory structure for the pipeline run:

python ~/opt/vRAPID/workflow/scripts/organize_run_samples.py

This script queries to PathogenDB, separating the samples into subdirectories based on the virus type, expected subtype, and batch ID. If no subtype is expected, then the script separates the samples into a None subdirectory. This creates the final directory structure within the sequencing run directory as shown before. You can then proceede to generate the samples.csv and multibamqc_input.txt as previously mentioned.

Running vRAPID

Then once the files are generated, the pipeline can be run using the following command:

submitjob 12 -c 4 -m 64 -q private -P acc_PVI ~/opt/vRAPID/run-vRAPID-analysis -i <run-ID> -p <path> -s <samples-run-csv>

Note that the <run-ID> , <path> , and <samples-run-csv> arguments in the command above are optional. If they are not supplied, then they are pulled from the config.yaml file.

vRAPID output

Once the pipeline has completed, the following output files are created:

<Run-ID>
├── <virus1>
│ ├── <expect_subtype>
│  ├── <batch_ID1>
│ │	│	├── samples.csv
│ │	│	├── multibamqc_input.txt
│ │	│	├── pipeline.log
│ │	│	├── file_movement_message.txt
│ │	│	├── <Run-ID>_cleaned.txt
│ │	│	├── <Run-ID>_data_release.txt
│ │	│	├── <Run-ID>_software_version.txt
│ │	│	├── <Run-ID>_<batch-ID>_assemblies.csv
│ │	│	├── <Run-ID>_<batch-ID>.tar.gz
│ │	│	├── logs
│ │	│	├── multi_bamqc
│ │	│	├── multiqc_report.html
│ │	│	├── <sample_ID>
│ │	│	│	├── 01_fastqs
│ │	│	│	│	├── <sample_ID>_1.fastq.gz
│ │	│	│	│	├── <sample_ID>_2.fastq.gz
│ │	│	│	├── 02_assembly
│ │	│	│	├── 03_qualityControl
│ │	│	│	├── 04_variants
│ │	│	│	├── 05_status

Accessory vRAPID pipeline scripts

The following is a brief description of the scripts that are called within the Snakefile . For more information, please see the function overview below:

  • run_pipeline.py : Takes in Illumina paired-end FASTQ files and assembles them against a reference genome, outputting a final FASTA consensus genome sequence and aligned read bam file, and other intermediary output files.

  • variant_analysis.py : Takes in the consensus FASTA file generated by run_pipeline.py and performs an intra-host variant analysis on the virus library.

  • run_QC.py : Takes in the bam file generated as an output from the run_pipeline.py script, and performs a quality control analysis on the virus library.

  • run-genome-annotation.py : Takes in the FASTA file generated from run_pipeline.py and annotates the consensus according to the virus type. For Inluenza samples, this is done using NCBI's Influenza Annotation Tool . For SARS-CoV-2, this is done using vadr .

  • all-virus_assembly_push.py : Used to upload viral genome assembly data to the Mount Sinai Pathogen Surveillance Program database (PathogenDB).

  • move_QC_PDB.py : Uploads the output that is obtained by running multiqc , and qualimap to the Mount Sinai Pathogen Surveillance Database (PathogenDB).

  • cleanup.py : Removes large intermediary output files once the pipeline has completed to conserve space before archival.

  • generate_run_report.R : Creates a final assembly run status and quality report for the samples that were processed within the run, outputting a CSV file.

  • data-release.py : Prepares vRAPID pipeline outputs for data release.

Code Snippets

36
37
38
39
shell:
   """
   mv {input.sample_folder} {output.renamed_file} > {log.rename} 2>&1
   """
63
64
65
66
   shell:
      """
      python {params.pipeline} -rd {params.repo_dir} -i {input.sample_folder} -r1 _R1_001.fastq.gz -r2 _R2_001.fastq.gz -r {params.reference} -pf {params.forward_primer} -pr {params.reverse_primer} -g {params.genbankfile} -l {params.length} -headers {params.ref_headers} > {log.assembly} 2>&1
	  """
86
87
88
89
shell:
   """
   python {params.var_analysis} -rd {params.repo_dir} -i {input.sample_folder} -r {params.reference} -ps_2 {params.ps_2kb} -ps_1_5 {params.ps_1_5kb} > {log.varanalysis} 2>&1
   """
114
115
116
117
shell:
   """
   python {params.QC} -rd {params.repo_dir} -i {input.sample_folder} -kdb /sc/arion/projects/PVI/db/minikraken_8GB_20200312 -r {params.reference} -pf {params.forward_primer} -pr {params.reverse_primer} -v {params.virus} -pc {params.plotcoverage} -tb {params.breakdown} > {log.QC} 2>&1
   """
133
134
135
136
   shell:
      """
	  python {params.annotation_script} -i {input.consensus_fasta_file} -v {params.virus} > {log.annotation} 2>&1
      """
155
156
157
158
159
shell:
   """
   python {params.pdbpush} -s {params.sample_name} -r {params.run_ID} -c {params.config} -p {params.path} > {log.PDBPush} 2>&1
   touch {output.pdb_upload_check}
   """
173
174
175
176
177
shell:
   """
   module load qualimap
   qualimap multi-bamqc -d multibamqc_input.txt -r > {log.qualimap} 2>&1
   """
190
191
192
193
shell:
   """
   multiqc . > {log.multiqc} 2>&1
   """ 
209
210
211
212
shell:
   """
   python {params.push_script} -r {params.run_ID} > {log.pushQC} 2>&1
   """
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
   shell:
      """
      echo "{params.run_ID} was run using the following versions \n" > {output.version_report}
      echo "\n Name                    Version                   Build  Channel \n SAMTOOLS: \n" &>> {output.version_report}
      conda list | grep -w "^samtools" &>> {output.version_report}

      echo "\n CUTADAPT: \n" &>> {output.version_report}
      conda list | grep -w "^cutadapt" &>> {output.version_report}

      echo "\n MINIMAP2: \n" &>> {output.version_report}
      conda list | grep -w "minimap2" &>> {output.version_report}

      echo "\n PILON: \n" &>> {output.version_report}
      conda list | grep "pilon" &>> {output.version_report}

	  echo "\n QUALIMAP: \n" &>> {output.version_report}
	  conda list | grep "qualimap" &>> {output.version_report}

	  echo "\n SHOVILL: \n" &>> {output.version_report}
	  conda list | grep "shovill" &>> {output.version_report}

	  echo "\n PROKKA: \n" &>> {output.version_report}
	  conda list | grep "prokka" &>> {output.version_report}

	  echo "\n KRAKEN2: \n" &>> {output.version_report}
	  conda list | grep "kraken2" &>> {output.version_report}

	  echo "\n PYTHON: \n" &>> {output.version_report}
	  conda list | grep "^python" &>> {output.version_report}

	  echo "\n MULTIQC: \n" &>> {output.version_report}
	  conda list | grep "multiqc" &>> {output.version_report}

	  echo "\n FASTQC: \n" &>> {output.version_report}
	  conda list | grep "fastqc" &>> {output.version_report}

	  echo "\n PICARD Alignment Sumamry Metrics: \n" &>> {output.version_report}
	  conda list | grep "picard" &>> {output.version_report}

	  echo "\n {params.run_ID} was aligned against the following reference: \n {params.reference} \n Using the following genabnk file: \n {params.genbankfile} \n " &>> {output.version_report}
	  echo "\n primer sets used were: \n (1) {params.ps1} \n (2) {params.ps2}" &>> {output.version_report} > {log.versioncheck} 2>&1
      """
289
290
291
292
293
shell:
   """
   python {params.cleanup} -p {params.samples_run} > {log.cleanup} 2>&1
   touch {output.cleanup_report}
   """
309
310
311
312
shell:
   """
   Rscript {params.report_script} {params.run_ID} > {log.reporting} 2>&1
   """
331
332
333
334
335
shell:
   """
   python {params.status_script} -p {params.sample_mapping} -r {params.run_ID} -v {params.virus} -f {params.ref_header} > {log.datarelease} 2>&1
   touch {output.data_release_report}
   """
ShowHide 10 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/BakelLab/vRAPID
Name: vrapid
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...