Term Project for BIOF 501 UBC

public public 1yr ago 0 bookmarks

Table of Contents

  1. Background and hypothesis

  2. Dependencies

  3. Workflow

  4. Usage

  5. Input

  6. Expected output

  7. References

Background and hypothesis

The recent advances in high-throughput sequencing have given us increasing evidence of associations between the microbiome and human cancers [1] . While many standalone platforms and R packages such as QIIME2 , DADA2 , phyloseq , and vegan have been published and widely used to conduct microbiome analysis, limited workflows allow integration of these tools. This pipeline aims to use Snakemake , QIIME2 and phyloseq to build a flexible and automated workflow to perform reproducible, in-depth analysis on any paired-end V4 16S rRNA microbiome data with two groups of interest and another variable that could account for variance in the dataset.

The dataset used in this pipeline was first described by Tsementzi et al. (2020) [2] . The objective of this study was to compare vaginal microbiota in gynecologic cancers (endometrial/cervical) patients pre- and post- radiation therapy and healthy women. For this project, the dataset was randomly subsampled to include five pre- and five post- radiation therapy (two groups of interest) from cervical cancer patients of different ethnicity (another variable). The hypothesis to test is that the microbiome composition is different within the two groups of interest,nd a proportion of the variance is explained by an additional variable (ethnicity). The workflow aims to provide a list of plots that describe rarefaction curves, alpha and diversity, the relative abundance of organisms in the two groups, hierarchical clustering, and phylogenetic trees. A complete list of accession IDs and relevant metadata of the dataset is available in 00-helperfiles/metadata.txt .

Dependencies

The pipeline is built using snakemake and conda and it will be assumed that users have git , conda , and Snakemake installed. The main dependencies of this workflow include:

  • sra-toolkit=2.11.0

  • qiime2=2021.8.0

  • fastqc=0.11.9

  • multiqc=1.11

  • r-base=4.0.5

  • phyloseq=1.34.0

  • r-vegan=2.5_7

Other minor dependencies are listed in spec-file.txt .

Workflow

An overview of the workflow could be represented using a Directed Acyclic graph as shown below:

The entire workflow with the described dataset takes ~1.5 hours to run. The main steps of the workflow include:

  1. Download raw data : This step needs a list of accession numbers to download raw paired-end data from the Sequence Read Archive (SRA) using pre-fetch and fasterq-dump commands from the sra-toolkit .

  2. Initial quality control : FastQC and MultiQC analyze fastq files for quality checks.

  3. Import data into QIIME2 : The fastq files will then be imported into a QIIME artifact (qza). After importing data, the pipeline will remove all raw data files downloaded from the SRA to optimize memory usage. Therefore, these raw fastq files will not be accessible after QIIME2 imports the data.

  4. Quality control using QIIME2 : The paired-end data artifact generated is then summarized using a QIIME2 visualization (qzv).

  5. Generation of ASVs : Using the DADA2 plugin in QIIME2, the paired-end data is denoised and generates an amplicon sequence variant (ASV) table.

  6. Taxonomic classification : The ASV table from DADA2 is passed into a QIIME2 pre-built Naive Bayes classifier to assign taxonomic labels to the generated ASVs. This pre-built classifier is trained on Greengenes 13_8 99% OTUs full-length sequences.

  7. Phylogenetics : The representative sequences for the ASVs then undergo sequence alignment using then MAFFT plugin in QIIME2 to generate aligned sequences, unrooted, and rooted trees in Newick tree format.

  8. Visual analysis : R packages such as Phyloseq and vegan use the ASV table, representative sequences, rooted tree, and taxonomic classification table to demonstrate differences/similarities in microbiome composition between samples.

Usage

  1. Clone this repository move into the cloned directory.
git clone https://github.com/dollinad/BIOF501TermProject.git
cd BIOF501termProject
  1. Create and activate a conda environment named microbiome using a specification file. This environment will consist of the required software and packages for the pipeline to run.
conda create --name microbiome --file spec-file.txt
conda activate microbiome
  1. Finally, run the pipeline using the following command that allows parallelization of tasks by assigning four cores for the workflow to use.
snakemake --use-conda --cores 4

In one chunk, this is:

git clone https://github.com/dollinad/BIOF501TermProject.git
cd BIOF501termProject
conda create --name microbiome --file spec-file.txt
conda activate microbiome
snakemake --use-conda --cores 4

Input

The workflow needs the following files in the 00-helperfiles as input:

  • accessionList.txt : A list of run accessions to download from SRA.

    SRR6920043
    SRR6920044
    '''
  • metadata.txt : A tab-delimited text file containing three columns: SRA run ID, condition 1, and variable 1.

    id gynecologic_disord ethnicity
    SRR6920043 post-radiotherapy African-american
    SRR6920044 pre-radiotherapy African-american
    ''' ''' '''
  • manifest.txt : A tab-delimited text file containing three columns: SRA run ID, the relative path for forward and reverse reads.

    sample-id forward-absolute-filepath reverse-absolute-filepath
    SRR6920043 01-data/SRR6920043_1.fastq 01-data/SRR6920043_2.fastq
    SRR6920044 01-data/SRR6920044_1.fastq 01-data/SRR6920044_2.fastq
    ''' ''' '''
  • gg-13-8-99-nb-classifier.qza : A pre-built classifier from QIIME2.

***Note:****Even though QIIME2 and the column name in manifest.txt need absolute paths to forward and reverse reads, it is sufficient to provide a relative path as the pipeline will automatically change this file to include an absolute path.

Expected output

The visual results from the pipeline can be found in the folder 05-visuals/ and are generated using the R_analysis.R script under the scripts/ folder. The pipeline will generate the following results:

  1. Quality control: Quality checks were performed using FastQC and MultiQC and the results from these tools could be found in 02/fastqc/ and 03/multiqc/ , respectively. Both FastQC and MultiQC generate an html report that could be viewed through a browser. Additional quality control was performed using QIIME2 and a corresponding QIIME visualization was generated which could be found in 05-visuals/ . To view a QIIME2 visual, use the following command:
qiime tools view paired-end-data.qzv

The visualization should open in the default browser. As shown in the example below, this visual helps users to trim reads keepng only high quality bases by remving primers and other sequencing artifacts.

Screen Shot 2021-11-28 at 5 25 02 PM

  1. 05-visuals/01-rarefaction.png : Shows the number of species observed at various sequence depths.

01-rarefraction

  1. 05-visuals/02-alpha-diversity.png : Generated using raw ASV table (not rarefied) and displays differences between the two groups (pre- and post- radiation). The figure shows that the number of species (species richess) is different for pre- and post- radiation groups.

02-alpha_diversity

  1. 05-visuals/relative_abundance.png : Generated using rarefied data at 90% of minimum sample depth and shows the relative abundance of phylum in the two groups. The figure shows that there is an enrichment of Tenericutes and Fusobacteria in pre-radiation and an increase of Spirochaetes post-radiation.

03-relative_abundance

  1. 05-visuals/04-beta_diversity.png : Multidimensional scaling (MDS) on various distance metrics to measure the dissimilarity between the pre- and post- radiation samples.

04-beta_diversity

  1. 05-visuals/05-heirarchal_clustering.png : Generated using the Bray-Curtis distance method and ard's linkage method for heirarchal cluster analysis. The figure was generated using teh vegan package in R.

05-heirarchal_clustering

  1. 05-visuals/06-phylogenetics_tree.png : A phylogenetic tree using fifty most abundant taxa. 06-phylogenetics_tree

Other intermediate output files that could be used for additional downstream analysis include 04-qiime2/table.qza (ASV table), 04-qiime2/rep-seqs.qza (representative sequences for ASVs), 04-qiime2/aligned-rep-seqs.qza , 04-qiime2/rooted-tree.qza , 04-qiime2/unrooted-tree.qza .

Since the above-mentioned files are QIIME2 artifacts, they need to be exported from the compressed object before analysis. For example, to export the ASV table use the following command.

qiime tools export --input-path 04-qiime2/table.qza --output-path outputDir/

References

  1. Yang, Jiqiao, et al. "Gastrointestinal microbiome and breast cancer: correlations, mechanisms and potential clinical implications." Breast Cancer 24.2 (2017): 220-228.

  2. Tsementzi, Despina, et al. "Comparison of vaginal microbiota in gynecologic cancer patients pre‐and post‐radiation therapy and healthy women." Cancer medicine 9.11 (2020): 3714-3724.

Code Snippets

71
72
shell:
    "fastqc {input} -o {fastqcOut} -q -t 4"
84
85
wrapper:
    "v0.80.1/bio/multiqc"
104
105
106
107
108
109
110
shell: 
    """
    qiime tools import --type 'SampleData[PairedEndSequencesWithQuality]' \
    --input-path {input.manFile} --output-path {output} \
    --input-format PairedEndFastqManifestPhred33V2
    rm -r '01-data/' && touch rawDataRemoved.txt
    """
121
122
123
124
125
shell:
    """
    qiime demux summarize --i-data {input} \
    --o-visualization {output}
    """
148
149
150
151
152
153
154
155
shell:
    "qiime dada2 denoise-paired --i-demultiplexed-seqs {input.data} \
    --p-n-threads 8 \
    --p-trim-left-f 20 --p-trim-left-r 20 \
    --p-trunc-len-f 275 --p-trunc-len-r 275 \
    --o-table {output.table} \
    --o-representative-sequences {output.rep_seqs} \
    --o-denoising-stats {output.denoise_stats}"
169
170
171
172
173
174
175
176
shell: 
    "qiime phylogeny align-to-tree-mafft-fasttree \
    --p-n-threads 8 \
    --i-sequences {input} \
    --o-alignment {output.aligned_rep_seqs} \
    --o-masked-alignment {output.masked_aligned_rep_seqs} \
    --o-tree {output.unrooted_tree} \
    --o-rooted-tree {output.rooted_tree}"
189
190
191
192
193
shell:
    "qiime feature-classifier classify-sklearn \
    --i-classifier {input.inClassifier} \
    --i-reads {input.reads} \
    --o-classification {output}"
206
207
208
209
210
211
212
213
214
215
216
shell:
    """
    qiime tools export --input-path {input.table} --output-path {output}
    biom convert -i {output}/feature-table.biom -o {output}/otu_table.tsv --to-tsv && cd {output}
    sed -i .bak '1d' otu_table.tsv
    sed -i .bak 's/#OTU ID//' otu_table.tsv
    cd ../
    qiime tools export --input-path {input.rep_seqs} --output-path {output}
    qiime tools export --input-path {input.tax} --output-path {output}
    qiime tools export --input-path {input.tree} --output-path {output}
    """
232
233
shell:
    "Rscript {input.script} > Rlog.txt"
SnakeMake From line 232 of main/Snakefile
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path

from snakemake.shell import shell


input_dirs = set(path.dirname(fp) for fp in snakemake.input)
output_dir = path.dirname(snakemake.output[0])
output_name = path.basename(snakemake.output[0])
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "multiqc"
    " {snakemake.params}"
    " --force"
    " -o {output_dir}"
    " -n {output_name}"
    " {input_dirs}"
    " {log}"
)
ShowHide 6 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/dollinad/BIOF501TermProject
Name: biof501termproject
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...