Term Project for BIOF 501 UBC

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Background and hypothesis

The recent advances in high-throughput sequencing have given us increasing evidence of associations between the microbiome and human cancers [1] . While many standalone platforms and R packages such as QIIME2 , DADA2 , phyloseq , and vegan have been published and widely used to conduct microbiome analysis, limited workflows allow integration of these tools. This pipeline aims to use Snakemake , QIIME2 and phyloseq to build a flexible and automated workflow to perform reproducible, in-depth analysis on any paired-end V4 16S rRNA microbiome data with two groups of interest and another variable that could account for variance in the dataset.

The dataset used in this pipeline was first described by Tsementzi et al. (2020) [2] . The objective of this study was to compare vaginal microbiota in gynecologic cancers (endometrial/cervical) patients pre- and post- radiation therapy and healthy women. For this project, the dataset was randomly subsampled to include five pre- and five post- radiation therapy (two groups of interest) from cervical cancer patients of different ethnicity (another variable). The hypothesis to test is that the microbiome composition is different within the two groups of interest,nd a proportion of the variance is explained by an additional variable (ethnicity). The workflow aims to provide a list of plots that describe rarefaction curves, alpha and diversity, the relative abundance of organisms in the two groups, hierarchical clustering, and phylogenetic trees. A complete list of accession IDs and relevant metadata of the dataset is available in 00-helperfiles/metadata.txt .

Dependencies

The pipeline is built using snakemake and conda and it will be assumed that users have git , conda , and Snakemake installed. The main dependencies of this workflow include:

sra-toolkit=2.11.0
qiime2=2021.8.0
fastqc=0.11.9
multiqc=1.11
r-base=4.0.5
phyloseq=1.34.0
r-vegan=2.5_7

Other minor dependencies are listed in spec-file.txt .

Workflow

An overview of the workflow could be represented using a Directed Acyclic graph as shown below:

The entire workflow with the described dataset takes ~1.5 hours to run. The main steps of the workflow include:

Download raw data : This step needs a list of accession numbers to download raw paired-end data from the Sequence Read Archive (SRA) using pre-fetch and fasterq-dump commands from the sra-toolkit .
Initial quality control : FastQC and MultiQC analyze fastq files for quality checks.
Import data into QIIME2 : The fastq files will then be imported into a QIIME artifact (qza). After importing data, the pipeline will remove all raw data files downloaded from the SRA to optimize memory usage. Therefore, these raw fastq files will not be accessible after QIIME2 imports the data.
Quality control using QIIME2 : The paired-end data artifact generated is then summarized using a QIIME2 visualization (qzv).
Generation of ASVs : Using the DADA2 plugin in QIIME2, the paired-end data is denoised and generates an amplicon sequence variant (ASV) table.
Taxonomic classification : The ASV table from DADA2 is passed into a QIIME2 pre-built Naive Bayes classifier to assign taxonomic labels to the generated ASVs. This pre-built classifier is trained on Greengenes 13_8 99% OTUs full-length sequences.
Phylogenetics : The representative sequences for the ASVs then undergo sequence alignment using then MAFFT plugin in QIIME2 to generate aligned sequences, unrooted, and rooted trees in Newick tree format.
Visual analysis : R packages such as Phyloseq and vegan use the ASV table, representative sequences, rooted tree, and taxonomic classification table to demonstrate differences/similarities in microbiome composition between samples.

Usage

Clone this repository move into the cloned directory.

git clone https://github.com/dollinad/BIOF501TermProject.git
cd BIOF501termProject

Create and activate a conda environment named microbiome using a specification file. This environment will consist of the required software and packages for the pipeline to run.

conda create --name microbiome --file spec-file.txt
conda activate microbiome

Finally, run the pipeline using the following command that allows parallelization of tasks by assigning four cores for the workflow to use.

snakemake --use-conda --cores 4

In one chunk, this is:

git clone https://github.com/dollinad/BIOF501TermProject.git
cd BIOF501termProject
conda create --name microbiome --file spec-file.txt
conda activate microbiome
snakemake --use-conda --cores 4

Input

The workflow needs the following files in the 00-helperfiles as input:

accessionList.txt : A list of run accessions to download from SRA.

SRR6920043

SRR6920044

'''


SRR6920043
SRR6920044
'''

metadata.txt : A tab-delimited text file containing three columns: SRA run ID, condition 1, and variable 1.

id	gynecologic_disord	ethnicity
SRR6920043	post-radiotherapy	African-american
SRR6920044	pre-radiotherapy	African-american
'''	'''	'''

manifest.txt : A tab-delimited text file containing three columns: SRA run ID, the relative path for forward and reverse reads.

sample-id	forward-absolute-filepath	reverse-absolute-filepath
SRR6920043	01-data/SRR6920043_1.fastq	01-data/SRR6920043_2.fastq
SRR6920044	01-data/SRR6920044_1.fastq	01-data/SRR6920044_2.fastq
'''	'''	'''

gg-13-8-99-nb-classifier.qza : A pre-built classifier from QIIME2.

***Note:****Even though QIIME2 and the column name in manifest.txt need absolute paths to forward and reverse reads, it is sufficient to provide a relative path as the pipeline will automatically change this file to include an absolute path.

Expected output

The visual results from the pipeline can be found in the folder 05-visuals/ and are generated using the R_analysis.R script under the scripts/ folder. The pipeline will generate the following results:

Quality control: Quality checks were performed using FastQC and MultiQC and the results from these tools could be found in 02/fastqc/ and 03/multiqc/ , respectively. Both FastQC and MultiQC generate an html report that could be viewed through a browser. Additional quality control was performed using QIIME2 and a corresponding QIIME visualization was generated which could be found in 05-visuals/ . To view a QIIME2 visual, use the following command:

qiime tools view paired-end-data.qzv

The visualization should open in the default browser. As shown in the example below, this visual helps users to trim reads keepng only high quality bases by remving primers and other sequencing artifacts.

Screen Shot 2021-11-28 at 5 25 02 PM

05-visuals/01-rarefaction.png : Shows the number of species observed at various sequence depths.

$01-rarefraction$

05-visuals/02-alpha-diversity.png : Generated using raw ASV table (not rarefied) and displays differences between the two groups (pre- and post- radiation). The figure shows that the number of species (species richess) is different for pre- and post- radiation groups.

02-alpha_diversity

05-visuals/relative_abundance.png : Generated using rarefied data at 90% of minimum sample depth and shows the relative abundance of phylum in the two groups. The figure shows that there is an enrichment of Tenericutes and Fusobacteria in pre-radiation and an increase of Spirochaetes post-radiation.

03-relative_abundance

05-visuals/04-beta_diversity.png : Multidimensional scaling (MDS) on various distance metrics to measure the dissimilarity between the pre- and post- radiation samples.

04-beta_diversity

05-visuals/05-heirarchal_clustering.png : Generated using the Bray-Curtis distance method and ard's linkage method for heirarchal cluster analysis. The figure was generated using teh vegan package in R.

05-heirarchal_clustering

05-visuals/06-phylogenetics_tree.png : A phylogenetic tree using fifty most abundant taxa.

Other intermediate output files that could be used for additional downstream analysis include 04-qiime2/table.qza (ASV table), 04-qiime2/rep-seqs.qza (representative sequences for ASVs), 04-qiime2/aligned-rep-seqs.qza , 04-qiime2/rooted-tree.qza , 04-qiime2/unrooted-tree.qza .

Since the above-mentioned files are QIIME2 artifacts, they need to be exported from the compressed object before analysis. For example, to export the ASV table use the following command.

qiime tools export --input-path 04-qiime2/table.qza --output-path outputDir/

References

Yang, Jiqiao, et al. "Gastrointestinal microbiome and breast cancer: correlations, mechanisms and potential clinical implications." Breast Cancer 24.2 (2017): 220-228.
Tsementzi, Despina, et al. "Comparison of vaginal microbiota in gynecologic cancer patients pre‐and post‐radiation therapy and healthy women." Cancer medicine 9.11 (2020): 3714-3724.

Code Snippets

shell:
    "fastqc {input} -o {fastqcOut} -q -t 4"

SnakeMake FastQC From line 71 of main/Snakefile

wrapper:
    "v0.80.1/bio/multiqc"

SnakeMake MultiQC From line 84 of main/Snakefile

shell: 
    """
    qiime tools import --type 'SampleData[PairedEndSequencesWithQuality]' \
    --input-path {input.manFile} --output-path {output} \
    --input-format PairedEndFastqManifestPhred33V2
    rm -r '01-data/' && touch rawDataRemoved.txt
    """

SnakeMake QIIME2.0 From line 104 of main/Snakefile

shell:
    """
    qiime demux summarize --i-data {input} \
    --o-visualization {output}
    """

SnakeMake QIIME2.0 From line 121 of main/Snakefile

shell:
    "qiime dada2 denoise-paired --i-demultiplexed-seqs {input.data} \
    --p-n-threads 8 \
    --p-trim-left-f 20 --p-trim-left-r 20 \
    --p-trunc-len-f 275 --p-trunc-len-r 275 \
    --o-table {output.table} \
    --o-representative-sequences {output.rep_seqs} \
    --o-denoising-stats {output.denoise_stats}"

SnakeMake QIIME2.0 From line 148 of main/Snakefile

shell: 
    "qiime phylogeny align-to-tree-mafft-fasttree \
    --p-n-threads 8 \
    --i-sequences {input} \
    --o-alignment {output.aligned_rep_seqs} \
    --o-masked-alignment {output.masked_aligned_rep_seqs} \
    --o-tree {output.unrooted_tree} \
    --o-rooted-tree {output.rooted_tree}"

SnakeMake QIIME2.0 From line 169 of main/Snakefile

shell:
    "qiime feature-classifier classify-sklearn \
    --i-classifier {input.inClassifier} \
    --i-reads {input.reads} \
    --o-classification {output}"

SnakeMake QIIME2.0 From line 189 of main/Snakefile

shell:
    """
    qiime tools export --input-path {input.table} --output-path {output}
    biom convert -i {output}/feature-table.biom -o {output}/otu_table.tsv --to-tsv && cd {output}
    sed -i .bak '1d' otu_table.tsv
    sed -i .bak 's/#OTU ID//' otu_table.tsv
    cd ../
    qiime tools export --input-path {input.rep_seqs} --output-path {output}
    qiime tools export --input-path {input.tax} --output-path {output}
    qiime tools export --input-path {input.tree} --output-path {output}
    """

SnakeMake QIIME2.0 From line 206 of main/Snakefile

shell:
    "Rscript {input.script} > Rlog.txt"

SnakeMake From line 232 of main/Snakefile

__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path

from snakemake.shell import shell


input_dirs = set(path.dirname(fp) for fp in snakemake.input)
output_dir = path.dirname(snakemake.output[0])
output_name = path.basename(snakemake.output[0])
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "multiqc"
    " {snakemake.params}"
    " --force"
    " -o {output_dir}"
    " -n {output_name}"
    " {input_dirs}"
    " {log}"
)