CGR's QIIME microbiome pipeline

public 1yr ago Version: v2.3.0 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

This is the Cancer Genomics Research Laboratory's (CGR) microbiome analysis pipeline. This pipeline utilizes QIIME2 to classify sequence data, calculate relative abundance, and perform alpha- and beta-diversity analysis.

How to run

Input requirements

Manifest file
- For external (non-CGR-produced data) runs, the following columns are required:
```
#SampleID Run-ID Project-ID fq1 fq2
```
- For internal runs, the following columns are required:
```
#SampleID Run-ID Project-ID
```
- See the template manifest files in this repo for examples
config.yaml
(for production runs) run_pipeline.sh

Options to run the pipeline (choose one)

A. Production run: Copy run_pipeline.sh and config.yaml to your directory, edit as needed, then execute the script. B. For dev/testing only: Copy and edit config.yaml , then run the snakefile directly, e.g.:

module load slurm singularity perl/5.36 bbmap python3/3.10.2 
conf=${PWD}/config.yml snakemake -s /path/to/pipeline/Snakefile

Configuration details

metadata_manifest: full path to manifest file
out_dir: full path to desired output directory (note that CGR production runs are stored at /DCEG/Projects/Microbiome/Analysis/ )
exec_dir: full path to pipeline (e.g. Snakefile)
fastq_abs_path: full path to fastqs
temp_dir: full path to temp/scratch space
qiime2_sif: singularity image for qiime2_2019.1
reference_db: list classifiers (1+) to be used for taxonomic classification; be sure to match trained classifiers with correct qiime version
cluster_mode: options are 'sbatch/etc ...' , 'local' , 'dryrun' , 'unlock'
- Example for ccad2: 'sbatch -p bigmemq --mem=100g --time=24:00:00 --output=/path/to/project/directory/logs/slurm-%j.out --cpus-per-task={threads}'

Workflow summary

Manifest management:

Manifest provided in config.yaml is checked for compliance with QIIME2 specifications
Per-flow cell manifests are created

Symlink fastqs to be viewable by DCEG PIs
Import and demultiplex fastqs
For external data only, fix any unmatched paired reads (can be generated by fastq QC process)
Denoise
Merge feature and sequence tables across flow cells; drop samples with zero reads
Build multiple sequence alignment, then build rooted and unrooted phylogenetic trees
Perform alpha- and beta-diversity analysis, rarefaction, and taxonomic classification

Example output directory structure

Within parent directory <out_dir>/ defined in config.yaml TODO: Needs updating!

.
├── LICENSE.md
├── README.md
├── bacteria_only
│ ├── feature_tables
│ └── sequence_tables
├── config
│ └── config.yaml
├── config.yaml
├── denoising
│ ├── feature_tables
│ ├── sequence_tables
│ └── stats
├── diversity_core_metrics
├── fastqs
├── import_and_demultiplex
├── logs
│ ├── slurm-136008.out
│ ├── ...
│ └── slurm-136610.out
├── manifests
│ └── manifest_qiime2.tsv
├── phylogenetics
│ ├── masked_msa.qza
│ ├── msa.qza
│ ├── rooted_tree.qza
│ └── unrooted_tree.qza
├── read_feature_and_sample_filtering
│ ├── feature_tables
│ └── sequence_tables
├── report
├── run_pipeline.sh
├── run_times
├── snakemake-136007.err
├── snakemake-136007.out
├── taxonomic_classification
│ ├── gg-13-8-99-515-806-nb-classifier
│ │ ├── barplots.qzv
│ │ ├── barplots_data_files
│ │ │ └── taxonomy.tsv
│ │ ├── taxa.qza
│ │ └── taxa.qzv
│ └── silva-132-99-515-806-nb-classifier
│ ├── barplots.qzv
│ ├── barplots_data_files
│ │ └── taxonomy.tsv
│ ├── taxa.qza
│ └── taxa.qzv
├── taxonomic_classification_bacteria_only
│ ├── gg-13-8-99-515-806-nb-classifier
│ │ ├── barplots.qzv
│ │ ├── barplots_data_files
│ │ │ └── taxonomy.tsv
│ │ └── taxa.qza
│ └── silva-132-99-515-806-nb-classifier
│ ├── barplots.qzv
│ ├── barplots_data_files
│ │ └── taxonomy.tsv
│ └── taxa.qza
└── workflow ├── Snakefile ├── rules └── scripts ├── Q2Manifest.pl ├── Q2Manifest.t └── qQ2_wrapper.sh

QC report

Input requirements

The manifest requirements for the report are the same as for the pipeline itself; however, including the following in your manifest will result in a more informative QC report:

To be detected as blanks, water blanks and no-template control sample IDs must contain the string "water" or "ntc" in the "#SampleID" column; this is not case sensitive.
"Source PCR Plate" column: header is case insensitive, can have spaces or not. For the report, only the first characters preceding an underscore in this column will be preserved; this strips CGR's well ID from the string.
"Sample Type" column: header is case insensitive, can have spaces or not. Populate with any string values that define useful categories, e.g. water, NTC_control, blank, study sample, qc, etc.
"External ID" column: header is case insensitive, can have spaces or not. This column is used at CGR to map IDs of received samples to the LIMS-generated IDs we use internally. For technical replicates, the sample ID will be unique (as required by QIIME), but the external ID can be the same to link the samples for comparison in the report.
Note that the report assumes that the sequencer ID is the second underscore-delimited field in the run ID; this may not be meaningful if your run ID does not conform.

Running the report

For CGR production runs, after the pipeline completes, generate a QC report using the Jupyter notebook in the report/ directory.

Open the notebook (see below for details) and change the proj_dir variable in section 1.1
Run the complete notebook
Save the report as html with the code hidden (see below for details)

Running jupyter notebooks at CGR

To run jupyter notebooks on the CGR HPC, login to the cluster, navigate to the notebook, then run the following. You can set the port to anything above 8000.

module load python3
jupyter notebook --no-browser --port=8080

Then, on your local machine, run the following to open up an ssh tunnel:

ssh -N -L 8080:localhost:8080 <username@cluster.domain>

Finally, open your browser to the URL given by the jupyter notebook command above ( https://localhost:8080/?token=<token> ).

To save the report

Clone the following repo: https://github.com/ihuston/jupyter-hide-code-html
Run in terminal: jupyter nbconvert --to html --template jupyter-hide-code-html/clean_output.tpl path/to/CGR_16S_Microbiome_QC_Report.ipynb
Name the above file NP###_pipeline_run_folder_QC_report.html and place it in the directory with the pipeline output

For development

Jupyter notebooks are stored as JSON with metadata, which can result in very messy diffs/merge requests in version control. Please do the following to create a clean python script version of the notebook to be used in diffs.

After making and testing changes, Kernel > Restart & Clear Output
Run in terminal: jupyter nbconvert --to script CGR_16S_Microbiome_QC_Report.ipynb
Add and commit both CGR_16S_Microbiome_QC_Report.ipynb AND CGR_16S_Microbiome_QC_Report.py

Notes

Samples are run at a flowcell level, due to DADA2 run requirements. The algorithm that DADA2 uses includes an error model that assumes one sequencing run. The pitfall of merging them together prior to running DADA2 is that a lower-quality run (but still passing threshold) may have samples thrown out because they are significantly lower than a high performing run.
After creating the QIIME2 manifest file, www.keemi.qiime2.org can be used from Google Chrome to verify the manifest is in the correct format.

Code Snippets

shell:
    'dos2unix -n {input} {output};'
    'perl {params.e}workflow/scripts/Q2Manifest.pl {output}'

SnakeMake From line 237 of workflow/Snakefile

shell:
    'ln -s {input.fq1} {output.sym1};'
    'ln -s {input.fq2} {output.sym2}'

SnakeMake From line 254 of workflow/Snakefile

shell:
    'if [[ $(zcat {input} | head -n1 | cut -f1 -d\":\") =~ " " ]]; then \
        zcat {input} | awk \'{{if (NR % 4 == 1) {{n=split($0, arr, " "); split(arr[2],tag,":"); printf "@%s ", arr[2]; for (i=3; i<=n; i++) printf "%s ",arr[i]; printf "orig_header=@%s %s\\n",substr(arr[1],2,length(arr[1])-1),tag[1]}} else {{print $0}}}}\' | gzip -c > {output}; \
    else \
        ln -s {input} {output}; \
    fi'

SnakeMake From line 273 of workflow/Snakefile

shell:
    'if [[ $(zcat {input} | head -n1 | cut -f1 -d\":\") =~ " " ]]; then \
        zcat {input} | awk \'{{if (NR % 4 == 1) {{n=split($0, arr, " "); split(arr[2],tag,":"); printf "@%s ", arr[2]; for (i=3; i<=n; i++) printf "%s ",arr[i]; printf "orig_header=@%s %s\\n",substr(arr[1],2,length(arr[1])-1),tag[1]}} else {{print $0}}}}\' | gzip -c > {output}; \
    else \
        ln -s {input} {output}; \
    fi'

SnakeMake From line 294 of workflow/Snakefile

shell:
    'repair.sh in1={input.fq1} in2={input.fq2} out1={params.fq1} out2={params.fq2} outs={params.single} repair;'
    'gzip {params.fq1};'
    'gzip {params.fq2};'
    'gzip {params.single}'

SnakeMake From line 315 of workflow/Snakefile

shell:
    'echo "{wildcards.sample},{input.fq1},forward,{params.runID}" > {output};' 
    'echo "{wildcards.sample},{input.fq2},reverse,{params.runID}" >> {output}'

SnakeMake From line 345 of workflow/Snakefile

shell:
    'find {params} -maxdepth 1 -name \'*Q2_manifest_by_sample.txt\' | xargs cat > {output}'

SnakeMake From line 360 of workflow/Snakefile

shell:
    'awk \'BEGIN{{FS=OFS=","; print "sample-id,absolute-filepath,direction"}}$4=="{wildcards.runID}"{{print $1,$2,$3}}\' {input} > {output}'

SnakeMake From line 372 of workflow/Snakefile

shell:
    'qiime tools import \
        --type {params.in_type} \
        --input-path {input} \
        --output-path {output} \
        --input-format {params.in_format}'

SnakeMake QIIME2.0 From line 398 of workflow/Snakefile

shell:
    'qiime demux summarize \
        --i-data {input} \
        --o-visualization {output}'

SnakeMake QIIME2.0 From line 421 of workflow/Snakefile

shell:
    'qiime dada2 denoise-paired \
        --verbose \
        --p-n-threads {threads} \
        --i-demultiplexed-seqs {input.qza} \
        --o-table {output.features} \
        --o-representative-sequences {output.seqs} \
        --o-denoising-stats {output.stats} \
        --p-trim-left-f {params.trim_l_f} \
        --p-trim-left-r {params.trim_l_r} \
        --p-trunc-len-f {params.trun_len_f} \
        --p-trunc-len-r {params.trun_len_r} \
        --p-min-fold-parent-over-abundance {params.min_fold}'

SnakeMake QIIME2.0 From line 450 of workflow/Snakefile

shell:
    'qiime metadata tabulate \
        --m-input-file {input} \
        --o-visualization {output}'

SnakeMake QIIME2.0 From line 479 of workflow/Snakefile

shell:
    """
    total_runs=`awk -F "\\t" -v col=Run-ID 'NR==1{{for(i=1;i<=NF;i++){{if($i==col){{c=i;break}}}} print $c}} NR>1{{print $c}}' {input.q2_manifest} \
                |sed 1d |sort|uniq|wc -l`; 
    if [ $total_runs -eq 1 ]; then
        cp {input.feature_tables} {output}
    else  
        tables=`awk -F "\\t" -v col=Run-ID 'NR==1{{for(i=1;i<=NF;i++){{if($i==col){{c=i;break}}}} print $c}} \
                NR>1{{print "--i-tables denoising/feature_tables/"$c".qza"}}' {input.q2_manifest} |sed 1d |sort|uniq`; 
        qiime feature-table merge $tables --o-merged-table {output}
    fi
    """

SnakeMake QIIME2.0 From line 519 of workflow/Snakefile

shell:
    """
    total_runs=`awk -F "\\t" -v col=Run-ID 'NR==1{{for(i=1;i<=NF;i++){{if($i==col){{c=i;break}}}} print $c}} NR>1{{print $c}}' {input.q2_manifest} \
                |sed 1d |sort|uniq|wc -l`;
    echo $total_runs;
    if [ $total_runs -eq 1 ]; then
       cp {input.seqs} {output}
    else
       tables=`awk -F "\\t" -v col=Run-ID 'NR==1{{for(i=1;i<=NF;i++){{if($i==col){{c=i;break}}}} print $c}} \
               NR>1{{print "--i-data denoising/sequence_tables/"$c".qza"}}' {input.q2_manifest} |sed 1d |sort|uniq`; 
       echo $tables
       qiime feature-table merge-seqs $tables --o-merged-data {output}
    fi
    """

SnakeMake QIIME2.0 From line 555 of workflow/Snakefile

shell:
    'qiime feature-table filter-samples \
        --i-table {input} \
        --p-min-frequency {params.f} \
        --o-filtered-table {output}'

SnakeMake QIIME2.0 From line 586 of workflow/Snakefile

shell:
    'qiime feature-table filter-features \
        --i-table {input} \
        --p-min-frequency {params.f} \
        --o-filtered-table {output}'

SnakeMake QIIME2.0 From line 605 of workflow/Snakefile

shell:
    'qiime feature-table filter-features \
        --i-table {input} \
        --p-min-samples {params.f} \
        --o-filtered-table {output}'

SnakeMake QIIME2.0 From line 624 of workflow/Snakefile

shell:
    'qiime feature-table filter-samples \
        --i-table {input} \
        --p-min-features {params.f} \
        --o-filtered-table {output}'

SnakeMake QIIME2.0 From line 646 of workflow/Snakefile

shell:
    'qiime feature-table summarize \
        --i-table {input.qza1} \
        --o-visualization {output.qzv1} \
        --m-sample-metadata-file {input.q2_manifest} && \
    qiime feature-table summarize \
        --i-table {input.qza2} \
        --o-visualization {output.qzv2} \
        --m-sample-metadata-file {input.q2_manifest} && \
    qiime feature-table summarize \
        --i-table {input.qza3} \
        --o-visualization {output.qzv3} \
        --m-sample-metadata-file {input.q2_manifest} && \
    qiime feature-table summarize \
        --i-table {input.qza4} \
        --o-visualization {output.qzv4} \
        --m-sample-metadata-file {input.q2_manifest}'

SnakeMake QIIME2.0 From line 673 of workflow/Snakefile

shell:
    'qiime feature-table filter-seqs --i-data {input.seq_table} --i-table {input.feat1} --o-filtered-data {output.seq1} && \
        qiime feature-table filter-seqs --i-data {input.seq_table} --i-table {input.feat2} --o-filtered-data {output.seq2} && \
        qiime feature-table filter-seqs --i-data {input.seq_table} --i-table {input.feat3} --o-filtered-data {output.seq3} && \
        qiime feature-table filter-seqs --i-data {input.seq_table} --i-table {input.feat4} --o-filtered-data {output.seq4}'

SnakeMake QIIME2.0 From line 707 of workflow/Snakefile

shell:
    'qiime feature-table tabulate-seqs \
        --i-data {input.qza1} \
        --o-visualization {output.qzv1} && \
    qiime feature-table tabulate-seqs \
        --i-data {input.qza2} \
        --o-visualization {output.qzv2} && \
    qiime feature-table tabulate-seqs \
        --i-data {input.qza3} \
        --o-visualization {output.qzv3} && \
    qiime feature-table tabulate-seqs \
        --i-data {input.qza4} \
        --o-visualization {output.qzv4}'

SnakeMake QIIME2.0 From line 732 of workflow/Snakefile

shell:
    'qiime feature-table tabulate-seqs \
        --i-data {input} \
        --o-visualization {output}'

SnakeMake QIIME2.0 From line 755 of workflow/Snakefile

shell:
    'qiime feature-table summarize \
        --i-table {input.qza} \
        --o-visualization {output} \
        --m-sample-metadata-file {input.q2_manifest}'

SnakeMake QIIME2.0 From line 770 of workflow/Snakefile

shell:
    'qiime feature-classifier {params.c_method} \
        --p-n-jobs {threads} \
        --i-classifier {input.ref} \
        --i-reads {input.seqs} \
        --o-classification {output}'

SnakeMake QIIME2.0 From line 810 of workflow/Snakefile

shell:
    'qiime feature-classifier {params.c_method} \
        --p-n-jobs {threads} \
        --i-classifier {input.ref} \
        --i-reads {input.seqs} \
        --o-classification {output}'

SnakeMake QIIME2.0 From line 851 of workflow/Snakefile

shell:
    "qiime tools export --input-path {input} --output-path {params} && \
        sed 's/ \t/\t/' {output.o1} > {output.o2} && \
        qiime tools import --type 'FeatureData[Taxonomy]' --input-path {output.o2} --output-path {output.o3}"

SnakeMake QIIME2.0 From line 871 of workflow/Snakefile

shell:
    'qiime metadata tabulate \
        --m-input-file {input} \
        --o-visualization {output}'

SnakeMake QIIME2.0 From line 892 of workflow/Snakefile

shell:
    'qiime taxa barplot \
        --i-table {input.seqs} \
        --i-taxonomy {input.tax} \
        --m-metadata-file {input.manifest} \
        --o-visualization {output}'

SnakeMake QIIME2.0 From line 915 of workflow/Snakefile

shell:
    'qiime taxa barplot \
        --i-table {input.seqs} \
        --i-taxonomy {input.tax} \
        --m-metadata-file {input.manifest} \
        --o-visualization {output}'

SnakeMake QIIME2.0 From line 940 of workflow/Snakefile

shell:
    'qiime tools export --input-path {input.taxonomy_qza} --output-path {params.d}; \
    qiime tools export --input-path {input.taxonomy_bar_plots} --output-path {params.d}'

SnakeMake QIIME2.0 From line 967 of workflow/Snakefile

shell:
    'qiime taxa filter-table \
        --i-table {input.seqs} \
        --i-taxonomy {input.tax} \
        --p-include "D_0__Bacteria;D_1,k__Bacteria; p__" \
        --o-filtered-table {output}'

SnakeMake QIIME2.0 From line 991 of workflow/Snakefile

shell:
    'qiime taxa filter-table \
        --i-table {input.features} \
        --i-taxonomy {input.tax} \
        --p-mode exact \
        --p-exclude "k__Bacteria; p__" \
        --o-filtered-table {output}'

SnakeMake QIIME2.0 From line 1022 of workflow/Snakefile

shell:
    'qiime feature-table filter-seqs \
        --i-data {input.seqs} \
        --i-table {input.bacterial_features} \
        --o-filtered-data {output}'

SnakeMake QIIME2.0 From line 1046 of workflow/Snakefile

shell:
    'qiime feature-table summarize \
        --i-table {input.features} \
        --o-visualization {output.features} \
        --m-sample-metadata-file {input.q2_manifest} && \
    qiime feature-table tabulate-seqs \
        --i-data {input.seqs} \
        --o-visualization {output.seqs}'

SnakeMake QIIME2.0 From line 1062 of workflow/Snakefile

shell:
    'qiime phylogeny align-to-tree-mafft-fasttree \
        --i-sequences {input} \
        --o-alignment {output.msa} \
        --o-masked-alignment {output.masked_msa} \
        --o-tree {output.unrooted_tree} \
        --o-rooted-tree {output.rooted_tree}'

SnakeMake QIIME2.0 From line 1093 of workflow/Snakefile

shell:
    'qiime diversity core-metrics-phylogenetic \
        --i-phylogeny {input.rooted_tree} \
        --i-table {input.features} \
        --p-sampling-depth {params.samp_depth} \
        --m-metadata-file {input.q2_manifest} \
        --o-rarefied-table {output.rare} \
        --o-faith-pd-vector {output.faith} \
        --o-observed-otus-vector {output.obs} \
        --o-shannon-vector {output.shan} \
        --o-evenness-vector {output.even} \
        --o-unweighted-unifrac-distance-matrix {output.unw_dist} \
        --o-unweighted-unifrac-pcoa-results {output.unw_pcoa} \
        --o-unweighted-unifrac-emperor {output.unw_emp} \
        --o-weighted-unifrac-distance-matrix {output.w_dist} \
        --o-weighted-unifrac-pcoa-results {output.w_pcoa} \
        --o-weighted-unifrac-emperor {output.w_emp} \
        --o-jaccard-distance-matrix {output.jac_dist} \
        --o-jaccard-pcoa-results {output.jac_pcoa} \
        --o-jaccard-emperor {output.jac_emp} \
        --o-bray-curtis-distance-matrix {output.bc_dist} \
        --o-bray-curtis-pcoa-results {output.bc_pcoa} \
        --o-bray-curtis-emperor {output.bc_emp}'

SnakeMake QIIME2.0 From line 1156 of workflow/Snakefile

shell:
    'qiime metadata tabulate \
        --m-input-file {input.obs} \
        --m-input-file {input.shan} \
        --m-input-file {input.even} \
        --m-input-file {input.faith} \
        --o-visualization {output}'

SnakeMake QIIME2.0 From line 1196 of workflow/Snakefile

shell:
    'qiime diversity alpha-rarefaction \
        --i-table {input.features} \
        --i-phylogeny {input.rooted} \
        --p-max-depth {params.m_depth} \
        --m-metadata-file {input.q2_manifest} \
        --o-visualization {output}'

SnakeMake QIIME2.0 From line 1227 of workflow/Snakefile

shell:
    'qiime tools export --input-path {input.table_dada2_qza} --output-path {params.out1}; \
    biom convert -i {output.table_dada2_biom} -o {output.table_dada2_biom_tsv} --to-tsv; \
    qiime tools export --input-path {input.repseq_dada2_qza} --output-path {params.out2}'

SnakeMake QIIME2.0 From line 1251 of workflow/Snakefile

shell:
    'qiime tools export --input-path {input.table_dada2_qza} --output-path {params.out1}; \
    biom convert -i {output.table_dada2_biom} -o {output.table_dada2_biom_tsv} --to-tsv; \
    qiime tools export --input-path {input.repseq_dada2_qza} --output-path {params.out2}'