CGR's QIIME microbiome pipeline

public public 1yr ago Version: v2.3.0 0 bookmarks

This is the Cancer Genomics Research Laboratory's (CGR) microbiome analysis pipeline. This pipeline utilizes QIIME2 to classify sequence data, calculate relative abundance, and perform alpha- and beta-diversity analysis.

How to run

Input requirements

  • Manifest file

    • For external (non-CGR-produced data) runs, the following columns are required:
    #SampleID Run-ID Project-ID fq1 fq2
    
    • For internal runs, the following columns are required:
    #SampleID Run-ID Project-ID
    
    • See the template manifest files in this repo for examples
  • config.yaml

  • (for production runs) run_pipeline.sh

Options to run the pipeline (choose one)

A. Production run: Copy run_pipeline.sh and config.yaml to your directory, edit as needed, then execute the script. B. For dev/testing only: Copy and edit config.yaml , then run the snakefile directly, e.g.:

module load slurm singularity perl/5.36 bbmap python3/3.10.2 
conf=${PWD}/config.yml snakemake -s /path/to/pipeline/Snakefile

Configuration details

  • metadata_manifest: full path to manifest file

  • out_dir: full path to desired output directory (note that CGR production runs are stored at /DCEG/Projects/Microbiome/Analysis/ )

  • exec_dir: full path to pipeline (e.g. Snakefile)

  • fastq_abs_path: full path to fastqs

  • temp_dir: full path to temp/scratch space

  • qiime2_sif: singularity image for qiime2_2019.1

  • reference_db: list classifiers (1+) to be used for taxonomic classification; be sure to match trained classifiers with correct qiime version

  • cluster_mode: options are 'sbatch/etc ...' , 'local' , 'dryrun' , 'unlock'

    • Example for ccad2: 'sbatch -p bigmemq --mem=100g --time=24:00:00 --output=/path/to/project/directory/logs/slurm-%j.out --cpus-per-task={threads}'

Workflow summary

  1. Manifest management:
  • Manifest provided in config.yaml is checked for compliance with QIIME2 specifications

  • Per-flow cell manifests are created

  1. Symlink fastqs to be viewable by DCEG PIs

  2. Import and demultiplex fastqs

  3. For external data only, fix any unmatched paired reads (can be generated by fastq QC process)

  4. Denoise

  5. Merge feature and sequence tables across flow cells; drop samples with zero reads

  6. Build multiple sequence alignment, then build rooted and unrooted phylogenetic trees

  7. Perform alpha- and beta-diversity analysis, rarefaction, and taxonomic classification

Example output directory structure

  • Within parent directory <out_dir>/ defined in config.yaml TODO: Needs updating!
.
├── LICENSE.md
├── README.md
├── bacteria_only
│ ├── feature_tables
│ └── sequence_tables
├── config
│ └── config.yaml
├── config.yaml
├── denoising
│ ├── feature_tables
│ ├── sequence_tables
│ └── stats
├── diversity_core_metrics
├── fastqs
├── import_and_demultiplex
├── logs
│ ├── slurm-136008.out
│ ├── ...
│ └── slurm-136610.out
├── manifests
│ └── manifest_qiime2.tsv
├── phylogenetics
│ ├── masked_msa.qza
│ ├── msa.qza
│ ├── rooted_tree.qza
│ └── unrooted_tree.qza
├── read_feature_and_sample_filtering
│ ├── feature_tables
│ └── sequence_tables
├── report
├── run_pipeline.sh
├── run_times
├── snakemake-136007.err
├── snakemake-136007.out
├── taxonomic_classification
│ ├── gg-13-8-99-515-806-nb-classifier
│ │ ├── barplots.qzv
│ │ ├── barplots_data_files
│ │ │ └── taxonomy.tsv
│ │ ├── taxa.qza
│ │ └── taxa.qzv
│ └── silva-132-99-515-806-nb-classifier
│ ├── barplots.qzv
│ ├── barplots_data_files
│ │ └── taxonomy.tsv
│ ├── taxa.qza
│ └── taxa.qzv
├── taxonomic_classification_bacteria_only
│ ├── gg-13-8-99-515-806-nb-classifier
│ │ ├── barplots.qzv
│ │ ├── barplots_data_files
│ │ │ └── taxonomy.tsv
│ │ └── taxa.qza
│ └── silva-132-99-515-806-nb-classifier
│ ├── barplots.qzv
│ ├── barplots_data_files
│ │ └── taxonomy.tsv
│ └── taxa.qza
└── workflow ├── Snakefile ├── rules └── scripts ├── Q2Manifest.pl ├── Q2Manifest.t └── qQ2_wrapper.sh

QC report

Input requirements

The manifest requirements for the report are the same as for the pipeline itself; however, including the following in your manifest will result in a more informative QC report:

  • To be detected as blanks, water blanks and no-template control sample IDs must contain the string "water" or "ntc" in the "#SampleID" column; this is not case sensitive.

  • "Source PCR Plate" column: header is case insensitive, can have spaces or not. For the report, only the first characters preceding an underscore in this column will be preserved; this strips CGR's well ID from the string.

  • "Sample Type" column: header is case insensitive, can have spaces or not. Populate with any string values that define useful categories, e.g. water, NTC_control, blank, study sample, qc, etc.

  • "External ID" column: header is case insensitive, can have spaces or not. This column is used at CGR to map IDs of received samples to the LIMS-generated IDs we use internally. For technical replicates, the sample ID will be unique (as required by QIIME), but the external ID can be the same to link the samples for comparison in the report.

  • Note that the report assumes that the sequencer ID is the second underscore-delimited field in the run ID; this may not be meaningful if your run ID does not conform.

Running the report

For CGR production runs, after the pipeline completes, generate a QC report using the Jupyter notebook in the report/ directory.

  • Open the notebook (see below for details) and change the proj_dir variable in section 1.1

  • Run the complete notebook

  • Save the report as html with the code hidden (see below for details)

Running jupyter notebooks at CGR

To run jupyter notebooks on the CGR HPC, login to the cluster, navigate to the notebook, then run the following. You can set the port to anything above 8000.

module load python3
jupyter notebook --no-browser --port=8080

Then, on your local machine, run the following to open up an ssh tunnel:

ssh -N -L 8080:localhost:8080 <username@cluster.domain>

Finally, open your browser to the URL given by the jupyter notebook command above ( https://localhost:8080/?token=<token> ).

To save the report

  • Clone the following repo: https://github.com/ihuston/jupyter-hide-code-html

  • Run in terminal: jupyter nbconvert --to html --template jupyter-hide-code-html/clean_output.tpl path/to/CGR_16S_Microbiome_QC_Report.ipynb

  • Name the above file NP###_pipeline_run_folder_QC_report.html and place it in the directory with the pipeline output

For development

Jupyter notebooks are stored as JSON with metadata, which can result in very messy diffs/merge requests in version control. Please do the following to create a clean python script version of the notebook to be used in diffs.

  • After making and testing changes, Kernel > Restart & Clear Output

  • Run in terminal: jupyter nbconvert --to script CGR_16S_Microbiome_QC_Report.ipynb

  • Add and commit both CGR_16S_Microbiome_QC_Report.ipynb AND CGR_16S_Microbiome_QC_Report.py

Notes

  • Samples are run at a flowcell level, due to DADA2 run requirements. The algorithm that DADA2 uses includes an error model that assumes one sequencing run. The pitfall of merging them together prior to running DADA2 is that a lower-quality run (but still passing threshold) may have samples thrown out because they are significantly lower than a high performing run.

  • After creating the QIIME2 manifest file, www.keemi.qiime2.org can be used from Google Chrome to verify the manifest is in the correct format.

Code Snippets

237
238
239
shell:
    'dos2unix -n {input} {output};'
    'perl {params.e}workflow/scripts/Q2Manifest.pl {output}'
254
255
256
shell:
    'ln -s {input.fq1} {output.sym1};'
    'ln -s {input.fq2} {output.sym2}'
273
274
275
276
277
278
shell:
    'if [[ $(zcat {input} | head -n1 | cut -f1 -d\":\") =~ " " ]]; then \
        zcat {input} | awk \'{{if (NR % 4 == 1) {{n=split($0, arr, " "); split(arr[2],tag,":"); printf "@%s ", arr[2]; for (i=3; i<=n; i++) printf "%s ",arr[i]; printf "orig_header=@%s %s\\n",substr(arr[1],2,length(arr[1])-1),tag[1]}} else {{print $0}}}}\' | gzip -c > {output}; \
    else \
        ln -s {input} {output}; \
    fi'
294
295
296
297
298
299
shell:
    'if [[ $(zcat {input} | head -n1 | cut -f1 -d\":\") =~ " " ]]; then \
        zcat {input} | awk \'{{if (NR % 4 == 1) {{n=split($0, arr, " "); split(arr[2],tag,":"); printf "@%s ", arr[2]; for (i=3; i<=n; i++) printf "%s ",arr[i]; printf "orig_header=@%s %s\\n",substr(arr[1],2,length(arr[1])-1),tag[1]}} else {{print $0}}}}\' | gzip -c > {output}; \
    else \
        ln -s {input} {output}; \
    fi'
315
316
317
318
319
shell:
    'repair.sh in1={input.fq1} in2={input.fq2} out1={params.fq1} out2={params.fq2} outs={params.single} repair;'
    'gzip {params.fq1};'
    'gzip {params.fq2};'
    'gzip {params.single}'
345
346
347
shell:
    'echo "{wildcards.sample},{input.fq1},forward,{params.runID}" > {output};' 
    'echo "{wildcards.sample},{input.fq2},reverse,{params.runID}" >> {output}'
360
361
shell:
    'find {params} -maxdepth 1 -name \'*Q2_manifest_by_sample.txt\' | xargs cat > {output}'
372
373
shell:
    'awk \'BEGIN{{FS=OFS=","; print "sample-id,absolute-filepath,direction"}}$4=="{wildcards.runID}"{{print $1,$2,$3}}\' {input} > {output}'
398
399
400
401
402
403
shell:
    'qiime tools import \
        --type {params.in_type} \
        --input-path {input} \
        --output-path {output} \
        --input-format {params.in_format}'
421
422
423
424
shell:
    'qiime demux summarize \
        --i-data {input} \
        --o-visualization {output}'
450
451
452
453
454
455
456
457
458
459
460
461
462
shell:
    'qiime dada2 denoise-paired \
        --verbose \
        --p-n-threads {threads} \
        --i-demultiplexed-seqs {input.qza} \
        --o-table {output.features} \
        --o-representative-sequences {output.seqs} \
        --o-denoising-stats {output.stats} \
        --p-trim-left-f {params.trim_l_f} \
        --p-trim-left-r {params.trim_l_r} \
        --p-trunc-len-f {params.trun_len_f} \
        --p-trunc-len-r {params.trun_len_r} \
        --p-min-fold-parent-over-abundance {params.min_fold}'
479
480
481
482
shell:
    'qiime metadata tabulate \
        --m-input-file {input} \
        --o-visualization {output}'
519
520
521
522
523
524
525
526
527
528
529
530
shell:
    """
    total_runs=`awk -F "\\t" -v col=Run-ID 'NR==1{{for(i=1;i<=NF;i++){{if($i==col){{c=i;break}}}} print $c}} NR>1{{print $c}}' {input.q2_manifest} \
                |sed 1d |sort|uniq|wc -l`; 
    if [ $total_runs -eq 1 ]; then
        cp {input.feature_tables} {output}
    else  
        tables=`awk -F "\\t" -v col=Run-ID 'NR==1{{for(i=1;i<=NF;i++){{if($i==col){{c=i;break}}}} print $c}} \
                NR>1{{print "--i-tables denoising/feature_tables/"$c".qza"}}' {input.q2_manifest} |sed 1d |sort|uniq`; 
        qiime feature-table merge $tables --o-merged-table {output}
    fi
    """
555
556
557
558
559
560
561
562
563
564
565
566
567
568
shell:
    """
    total_runs=`awk -F "\\t" -v col=Run-ID 'NR==1{{for(i=1;i<=NF;i++){{if($i==col){{c=i;break}}}} print $c}} NR>1{{print $c}}' {input.q2_manifest} \
                |sed 1d |sort|uniq|wc -l`;
    echo $total_runs;
    if [ $total_runs -eq 1 ]; then
       cp {input.seqs} {output}
    else
       tables=`awk -F "\\t" -v col=Run-ID 'NR==1{{for(i=1;i<=NF;i++){{if($i==col){{c=i;break}}}} print $c}} \
               NR>1{{print "--i-data denoising/sequence_tables/"$c".qza"}}' {input.q2_manifest} |sed 1d |sort|uniq`; 
       echo $tables
       qiime feature-table merge-seqs $tables --o-merged-data {output}
    fi
    """
586
587
588
589
590
shell:
    'qiime feature-table filter-samples \
        --i-table {input} \
        --p-min-frequency {params.f} \
        --o-filtered-table {output}'
605
606
607
608
609
shell:
    'qiime feature-table filter-features \
        --i-table {input} \
        --p-min-frequency {params.f} \
        --o-filtered-table {output}'
624
625
626
627
628
shell:
    'qiime feature-table filter-features \
        --i-table {input} \
        --p-min-samples {params.f} \
        --o-filtered-table {output}'
646
647
648
649
650
shell:
    'qiime feature-table filter-samples \
        --i-table {input} \
        --p-min-features {params.f} \
        --o-filtered-table {output}'
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
shell:
    'qiime feature-table summarize \
        --i-table {input.qza1} \
        --o-visualization {output.qzv1} \
        --m-sample-metadata-file {input.q2_manifest} && \
    qiime feature-table summarize \
        --i-table {input.qza2} \
        --o-visualization {output.qzv2} \
        --m-sample-metadata-file {input.q2_manifest} && \
    qiime feature-table summarize \
        --i-table {input.qza3} \
        --o-visualization {output.qzv3} \
        --m-sample-metadata-file {input.q2_manifest} && \
    qiime feature-table summarize \
        --i-table {input.qza4} \
        --o-visualization {output.qzv4} \
        --m-sample-metadata-file {input.q2_manifest}'
707
708
709
710
711
shell:
    'qiime feature-table filter-seqs --i-data {input.seq_table} --i-table {input.feat1} --o-filtered-data {output.seq1} && \
        qiime feature-table filter-seqs --i-data {input.seq_table} --i-table {input.feat2} --o-filtered-data {output.seq2} && \
        qiime feature-table filter-seqs --i-data {input.seq_table} --i-table {input.feat3} --o-filtered-data {output.seq3} && \
        qiime feature-table filter-seqs --i-data {input.seq_table} --i-table {input.feat4} --o-filtered-data {output.seq4}'
732
733
734
735
736
737
738
739
740
741
742
743
744
shell:
    'qiime feature-table tabulate-seqs \
        --i-data {input.qza1} \
        --o-visualization {output.qzv1} && \
    qiime feature-table tabulate-seqs \
        --i-data {input.qza2} \
        --o-visualization {output.qzv2} && \
    qiime feature-table tabulate-seqs \
        --i-data {input.qza3} \
        --o-visualization {output.qzv3} && \
    qiime feature-table tabulate-seqs \
        --i-data {input.qza4} \
        --o-visualization {output.qzv4}'
755
756
757
758
shell:
    'qiime feature-table tabulate-seqs \
        --i-data {input} \
        --o-visualization {output}'
770
771
772
773
774
shell:
    'qiime feature-table summarize \
        --i-table {input.qza} \
        --o-visualization {output} \
        --m-sample-metadata-file {input.q2_manifest}'
810
811
812
813
814
815
shell:
    'qiime feature-classifier {params.c_method} \
        --p-n-jobs {threads} \
        --i-classifier {input.ref} \
        --i-reads {input.seqs} \
        --o-classification {output}'
851
852
853
854
855
856
shell:
    'qiime feature-classifier {params.c_method} \
        --p-n-jobs {threads} \
        --i-classifier {input.ref} \
        --i-reads {input.seqs} \
        --o-classification {output}'
871
872
873
874
shell:
    "qiime tools export --input-path {input} --output-path {params} && \
        sed 's/ \t/\t/' {output.o1} > {output.o2} && \
        qiime tools import --type 'FeatureData[Taxonomy]' --input-path {output.o2} --output-path {output.o3}"
892
893
894
895
shell:
    'qiime metadata tabulate \
        --m-input-file {input} \
        --o-visualization {output}'
915
916
917
918
919
920
shell:
    'qiime taxa barplot \
        --i-table {input.seqs} \
        --i-taxonomy {input.tax} \
        --m-metadata-file {input.manifest} \
        --o-visualization {output}'
940
941
942
943
944
945
shell:
    'qiime taxa barplot \
        --i-table {input.seqs} \
        --i-taxonomy {input.tax} \
        --m-metadata-file {input.manifest} \
        --o-visualization {output}'
967
968
969
shell:
    'qiime tools export --input-path {input.taxonomy_qza} --output-path {params.d}; \
    qiime tools export --input-path {input.taxonomy_bar_plots} --output-path {params.d}'
991
992
993
994
995
996
shell:
    'qiime taxa filter-table \
        --i-table {input.seqs} \
        --i-taxonomy {input.tax} \
        --p-include "D_0__Bacteria;D_1,k__Bacteria; p__" \
        --o-filtered-table {output}'
1022
1023
1024
1025
1026
1027
1028
shell:
    'qiime taxa filter-table \
        --i-table {input.features} \
        --i-taxonomy {input.tax} \
        --p-mode exact \
        --p-exclude "k__Bacteria; p__" \
        --o-filtered-table {output}'
1046
1047
1048
1049
1050
shell:
    'qiime feature-table filter-seqs \
        --i-data {input.seqs} \
        --i-table {input.bacterial_features} \
        --o-filtered-data {output}'
1062
1063
1064
1065
1066
1067
1068
1069
shell:
    'qiime feature-table summarize \
        --i-table {input.features} \
        --o-visualization {output.features} \
        --m-sample-metadata-file {input.q2_manifest} && \
    qiime feature-table tabulate-seqs \
        --i-data {input.seqs} \
        --o-visualization {output.seqs}'
1093
1094
1095
1096
1097
1098
1099
shell:
    'qiime phylogeny align-to-tree-mafft-fasttree \
        --i-sequences {input} \
        --o-alignment {output.msa} \
        --o-masked-alignment {output.masked_msa} \
        --o-tree {output.unrooted_tree} \
        --o-rooted-tree {output.rooted_tree}'
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
shell:
    'qiime diversity core-metrics-phylogenetic \
        --i-phylogeny {input.rooted_tree} \
        --i-table {input.features} \
        --p-sampling-depth {params.samp_depth} \
        --m-metadata-file {input.q2_manifest} \
        --o-rarefied-table {output.rare} \
        --o-faith-pd-vector {output.faith} \
        --o-observed-otus-vector {output.obs} \
        --o-shannon-vector {output.shan} \
        --o-evenness-vector {output.even} \
        --o-unweighted-unifrac-distance-matrix {output.unw_dist} \
        --o-unweighted-unifrac-pcoa-results {output.unw_pcoa} \
        --o-unweighted-unifrac-emperor {output.unw_emp} \
        --o-weighted-unifrac-distance-matrix {output.w_dist} \
        --o-weighted-unifrac-pcoa-results {output.w_pcoa} \
        --o-weighted-unifrac-emperor {output.w_emp} \
        --o-jaccard-distance-matrix {output.jac_dist} \
        --o-jaccard-pcoa-results {output.jac_pcoa} \
        --o-jaccard-emperor {output.jac_emp} \
        --o-bray-curtis-distance-matrix {output.bc_dist} \
        --o-bray-curtis-pcoa-results {output.bc_pcoa} \
        --o-bray-curtis-emperor {output.bc_emp}'
1196
1197
1198
1199
1200
1201
1202
shell:
    'qiime metadata tabulate \
        --m-input-file {input.obs} \
        --m-input-file {input.shan} \
        --m-input-file {input.even} \
        --m-input-file {input.faith} \
        --o-visualization {output}'
1227
1228
1229
1230
1231
1232
1233
shell:
    'qiime diversity alpha-rarefaction \
        --i-table {input.features} \
        --i-phylogeny {input.rooted} \
        --p-max-depth {params.m_depth} \
        --m-metadata-file {input.q2_manifest} \
        --o-visualization {output}'
1251
1252
1253
1254
shell:
    'qiime tools export --input-path {input.table_dada2_qza} --output-path {params.out1}; \
    biom convert -i {output.table_dada2_biom} -o {output.table_dada2_biom_tsv} --to-tsv; \
    qiime tools export --input-path {input.repseq_dada2_qza} --output-path {params.out2}'
1272
1273
1274
1275
shell:
    'qiime tools export --input-path {input.table_dada2_qza} --output-path {params.out1}; \
    biom convert -i {output.table_dada2_biom} -o {output.table_dada2_biom_tsv} --to-tsv; \
    qiime tools export --input-path {input.repseq_dada2_qza} --output-path {params.out2}'
ShowHide 39 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/NCI-CGR/QIIME_pipeline
Name: qiime_pipeline
Version: v2.3.0
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...