Amplicon sequence processing workflow using QIIME 2 and Snakemake

public public 1yr ago Version: v1.1.1 0 bookmarks
png/tourmaline_banner png/figure1

Tourmaline

Tourmaline is an amplicon sequence processing workflow for Illumina sequence data that uses QIIME 2 and the software packages it wraps. Tourmaline manages commands, inputs, and outputs using the Snakemake workflow management system.

The current version of Tourmaline supports qiime2-2023.5 . To use previous versions of Qiime2, check out previous Tourmaline versions under Releases .

Why should I use Tourmaline?

Tourmaline has several features that enhance usability and interoperability:

  • Portability. Native support for Linux and macOS in addition to Docker containers.

  • QIIME 2. The core commands of Tourmaline, including the DADA2 and Deblur packages, are all commands of QIIME 2, one of the most popular amplicon sequence analysis software tools available. You can print all of the QIIME 2 and other shell commands of your workflow before or while running the workflow.

  • Snakemake. Managing the workflow with Snakemake provides several benefits:

    • Configuration file contains all parameters in one file, so you can see what your workflow is doing and make changes for a subsequent run.

    • Directory structure is the same for every Tourmaline run, so you always know where your outputs are.

    • On-demand commands mean that only the commands required for output files not yet generated are run, saving time and computation when re-running part of a workflow.

  • Parameter optimization. The configuration file and standard directory structure make it simple to test and compare different parameter sets to optimize your workflow. Included code helps choose read truncation parameters and identify outliers in representative sequences (ASVs).

  • Visualizations and reports. Every Tourmaline run produces an HTML report containing a summary of your metadata and outputs, with links to web-viewable QIIME 2 visualization files.

  • Downstream analysis. Analyze the output of single or multiple Tourmaline runs programmatically, with qiime2R in R or the QIIME 2 Artifact API in Python, using the provided R and Python notebooks or your own code.

What QIIME 2 options does Tourmaline support?

If you have used QIIME 2 before, you might be wondering which QIIME 2 commands Tourmaline uses and supports. All commands are specified as rules in Snakefile , and typical workflows without and with sequence filtering are shown as directed acyclic graphs in the folder dags . The main analysis features and options supported by Tourmaline and specified by the Snakefile are as follows:

  • FASTQ sequence import using a manifest file, or use your pre-imported FASTQ .qza file

  • Denoising with DADA2 (paired-end and single-end) and Deblur (single-end)

  • Feature classification (taxonomic assignment) with options of naive Bayes , consensus BLAST , and consensus VSEARCH

  • Feature filtering by taxonomy, sequence length, feature ID, and abundance/prevalence

  • De novo multiple sequence alignment with MUSCLE5 , Clustal Omega , or MAFFT (with masking) and tree building with FastTree

  • Outlier detection with odseq

  • Interactive taxonomy barplot

  • Tree visualization using Empress

  • Alpha diversity, alpha rarefaction, and alpha group significance with four metrics: Faith's phylogenetic diversity, observed features, Shannon diversity, and Pielou’s evenness

  • Beta diversity distances, principal coordinates, Emperor plots, and beta group significance (one metadata column) with four metrics: unweighted and weighted UniFrac , Jaccard distance, and Bray–Curtis distance

  • Robust Aitchison PCA and biplot ordination using DEICODE

How do I cite Tourmaline?

Please cite our paper in GigaScience :

  • Thompson, L. R., Anderson, S. R., Den Uyl, P. A., Patin, N. V., Lim, S. J., Sanderson, G. & Goodwin, K. D. Tourmaline: A containerized workflow for rapid and iterable amplicon sequence analysis using QIIME 2 and Snakemake. GigaScience , Volume 11, 2022, giac066, https://doi.org/10.1093/gigascience/giac066.

How do I get started?

If this is your first time using Tourmaline or Snakemake, you may want to browse through the Wiki for a detailed walkthrough. If you want to get started right away, check out the Quick Start below and follow along with the video tutorial on YouTube .

Contact us

  • Questions? Join the conversation on gitter .

  • Have a feature request? Raise an issue on GitHub .

Quick Start

Tourmaline provides Snakemake rules for DADA2 (single-end and paired-end) and Deblur (single-end). For each type of processing, there are four steps:

  1. the denoise rule imports FASTQ data and runs denoising, generating a feature table and representative sequences;

  2. the taxonomy rule assigns taxonomy to representative sequences;

  3. the diversity rule does representative sequence curation, core diversity analyses, and alpha and beta group significance; and

  4. the report rule generates an HTML report of the outputs plus metadata, inputs, and parameters. Also, the report rule can be run immediately to run the entire workflow.

Steps 2–4 have unfiltered and filtered modes, the difference being that in the taxonomy step of filtered mode, undesired taxonomic groups or individual sequences from the representative sequences and feature table are removed. The diversity and report rules are the same for unfiltered and filtered modes, except the output goes into separate subdirectories.

Install

The current version of Tourmaline supports qiime2-2023.5 . To use previous versions of Qiime2, check out previous Tourmaline versions under Releases .

Before you download the Tourmaline commands and directory structure from GitHub, you first need to install QIIME 2, Snakemake, and the other dependencies of Tourmaline. Two options are provided: a native installation on a Mac or Linux system and a Docker image/container. If you have an Apple Silicon chip (M1, M2 Macs), the instructions to install QIIME 2 vary slightly.

Option 1: Native installation

To run Tourmaline natively on a Mac (Intel) or Linux system, start with a Conda installation of Snakemake .

conda create -c conda-forge -c bioconda -n snakemake snakemake-minimal

Then install QIIME 2 with conda ( for Linux, change "osx" to "linux" ):

wget https://data.qiime2.org/distro/core/qiime2-2023.5-py38-osx-conda.yml
conda env create -n qiime2-2023.5 --file qiime2-2023.5-py38-osx-conda.yml

Activate the qiime2-2023.5 environment and install the other Conda- or PIP-installable dependencies:

conda activate qiime2-2023.2
conda install -c conda-forge -c bioconda biopython muscle clustalo tabulate
conda install -c conda-forge deicode
pip install empress
qiime dev refresh-cache
conda install -c bioconda bioconductor-msa bioconductor-odseq
Apple Silicon Macs

Follow these instructions for Macs with M1/M2 chips.

First, set your Terminal application to run in Rosetta mode .

wget https://data.qiime2.org/distro/core/qiime2-2023.2-py38-osx-conda.yml
CONDA_SUBDIR=osx-64 conda env create -n qiime2-2023.2 --file qiime2-2023.2-py38-osx-conda.yml
conda activate qiime2-2023.2
conda config --env --set subdir osx-64

Then continue to install the other Conda- or PIP-installable dependencies.

Option 2: Docker container

To run Tourmaline inside a Docker container:

  1. Install Docker Desktop (Mac, Windows, or Linux) from Docker.com .

  2. Open Docker app.

  3. Increase the memory to 8 GB or more (Preferences -> Resources -> Advanced -> Memory).

  4. Download the Docker image from DockerHub (command below).

  5. Run the Docker image (command below).

docker pull aomlomics/tourmaline
docker run -v $HOME:/data -it aomlomics/tourmaline

If installing on a Mac with an Apple M1 chip, run the Docker image with the --platform linux/amd64 command. It will take a few minutes for the image to load the first time it is run.

docker run --platform linux/amd64 -v $HOME:/data -it aomlomics/tourmaline

The -v (volume) flag above allows you to mount a local file system volume (in this case your home directory) to read/write from your container. Note that symbolic links in a mounted volume will not work.

Use mounted volumes to:

  • copy metadata and manifest files to your container;

  • create symbolic links from your container to your FASTQ files and reference database;

  • copy your whole Tourmaline directory out of the container when the run is completed (alternatively, you can clone the Tourmaline directory inside the mounted volume).

See the Install page for more details on installing and running Docker.

Setup

If this is your first time running Tourmaline, you'll need to set up your directory. Simplified instructions are below, but see the Wiki's Setup page for complete instructions.

Start by cloning the Tourmaline directory and files:

git clone https://github.com/aomlomics/tourmaline.git

If using the Docker container, it's recommended you run the above command from inside /data .

Setup for the test data

The test data (16 samples of paired-end 16S rRNA data with 1000 sequences per sample) comes with your cloned copy of Tourmaline. It's fast to run and will verify that you can run the workflow.

Download reference database sequence and taxonomy files, named refseqs.qza and reftax.qza (QIIME 2 archives), in 01-imported :

cd tourmaline/01-imported
wget https://data.qiime2.org/2023.2/common/silva-138-99-seqs-515-806.qza
wget https://data.qiime2.org/2023.2/common/silva-138-99-tax-515-806.qza
ln -s silva-138-99-seqs-515-806.qza refseqs.qza
ln -s silva-138-99-tax-515-806.qza reftax.qza

Edit FASTQ manifests manifest_se.csv and manifest_pe.csv in 00-data so file paths match the location of your tourmaline directory. In the command below, replace /path/to with the location of your tourmaline directory—or skip this step if you are using the Docker container and you cloned tourmaline into /data :

cd ../00-data
cat manifest_pe.csv | sed 's|/data/tourmaline|/path/to/tourmaline|' > temp && mv temp manifest_pe.csv 
cat manifest_pe.csv | grep -v "reverse" > manifest_se.csv

Go to Run Snakemake .

Setup for your data

Before setting up to run your own data, please note:

  • Symbolic links can be used for any of the input files, which may be useful for large files (e.g., the FASTQ and reference database .qza files).

  • If you plan on using Deblur, sample names must not contain underscores (only alphanumerics, dashes, and/or periods).

Now edit, replace, or store the required input files as described here:

  1. Edit or replace the metadata file 00-data/metadata.tsv . The first column header should be "sample_name", with sample names matching the FASTQ manifest(s), and additional columns containing any relevant metadata for your samples. You can use a spreadsheet editor like Microsoft Excel or LibreOffice, but make sure to export the output in tab-delimited text format.

  2. Prepare FASTQ data:

    • Option 1: Edit or replace the FASTQ manifests 00-data/manifest_pe.csv (paired-end) and/or 00-data/manifest_se.csv (single-end). Ensure that (1) file paths in the column "absolute-filepath" point to your .fastq.gz files (they can be anywhere on your computer) and (2) sample names match the metadata file. You can use a text editor such as Sublime Text, nano, vim, etc.

    • Option 2: Store your pre-imported FASTQ .qza files as 01-imported/fastq_pe.qza (paired-end) and/or 01-imported/fastq_se.qza (single-end).

  3. Prepare reference database:

    • Option 1: Store the reference FASTA and taxonomy files as 00-data/refseqs.fna and 00-data/reftax.tsv .

    • Option 2: Store the pre-imported reference FASTA and taxonomy .qza files as 01-imported/refseqs.qza and 01-imported/reftax.qza .

  4. Edit the configuration file config.yaml to set DADA2 and/or Deblur parameters (sequence truncation/trimming, sample pooling, chimera removal, etc.), rarefaction depth, taxonomic classification method, and other parameters. This YAML (yet another markup language) file is a regular text file that can be edited in Sublime Text, nano, vim, etc.

  5. Go to Run Snakemake .

Run Snakemake

Tourmaline is now run within the snakemake conda environment, not the qiime2-2023.5 environment.

conda activate snakemake

Shown here is the DADA2 paired-end workflow. See the Wiki's Run page for complete instructions on all steps, denoising methods, and filtering modes.

Note that any of the commands below can be run with various options, including --printshellcmds to see the shell commands being executed and --dryrun to display which rules would be run but not execute them. To generate a graph of the rules that will be run from any Snakemake command, see the section "Directed acyclic graph (DAG)" on the Run page. Always include the --use-conda option.

From the tourmaline directory (which you may rename), run Snakemake with the denoise rule as the target, changing the number of cores to match your machine:

snakemake --use-conda dada2_pe_denoise --cores 4

Pausing after the denoise step allows you to make changes before proceeding:

  • Check the table summaries and representative sequence lengths to determine if DADA2 or Deblur parameters need to be modified. If so, you can rename or delete the output directories and then rerun the denoise rule.

  • View the table visualization to decide an appropriate subsampling (rarefaction) depth. Then modify the parameters "alpha_max_depth" and "core_sampling_depth" in config.yaml .

  • Decide whether to filter your feature table and representative sequences by taxonomy or feature ID. After the taxonomy step, you can examine the taxonomy summary and bar plot to aid your decision. If you do filter your data, all output from that point on will go in a separate folder so you can compare output with and without filtering.

Unfiltered mode

Continue the workflow without filtering (for now). If you are satisfied with your parameters and files, run the taxonomy rule (for unfiltered data):

snakemake --use-conda dada2_pe_taxonomy_unfiltered --cores 4

Next, run the diversity rule (for unfiltered data):

snakemake --use-conda dada2_pe_diversity_unfiltered --cores 4

Finally, run the report rule (for unfiltered data):

snakemake --use-conda dada2_pe_report_unfiltered --cores 4

Filtered mode

After viewing the unfiltered results—the taxonomy summary and taxa barplot, the representative sequence summary plot and table, or the list of unassigned and potential outlier representative sequences—the user may wish to filter (remove) certain taxonomic groups or representative sequences. If so, the user should first check the following parameters and/or files:

  • copy 2-output-dada2-pe-unfiltered/02-alignment-tree/repseqs_to_filter_outliers.tsv to 00-data/repseqs_to_filter_dada2-pe.tsv to filter outliers, or manually include feature IDs in 00-data/repseqs_to_filter_dada2-pe.tsv to filter those feature IDs (change "dada2-pe" to "dada2-se" or "deblur-se" as appropriate);

  • exclude_terms in config.yaml – add taxa to exclude from representative sequences, if desired;

  • repseq_min_length and repseq_max_length in config.yaml – set minimum and/or maximum lengths for filtering representative sequences, if desired;

  • repseq_min_abundance and repseq_min_prevalence in config.yaml – set minimum abundance and/or prevalence values for filtering representative sequences, if desired.

Now we are ready to filter the representative sequences and feature table, generate new summaries, and generate a new taxonomy bar plot, by running the taxonomy rule (for filtered data):

snakemake --use-conda dada2_pe_taxonomy_filtered --cores 4

Next, run the diversity rule (for filtered data):

snakemake --use-conda dada2_pe_diversity_filtered --cores 4

Finally, run the report rule (for filtered data):

snakemake --use-conda dada2_pe_report_filtered --cores 1

View output

View report and output files

Open your HTML report (e.g., 03-reports/report_dada2-pe_unfiltered.html ) in Chrome {target="_blank"} or Firefox {target="_blank"}. To view the linked files:

  • QZV (QIIME 2 visualization): click to download, then drag and drop in https://view.qiime2.org {target="_blank"}. Empress trees (e.g., rooted_tree.qzv ) may take more than 10 minutes to load.

  • TSV (tab-separated values): click to download, then open in Microsoft Excel or Tabview (command line tool that comes with Tourmaline).

  • PDF (portable document format): click to open and view in new tab.

Downloaded files can be deleted after viewing because they are already stored in your Tourmaline directory.

More tips

Troubleshooting

  • The whole workflow with test data should take ~3–5 minutes to complete. A normal dataset may take several hours to complete.

  • If any of the above commands don't work, read the error messages carefully, try to figure out what went wrong, and attempt to fix the offending file. A common issue is the file paths in your FASTQ manifest file need to be updated.

  • If you are running in a Docker container and you get an error like "Signals.SIGKILL: 9", you probably need to give Docker more memory. See the Wiki section on Installation options .

Power tips

  • The whole workflow can be run with just the command snakemake dada2_pe_report_unfiltered (without filtering representative sequences) or snakemake dada2_pe_report_filtered (after filtering representative sequences). Warning: If your parameters are not optimized, the results will be suboptimal (garbage in, garbage out).

  • If you want to make a fresh run and not save the previous output, simply delete the output directories (e.g., 02-output-{method}-{filter} and 03-report ) generated in the previous run. If you want to save these outputs and rerun with different parameters, you can change the name of the output directories and report files to something informative and leave them in the Tourmaline directory.

  • You can always delete any file you want to regenerate. Then there are several ways to regenerate it: run snakemake FILE and Snakemake will determine which rules (commands) need to be run to generate that file; or, run snakemake RULE where the rule generates the desired file as output.

  • If you've run Tourmaline on your dataset before, you can speed up the setup process and initialize a new Tourmaline directory (e.g., tourmaline-new ) with the some of the files and symlinks of the existing one (e.g., tourmaline-existing ) using the command below:

    cd /path/to/tourmaline-new
    scripts/initialize_dir_from_existing_tourmaline_dir.sh /path/to/tourmaline-existing
    

    You may get error messages if some files don't exist, but it should have copied the files that were there. The files that will be copied from the existing directory to the new directory are:

    config.yaml
    00-data/manifest_pe.csv
    00-data/manifest_se.csv
    00-data/metadata.tsv
    00-data/repseqs_to_filter_dada2-pe.tsv
    00-data/repseqs_to_filter_dada2-se.tsv
    00-data/repseqs_to_filter_deblur-se.tsv
    01-imported/refseqs.qza
    01-imported/reftax.qza
    01-imported/classifier.qza
    

    Ensure you make any changes to your configuration file and (if necessary) delete any files you want to be regenerated before you run Snakemake. If you copy over output files from a previous Tourmaline run manually that you do not want to be regenerated (eg, 02-output-{method}-unfiltered ), you should use the cp -p flag to preserve timestamps.

    cp -rp tourmaline-old/02-output-dada2-pe-unfiltered/ tourmaline-new/
    

Alternatives

Some alternative pipelines for amplicon sequence analysis include the following:

  • Anacapa Toolkit from UCLA: https://github.com/limey-bean/Anacapa

  • Banzai from MBON: https://github.com/jimmyodonnell/banzai

  • Tagseq QIIME 2 Snakemake workflow: https://github.com/shu251/tagseq-qiime2-snakemake

Code Snippets

 2
 3
 4
 5
 6
 7
 8
 9
10
11
while read line
do
    if [[ $line =~ ^\>.* ]]
    then
        echo $line | sed 's/>//' | tr -d '\n'
        echo -e -n '\t'
    else
        echo $line | sed 's/[^-]//g' | awk '{{ print length }}'
    fi
done < "${1:-/dev/stdin}"
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import pandas as pd
import sys
from tabulate import tabulate

# usage
usage = '''
describe_fastq_counts.py fastq_counts_tsv output1_md

'''

if len(sys.argv) < 1:
    print(usage)
    sys.exit()

# input paths
IN = sys.argv[1]

# output paths
OUT = sys.argv[2] # 'fastq_counts_describe.md'


s = pd.read_csv(IN, index_col=0, sep='\t')
t = s.describe()
outstr = tabulate(pd.DataFrame(t.iloc[1:,0]), tablefmt="pipe", headers=['Statistic (n=%s)' % t.iloc[0,0].astype(int), 'Fastq sequences per sample'])
with open(OUT, 'w') as target:
    target.write(outstr)
    target.write('\n')
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
import click
import numpy as np
import pandas as pd

def count_starting_kmers(input_file, file_type, num_seqs, seed):
    """Generate value_counts dataframe of 5' tetramers for random subsample
    of a fasta file"""
    kmer_length = 4
    if file_type == 'fastq':
        interval = 4
    else:
        interval = 2
    if seed:
        np.random.seed(seed)
    starting_kmers = []
    with open(input_file) as handle:
        lines = pd.Series(handle.readlines())
        num_lines = len(lines)
        if num_lines/2 < num_seqs:
            rand_line_nos = np.random.choice(np.arange(1,num_lines,interval), 
                                             size=num_seqs, replace=True)
        else:
            rand_line_nos = np.random.choice(np.arange(1,num_lines,interval), 
                                             size=num_seqs, replace=False)
        rand_lines = lines[rand_line_nos]
    for sequence in rand_lines:
        starting_kmers.append(sequence[:kmer_length])
    starting_kmer_value_counts = pd.Series(starting_kmers).value_counts()
    if (starting_kmer_value_counts.shape[0] == 2):
        starting_kmer_value_counts.loc['NNNX'] = 0
    elif (starting_kmer_value_counts.shape[0] == 1):
        starting_kmer_value_counts.loc['NNNX'] = 0
        starting_kmer_value_counts.loc['NNNY'] = 0
    return(starting_kmer_value_counts)

@click.command()
@click.option('--input_file', '-i', required=True,
              type=click.Path(resolve_path=True, readable=True, exists=True,
              file_okay=True), 
              help="Input sequence file (.fa, .fna, .fasta, .fq, .fastq)")
@click.option('--file_type', '-t', required=False, type=str, default='fasta',
              help="Sequence file type (fasta, fastq) [default: fasta]")
@click.option('--num_seqs', '-n', required=False, type=int, default=10000,
              help="Number of sequences to randomly subsample [default: 10000]")
@click.option('--cutoff', '-c', required=False, type=float, default=0.5,
              help="Minimum fraction of sequences required to match "
                   "a diagnostic 5' tetramer [default: 0.5]")
@click.option('--seed', '-s', required=False, type=int, 
              help="Random number seed [default: None]")

def detect_amplicon_locus(input_file, file_type, num_seqs, cutoff, seed):
    """Determine the most likely amplicon locus of a fasta file based on the
    first four nucleotides.

    The most frequent 5' tetramer in a random subsample of sequences must 
    match, above a given cutoff fraction of sequences, one of the following  
    diagnostic tetramers:

      Tetramer\tAmplicon locus\tForward primer                                  
      TACG\t16S rRNA prok\t515f (EMP)                                           
      GTAG\tITS rRNA euk\tITS1f (EMP)                                           
      GCT[AC]\t18S rRNA euk\tEuk1391f (EMP)                                     
      GTCG\t12S rRNA mt\tMiFish-U-F                                             
    """
    starting_kmer_value_counts = count_starting_kmers(input_file, file_type, 
      num_seqs, seed)
    top_kmer = starting_kmer_value_counts.index[0]
    top_kmer_count = starting_kmer_value_counts[0]
    second_kmer = starting_kmer_value_counts.index[1]
    second_kmer_count = starting_kmer_value_counts[1]
    third_kmer = starting_kmer_value_counts.index[2]
    third_kmer_count = starting_kmer_value_counts[2]
    top_kmer_frac = top_kmer_count/num_seqs
    top2_kmer_frac = (top_kmer_count+second_kmer_count)/num_seqs
    top3_kmer_frac = (top_kmer_count+second_kmer_count+third_kmer_count)/num_seqs
    if (top_kmer == 'TACG') & (top_kmer_frac > cutoff):
        print('Amplicon locus: 16S/515f (%s%% of sequences start with %s)' %
              (round(top_kmer_frac*100, 1), top_kmer))
    elif (top_kmer == 'GTAG') & (top_kmer_frac > cutoff):
        print('Amplicon locus: ITS/ITS1f (%s%% of sequences start with %s)' %
              (round(top_kmer_frac*100, 1), top_kmer))
    elif (top_kmer in ['GCTA', 'GCTC', 'ACAC']) & (second_kmer in ['GCTA', 'GCTC', 
        'ACAC']) & (third_kmer in ['GCTA', 'GCTC', 'ACAC']) & (
        top3_kmer_frac > cutoff):
        print('Amplicon locus: 18S/Euk1391f (%s%% of sequences start with %s, %s, or %s)' %
              (round(top3_kmer_frac*100, 1), top_kmer, second_kmer, third_kmer))
    elif (top_kmer == 'GTCG') & (top_kmer_frac > cutoff):
        print('Amplicon locus: 12S/MiFish-U-F (%s%% of sequences start with %s)' %
              (round(top_kmer_frac*100, 1), top_kmer))
    else:
        print('Could not determine amplicon locus.'),
        print('Most frequent starting tetramer was %s (%s%% of sequences).' %
              (top_kmer, round(top_kmer_frac*100, 1)))

if __name__ == '__main__':
    detect_amplicon_locus()
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import pandas as pd
import sys
from qiime2 import Artifact

# usage
usage = '''
filter_taxonomy.py taxonomy repseqs taxonomytsv taxonomyqza
'''

if len(sys.argv) < 3:
    print(usage)
    sys.exit()

# input paths
taxonomy = sys.argv[1]
repseqs = sys.argv[2]

# output paths
taxonomytsv = sys.argv[3] #
taxonomyqza = sys.argv[4]

df_taxonomy = pd.read_csv(taxonomy, index_col=0, sep='\t')
df_repseqs = pd.read_csv(repseqs, header=None, index_col=0, sep='\t')
keep_ids = df_repseqs.index
df_taxonomy_filtered = df_taxonomy.loc[list(keep_ids)]
df_taxonomy_filtered.to_csv(taxonomytsv, sep='\t')
artifact_taxonomy_filtered = Artifact.import_data('FeatureData[Taxonomy]', df_taxonomy_filtered)
artifact_taxonomy_filtered.save(taxonomyqza)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import pandas as pd
import numpy as np
import sys
from tabulate import tabulate
from qiime2 import Artifact
import seaborn as sns
import matplotlib.pyplot as plt

# usage
usage = '''
plot_repseq_properties.py repseqs_lengths_tsv aligned_repseqs_gaps aligned_repseqs_outliers taxonomy_qza table_qza repseqs_properties repseqs_properties_describe repseqs_properties_pdf outliers_tsv

'''

if len(sys.argv) < 8:
    print(usage)
    sys.exit()

# input paths
lengths = sys.argv[1]
gaps = sys.argv[2]
outliers = sys.argv[3]
taxonomy = sys.argv[4]
table = sys.argv[5]

# output paths
proptsv = sys.argv[6] #
propdescribe = sys.argv[7]
proppdf = sys.argv[8]
outliersforqza = sys.argv[9] 

lengths = pd.read_csv(lengths, names=['length'], index_col=0, sep='\t')
gaps = pd.read_csv(gaps, names=['gaps'], index_col=0, sep='\t')
outliers = pd.read_csv(outliers, names=['outlier'], index_col=0, sep='\t')
taxonomy = Artifact.load(taxonomy)
taxonomydf = taxonomy.view(view_type=pd.DataFrame)
taxonomydf['level_1'] = [x.split(';')[0] for x in taxonomydf['Taxon']]
table = Artifact.load(table)
tabledf = table.view(view_type=pd.DataFrame)
merged = pd.merge(lengths, gaps, left_index=True, right_index=True, how='outer')
merged = pd.merge(merged, outliers, left_index=True, right_index=True, how='outer')
merged = pd.merge(merged, taxonomydf['Taxon'], left_index=True, right_index=True, how='outer')
merged = pd.merge(merged, taxonomydf['level_1'], left_index=True, right_index=True, how='outer')
merged = pd.merge(merged, tabledf.sum().rename('observations'), left_index=True, right_index=True, how='outer')
merged.columns = ['length', 'gaps', 'outlier', 'taxonomy', 'taxonomy_level_1', 'observations']
merged.index.name = 'featureid'
merged['log10(observations)'] = [np.log10(x) for x in merged['observations']]
merged.sort_values('log10(observations)', ascending=False, inplace=True)
merged.to_csv(proptsv, index=True, sep='\t')
t = merged.describe()
tcolumns = t.columns
tcolumns = tcolumns.insert(0, 'Statistic (n=%s)' % t.iloc[0].values[0].astype(int))
outstr = tabulate(t.iloc[1:], tablefmt="pipe", headers=tcolumns)
with open(propdescribe, 'w') as target:
    target.write(outstr)
    target.write('\n')
g = sns.relplot(data=merged, x='length', y='gaps', col='outlier', hue='taxonomy_level_1', size='log10(observations)', sizes=(1,500), edgecolor = 'none', alpha=0.7)
g.set_axis_labels('length (bp) not including gaps', 'gaps (bp) in multiple sequence alignment')
plt.savefig(proppdf, bbox_inches='tight')
outliers.columns = ['Outlier']
outliers.index.name = 'Feature ID'
outliers = outliers*1
outliers.to_csv(outliersforqza, index=True, sep='\t')
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import pandas as pd
import sys
from tabulate import tabulate

# usage
usage = '''
repseqs_lengths_describe.py repseqs_lengths_tsv repseqs_lengths_md

'''

if len(sys.argv) < 1:
    print(usage)
    sys.exit()

# input paths
IN = sys.argv[1]

# output paths
OUT = sys.argv[2] # 'repseqs_lengths_describe.md'

s = pd.read_csv(IN, header=None, index_col=0, sep='\t')
t = s.describe()
outstr = tabulate(t.iloc[1:], tablefmt="pipe", headers=['Statistic (n=%s)' % t.iloc[0].values[0].astype(int), 'Sequence length'])
with open(OUT, 'w') as target:
    target.write(outstr)
    target.write('\n')
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import pandas as pd
from tabulate import tabulate
import sys

# usage
usage = '''
summarize_metadata.py metadata output1_md output2_txt

'''

if len(sys.argv) < 2:
    print(usage)
    sys.exit()

# input paths
metadata = sys.argv[1]

# output paths
output1 = sys.argv[2] # 'manifest_pe.csv'
output2 = sys.argv[3] # 'manifest_se.csv'

df = pd.read_csv(metadata, sep='\t')
cols = df.columns
df2 = pd.DataFrame(columns =[0,1], index=cols)
for col in cols:
    if col in df.columns:
        vc = df[col].value_counts()
        if vc.index.shape == (0,):
            df2.loc[col, 0] = '(no values in column)'
            df2.loc[col, 1] = '--'
        else:
            df2.loc[col, 0] = vc.index[0]
            df2.loc[col, 1] = vc.values[0]
    else:
        df2.loc[col, 0] = '(column not provided)'
        df2.loc[col, 1] = '--'
df2.columns = ['Most common value', 'Count']
df2.index.name = 'Column name'
outstr = tabulate(df2, tablefmt="pipe", headers="keys")
with open(output1, 'w') as target:
    target.write(outstr)
    target.write('\n')
with open(output2, 'w') as target:
    [target.write('%s\n' % i) for i in cols]
305
306
307
308
309
shell:
    "if [ -r '00-data/metadata.tsv' ]; then "
    "    echo 'OK: Metadata file 00-data/metadata.tsv found.'; "
    "else echo 'Error: Metadata file 00-data/metadata.tsv not found.' && exit 1; "
    "fi; "
SnakeMake From line 305 of master/Snakefile
321
322
shell:
    "python scripts/summarize_metadata.py {input.metadata} {output[0]} {output[1]}"
SnakeMake From line 321 of master/Snakefile
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
shell:
    "if [ -r '01-imported/fastq_pe.qza' ]; then "
    "    echo 'OK: FASTQ archive 01-imported/fastq_pe.qza found; FASTQ manifest file 00-data/manifest_pe.csv not required.'; "
    "elif [ -r '00-data/manifest_pe.csv' ]; then "
    "    echo 'OK: FASTQ manifest file 00-data/manifest_pe.csv found; it will be used to create FASTQ archive 01-imported/fastq_pe.qza.'; "
    "else echo 'Error: FASTQ sequence data not found; either 00-data/manifest_pe.csv or 01-imported/fastq_pe.qza is required.' && exit 1; "
    "fi; "
    "if [ {params.classifymethod} = naive-bayes ]; then "
    "    if [ -r '01-imported/classifier.qza' ]; then "
    "        echo 'OK: Reference sequences classifier 01-imported/classifier.qza found; reference sequences FASTA file 00-data/refseqs.fna not required.'; "
    "    elif [ -r '01-imported/refseqs.qza' ]; then "
    "        echo 'OK: Reference sequences archive 01-imported/refseqs.qza found; reference sequences FASTA file 00-data/refseqs.fna not required.'; "
    "    elif [ -r '00-data/refseqs.fna' ]; then "
    "        echo 'OK: Reference sequences FASTA file 00-data/refseqs.fna found; it will be used to create reference sequences archive 01-imported/refseqs.qza.'; "
    "    else echo 'Error: Reference sequences not found; either 01-imported/classifier.qza or 00-data/refseqs.fna or 01-imported/refseqs.qza is required.' && exit 1; "
    "    fi; "
    "elif [ -r '01-imported/refseqs.qza' ]; then "
    "    echo 'OK: Reference sequences archive 01-imported/refseqs.qza found; reference sequences FASTA file 00-data/refseqs.fna not required.'; "
    "elif [ -r '00-data/refseqs.fna' ]; then "
    "    echo 'OK: Reference sequences FASTA file 00-data/refseqs.fna found; it will be used to create reference sequences archive 01-imported/refseqs.qza.'; "
    "else echo 'Error: Reference sequences not found; either 00-data/refseqs.fna or 01-imported/refseqs.qza is required.' && exit 1; "
    "fi; "
    "if [ {params.classifymethod} != naive-bayes ]; then "
    "    if [ -r '01-imported/reftax.qza' ]; then "
    "        echo 'OK: Reference taxonomy archive 01-imported/reftax.qza found; reference taxonomy file 00-data/reftax.tsv not required.'; "
    "    elif [ -r '00-data/reftax.tsv' ]; then "
    "        echo 'OK: Reference taxonomy file 00-data/reftax.tsv found; it will be used to create reference taxonomy archive 01-imported/reftax.qza.'; "
    "    else echo 'Error: Reference taxonomy not found; either 00-data/reftax.tsv or 01-imported/reftax.qza is required.' && exit 1; "
    "    fi; "
    "fi; "
    "if grep -q ^{params.column}$ {input}; then "
    "    echo 'OK: Metadata contains the column \"{params.column}\" that is specified as beta_group_column in config.yaml.'; "
    "else echo 'Error: Metadata does not contain the column \"{params.column}\" that is specified as beta_group_column in config.yaml.' && exit 1; "
    "fi"
SnakeMake From line 336 of master/Snakefile
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
shell:
    "if [ -r '01-imported/fastq_se.qza' ]; then "
    "    echo 'OK: FASTQ archive 01-imported/fastq_se.qza found; FASTQ manifest file 00-data/manifest_se.csv not required.'; "
    "elif [ -r '00-data/manifest_se.csv' ]; then "
    "    echo 'OK: FASTQ manifest file 00-data/manifest_se.csv found; it will be used to create FASTQ archive 01-imported/fastq_se.qza.'; "
    "else echo 'Error: FASTQ sequence data not found; either 00-data/manifest_se.csv or 01-imported/fastq_se.qza is required.' && exit 1; "
    "fi; "
    "if [ {params.classifymethod} = naive-bayes ]; then "
    "    if [ -r '01-imported/classifier.qza' ]; then "
    "        echo 'OK: Reference sequences classifier 01-imported/classifier.qza found; reference sequences FASTA file 00-data/refseqs.fna not required.'; "
    "    elif [ -r '01-imported/refseqs.qza' ]; then "
    "        echo 'OK: Reference sequences archive 01-imported/refseqs.qza found; reference sequences FASTA file 00-data/refseqs.fna not required.'; "
    "    elif [ -r '00-data/refseqs.fna' ]; then "
    "        echo 'OK: Reference sequences FASTA file 00-data/refseqs.fna found; it will be used to create reference sequences archive 01-imported/refseqs.qza.'; "
    "    else echo 'Error: Reference sequences not found; either 01-imported/classifier.qza or 00-data/refseqs.fna or 01-imported/refseqs.qza is required.' && exit 1; "
    "    fi; "
    "elif [ -r '01-imported/refseqs.qza' ]; then "
    "    echo 'OK: Reference sequences archive 01-imported/refseqs.qza found; reference sequences FASTA file 00-data/refseqs.fna not required.'; "
    "elif [ -r '00-data/refseqs.fna' ]; then "
    "    echo 'OK: Reference sequences FASTA file 00-data/refseqs.fna found; it will be used to create reference sequences archive 01-imported/refseqs.qza.'; "
    "else echo 'Error: Reference sequences not found; either 00-data/refseqs.fna or 01-imported/refseqs.qza is required.' && exit 1; "
    "fi; "
    "if [ {params.classifymethod} != naive-bayes ]; then "
    "    if [ -r '01-imported/reftax.qza' ]; then "
    "        echo 'OK: Reference taxonomy archive 01-imported/reftax.qza found; reference taxonomy file 00-data/reftax.tsv not required.'; "
    "    elif [ -r '00-data/reftax.tsv' ]; then "
    "        echo 'OK: Reference taxonomy file 00-data/reftax.tsv found; it will be used to create reference taxonomy archive 01-imported/reftax.qza.'; "
    "    else echo 'Error: Reference taxonomy not found; either 00-data/reftax.tsv or 01-imported/reftax.qza is required.' && exit 1; "
    "    fi; "
    "fi; "
    "if grep -q ^{params.column}$ {input}; then "
    "    echo 'OK: Metadata contains the column \"{params.column}\" that is specified as beta_group_column in config.yaml.'; "
    "else echo 'Error: Metadata does not contain the column \"{params.column}\" that is specified as beta_group_column in config.yaml.' && exit 1; "
    "fi"
SnakeMake From line 381 of master/Snakefile
426
427
428
429
430
shell:
    "qiime tools import "
    "--type 'FeatureData[Sequence]' "
    "--input-path {input} "
    "--output-path {output}"
440
441
442
443
444
445
shell:
    "qiime tools import "
    "--type 'FeatureData[Taxonomy]' "
    "--input-format HeaderlessTSVTaxonomyFormat "
    "--input-path {input} "
    "--output-path {output}"
456
457
458
459
460
461
shell:
    "qiime tools import "
    "--type 'SampleData[PairedEndSequencesWithQuality]' "
    "--input-path {input[0]} "
    "--output-path {output} "
    "--input-format PairedEndFastqManifestPhred33"
472
473
474
475
476
477
shell:
    "qiime tools import "
    "--type 'SampleData[SequencesWithQuality]' "
    "--input-path {input[0]} "
    "--output-path {output} "
    "--input-format SingleEndFastqManifestPhred33"
490
491
492
493
shell:
    "qiime demux summarize "
    "--i-data {input[0]} "
    "--o-visualization {output}"
504
505
506
507
shell:
    "qiime demux summarize "
    "--i-data {input[0]} "
    "--o-visualization {output}"
517
518
519
520
shell:
    "unzip -qq -o {input} -d temp0; "
    "mv temp0/*/data/per-sample-fastq-counts.tsv {output}; "
    "/bin/rm -r temp0"
SnakeMake From line 517 of master/Snakefile
530
531
shell:
    "python scripts/describe_fastq_counts.py {input} {output}"
SnakeMake From line 530 of master/Snakefile
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
shell:
    "qiime dada2 denoise-paired "
    "--i-demultiplexed-seqs {input[0]} "
    "--p-trunc-len-f {params.trunclenf} "
    "--p-trunc-len-r {params.trunclenr} "
    "--p-trim-left-f {params.trimleftf} "
    "--p-trim-left-r {params.trimleftr} "
    "--p-max-ee-f {params.maxeef} "
    "--p-max-ee-r {params.maxeer} "
    "--p-trunc-q {params.truncq} "
    "--p-pooling-method {params.poolingmethod} "
    "--p-chimera-method {params.chimeramethod} "
    "--p-min-fold-parent-over-abundance {params.minfoldparentoverabundance} "
    "--p-n-reads-learn {params.nreadslearn} "
    "--p-n-threads {threads} "
    "{params.hashedfeatureids} "
    "--o-table {output.table} "
    "--o-representative-sequences {output.repseqs} "
    "--o-denoising-stats {output.stats} "
    "--verbose"
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
shell:
    "qiime dada2 denoise-single "
    "--i-demultiplexed-seqs {input[0]} "
    "--p-trunc-len {params.trunclen} "
    "--p-trim-left {params.trimleft} "
    "--p-max-ee {params.maxee} "
    "--p-trunc-q {params.truncq} "
    "--p-pooling-method {params.poolingmethod} "
    "--p-chimera-method {params.chimeramethod} "
    "--p-min-fold-parent-over-abundance {params.minfoldparentoverabundance} "
    "--p-n-reads-learn {params.nreadslearn} "
    "--p-n-threads {threads} "
    "{params.hashedfeatureids} "
    "--o-table {output.table} "
    "--o-representative-sequences {output.repseqs} "
    "--o-denoising-stats {output.stats} "
    "--verbose"
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
shell:
    "qiime deblur denoise-other "
    "--i-demultiplexed-seqs {input.seqs} "
    "--i-reference-seqs {input.refseqs} "
    "--p-trim-length {params.trimlength} "
    "{params.samplestats} "
    "--p-mean-error {params.meanerror} "
    "--p-indel-prob {params.indelprob} "
    "--p-indel-max {params.indelmax} "
    "--p-min-reads {params.minreads} "
    "--p-min-size {params.minsize} "
    "--p-jobs-to-start {threads} "
    "{params.hashedfeatureids} "
    "--o-table {output.table} "
    "--o-representative-sequences {output.repseqs} "
    "--o-stats {output.stats} "
    "--verbose"
675
676
677
678
679
shell:
    "qiime feature-table summarize "
    "--i-table {input.table} "
    "--m-sample-metadata-file {input.metadata} "
    "--o-visualization {output}"
691
692
693
694
shell:
    "qiime metadata tabulate "
    "--m-input-file {input.stats} "
    "--o-visualization {output}"
705
706
707
708
709
shell:
    "qiime tools export "
    "--input-path {input} "
    "--output-path {output} "
    "--output-format BIOMV210Format"
719
720
721
722
723
724
shell:
    "biom summarize-table "
    "--input-fp {input} "
    "--output-fp {output}; "
    "cat {output} | sed 's/observation/feature/g' | sed 's/.000$//' > temp; "
    "mv temp {output}"
SnakeMake From line 719 of master/Snakefile
734
735
736
737
738
739
740
shell:
    "biom summarize-table "
    "--observations "
    "--input-fp {input} "
    "--output-fp {output}; "
    "cat {output} | sed 's/observation/feature/g' | sed 's/Counts\/sample/Counts\/feature/g' | sed 's/.000$//' > temp; "
    "mv temp {output}"
SnakeMake From line 734 of master/Snakefile
750
751
752
753
shell:
    "qiime feature-table tabulate-seqs "
    "--i-data {input} "
    "--o-visualization {output}"
763
764
765
766
767
shell:
    "qiime tools export "
    "--input-path {input} "
    "--output-path {output} "
    "--output-format DNAFASTAFormat"
777
778
shell:
    "python scripts/detect_amplicon_locus.py -i {input} > {output}"
SnakeMake From line 777 of master/Snakefile
788
789
shell:
    "perl scripts/fastaLengths.pl {input} > {output}"
SnakeMake From line 788 of master/Snakefile
799
800
shell:
    "python scripts/repseqs_lengths_describe.py {input} {output}"
SnakeMake From line 799 of master/Snakefile
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
shell:
    "echo classify_method: {params.classifymethod}; "
    "if [ {params.classifymethod} = naive-bayes ]; then "
    "    if [ ! -r 01-imported/classifier.qza ]; then "
    "        qiime feature-classifier fit-classifier-naive-bayes "
    "        --i-reference-reads {params.refseqs} "
    "        --i-reference-taxonomy {params.reftax} "
    "        --o-classifier 01-imported/classifier.qza; "
    "    fi; "
    "    qiime feature-classifier classify-sklearn "
    "    --i-classifier 01-imported/classifier.qza "
    "    --i-reads {input.repseqs} "
    "    --o-classification {output} "
    "    --p-n-jobs {threads} "
    "    {params.classifyparams}; "
    "elif [ {params.classifymethod} = consensus-blast ]; then "
    "    qiime feature-classifier classify-consensus-blast "
    "    --i-reference-reads {params.refseqs} "
    "    --i-reference-taxonomy {params.reftax} "
    "    --i-query {input.repseqs} "
    "    --o-classification {output} "
    "    --o-search-results {params.searchout} "
    "    {params.classifyparams}; "
    "elif [ {params.classifymethod} = consensus-vsearch ]; then "
    "    qiime feature-classifier classify-consensus-vsearch "
    "    --i-reference-reads {params.refseqs} "
    "    --i-reference-taxonomy {params.reftax} "
    "    --i-query {input.repseqs} "
    "    --o-classification {output} "
    "    --o-search-results {params.searchout} "
    "    --p-threads {threads} "
    "    {params.classifyparams}; "
    "fi"
860
861
862
863
shell:
    "qiime metadata tabulate "
    "--m-input-file {input} "
    "--o-visualization {output}"
875
876
877
878
879
880
shell:
    "qiime taxa barplot "
    "--i-table {input.table} "
    "--i-taxonomy {input.taxonomy} "
    "--m-metadata-file {input.metadata} "
    "--o-visualization {output}"
893
894
895
896
897
898
899
900
901
902
903
904
905
906
shell:
    "qiime taxa collapse "
    "--i-table {input.table} "
    "--i-taxonomy {input.taxonomy} "
    "--p-level {params.taxalevel} "
    "--o-collapsed-table tempfile_collapsed.qza;"
    "qiime tools export "
    "--input-path tempfile_collapsed.qza "
    "--output-path temp_export;"
    "biom convert "
    "-i temp_export/feature-table.biom "
    "-o {output} "
    "--to-tsv;"
    "/bin/rm -r tempfile_collapsed.qza temp_export/"
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
shell:
    "qiime feature-table transpose "
    "--i-table {input.table} "
    "--o-transposed-feature-table transposed-table.qza; "
    "qiime metadata tabulate "
    "--m-input-file {input.repseqs} "
    "--m-input-file {input.taxonomy} "
    "--m-input-file transposed-table.qza "
    "--o-visualization merged-data.qzv; "
    "qiime tools export "
    "--input-path merged-data.qzv "
    "--output-path {output}; "
    "mv {output}/metadata.tsv temp; "
    "rm -r {output}; "
    "sed -e '2d' temp | sed '1 s|id\\t|featureid\\t|' | sed '1 s|Taxon|taxonomy|' | sed '1 s|Sequence|sequence|' > {output}; "
    "/bin/rm -r temp transposed-table.qza merged-data.qzv"
943
944
945
946
947
shell:
    "qiime tools export "
    "--input-path {input} "
    "--output-path {output} "
    "--output-format TSVTaxonomyFormat"
957
958
959
960
961
962
shell:
    "qiime tools import "
    "--type 'FeatureData[Taxonomy]' "
    "--input-format TSVTaxonomyFormat "
    "--input-path {input} "
    "--output-path {output}"
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
shell:
    "if [ {params.method} = muscle ]; then "
    "    echo 'Multiple sequence alignment method: MUSCLE ...'; "
    "    muscle "
    "    -super5 {input.repseqsfasta} "
    "    -threads {threads} "
    "    -refineiters {params.muscle_iters} "
    "    -output temp_aligned_repseqs.fasta; "
    "    perl scripts/cleanupMultiFastaNoBreaks.pl temp_aligned_repseqs.fasta > {output.alnfasta}; "
    "    echo 'Line breaks removed to generate {output.alnfasta}'; "
    "    /bin/rm temp_aligned_repseqs.fasta; "
    "    qiime tools import "
    "    --type 'FeatureData[AlignedSequence]' "
    "    --input-path {output.alnfasta} "
    "    --output-path {output.alnqza}; "
    "elif [ {params.method} = clustalo ]; then "
    "    echo 'Multiple sequence alignment method: Clustal Omega ...'; "
    "    clustalo --verbose --force "
    "    --in {input.repseqsfasta} "
    "    --out temp_aligned_repseqs.fasta "
    "    --threads={threads}; "
    "    perl scripts/cleanupMultiFastaNoBreaks.pl temp_aligned_repseqs.fasta > {output.alnfasta}; "
    "    echo 'Line breaks removed to generate {output.alnfasta}'; "
    "    /bin/rm temp_aligned_repseqs.fasta; "
    "    qiime tools import "
    "    --type 'FeatureData[AlignedSequence]' "
    "    --input-path {output.alnfasta} "
    "    --output-path {output.alnqza}; "
    "elif [ {params.method} = mafft ]; then "
    "    echo 'Multiple sequence alignment method: MAFFT ...'; "
    "    qiime alignment mafft "
    "    --i-sequences {input.repseqsqza} "
    "    --o-alignment tempfile_unmasked_aligned_repseqs.qza "
    "    --p-n-threads {threads}; "
    "    qiime alignment mask "
    "    --i-alignment tempfile_unmasked_aligned_repseqs.qza "
    "    --o-masked-alignment {output.alnqza}; "
    "    /bin/rm tempfile_unmasked_aligned_repseqs.qza; "
    "    qiime tools export "
    "    --input-path {output.alnqza} "
    "    --output-path {output.alnfasta} "
    "    --output-format AlignedDNAFASTAFormat; "
    "else "
    "    echo 'Multiple sequence alignment method: MUSCLE ...'; "
    "fi"
1033
1034
1035
1036
1037
shell:
    "qiime phylogeny fasttree "
    "--i-alignment {input} "
    "--o-tree {output} "
    "--p-n-threads {threads}"
1047
1048
1049
1050
shell:
    "qiime phylogeny midpoint-root "
    "--i-tree {input} "
    "--o-rooted-tree {output}"
1064
1065
1066
1067
1068
1069
1070
1071
shell:
    "qiime empress community-plot "
    "--i-tree {input.tree} "
    "--i-feature-table {input.table} "
    "--m-sample-metadata-file {input.metadata} "
    "--m-feature-metadata-file {input.taxonomy} "
    "--m-feature-metadata-file {input.outliers} "
    "--o-visualization {output}"
1081
1082
shell:
    "bash scripts/alignment_count_gaps.sh < {input} > {output}"
SnakeMake From line 1081 of master/Snakefile
1109
1110
1111
1112
shell:
    "Rscript --vanilla scripts/run_odseq.R {input} {params.metric} {params.replicates} {params.threshold} temp_odseq; "
    "cat temp_odseq | sed 's/^X//' > {output}; "
    "/bin/rm temp_odseq"
SnakeMake From line 1109 of master/Snakefile
1129
1130
1131
shell:
    "python scripts/plot_repseq_properties.py {input.lengths} {input.gaps} {input.outliers} {input.taxonomy} "
    "{input.table} {output.proptsv} {output.propdescribe} {output.proppdf} {output.outliersforqza}"
SnakeMake From line 1129 of master/Snakefile
1141
1142
1143
1144
1145
shell:
    "qiime tools import "
    "--type 'FeatureData[Importance]' "
    "--input-path {input} "
    "--output-path {output}"
1156
1157
1158
shell:
    "cat {input.proptsv} | grep -i 'outlier\|true' | cut -f1,4 > {output.outliers}; "
    "cat {input.proptsv} | grep -i 'taxonomy\|unassigned' | cut -f1,5 > {output.unassigned}"
SnakeMake From line 1156 of master/Snakefile
1255
1256
shell:
    "python scripts/filter_taxonomy.py {input.taxonomy} {input.repseqs} {output.taxonomytsv} {output.taxonomyqza}"
SnakeMake From line 1255 of master/Snakefile
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
shell:
    "qiime diversity alpha-rarefaction "
    "--i-table {input.table} "
    "--i-phylogeny {input.phylogeny} "
    "--p-max-depth {params.maxdepth} "
    "--p-metrics faith_pd "
    "--p-metrics observed_features "
    "--p-metrics shannon "
    "--p-metrics pielou_e "
    "--m-metadata-file {input.metadata} "
    "--o-visualization {output}"
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
shell:
    "qiime diversity core-metrics-phylogenetic "
    "--i-table {input.table} "
    "--i-phylogeny {input.phylogeny} "
    "--p-sampling-depth {params.samplingdepth} "
    "--m-metadata-file {input.metadata} "
    "--o-rarefied-table {output.rarefiedtable} "
    "--o-faith-pd-vector {output.faithpdvector} "
    "--o-observed-features-vector {output.observedfeaturesvector} "
    "--o-shannon-vector {output.shannonvector} "
    "--o-evenness-vector {output.evennessvector} "
    "--o-unweighted-unifrac-distance-matrix {output.unweightedunifracdistancematrix} "
    "--o-weighted-unifrac-distance-matrix {output.weightedunifracdistancematrix} "
    "--o-jaccard-distance-matrix {output.jaccarddistancematrix} "
    "--o-bray-curtis-distance-matrix {output.braycurtisdistancematrix} "
    "--o-unweighted-unifrac-pcoa-results {output.unweightedunifracpcoaresults} "
    "--o-weighted-unifrac-pcoa-results {output.weightedunifracpcoaresults} "
    "--o-jaccard-pcoa-results {output.jaccardpcoaresults} "
    "--o-bray-curtis-pcoa-results {output.braycurtispcoaresults} "
    "--o-unweighted-unifrac-emperor {output.unweightedunifracemperor} "
    "--o-weighted-unifrac-emperor {output.weightedunifracemperor} "
    "--o-jaccard-emperor {output.jaccardemperor} "
    "--o-bray-curtis-emperor {output.braycurtisemperor} "
    "--p-n-jobs-or-threads {threads}"
1347
1348
1349
1350
1351
shell:
    "qiime diversity alpha-group-significance "
    "--i-alpha-diversity {input.alphadiversity} "
    "--m-metadata-file {input.metadata} "
    "--o-visualization {output}"
1366
1367
1368
1369
1370
1371
1372
1373
shell:
    "qiime diversity beta-group-significance "
    "--i-distance-matrix {input.distancematrix} "
    "--m-metadata-file {input.metadata} "
    "--m-metadata-column {params.column} "
    "--p-method {params.method} "
    "{params.pairwise} "
    "--o-visualization {output}"
1389
1390
1391
1392
1393
1394
1395
1396
1397
shell:
    "qiime deicode auto-rpca "
    "--i-table {input.table} "
    "--p-min-sample-count {params.minsamplecount} "
    "--p-min-feature-count {params.minfeaturecount} "
    "--p-min-feature-frequency {params.minfeaturefrequency} "
    "--p-max-iterations {params.maxiterations} "
    "--o-biplot {output.biplot} "
    "--o-distance-matrix {output.distancematrix}"
1410
1411
1412
1413
1414
1415
shell:
    "qiime emperor biplot "
    "--i-biplot {input.biplot} "
    "--m-sample-metadata-file {input.metadata} "
    "--o-visualization {output.emperor} "
    "--p-number-of-features {params.numfeatures}"
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
shell:
    "echo '# Tourmaline Report' > {output};"
    "echo '' >> {output};"
    "echo 'View this HTML report with [Chrome](https://www.google.com/chrome/){{target=\"_blank\"}} or [Firefox](https://www.mozilla.org/en-US/firefox/new/){{target=\"_blank\"}} for best results.' >> {output};"
    "echo '' >> {output};"
    "echo 'To view the linked files below: ' >> {output};"
    "echo '' >> {output};"
    "echo '* QZV (QIIME 2 visualization): click to download, then drag and drop in [https://view.qiime2.org](https://view.qiime2.org){{target=\"_blank\"}}.' >> {output};"
    "echo '* TSV (tab-separated values): click to download, then open in Microsoft Excel or Tabview (command line tool).' >> {output};"
    "echo '* PDF (portable document format): click to open and view in new tab.' >> {output};"
    "echo '* Markdown and text: click to open and view in new tab.' >> {output};"
    "echo '' >> {output};"
    "echo 'Note: Downloaded files can be deleted after viewing, as they are already stored in your Tourmaline directory.' >> {output};"
    "echo '' >> {output};"
    "echo 'For information on Tourmaline outputs, visit [https://github.com/aomlomics/tourmaline](https://github.com/aomlomics/tourmaline){{target=\"_blank\"}}.' >> {output};"
    "echo '' >> {output};"
    "echo '---' >> {output};"
    "echo '' >> {output};"
    "echo '## Metadata Summary' >> {output};"
    "echo '' >> {output};"
    "echo Markdown: \[{input.mdsummary}\]\(../{input.mdsummary}\){{target=\"_blank\"}} >> {output};"
    "echo '' >> {output};"
    "cat {input.mdsummary} >> {output};"
    "echo '' >> {output};"
    "echo '' >> {output};"
    "echo '---' >> {output};"
    "echo '' >> {output};"
    "echo '## Fastq Sequences Information' >> {output};"
    "echo '' >> {output};"
    "echo '### Fastq Sequences per Sample' >> {output};"
    "echo '' >> {output};"
    "echo Markdown: \[{input.fastq}\]\(../{input.fastq}\){{target=\"_blank\"}} >> {output};"
    "echo '' >> {output};"
    "cat {input.fastq} >> {output};"
    "echo '' >> {output};"
    "echo '### Visualization of Fastq Sequences' >> {output};"
    "echo '' >> {output};"
    "echo QZV: \[{input.visfastq}\]\(../{input.visfastq}\){{target=\"_blank\"}} >> {output};"
    "echo '' >> {output};"
    "echo '---' >> {output};"
    "echo '' >> {output};"
    "echo '## Representative Sequences Information' >> {output};"
    "echo '' >> {output};"
    "echo '### Representative Sequences Properties Table' >> {output};"
    "echo '' >> {output};"
    "echo TSV: \[{input.repseqstsv}\]\(../{input.repseqstsv}\){{target=\"_blank\"}} >> {output};"
    "echo '' >> {output};"
    "echo 'Columns:' >> {output};"
    "echo '' >> {output};"
    "echo '* featureid' >> {output};"
    "echo '* length - length (bp) not including gaps' >> {output};"
    "echo '* gaps - gaps (bp) in multiple sequence alignment' >> {output};"
    "echo '* outlier - outlier (True/False) determined by OD-seq' >> {output};"
    "echo '* taxonomy - domain level' >> {output};"
    "echo '* observations - total observations summed across all samples (unrarefied)' >> {output};"
    "echo '* log10(observations) - log base-10 of total observations' >> {output};"
    "echo '' >> {output};"
    "echo '### Representative Sequences Properties Summary' >> {output};"
    "echo '' >> {output};"
    "echo Markdown: \[{input.repseqsdescribe}\]\(../{input.repseqsdescribe}\){{target=\"_blank\"}} >> {output};"
    "echo '' >> {output};"
    "cat {input.repseqsdescribe} >> {output};"
    "echo '' >> {output};"
    "echo '### Representative Sequences Properties Plot' >> {output};"
    "echo '' >> {output};"
    "echo PDF: \[{input.repseqspdf}\]\(../{input.repseqspdf}\){{target=\"_blank\"}} >> {output};"
    "echo '' >> {output};"
SnakeMake From line 1460 of master/Snakefile
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
shell:
    "pandoc -i {input} -o {output};"
    "echo '<!DOCTYPE html>' > header.html;"
    "echo '<html>' >> header.html;"
    "echo '<head>' >> header.html;"
    "echo '<link rel=\"stylesheet\" type=\"text/css\" href=\"../css/{params.theme}.css\">' >> header.html;"
    "echo '</head>' >> header.html;"
    "echo '<body>' >> header.html;"
    "echo '' >> header.html;"
    "cat header.html {output} > temp.html;"
    "echo '' >> temp.html;"
    "echo '</body>' >> temp.html;"
    "echo '</html>' >> temp.html;"
    "mv temp.html {output};"
    "/bin/rm header.html"
SnakeMake From line 1669 of master/Snakefile
ShowHide 51 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/aomlomics/tourmaline
Name: tourmaline
Version: v1.1.1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: BSD 3-Clause "New" or "Revised" License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...