Microbiome Analysis Workflow based on a qiime2

public public 1yr ago 0 bookmarks

This repository contains the analysis workflow for Debelius et al (doi: 10.1101/2021.03.23.436606 ).

Installation

The analysis was done based on a qiime2-2020.11 environment with RESRIPt and q2-sidle installed.

To build this environment, install qiime2-202011 according to your system requirements. Then, follow the q2-sidle installation instructions. You will also need snakemake > 5.3.

conda install --channel bioconda snakemake=5.3

Finally, you will need to clone and download the repository in whatever way works best for you.

Input Files

Simulation

To run the simulation, you will need to download the 100nt OTU table (ID 45113, CRC32: 7066cf83) and metadata from Qiita Study 850 . The OTU table should be imported into QIIME 2 using the command,

qiime tools import \
 --type 'FeatureTable[Frequency]' \
 --source-fomat BIOMV210Format \
 --input-path 45113_otu_table.biom \
 --output-path start-table.qza

And placed in the folder specified as the config file ( input_dir under simulation ). (By default, this folder is ipynb/data/inputs/simulation ).

The metadata should also be downloaded from the qiita study (found under the sample information tab), and renamed to start-metadata.txt and moved to the simulation input folder ipynb/data/inputs/simulation .

Real Data

Real data can be downloaded from ENA study PRJEB37382 , accessions ERR4704801-ERR4704848. The should be placed in the location specified in the config file. By default, they are expected in the ipynb/data/inputs/real/seqs directory. Metadata is inferred from the filenames.

Database Files

The analysis is currently based on the Silva 128 QIIME release, available here . The files should be imported into QIIME 2 and saved to the ipynb/data/reference/silva-ori/ folder. The SEPP reference tree can be downloaded from the qiime2 resources page . The formatted Optivag and Greengenes databases are already in the repository.

Running the workflow

You can be through easily through snakemake based on the current repository set up.

snakemake 

Workflow steps

Simulation

  1. Reference Simulation . The greengenes database is used to filter the reference OTU table to exclude sequences with more than 5 degenerates, and then a reference dataset is simulated from the original data by averaging abundance across multiple individuals of the same age group.

    • filter_degenerete_reference

    • filter_ref_table

    • simulate_reference_samples

  2. Regional Simulation . Regions of the 16s gene are extracted and combined with the full length reference table to build a regional "ASV" table and representative sequence set. These are the starting points for our three test methods.

    • extract_regions

    • generate_asvs_and_derep_tables

  3. Database Preparation . The database is prepared by filtering to remove sequences with a large number of degenerate nucleotides from the reference and database sequences which do not have class-level annotations. These are used to generate per-region

    • filter_silva_reference

    • extract_sidle_regions

    • prepare_extracted_region

  4. Reference Reconstruction . The simulated reference table, the greengenes 13_8 taxonomy and the greengenes 13_8 tree are copied from their respective locations.

    • copy_sim_reference
  5. OTU Clustering . The table is generated by merging all ASV representative sequences and the feature tables from all the individual regions and then clustering them closed reference at 99% identity against the Silva 128 reference database

    • concatenate_single_asv_counts

    • concatenate_single_rep_seqs

    • cluster_otus

    • get_otu_tree

    • get_otu_taxonomy

  6. ASV denosing . The table is generated by merging all ASV counts. Taxonomy comes from naive bayesian classification of all sequences; sequences which don't have phylum level annotation are excluded. The tree is built by fragment insertion of the merged sequences

    • concatenate_single_asv_counts

    • concatenate_single_rep_seqs

    • classify_taxonomy

    • get_asv_rep_seq

    • get_asv_table

    • build_insertion_tree

  7. Sidle . For Sidle, the regional sequences are aligned to the regional kmers. The alignments, regional database maps, and feature counts are used to reconstruct the table. Taxonomy is reconstructed using the database map to identify unresolved sequences. To build the phylogenetic tree, we generate consensus fragments and then insert them into them in the Silva 128 backbone.

    • align_sidle_regions

    • reconstruct_table_single

    • reconstruct_taxonomy

    • reconstruct_fragments_single

    • reconstructed_fragment_insertion

  8. Alpha Diversity . Within sample (alpha) diversity is calculated from multiple rarefaction. We look at observed features, Shannon, Pielou's evenness and faith's diversity. The results are summarized using the Table1-CompareAlpha.ipynb notebook. This generated Table 1.

    • rarefy_iteration

    • alpha_iter

    • alpha_table

  9. Beta Diversity . The between sample (beta) diversity is also calculated from rarified tables, and then, we look at three feature-based metrics (unweighted UniFrac, weighted UniFrac, and Bray-Curtis) as well as a collapsed metric (genus level Bray-Curtis). The results are summarized in the ipynb/Table2-CompareBeta.ipynb notebook. This generates Table 2.

    • rarefy_iteration

    • genus_braycurtis

    • beta_phylogenetic

    • braycurtis

    • adonis

    • beta_table

  10. Taxonomic Comparison . We compared the taxonomy at class level using the merged, unnormalized tables through the ipynb/Figure1-Taxonomy.ipynb . This generates Figure 2 and Table S3.

    • compare_taxonomy

Real Data Analysis

  1. Data Preparation . We import the sequences into QIIME 2, and then denoise them using dada2 before denoising the sequences.

    • build_manifest

    • import_seqs

    • trim_data_v13

    • denoise_v13

    • trim_data_v34

    • denoise_v34

    • trim_posthoc

  2. Database Preparation . For benchmarking the greengenes database needs to be prepared ( filter_greengenes_reference ), otherwise, the Optivag database is assumed to be tidied. Then, we extract the regions and prepare them from sidle alignment.

    • filter_greengenes_reference

    • extract_sidle_regions

    • prepare_extracted_region

  3. Reference Reconstruction . The V34 region alone was used as the reference for the simulation. We used the paired end table, classified the taxonomy using a naive bayesian classifier for the full length sequences, and then filtered the table to remove anything missing a phylum level or kingdom level designation.

    • classify_regional_taxonomy

    • get_real_reference_data

    • get_real_representative_seqs

  4. OTU Clustering . The table is generated by merging all ASV representative sequences and the feature tables from all the individual regions and then clustering them closed reference at 99% identity

    • concatenate_single_asv_counts

    • concatenate_single_rep_seqs

    • cluster_otus

    • get_otu_taxonomy

  5. ASV denosing . The table is generated by merging all ASV counts. Taxonomy comes from naive bayesian classification of all sequences; sequences which don't have phylum level annotation are excluded. The tree is built by fragment insertion of the merged sequences

    • concatenate_single_asv_counts

    • concatenate_single_rep_seqs

    • classify_taxonomy

    • get_asv_rep_seq

    • get_asv_table

  6. Sidle . For Sidle, the regional sequences are aligned to the regional kmers. The alignments, regional database maps, and feature counts are used to reconstruct the table. Taxonomy is reconstructed using the database map to identify unresolved sequences.

    • align_sidle_regions

    • reconstruct_table_single

    • reconstruct_taxonomy

  7. Data Synthesis . The taxonomic annotation was checked and the tables evaluated using the ipynb/Figure2-RealData.ipynb notebook.

    • real_data_figure

Benchmarking

The snake file contains a section dedicated to preparing the sidle reads, and it will output a table. This will generate a series of benchmarking files, in the benchmark folder. The Matlab folder contains the modified m scripts to run with SMURF.

Code Snippets

 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
suppressMessages(library("argparser"))
suppressMessages(library("vegan"))


# Sets up an argparser to make this tractable
parser <- arg_parser(
  description="Basic Adonis Functionality for age and set"
)
parser <- add_argument(parser,
  '--distance-matrix', 
  help='the distance matrix',
)
parser <- add_argument(parser,
  '--metadata', 
  help=('Path to the metadata file'),
)
parser <- add_argument(parser,
  '--output',
  help=('The location for the summary file')
)

args <- parse_args(parser)

# reads in the distance matrix as a dataframe and sets the rownames as the first column
df <- read.csv2(args$distance_matrix, sep='\t', row.names='X')
map <- read.csv2(args$metadata, sep='\t', row.names='sample.id')
map <- map[row.names(df),]

dm  <- dist(df[row.names(df), row.names(df)])

age <- adonis2(dm ~ age, data=map)

res <- data.frame(
  age = c(age$R2[1], age$`Pr(>F)`[1], age$F[1]),
  row.names = c('R2', 'p_999', 'F')
  )

write.table(res, file=args$output, quote=FALSE, sep='\t', col.names = NA)
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
description = """
This script will build a set of synthetic samples based on resampling of the
american gut data. We'll do this by reading in the biom table and then 
randomly selecting a subset of samples using non ambigious samples. 

The samples will then be split into two approximately equal sized groups and 
used to generate a set of normalized, simulated samples
"""

import argparse
import os

import biom
import numpy as np
import pandas as pd

from qiime2 import Artifact


def subsample_data(metadata, table, age_split_fun, id_prefixes,
                   colors=None,
                   group_size=10, 
                   n_samples=20, seed=1234):
    """
    Generates a set of synthetic sequences based on the American Gut using
    seperation between people in their 20s and their 60s

    Parameters
    ----------
    keep_map : DataFrame
        A map of the samples with the age encoded in `age_cat` and the 
        index as the sample number
    samples : biom.Table
        The samples used for simulation
    age_split_fun: function
    group_size : int
        The number of samples to be combined to make the simulated sample
    split_size : int
        The number of samples to use in each set. (This is used to adjust)
        for uneven group sizes in the two age groups
    n_samples : int
        The number of samples to simulate for each age group

    Return 
    ------
    biom.Table
        All of the synthesized samples

    """
    # Sets a seed for consistency
    np.random.seed(seed)

    # Gets the age catgories 
    metadata['age_cat'] = metadata['age'].apply(age_split_fun)
    metadata.dropna(subset=['age_cat'], inplace=True)

    table.filter(metadata.index, axis='sample', inplace=True)
    table.add_metadata(metadata[['age_cat']].to_dict(orient='index'), axis='sample')

    print(id_prefixes)

    tables = [
        table.copy().filter(lambda v, id_, md: md['age_cat'] == prefix,
                            axis='sample', inplace=False)
        for prefix in id_prefixes
    ]

    synthetic = [
        tidy_names(pd.concat(axis=1, objs=[
            build_subsample(table, group_size, prefix) 
            for i in range(n_samples)]), 
        prefix)
        for prefix, table in zip(*(id_prefixes, tables))
    ]
    print(synthetic[0].head())

    all_samples = pd.concat(axis=1, sort=False, objs=synthetic)
    all_samples.fillna(0, inplace=True)

    all_biom = biom.Table(data=all_samples.values, 
                          sample_ids=all_samples.columns,
                          observation_ids=all_samples.index
                          )
    metadata = pd.DataFrame.from_dict(orient='index', data={
        id_: {'set': id_.split('.')[1],
              'age': id_.split('.')[2],
              }
        for id_ in all_biom.ids(axis='sample')
        })
    metadata.index.set_names('sample-id', inplace=True)
    metadata['composite'] = metadata.apply(lambda x: "{age}-{set}".format(**x),
                                            axis=1)
    if colors is not None:
        metadata['color'] = metadata['composite'].replace(colors)

    return all_biom, metadata

def tidy_names(table, prefix):
    table.rename(inplace=True, columns={
        int(i): f"sample.1.{prefix}.{i}" for i in table.columns
        })
    return table

def build_subsample(table, sample_size, group):
    """
    Builds a subsampled table from a biom object
    """
    sub_t = table.copy().subsample(sample_size, axis='sample', by_id=True)
    sub_t.filter(lambda v, id_, md: (v > 0).sum() > np.round(0.4 * sample_size), 
                 axis='observation', 
                 inplace=True)
    single = sub_t.copy().collapse(lambda id_, md: md['age_cat'], 
                                   axis='sample')
    single.norm(axis='sample', inplace=True)
    single.filter(lambda v, id_, md: (v > 2e-5).all(), 
                  axis='observation', 
                  inplace=True)
    single.norm(axis='sample', inplace=True)
    return pd.Series(np.round(single.data(group) * 5e5).round(0),
                     index=single.ids(axis='observation'))


parser = argparse.ArgumentParser(description=description)
parser.add_argument(
    '--keep-map',
    help=("The map of the samples to be used for simulation. This must "
          "contain  a sample identifier labeled `sample_name` and the age"
          "labeled age_cat"
          ),
    )
parser.add_argument(
    '--sample-table',
    help=("A biom table containing sample counts. The sample list should "
          "match those present in `--keep-map`"),
    )
parser.add_argument(
    '--simulated-table',
    help=('Filepath to the qiime artifact containing all the samples'),
    )
parser.add_argument(
    '--simulated-metadata',
    help=('Filepath to the qiime artifact containing all the samples'),
    )
parser.add_argument(
    '--samples-into-simulation',
    help=("The number of samples to combine into each simulated sample"),
    default=10,
    type=int,
    )
parser.add_argument(
    '--group-size',
    help=("The number of samples in each group to simulate for the set."),
    default=20,
    type=int,
    )
parser.add_argument(
    '--seed',
    help=('a random seed'),
    default=1234,
    type=int,
    )

if __name__ == '__main__':
    args = parser.parse_args()

    # Loads group
    keep_map = pd.read_csv(args.keep_map, sep='\t', dtype=str)   
    keep_map.set_index('sample_name', inplace=True)
    keep_map['age'] = keep_map['age'].astype(float)

    # Loads table
    sample_table = Artifact.load(args.sample_table).view(biom.Table)

    def cat_age(x):
        if pd.isnull(x):
            return x
        if x <= 3:
            return 'infant'
        elif (3 < x) & (x <= 12):
            return 'child'
        elif (20 < x) & (x <= 60):
            return 'adult'
        else:
            return np.nan

    colors = {'infant-1': '#1f78b4', 'infant-2': '#a6cee3', 
              'adult-1': '#e31a1c', 'adult-2': '#fb9a99'}


    all_biom, metadata = \
        subsample_data(metadata=keep_map.copy(), 
                       table=sample_table, 
                       age_split_fun=cat_age, 
                       id_prefixes=['infant', 'adult'],
                       group_size=args.samples_into_simulation,
                       n_samples=args.group_size,
                       seed=args.seed,
                       )
    all_table = Artifact.import_data('FeatureTable[Frequency]', all_biom)
    all_table.save(args.simulated_table)

    metadata.to_csv(args.simulated_metadata, sep='\t')
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
description = """
Simulates the an ASV table and dereplicated sequences for OTU clustering
and meta analysis exploration
"""


import argparse
import hashlib

import biom
import numpy as np
import pandas as pd
import regex
import skbio
from skbio import DNA

from qiime2 import Artifact


def build_regional_table(samples, seqs, read_length):
    """
    Amplifies a regional table from the original table

    Parameters
    ----------
    samples : DataFrame
        A biom-style dataframe where the index is sequences (corresponding)
        to the sequences in seqs

    """
    # Filters the sequences and table down to those present in the sequences
    degen_count = seqs.astype(str).apply(
        lambda x: len(regex.findall('[RYSWKMBDHVN]', str(x)))
        )
    len_check = seqs.astype(str).apply(lambda x: (len(x) >= np.absolute(read_length)))
    print(len_check.sum())
    degen_keep = degen_count.index[(degen_count <= 10) & len_check].values
    var_ids = [id_ for id_ in samples.index if (id_ in degen_keep)]
    samples = samples.loc[var_ids]

    table_seqs = seqs[var_ids].copy()

    # Expands the degenerate sequences
    exp_ = {id_: np.array([str(s) for s in seq_.expand_degenerates()])
            for id_, seq_ in table_seqs.iteritems()}

    # Generates the table by expanding the sequence where degenerates are
    # randomnly sampled and then combining the sequences into a table
    forward_reads = dict([])
    reverse_reads = dict([])
    joined_reads = dict([])
    for sample in samples.columns:
        expanded = pd.Series(np.hstack([
                np.random.choice(exp_[id_], size=count, replace=True)
                for id_, count in 
                samples.loc[samples[sample] > 0, sample].iteritems()])
        ).apply(lambda x: x[:read_length])
        joined_reads[sample] = expanded.value_counts()

    joined_table = pd.DataFrame.from_dict(orient='index', data=joined_reads).fillna(0)
    joined_table = joined_table.loc[joined_table.sum(axis=1) > 10]
    joined_reads = pd.Series({
        hash_seq(seq): DNA(seq) 
        for seq in joined_table.columns
    })
    joined_table.rename(columns={seq: hash_seq(seq) for seq in joined_table.columns},
                        inplace=True)
    jointable = Artifact.import_data('FeatureTable[Frequency]', joined_table, pd.DataFrame)
    joinseqs = Artifact.import_data('FeatureData[Sequence]', joined_reads, pd.Series)

    return jointable, joinseqs


def hash_seq(x):
    """
    A lazy function to hash things
    """
    return hashlib.md5(x.encode()).hexdigest()




parser = argparse.ArgumentParser(description=description)
parser.add_argument(
    '--sequences',
    help=('The extracted regional sequences'),
    )
parser.add_argument(
    '--sample-table',
    help=("A table of the samples to be simulated."),
    )
parser.add_argument(
    '--read-length',
    help=("length of the final simulated sequences. We treat these as paired"
          "end sequences"),
    type=int,
    )
parser.add_argument(
    '--table',
    help=('Path to save the joined-sequence table')
    )
parser.add_argument(
    '--reads',
    help=('Path to save the representative sequences for the joined table'),
    )
parser.add_argument(
    '--random',
    help=("A seed for randomn number generation"),
    default=None,
    )

if __name__ == '__main__':
    args = parser.parse_args()

    # Gets the regional sequences
    seqs =  Artifact.load(args.sequences).view(pd.Series)

    # Gets the table
    table = Artifact.load(args.sample_table).view(pd.DataFrame).fillna(0).astype(int).T

    # Simulates the samples
    if args.random is not None:
        np.random.seed(int(args.random))
    # fwd_table, rev_table, join_table, fwd_seqs, rev_seqs, join_seqs = \
    join_table, join_seqs = \
        build_regional_table(table, seqs, args.read_length)

    # fwd_table.save(args.fwd_table)
    # fwd_seqs.save(args.fwd_reads)
    # rev_table.save(args.rev_table)
    # rev_seqs.save(args.rev_reads)
    join_table.save(args.table)
    join_seqs.save(args.reads)
29
30
31
32
33
34
35
36
shell:
	"""
	qiime sidle filter-degenerate-sequences \
	 --i-sequences {input} \
	 --p-n-workers 2 \
	 --p-max-degen 5 \
	 --o-filtered-sequences {output}
	"""
48
49
50
51
52
53
54
shell:
	"""
	qiime feature-table filter-features \
	 --i-table {input.table} \
	 --m-metadata-file {input.sequences} \
	 --o-filtered-table {output}
	"""
67
68
69
70
71
72
73
74
75
shell:
	"""
	python bin/simulate_samples.py \
	--keep-map {input.metadata} \
	--sample-table {input.table} \
	--group-size {params.group_size} \
	--simulated-table {output.table} \
	--simulated-metadata {output.metadata}
	"""
SnakeMake From line 67 of main/Snakefile
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
shell:
	"""
	qiime feature-classifier extract-reads \
	 --i-sequences {input} \
	 --p-f-primer {params.fwd} \
	 --p-r-primer {params.rev} \
	 --p-min-length 50 \
	 --p-max-length 700 \
	 --p-n-jobs 5 \
	 --o-reads {output}
	"""
113
114
115
116
117
118
119
120
121
122
shell:
	"""
	python bin/simulate_tables_exact.py \
	 --sequences {input.seqs} \
	 --sample-table {input.table} \
	 --read-length {params.length} \
	 --random {params.seed} \
	 --table {output.joined_table} \
	 --reads {output.joined_repset}
	"""
SnakeMake From line 113 of main/Snakefile
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
run:
	import os
	import numpy as np
	import pandas as pd

	samples = inputs[0]
	region = params[0]
	joined = params[1]

	def get_sample_descriptor(fp):
	    name = fp.split('.')[0]
	    abs_fp = os.path.abspath(fp)
	    position, study_info, sample_info = name.split("__")
	    [pool_id, rep, region, pcr, dir_] = sample_info.split('_')

	    dir_label = {'R1': 'forward', 'R2': 'reverse'}[dir_]

	    summary = pd.Series(
	        data={'%s-absolute-filepath' % dir_label: abs_fp,
	              'sample-id': '{}_{}'.format(pool_id, rep),
	              'pool': pool_id,
	              'replicate': rep,
	              'region': region,
	              'pcr_type': pcr},
	        name='_'.join(sample_info.split('_')[:-1]))
	    return summary

	manifest = pd.DataFrame([get_sample_descriptor(fp) 
							 for fp in samples if ('R1.' in fp)])
	rev_summary = pd.DataFrame([get_sample_descriptor(fp) 
								for fp in samples if ('R2.' in fp)])

	if direction:
		manifest['reverse-absolute-filepath'] = rev_summary['reverse-absolute-filepath']
	else:
		manifest.rename(columns={'forward-absolute-filepath': 'absolute-filepath'}, 
						inplace=True)
	manifest.set_index('sample-id', inplace=True)
	manifest.to_csv(str(output), sep='\t')
190
191
192
193
194
195
196
197
shell:
	"""
	qiime tools import \
	 --type {params.type} \
	 --input-path {input} \
	 --input-format {params.format} \
	 --output-path {output}
	"""
210
211
212
213
214
215
216
217
shell:
	"""
	qiime cutadapt trim-single \
	 --i-demultiplexed-sequences {input} \
	 --p-front {params.f_primer} \
	 --p-no-discard-untrimmed \
	 --o-trimmed-sequences {output}
	"""
228
229
230
231
232
233
234
235
236
237
shell:
	"""
	qiime dada2 denoise-single \
	 --i-demultiplexed-seqs {input} \
	 --p-trunc-len 280 \
	 --p-n-threads 4 \
	 --o-table {output.table} \
	 --o-representative-sequences {output.seqs} \
	 --o-denoising-stats {output.stats}
	"""
251
252
253
254
255
256
257
258
259
shell:
	"""
	qiime cutadapt trim-paired \
	 --i-demultiplexed-sequences {input} \
	 --p-front-f {params.f_primer} \
	 --p-front-r {params.r_primer} \
	 --p-no-discard-untrimmed \
	 --o-trimmed-sequences {output}
	"""
270
271
272
273
274
275
276
277
278
279
280
shell:
	"""
	qiime dada2 denoise-paired \
	 --i-demultiplexed-seqs {input} \
	 --p-trunc-len-f 280 \
	 --p-trunc-len-r 280 \
	 --p-n-threads 4 \
	 --o-table {output.table} \
	 --o-representative-sequences {output.seqs} \
	 --o-denoising-stats {output.stats}
	"""
297
298
299
300
301
302
303
304
305
306
shell:
	"""
	qiime sidle trim-dada2-posthoc \
	 --i-table {input.table}  \
	 --i-representative-sequences {input.seqs} \
	 --p-trim-length {params.length} \
	 --p-hashed-feature-ids \
	 --o-trimmed-table {output.table} \
	 --o-trimmed-representative-sequences {output.seqs} \
	"""
322
323
324
325
326
327
328
329
330
shell:
	"""
	qiime cutadapt trim-paired \
	 --i-demultiplexed-sequences {input} \
	 --p-front-f {params.fwd} \
	 --p-front-r {params.rev} \
	 --p-discard-untrimmed \
	 --o-trimmed-sequences {output}
	"""
343
344
345
346
347
348
shell:
	"""
	qiime vsearch join-pairs \
	--i-demultiplexed-seqs {input} \
	 --o-joined-sequences {output}
	"""
358
359
360
361
362
363
364
shell:
	"""
	qiime quality-filter q-score \
	 --i-demux {input} \
	 --o-filtered-sequences {output.seqs}\
	 --o-filter-stats {output.stats}
	"""
377
378
379
380
381
382
383
384
385
386
shell:
	"""
	qiime deblur denoise-16S \
	  --i-demultiplexed-seqs {input} \
	  --p-trim-length {params} \
	  --o-representative-sequences {output.seqs} \
	  --o-table {output.table} \
	  --p-sample-stats \
	  --o-stats {output.stats}
	"""
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
shell:
	"""
	qiime sidle filter-degenerate-sequences \
	 --i-sequences {input.seqs} \
	 --p-max-degen {params.degen} \
	 --p-n-workers 4 \
	 --o-filtered-sequences {params.tmp}

	qiime taxa filter-seqs \
	 --i-sequences {params.tmp} \
	 --i-taxonomy {input.taxa} \
	 --p-include {params.include} \
	 --p-exclude {params.exclude} \
	 --p-mode "contains" \
	 --o-filtered-sequences {output.seqs} 

	rm {params.tmp}
	"""
436
437
438
439
440
441
442
443
shell:
	"""
	qiime feature-classifier extract-reads \
	 --i-sequences {input} \
	 --p-f-primer {params.fwd} \
	 --p-r-primer {params.rev} \
	 --o-reads {output}
	"""
458
459
460
461
462
463
464
465
466
467
468
469
shell:
	"""
	qiime sidle prepare-extracted-region \
	 --i-sequences {input} \
	 --p-fwd-primer {params.fwd} \
	 --p-rev-primer {params.rev} \
	 --p-region {params.region} \
	 --p-trim-length {params.trim} \
	 --p-n-workers 2 \
	 --o-collapsed-kmers {output.kmers} \
	 --o-kmer-map {output.map_}
	"""
487
488
489
490
491
492
shell:
	"""
	cp {input.table} {output.table}
	cp {input.taxa} {output.taxa}
	cp {input.tree} {output.tree}
	"""
SnakeMake From line 487 of main/Snakefile
502
503
504
505
506
507
508
shell:
	"""
	qiime feature-classifier classify-sklearn \
	 --i-reads {input.seqs} \
	 --i-classifier {input.classifier} \
	 --o-classification {output}
	"""
516
517
518
519
520
521
522
523
524
shell:
	"""
	qiime taxa filter-table \
	 --i-table {input.table} \
	 --i-taxonomy {input.taxonomy} \
	 --p-include ';c__' \
	 --p-mode contains \
	 --o-filtered-table {output}
	"""
532
533
534
535
536
537
538
shell:
	"""
	qiime feature-table filter-seqs \
	 --i-table {input.table} \
	 --i-data {input.seqs} \
	 --o-filtered-data {output}
	"""
550
551
552
553
554
555
556
557
run:
	from qiime2 import Artifact
	import qiime2.plugins.feature_table.actions as q2_table

	tables = [Artifact.load(fp) for fp in input]

	merged = q2_table.merge(tables, overlap_method='sum').merged_table
	merged.save(str(output))
565
566
567
568
569
570
571
572
run:
	from qiime2 import Artifact
	import qiime2.plugins.feature_table.actions as q2_table

	tables = [Artifact.load(fp) for fp in input]

	merged = q2_table.merge_seqs(tables).merged_data
	merged.save(str(output))
592
593
594
595
596
597
598
599
600
601
602
603
shell:
	"""
	qiime vsearch cluster-features-closed-reference \
	 --i-sequences {input.seqs} \
	 --i-table {input.table} \
	 --i-reference-sequences {input.ref} \
	 --p-perc-identity {params.perc_identity} \
	 --p-threads {params.threads} \
	 --o-clustered-table {output.table} \
	 --o-clustered-sequences {output.centroids} \
	 --o-unmatched-sequences {output.discard}
	"""
610
611
612
613
shell:
	"""
	cp {input} {output}
	"""
SnakeMake From line 610 of main/Snakefile
620
621
622
623
shell:
	"""
	cp {input} {output}
	"""
SnakeMake From line 620 of main/Snakefile
636
637
638
639
640
641
642
shell:
	"""
	qiime feature-classifier classify-sklearn \
	 --i-reads {input.seqs} \
	 --i-classifier {input.classifier} \
	 --o-classification {output}
	"""
652
653
654
655
656
657
658
659
660
shell:
	"""
	qiime taxa filter-table \
	 --i-table {input.table} \
	 --i-taxonomy {input.taxonomy} \
	 --p-include ";D_1__" \
	 --p-mode contains \
	 --o-filtered-table {output}
	"""
668
669
670
671
672
673
674
shell:
	"""
	qiime feature-table filter-seqs \
	 --i-table {input.table} \
	 --i-data {input.seqs} \
	 --o-filtered-data {output}
	"""
685
686
687
688
689
690
691
692
693
shell:
	"""
	qiime fragment-insertion sepp \
		--i-representative-sequences {input.seqs} \
		--i-reference-database {input.reference} \
		--o-tree {output.tree} \
		--p-threads {params.threads} \
		--o-placements {output.placements}
	"""
712
713
714
715
716
717
718
719
720
721
722
723
shell:
	"""
	qiime sidle align-regional-kmers \
	 --i-kmers {input.kmers} \
	 --i-rep-seq {input.rep_seq} \
	 --p-region {params.region} \
	 --p-max-mismatch {params.mismatch} \
	 --p-n-workers 2 \
	 --p-chunk-size 75 \
	 --o-regional-alignment {output.alignment} \
	 --verbose
	"""
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
run:
	from qiime2 import Artifact
	import qiime2.plugins.sidle.actions as q2_sidle

	num_regions = len(params[0])
	regions = params[0]
	nworkers = params[1]

	kmers = [Artifact.load(str(fp_)) 
			 for fp_ in input[:num_regions]]
	align = [Artifact.load(str(fp_)) 
			 for fp_ in input[num_regions:2*num_regions]]
	tables = [Artifact.load(str(fp_)) 
				for fp_ in input[2*num_regions:]]

	table, summary, db_map = q2_sidle.reconstruct_counts(
		kmer_map=kmers,
		regional_alignment=align,
		regional_table=tables,
		region=regions,
		region_normalize='average',
		n_workers=3,
		# n_workers=nworkers,
		# verbose=True,
		)

	table.save(str(output[0]))
	summary.save(str(output[1]))
	db_map.save(str(output[2]))
780
781
782
783
784
785
786
787
788
shell:
	"""
	qiime sidle reconstruct-taxonomy \
	 --i-reconstruction-map {input.db_map} \
	 --i-taxonomy {input.taxa} \
	 --p-database {params.database} \
	 --p-define-missing merge \
	 --o-reconstructed-taxonomy {output}
	"""
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
run:
	from qiime2 import Artifact
	import qiime2.plugins.sidle.actions as q2_sidle

	summary = Artifact.load(str(input[0]))
	db_map = Artifact.load(str(input[1]))
	aligned_seqs = Artifact.load(str(input[2]))
	kmer_maps = [Artifact.load(str(fp)) for fp in input[3:]]

	regions = params[0]

	representative_fragments = q2_sidle.reconstruct_fragment_rep_seqs(
		reconstruction_map=db_map,
		reconstruction_summary=summary,
		aligned_sequences=aligned_seqs,
		kmer_map=kmer_maps,
		region=regions
		).representative_fragments
	representative_fragments.save(str(output))
828
829
830
831
832
833
834
835
836
shell:
	"""
	qiime fragment-insertion sepp \
		--i-representative-sequences {input.fragments} \
		--i-reference-database {input.reference} \
		--p-threads 6 \
		--o-tree {output.tree} \
		--o-placements {output.placements}
	"""
852
853
854
855
856
857
858
shell:
	"""
	qiime feature-table rarefy \
	 --i-table {input.table} \
	 --p-sampling-depth {params.depth} \
	 --o-rarefied-table {output.table}
	"""
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
run:
	from qiime2 import Artifact
	import qiime2.plugins.diversity.actions as q2_diversity

	table = Artifact.load(str(input[0]))
	tree = Artifact.load(str(input[1]))
	metric = params[0]

	if metric == 'faith_pd':
		alpha = q2_diversity.alpha_phylogenetic(
			table=table,
			phylogeny=tree,
			metric=metric,
			).alpha_diversity
	else:
		alpha = q2_diversity.alpha(
			table=table,
			metric=metric,
			).alpha_diversity
	alpha.save(output[0])
901
902
903
904
905
906
907
908
shell:
	"""
	qiime diversity beta-phylogenetic \
	 --i-table {input.table} \
	 --i-phylogeny {input.tree} \
	 --p-metric {params.metric} \
	 --o-distance-matrix {output}
	"""
917
918
919
920
921
922
923
924
925
926
927
928
929
shell:
	"""
	qiime taxa collapse \
	 --i-table {input.table} \
	 --i-taxonomy {input.taxonomy} \
	 --p-level 6 \
	 --o-collapsed-table {output.table}

	qiime diversity beta \
	--i-table {output.table} \
	--p-metric braycurtis \
	--o-distance-matrix {output.dm}
	"""
937
938
939
940
941
942
943
shell:
	"""
	qiime diversity beta \
	--i-table {input.table} \
	--p-metric braycurtis \
	--o-distance-matrix {output}
	"""
954
955
956
957
958
959
960
961
962
963
964
965
966
shell:
	"""
	qiime tools export \
	 --input-path {input.dm} \
	 --output-path {params.dir_}

	Rscript bin/adonis_single.R \
	 -d {params.dir_}/distance-matrix.tsv \
	 -m {input.meta} \
	 -o {output}

	rm -r {params.dir_}
	"""
ShowHide 41 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/jwdebelius/avengers-assemble
Name: avengers-assemble
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...