Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Short description
This workflow is designed to study large arrays of multiomics data by analyzing patterns of cooccurrence of pairs of metagenomics and metabolomics features in a large amount of samples in order to find statistically significant functional associations, e.g. a bacterium produces an antibiotic. The workflow is also meant to assess various metabolomics feature extraction tools and/or correlation methods by generating a decoy database and calculating False Discovery Rate (FDR). The workflow is built and run with Snakemake , which allows to easily change and expand the pipeline when necessary.
Directories structure
-
input
: preprocessed initial data -
log
: workflow steps logs -
output
: results -
preprocessing
: subworkflow for converting the raw data to the unified format -
scripts
: workflow scripts -
tmp
: temporary workflow files
Main steps
Data preprocessing
Features data comes in a variety of format, depending on the source and methods used to obtain it, therefore a
preprocessing step is needed to convert this data to a unified format (described below). There are no restrictions on
the way this step is performed as long as it follows the output requirements, however, a Snakemake-based subworkflow
located in the
preprocessing
directory can be used (needs to be implemented explicitly).
As the result of this step the following files should be created in the
input
directory:
-
mbx.feat.<type>
: metabolomics features -
mgx.feat.<type>
: metagenomics features -
samples.tsv
: samples' names and additional information
Decoy database creation
To be able to evaluate the quality of the feature extraction or association methods that are used, we generate a decoy database of metabolomic features by applying a random permutation to the sample set for every feature. This decoy database is later use to estimate the FDR.
As the result of this step the following files are created in the
input
directory:
-
mbx_decoy.feat.<type>
: decoy metabolomics features
Pairwise associations search
For every pair of a metagenomic and a metabolomic features we compare their abundance profiles to find pairs of associated features. For now, Fisher's exact test ( scipy.stats.fisher_exact ) is applied to binary abundance data (present or not present) and the p-value is evaluated to measure the statistical significance of the association of these features, however, other association evaluation methods, e.g. Pearson or Spearman correlation, may be added in the future.
Fisher's exact test can use one of the three alternative hypotheses. Depending on which
fisher_alternative
is used
different kinds of associations would be searched for:
-
greater
: positive associations (presence of featureA
is associated with presence of feature 'B') would be found -
less
: negative associations (presence of featureA
is associated with absence of feature 'B') would be found -
two-sided
: both kinds of associations would be found
Only the pairs with a p-value under the given
threshold
are reported. If
associations_parts
equals 1 then all input
pairs are processed as one job, otherwise the input is split into given amount of parts of almost equal size and each
part is processed as a separate job, all results being merged afterwards. As the result of this step the following
files are created in the
output
directory:
-
associations.tsv
: the reported pairs of associated features
Decoy associations search
The same procedure described above is applied to the decoy database of metabolomic features to determine known false positive associations.
The same
fisher_alternative
,
threshold
and
associations_parts
are used. As the result of this step the following
files are created in the
output
directory:
-
associations_decoy.tsv
: the reported pairs of associated features (using decoy metabolomics feature list)
False Discovery Rate estimation
After the associations search we assess the quality of results obtained for a particular p-value threshold by estimating the FDR using two methods:
-
Target-Decoy Approach:
FDR = associations_decoy / associations_real
, whereassociations_decoy
andassociations_real
are the amounts of associations of metagenomic features with decoy and real metabolomics features respectively. -
Benjamini-Hochberg method: the upper-bound estimate on the FDR is calculated as the lowest level for which all the reported associations are considered significant when applying the BH step-up procedure .
FDR is estimated for a list of
fdr_thresholds
. Every value in
fdr_thresholds
should be greater than or equal to the
previously used
threshold
. As the result of this step the following files are created in the
output
directory:
-
fdr.tsv
: a table of estimated FDRs for the given p-value thresholds
Data formats
Samples
samples.tsv
file is a TAB-separated table of the samples data with the first line being the header and every
consecutive line describing a sample. The table has the following columns:
-
name
: name of the sample
Other columns are allowed and will be ignored by the workflow.
Features
Every feature file has the extension
.feat.<type>
, where
<type>
denotes the data representation type in the file
and can be one of the following:
-
bin_dense
: dense representation of which samples every the features is present in -
bin_sparse
: sparse representation of which samples the features is present in -
real
: real-valued presence data of the features in the samples, e.g. abundance or intensity
*.feat.bin_sparse
files have the following structure:
-
first line consists of the strings
name
,annotation
andsamples
separated by TAB characters -
every consecutive line describes one feature and consists of the name and the annotation of the feature followed by the list of the samples this feature is present in, all fields being separated by TAB characters
-
it is required that each feature is present in at least one sample
*.feat.bin_dense
files have the following structure:
-
first line consists of the strings
name
andannotation
followed by the list of all the samples, all fields being separated by TAB characters ('\t'
) -
every consecutive line describes one feature and consists of the name and the annotation of the feature followed by the list of
'0'
and'1'
characters, meaning the absence and presence of the feature in the corresponding sample respectively, all fields being separated by TAB characters ('\t'
)
*.feat.real
files have the same structure as the
*.feat.bin_dense
, but it contains real-valued data instead of
binary
'0'
/
'1'
.
annotation
fields are either
<type>
or
<type>/<custom_annotation>
, where
<custom_annotation>
, if available,
is the any relevant annotation of the feature, e.g. the OTU name or the result of molecular database dereplication, and
type
is one of the following:
-
MBX_DECOY
: for metabolomic decoy features -
MBX
: for metabolomic features -
MGX
: for metagenomic features
Associations
associations.tsv
and other associations files are TAB-separated tables with the first line being the header and every
consecutive line describing a reported association pair. The table has the following columns:
-
feature1
: name of the first feature of the pair -
feature2
: name of the second feature of the pair -
annotation1
: annotation of the first feature of the pair -
annotation2
: annotation of the second feature of the pair -
p-value
: p-value of the association -
samples1
: number of the samples the first feature is present in -
samples2
: number of the samples the second feature is present in -
samples_shared
: number of the samples both features are present in
Other columns are allowed and will be ignored by the workflow.
FDR list
fdr.tsv
file is a TAB-separated table with the first line being the header and every consecutive line describing an
FDR calculated for a p-value threshold from
fdr_thresholds
. The table has the following columns:
-
FDR
: the FDR calculated using Target-Decoy Approach -
FDR_BH
: the FDR calculated using Benjamini-Hochberg method -
threshold
: the p-value threshold for which the FDR is calculated -
n_decoy_assn
: the number of associations with decoy metabolomics features having the p-value under the threshold -
n_real_assn
: the number of associations with real metabolomics features having the p-value under the threshold
Other columns are allowed and will be ignored by the workflow.
Configuration
config.yaml
is the configuration file written in the
YAML
format. The following entries should be
specified:
-
fisher_alternative
: the alternative hypothesis used in Fisher's exact test -
threshold
: the p-value threshold for the association pair to be reported -
fdr_thresholds
: the list of p-value thresholds for which the FDR values should be calculated -
associations_parts
: the number of jobs the associations step is split into allowing for parallel execution
Running the workflow
To run the whole workflow perform the following steps:
-
Prepare and do the data preprocessing step using any method you choose.
-
Run
snakemake
.
Individual steps can also be executed. For more information visit Snakemake website.
Authors
Egor Shcherbin, Liu Cao, Hosein Mohimani
Mohimani Lab, CMU, 2019
Code Snippets
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 | import csv import logging import numpy as np import pandas as pd from scipy.stats import fisher_exact from timeit import default_timer as timer N_DEBUGSTEPS = 50000 logging.basicConfig(filename=snakemake.log[0], format='%(asctime)s %(message)s', datefmt='%d.%m.%Y %H:%M:%S | ', level=logging.DEBUG) logging.info('Started') threshold = np.float64(snakemake.config['threshold']) fisher_alternative = snakemake.config['fisher_alternative'] samples = pd.read_table(snakemake.input.samples_fn, index_col=False, dtype=str) with open(snakemake.input.mgx_fn, newline='') as mgx_file: mgx_reader = csv.reader(mgx_file, delimiter='\t') next(mgx_reader) # skip the header mgx = [(row[0], row[1], set(row[2:])) for row in mgx_reader] start_time = timer() with open(snakemake.output.assn_fn, 'w', newline='') as assn_file: assn_writer = csv.writer(assn_file, delimiter='\t') assn_writer.writerow([ 'feature1', 'feature2', 'annotation1', 'annotation2', 'p-value', 'samples1', 'samples2', 'samples_shared']) with open(snakemake.input.mbx_fn, newline='') as mbx_file: mbx_reader = csv.reader(mbx_file, delimiter='\t') next(mbx_reader) # skip the header steps = 0 pairs = 0 for row in mbx_reader: mb = row[0], row[1], set(row[2:]) for mg in mgx: a = len(set.intersection(mb[2], mg[2])) b = len(mg[2]) - a c = len(mb[2]) - a d = len(samples) - a - b - c _, p = fisher_exact([[a, b], [c, d]], alternative=fisher_alternative) if p < threshold: assn_writer.writerow([ mb[0], mg[0], mb[1], mg[1], str(p), str(len(mb[2])), str(len(mg[2])), str(a)]) pairs += 1 steps += 1 if steps % N_DEBUGSTEPS == 0: logging.debug(('{} p-values calculated, ' + '{} associations reported, ' + '{:.3f} seconds elapsed').format( steps, pairs, timer() - start_time)) logging.info(('Finished calculating {} p-values ' + '({} reported) in {:.3f} seconds').format( steps, pairs, timer() - start_time)) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | import csv import pandas as pd import numpy as np # algorithm by Floyd, see doi:10.1145/30401.315746 def random_subset(n, k): s = set() for i in range(n - k, n): a = np.random.randint(i + 1) s.add(a if a not in s else i) return np.fromiter(s, dtype=np.int) if 'seed' in snakemake.config: np.random.seed(snakemake.config['seed']) samples = pd.read_table(snakemake.input.samples_fn, index_col=False, dtype=str).name.values with open(snakemake.input.mbx_fn, newline='') as mbx_file, \ open(snakemake.output.mbx_decoy_fn, 'w', newline='') as mbx_decoy_file: mbx_reader = csv.reader(mbx_file, delimiter='\t') mbx_decoy_writer = csv.writer(mbx_decoy_file, delimiter='\t') mbx_decoy_writer.writerow(next(mbx_reader)) for row in mbx_reader: p = random_subset(len(samples), len(row) - 2) np.random.shuffle(p) mbx_decoy_writer.writerow([ row[0], 'MBX_DECOY' + row[1][3:], *[samples[i] for i in p]]) |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | from collections import defaultdict, Counter import numpy as np import csv # unique_mgx = defaultdict(set) cnt_real = Counter() max_p = defaultdict(lambda: 0) with open(snakemake.input.assn_fn, newline='') as assn_file: assn_reader = csv.reader(assn_file, delimiter='\t') next(assn_reader) # skip the header for row in assn_reader: for threshold in snakemake.config['fdr_thresholds']: # threshold is a string here if np.float64(row[4]) < np.float64(threshold): # mg_feat = row[0] if row[2].startswith('MGX') else row[1] # unique_mgx[threshold].add(mg_feat) cnt_real[threshold] += 1 max_p[threshold] = max(max_p[threshold], np.float64(row[4])) # unique_mgx_decoy = defaultdict(set) cnt_decoy = Counter() with open(snakemake.input.assn_decoy_fn, newline='') as assn_decoy_file: assn_decoy_reader = csv.reader(assn_decoy_file, delimiter='\t') next(assn_decoy_reader) # skip the header for row in assn_decoy_reader: for threshold in snakemake.config['fdr_thresholds']: # threshold is a string here if np.float64(row[4]) < np.float64(threshold): # mg_feat = row[0] if row[2].startswith('MGX') else row[1] # unique_mgx_decoy[threshold].add(mg_feat) cnt_decoy[threshold] += 1 with open(snakemake.input.mbx_fn) as mbx_file, \ open(snakemake.input.mgx_fn) as mgx_file: n_total_pairs = (len(mbx_file.readlines()) - 1) * \ (len(mgx_file.readlines()) - 1) with open(snakemake.output.fdr_fn, 'w', newline='') as fdr_file: fdr_writer = csv.writer(fdr_file, delimiter='\t') fdr_writer.writerow([ 'FDR', 'FDR_BH', 'threshold', # 'n_decoy_mgx', # 'n_real_mgx', 'n_decoy_assn', 'n_real_assn']) for threshold in snakemake.config['fdr_thresholds']: # n_decoy_mgx = len(unique_mgx_decoy[threshold]) # n_real_mgx = len(unique_mgx[threshold]) n_decoy_assn = cnt_decoy[threshold] n_real_assn = cnt_real[threshold] fdr_writer.writerow([ # str(n_decoy_mgx / n_real_mgx) if n_real_mgx else 'nan', str(n_decoy_assn / n_real_assn) if n_real_assn else 'nan', str(n_total_pairs / n_real_assn * max_p[threshold]) if n_real_assn else 'nan', threshold, # str(n_decoy_mgx), # str(n_real_mgx), str(n_decoy_assn), str(n_real_assn)]) |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 | import csv import numpy as np from itertools import product def multiples(n): for i in range(1, n + 1): if n % i == 0: yield i, n // i def read_features(feat_fn): with open(feat_fn, newline='') as feat_file: feat_reader = csv.reader(feat_file, delimiter='\t') header = next(feat_reader) features = list(feat_reader) return features, header mbx, _ = read_features(snakemake.input.mbx_fn) mbx_decoy, _ = read_features(snakemake.input.mbx_decoy_fn) mgx, header = read_features(snakemake.input.mgx_fn) n = snakemake.config['associations_parts'] min_part_size = n + 1 # find the minimal part size to make tmp files small for a, b in multiples(n): if a <= len(mbx) and b <= len(mgx) and n / min(a, b) < min_part_size: mbx_coords = np.linspace(0, len(mbx), num=a + 1, dtype=np.int) mgx_coords = np.linspace(0, len(mgx), num=b + 1, dtype=np.int) min_part_size = n / min(a, b) break else: mbx_coords = np.linspace(0, len(mbx), num=n + 1, dtype=np.int) mgx_coords = np.linspace(0, len(mgx), num=2, dtype=np.int) assert (len(mbx_coords) - 1) * (len(mgx_coords) - 1) == n for (i, j), mbx_part, mbx_decoy_part, mgx_part in zip( product(range(len(mbx_coords) - 1), range(len(mgx_coords) - 1)), snakemake.output.mbx_parts, snakemake.output.mbx_decoy_parts, snakemake.output.mgx_parts): with open(mgx_part, 'w', newline='') as mgx_file, \ open(mbx_part, 'w', newline='') as mbx_file, \ open(mbx_decoy_part, 'w', newline='') as mbx_decoy_file: mbx_writer = csv.writer(mbx_file, delimiter='\t') mbx_decoy_writer = csv.writer(mbx_decoy_file, delimiter='\t') mgx_writer = csv.writer(mgx_file, delimiter='\t') mbx_writer.writerow(header) mbx_decoy_writer.writerow(header) mgx_writer.writerow(header) for k in range(mbx_coords[i], mbx_coords[i + 1]): mbx_writer.writerow(mbx[k]) mbx_decoy_writer.writerow(mbx_decoy[k]) for k in range(mgx_coords[j], mgx_coords[j + 1]): mgx_writer.writerow(mgx[k]) |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 | import numpy as np import pandas as pd from skbio import TreeNode p_threshold = np.float64(snakemake.config['tree_biclustering']['p_threshold']) clade_threshold = \ np.float64(snakemake.config['tree_biclustering']['clade_threshold']) assn = pd.read_table(snakemake.input.assn_fn, index_col=False, header=0, dtype=str) assn['p-value'] = assn['p-value'].astype(np.float64) assn = assn[assn['p-value'] < p_threshold] assn.reset_index(drop=True, inplace=True) mbx = assn[['feature1', 'annotation1']].drop_duplicates() mgx = assn[['feature2', 'annotation2']].drop_duplicates() tree = TreeNode.read(snakemake.input.tree_fn) tree = tree.shear(mgx['feature2'].values) tree.assign_ids() tree_size = tree.count() clade_size = np.zeros(tree_size) for tip in tree.tips(include_self=True): clade_size[tip.id] += 1 for anc in tip.ancestors(): clade_size[anc.id] += 1 biclust = [set() for _ in range(tree_size)] for mb in mbx['feature1'].values: cnt_assn = np.zeros(tree_size) for mg in assn['feature2'][assn['feature1'] == mb].values: node = tree.find(mg) cnt_assn[node.id] += 1 for anc in node.ancestors(): cnt_assn[anc.id] += 1 taken = set() for node in tree.preorder(include_self=True): if taken.intersection(set(anc.id for anc in node.ancestors())): continue if node.is_tip() and cnt_assn[node.id] or \ cnt_assn[node.id] >= clade_threshold * clade_size[node.id]: biclust[node.id].add(mb) taken.add(node.id) with open(snakemake.output.biclust_fn, 'w') as biclust_file: for node in tree.traverse(include_self=True): biclust_mbx = biclust[node.id] if not biclust_mbx: continue biclust_mgx = set(tip.name for tip in node.tips(include_self=True)) name = 'Bicluster_{}_{}_{}'.format(node.id, len(biclust_mgx), len(biclust_mbx)) biclust_file.write(name + '\n') for mg in mgx[mgx['feature2'].isin(biclust_mgx)].values: biclust_file.write('\t'.join(mg) + '\n') for mb in mbx[mbx['feature1'].isin(biclust_mbx)].values: biclust_file.write('\t'.join(mb) + '\n') biclust_file.write('\n') assn_biclust = np.zeros(assn.shape[0]) for i, row in assn.iterrows(): mb, mg = row['feature1'], row['feature2'] node = tree.find(mg) while mb not in biclust[node.id]: node = node.parent assn_biclust[i] = node.id assn.assign(bicluster_id=assn_biclust).to_csv(snakemake.output.assn_biclust_fn, sep='\t', index=False) for node in tree.non_tips(include_self=True): node.name = str(node.id) + ' (' + str(len(biclust[node.id])) + ')' for node in tree.tips(): node.name = node.name + ' (' + str(len(biclust[node.id])) + ')' tree.write(snakemake.output.tree_biclust_fn) |
16 17 | script: 'scripts/decoy_database.py' |
32 33 | script: 'scripts/associations.py' |
45 46 | script: 'scripts/associations.py' |
61 62 | script: 'scripts/split_input.py' |
74 75 | script: 'scripts/associations.py' |
83 84 85 86 87 88 89 90 91 | shell: ''' tail_start=1 for part in {input.assn_parts} do tail -n+$tail_start $part >>{output.assn_fn} tail_start=2 done ''' |
103 104 | script: 'scripts/associations.py' |
112 113 114 115 116 117 118 119 120 | shell: ''' tail_start=1 for part in {input.assn_parts} do tail -n+$tail_start $part >>{output.assn_fn} tail_start=2 done ''' |
133 134 | script: 'scripts/fdr.py' |
145 146 | script: 'scripts/tree_biclustering.py' |
Support
- Future updates