Metagenomic-metabolomic associations workflow

public public 1yr ago 0 bookmarks

Short description

This workflow is designed to study large arrays of multiomics data by analyzing patterns of cooccurrence of pairs of metagenomics and metabolomics features in a large amount of samples in order to find statistically significant functional associations, e.g. a bacterium produces an antibiotic. The workflow is also meant to assess various metabolomics feature extraction tools and/or correlation methods by generating a decoy database and calculating False Discovery Rate (FDR). The workflow is built and run with Snakemake , which allows to easily change and expand the pipeline when necessary.

Directories structure

  • input : preprocessed initial data

  • log : workflow steps logs

  • output : results

  • preprocessing : subworkflow for converting the raw data to the unified format

  • scripts : workflow scripts

  • tmp : temporary workflow files

Main steps

Data preprocessing

Features data comes in a variety of format, depending on the source and methods used to obtain it, therefore a preprocessing step is needed to convert this data to a unified format (described below). There are no restrictions on the way this step is performed as long as it follows the output requirements, however, a Snakemake-based subworkflow located in the preprocessing directory can be used (needs to be implemented explicitly).

As the result of this step the following files should be created in the input directory:

  • mbx.feat.<type> : metabolomics features

  • mgx.feat.<type> : metagenomics features

  • samples.tsv : samples' names and additional information

Decoy database creation

To be able to evaluate the quality of the feature extraction or association methods that are used, we generate a decoy database of metabolomic features by applying a random permutation to the sample set for every feature. This decoy database is later use to estimate the FDR.

As the result of this step the following files are created in the input directory:

  • mbx_decoy.feat.<type> : decoy metabolomics features

Pairwise associations search

For every pair of a metagenomic and a metabolomic features we compare their abundance profiles to find pairs of associated features. For now, Fisher's exact test ( scipy.stats.fisher_exact ) is applied to binary abundance data (present or not present) and the p-value is evaluated to measure the statistical significance of the association of these features, however, other association evaluation methods, e.g. Pearson or Spearman correlation, may be added in the future.

Fisher's exact test can use one of the three alternative hypotheses. Depending on which fisher_alternative is used different kinds of associations would be searched for:

  • greater : positive associations (presence of feature A is associated with presence of feature 'B') would be found

  • less : negative associations (presence of feature A is associated with absence of feature 'B') would be found

  • two-sided : both kinds of associations would be found

Only the pairs with a p-value under the given threshold are reported. If associations_parts equals 1 then all input pairs are processed as one job, otherwise the input is split into given amount of parts of almost equal size and each part is processed as a separate job, all results being merged afterwards. As the result of this step the following files are created in the output directory:

  • associations.tsv : the reported pairs of associated features

Decoy associations search

The same procedure described above is applied to the decoy database of metabolomic features to determine known false positive associations.

The same fisher_alternative , threshold and associations_parts are used. As the result of this step the following files are created in the output directory:

  • associations_decoy.tsv : the reported pairs of associated features (using decoy metabolomics feature list)

False Discovery Rate estimation

After the associations search we assess the quality of results obtained for a particular p-value threshold by estimating the FDR using two methods:

  1. Target-Decoy Approach: FDR = associations_decoy / associations_real , where associations_decoy and associations_real are the amounts of associations of metagenomic features with decoy and real metabolomics features respectively.

  2. Benjamini-Hochberg method: the upper-bound estimate on the FDR is calculated as the lowest level for which all the reported associations are considered significant when applying the BH step-up procedure .

FDR is estimated for a list of fdr_thresholds . Every value in fdr_thresholds should be greater than or equal to the previously used threshold . As the result of this step the following files are created in the output directory:

  • fdr.tsv : a table of estimated FDRs for the given p-value thresholds

Data formats

Samples

samples.tsv file is a TAB-separated table of the samples data with the first line being the header and every consecutive line describing a sample. The table has the following columns:

  • name : name of the sample

Other columns are allowed and will be ignored by the workflow.

Features

Every feature file has the extension .feat.<type> , where <type> denotes the data representation type in the file and can be one of the following:

  • bin_dense : dense representation of which samples every the features is present in

  • bin_sparse : sparse representation of which samples the features is present in

  • real : real-valued presence data of the features in the samples, e.g. abundance or intensity

*.feat.bin_sparse files have the following structure:

  • first line consists of the strings name , annotation and samples separated by TAB characters

  • every consecutive line describes one feature and consists of the name and the annotation of the feature followed by the list of the samples this feature is present in, all fields being separated by TAB characters

  • it is required that each feature is present in at least one sample

*.feat.bin_dense files have the following structure:

  • first line consists of the strings name and annotation followed by the list of all the samples, all fields being separated by TAB characters ( '\t' )

  • every consecutive line describes one feature and consists of the name and the annotation of the feature followed by the list of '0' and '1' characters, meaning the absence and presence of the feature in the corresponding sample respectively, all fields being separated by TAB characters ( '\t' )

*.feat.real files have the same structure as the *.feat.bin_dense , but it contains real-valued data instead of binary '0' / '1' .

annotation fields are either <type> or <type>/<custom_annotation> , where <custom_annotation> , if available, is the any relevant annotation of the feature, e.g. the OTU name or the result of molecular database dereplication, and type is one of the following:

  • MBX_DECOY : for metabolomic decoy features

  • MBX : for metabolomic features

  • MGX : for metagenomic features

Associations

associations.tsv and other associations files are TAB-separated tables with the first line being the header and every consecutive line describing a reported association pair. The table has the following columns:

  • feature1 : name of the first feature of the pair

  • feature2 : name of the second feature of the pair

  • annotation1 : annotation of the first feature of the pair

  • annotation2 : annotation of the second feature of the pair

  • p-value : p-value of the association

  • samples1 : number of the samples the first feature is present in

  • samples2 : number of the samples the second feature is present in

  • samples_shared : number of the samples both features are present in

Other columns are allowed and will be ignored by the workflow.

FDR list

fdr.tsv file is a TAB-separated table with the first line being the header and every consecutive line describing an FDR calculated for a p-value threshold from fdr_thresholds . The table has the following columns:

  • FDR : the FDR calculated using Target-Decoy Approach

  • FDR_BH : the FDR calculated using Benjamini-Hochberg method

  • threshold : the p-value threshold for which the FDR is calculated

  • n_decoy_assn : the number of associations with decoy metabolomics features having the p-value under the threshold

  • n_real_assn : the number of associations with real metabolomics features having the p-value under the threshold

Other columns are allowed and will be ignored by the workflow.

Configuration

config.yaml is the configuration file written in the YAML format. The following entries should be specified:

  • fisher_alternative : the alternative hypothesis used in Fisher's exact test

  • threshold : the p-value threshold for the association pair to be reported

  • fdr_thresholds : the list of p-value thresholds for which the FDR values should be calculated

  • associations_parts : the number of jobs the associations step is split into allowing for parallel execution

Running the workflow

To run the whole workflow perform the following steps:

  1. Prepare and do the data preprocessing step using any method you choose.

  2. Run snakemake .

Individual steps can also be executed. For more information visit Snakemake website.

Authors

Egor Shcherbin, Liu Cao, Hosein Mohimani

Mohimani Lab, CMU, 2019

Code Snippets

 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
import csv
import logging
import numpy as np
import pandas as pd
from scipy.stats import fisher_exact
from timeit import default_timer as timer


N_DEBUGSTEPS = 50000


logging.basicConfig(filename=snakemake.log[0],
                    format='%(asctime)s %(message)s',
                    datefmt='%d.%m.%Y %H:%M:%S | ',
                    level=logging.DEBUG)
logging.info('Started')

threshold = np.float64(snakemake.config['threshold'])
fisher_alternative = snakemake.config['fisher_alternative']
samples = pd.read_table(snakemake.input.samples_fn, index_col=False, dtype=str)
with open(snakemake.input.mgx_fn, newline='') as mgx_file:
    mgx_reader = csv.reader(mgx_file, delimiter='\t')
    next(mgx_reader)  # skip the header
    mgx = [(row[0], row[1], set(row[2:]))
           for row in mgx_reader]


start_time = timer()
with open(snakemake.output.assn_fn, 'w', newline='') as assn_file:
    assn_writer = csv.writer(assn_file, delimiter='\t')
    assn_writer.writerow([
        'feature1',
        'feature2',
        'annotation1',
        'annotation2',
        'p-value',
        'samples1',
        'samples2',
        'samples_shared'])
    with open(snakemake.input.mbx_fn, newline='') as mbx_file:
        mbx_reader = csv.reader(mbx_file, delimiter='\t')
        next(mbx_reader)  # skip the header
        steps = 0
        pairs = 0
        for row in mbx_reader:
            mb = row[0], row[1], set(row[2:])
            for mg in mgx:
                a = len(set.intersection(mb[2], mg[2]))
                b = len(mg[2]) - a
                c = len(mb[2]) - a
                d = len(samples) - a - b - c
                _, p = fisher_exact([[a, b], [c, d]],
                                    alternative=fisher_alternative)
                if p < threshold:
                    assn_writer.writerow([
                        mb[0],
                        mg[0],
                        mb[1],
                        mg[1],
                        str(p),
                        str(len(mb[2])),
                        str(len(mg[2])),
                        str(a)])
                    pairs += 1

                steps += 1
                if steps % N_DEBUGSTEPS == 0:
                    logging.debug(('{} p-values calculated, ' +
                                   '{} associations reported, ' +
                                   '{:.3f} seconds elapsed').format(
                                      steps,
                                      pairs,
                                      timer() - start_time))

logging.info(('Finished calculating {} p-values ' +
              '({} reported) in {:.3f} seconds').format(
    steps,
    pairs,
    timer() - start_time))
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import csv
import pandas as pd
import numpy as np


# algorithm by Floyd, see doi:10.1145/30401.315746
def random_subset(n, k):
    s = set()
    for i in range(n - k, n):
        a = np.random.randint(i + 1)
        s.add(a if a not in s else i)
    return np.fromiter(s, dtype=np.int)


if 'seed' in snakemake.config:
    np.random.seed(snakemake.config['seed'])

samples = pd.read_table(snakemake.input.samples_fn,
                        index_col=False,
                        dtype=str).name.values

with open(snakemake.input.mbx_fn, newline='') as mbx_file, \
        open(snakemake.output.mbx_decoy_fn, 'w', newline='') as mbx_decoy_file:
    mbx_reader = csv.reader(mbx_file, delimiter='\t')
    mbx_decoy_writer = csv.writer(mbx_decoy_file, delimiter='\t')
    mbx_decoy_writer.writerow(next(mbx_reader))
    for row in mbx_reader:
        p = random_subset(len(samples), len(row) - 2)
        np.random.shuffle(p)
        mbx_decoy_writer.writerow([
            row[0],
            'MBX_DECOY' + row[1][3:],
            *[samples[i] for i in p]])
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
from collections import defaultdict, Counter
import numpy as np
import csv


# unique_mgx = defaultdict(set)
cnt_real = Counter()
max_p = defaultdict(lambda: 0)
with open(snakemake.input.assn_fn, newline='') as assn_file:
    assn_reader = csv.reader(assn_file, delimiter='\t')
    next(assn_reader)  # skip the header
    for row in assn_reader:
        for threshold in snakemake.config['fdr_thresholds']:
            # threshold is a string here
            if np.float64(row[4]) < np.float64(threshold):
                # mg_feat = row[0] if row[2].startswith('MGX') else row[1]
                # unique_mgx[threshold].add(mg_feat)
                cnt_real[threshold] += 1
                max_p[threshold] = max(max_p[threshold], np.float64(row[4]))

# unique_mgx_decoy = defaultdict(set)
cnt_decoy = Counter()
with open(snakemake.input.assn_decoy_fn, newline='') as assn_decoy_file:
    assn_decoy_reader = csv.reader(assn_decoy_file, delimiter='\t')
    next(assn_decoy_reader)  # skip the header
    for row in assn_decoy_reader:
        for threshold in snakemake.config['fdr_thresholds']:
            # threshold is a string here
            if np.float64(row[4]) < np.float64(threshold):
                # mg_feat = row[0] if row[2].startswith('MGX') else row[1]
                # unique_mgx_decoy[threshold].add(mg_feat)
                cnt_decoy[threshold] += 1

with open(snakemake.input.mbx_fn) as mbx_file, \
         open(snakemake.input.mgx_fn) as mgx_file:
    n_total_pairs = (len(mbx_file.readlines()) - 1) * \
        (len(mgx_file.readlines()) - 1)
with open(snakemake.output.fdr_fn, 'w', newline='') as fdr_file:
    fdr_writer = csv.writer(fdr_file, delimiter='\t')
    fdr_writer.writerow([
        'FDR',
        'FDR_BH',
        'threshold',
        # 'n_decoy_mgx',
        # 'n_real_mgx',
        'n_decoy_assn',
        'n_real_assn'])
    for threshold in snakemake.config['fdr_thresholds']:
        # n_decoy_mgx = len(unique_mgx_decoy[threshold])
        # n_real_mgx = len(unique_mgx[threshold])
        n_decoy_assn = cnt_decoy[threshold]
        n_real_assn = cnt_real[threshold]
        fdr_writer.writerow([
            # str(n_decoy_mgx / n_real_mgx) if n_real_mgx else 'nan',
            str(n_decoy_assn / n_real_assn) if n_real_assn else 'nan',
            str(n_total_pairs / n_real_assn * max_p[threshold])
            if n_real_assn else 'nan',
            threshold,
            # str(n_decoy_mgx),
            # str(n_real_mgx),
            str(n_decoy_assn),
            str(n_real_assn)])
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import csv
import numpy as np
from itertools import product


def multiples(n):
    for i in range(1, n + 1):
        if n % i == 0:
            yield i, n // i


def read_features(feat_fn):
    with open(feat_fn, newline='') as feat_file:
        feat_reader = csv.reader(feat_file, delimiter='\t')
        header = next(feat_reader)
        features = list(feat_reader)
    return features, header


mbx, _ = read_features(snakemake.input.mbx_fn)
mbx_decoy, _ = read_features(snakemake.input.mbx_decoy_fn)
mgx, header = read_features(snakemake.input.mgx_fn)

n = snakemake.config['associations_parts']
min_part_size = n + 1  # find the minimal part size to make tmp files small
for a, b in multiples(n):
    if a <= len(mbx) and b <= len(mgx) and n / min(a, b) < min_part_size:
        mbx_coords = np.linspace(0, len(mbx), num=a + 1, dtype=np.int)
        mgx_coords = np.linspace(0, len(mgx), num=b + 1, dtype=np.int)
        min_part_size = n / min(a, b)
        break
else:
    mbx_coords = np.linspace(0, len(mbx), num=n + 1, dtype=np.int)
    mgx_coords = np.linspace(0, len(mgx), num=2, dtype=np.int)
assert (len(mbx_coords) - 1) * (len(mgx_coords) - 1) == n

for (i, j), mbx_part, mbx_decoy_part, mgx_part in zip(
        product(range(len(mbx_coords) - 1),
                range(len(mgx_coords) - 1)),
        snakemake.output.mbx_parts,
        snakemake.output.mbx_decoy_parts,
        snakemake.output.mgx_parts):
    with open(mgx_part, 'w', newline='') as mgx_file, \
         open(mbx_part, 'w', newline='') as mbx_file, \
         open(mbx_decoy_part, 'w', newline='') as mbx_decoy_file:
        mbx_writer = csv.writer(mbx_file, delimiter='\t')
        mbx_decoy_writer = csv.writer(mbx_decoy_file, delimiter='\t')
        mgx_writer = csv.writer(mgx_file, delimiter='\t')
        mbx_writer.writerow(header)
        mbx_decoy_writer.writerow(header)
        mgx_writer.writerow(header)

        for k in range(mbx_coords[i], mbx_coords[i + 1]):
            mbx_writer.writerow(mbx[k])
            mbx_decoy_writer.writerow(mbx_decoy[k])
        for k in range(mgx_coords[j], mgx_coords[j + 1]):
            mgx_writer.writerow(mgx[k])
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import numpy as np
import pandas as pd
from skbio import TreeNode

p_threshold = np.float64(snakemake.config['tree_biclustering']['p_threshold'])
clade_threshold = \
    np.float64(snakemake.config['tree_biclustering']['clade_threshold'])

assn = pd.read_table(snakemake.input.assn_fn, index_col=False,
                     header=0,
                     dtype=str)
assn['p-value'] = assn['p-value'].astype(np.float64)
assn = assn[assn['p-value'] < p_threshold]
assn.reset_index(drop=True, inplace=True)
mbx = assn[['feature1', 'annotation1']].drop_duplicates()
mgx = assn[['feature2', 'annotation2']].drop_duplicates()

tree = TreeNode.read(snakemake.input.tree_fn)
tree = tree.shear(mgx['feature2'].values)
tree.assign_ids()
tree_size = tree.count()
clade_size = np.zeros(tree_size)
for tip in tree.tips(include_self=True):
    clade_size[tip.id] += 1
    for anc in tip.ancestors():
        clade_size[anc.id] += 1

biclust = [set() for _ in range(tree_size)]
for mb in mbx['feature1'].values:
    cnt_assn = np.zeros(tree_size)
    for mg in assn['feature2'][assn['feature1'] == mb].values:
        node = tree.find(mg)
        cnt_assn[node.id] += 1
        for anc in node.ancestors():
            cnt_assn[anc.id] += 1
    taken = set()
    for node in tree.preorder(include_self=True):
        if taken.intersection(set(anc.id for anc in node.ancestors())):
            continue
        if node.is_tip() and cnt_assn[node.id] or \
                cnt_assn[node.id] >= clade_threshold * clade_size[node.id]:
            biclust[node.id].add(mb)
            taken.add(node.id)

with open(snakemake.output.biclust_fn, 'w') as biclust_file:
    for node in tree.traverse(include_self=True):
        biclust_mbx = biclust[node.id]
        if not biclust_mbx:
            continue
        biclust_mgx = set(tip.name for tip in node.tips(include_self=True))
        name = 'Bicluster_{}_{}_{}'.format(node.id,
                                           len(biclust_mgx),
                                           len(biclust_mbx))
        biclust_file.write(name + '\n')
        for mg in mgx[mgx['feature2'].isin(biclust_mgx)].values:
            biclust_file.write('\t'.join(mg) + '\n')
        for mb in mbx[mbx['feature1'].isin(biclust_mbx)].values:
            biclust_file.write('\t'.join(mb) + '\n')
        biclust_file.write('\n')

assn_biclust = np.zeros(assn.shape[0])
for i, row in assn.iterrows():
    mb, mg = row['feature1'], row['feature2']
    node = tree.find(mg)
    while mb not in biclust[node.id]:
        node = node.parent
    assn_biclust[i] = node.id
assn.assign(bicluster_id=assn_biclust).to_csv(snakemake.output.assn_biclust_fn,
                                              sep='\t',
                                              index=False)

for node in tree.non_tips(include_self=True):
    node.name = str(node.id) + ' (' + str(len(biclust[node.id])) + ')'
for node in tree.tips():
    node.name = node.name + ' (' + str(len(biclust[node.id])) + ')'
tree.write(snakemake.output.tree_biclust_fn)
16
17
script:
    'scripts/decoy_database.py'
32
33
script:
    'scripts/associations.py'
45
46
script:
    'scripts/associations.py'
61
62
script:
    'scripts/split_input.py'
74
75
script:
    'scripts/associations.py'
83
84
85
86
87
88
89
90
91
shell:
    '''
    tail_start=1
    for part in {input.assn_parts}
    do
        tail -n+$tail_start $part >>{output.assn_fn}
        tail_start=2
    done
    '''
103
104
script:
    'scripts/associations.py'
SnakeMake From line 103 of master/Snakefile
112
113
114
115
116
117
118
119
120
shell:
    '''
    tail_start=1
    for part in {input.assn_parts}
    do
        tail -n+$tail_start $part >>{output.assn_fn}
        tail_start=2
    done
    '''
SnakeMake From line 112 of master/Snakefile
133
134
script:
    'scripts/fdr.py'
SnakeMake From line 133 of master/Snakefile
145
146
script:
    'scripts/tree_biclustering.py'
SnakeMake From line 145 of master/Snakefile
ShowHide 13 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/mohimanilab/AssociationNetworks
Name: associationnetworks
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...