Metagenomic-metabolomic associations workflow

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Short description

This workflow is designed to study large arrays of multiomics data by analyzing patterns of cooccurrence of pairs of metagenomics and metabolomics features in a large amount of samples in order to find statistically significant functional associations, e.g. a bacterium produces an antibiotic. The workflow is also meant to assess various metabolomics feature extraction tools and/or correlation methods by generating a decoy database and calculating False Discovery Rate (FDR). The workflow is built and run with Snakemake , which allows to easily change and expand the pipeline when necessary.

Directories structure

input : preprocessed initial data
log : workflow steps logs
output : results
preprocessing : subworkflow for converting the raw data to the unified format
scripts : workflow scripts
tmp : temporary workflow files

Main steps

Data preprocessing

Features data comes in a variety of format, depending on the source and methods used to obtain it, therefore a preprocessing step is needed to convert this data to a unified format (described below). There are no restrictions on the way this step is performed as long as it follows the output requirements, however, a Snakemake-based subworkflow located in the preprocessing directory can be used (needs to be implemented explicitly).

As the result of this step the following files should be created in the input directory:

mbx.feat.<type> : metabolomics features
mgx.feat.<type> : metagenomics features
samples.tsv : samples' names and additional information

Decoy database creation

To be able to evaluate the quality of the feature extraction or association methods that are used, we generate a decoy database of metabolomic features by applying a random permutation to the sample set for every feature. This decoy database is later use to estimate the FDR.

As the result of this step the following files are created in the input directory:

mbx_decoy.feat.<type> : decoy metabolomics features

Pairwise associations search

For every pair of a metagenomic and a metabolomic features we compare their abundance profiles to find pairs of associated features. For now, Fisher's exact test ( scipy.stats.fisher_exact ) is applied to binary abundance data (present or not present) and the p-value is evaluated to measure the statistical significance of the association of these features, however, other association evaluation methods, e.g. Pearson or Spearman correlation, may be added in the future.

Fisher's exact test can use one of the three alternative hypotheses. Depending on which fisher_alternative is used different kinds of associations would be searched for:

greater : positive associations (presence of feature A is associated with presence of feature 'B') would be found
less : negative associations (presence of feature A is associated with absence of feature 'B') would be found
two-sided : both kinds of associations would be found

Only the pairs with a p-value under the given threshold are reported. If associations_parts equals 1 then all input pairs are processed as one job, otherwise the input is split into given amount of parts of almost equal size and each part is processed as a separate job, all results being merged afterwards. As the result of this step the following files are created in the output directory:

associations.tsv : the reported pairs of associated features

Decoy associations search

The same procedure described above is applied to the decoy database of metabolomic features to determine known false positive associations.

The same fisher_alternative , threshold and associations_parts are used. As the result of this step the following files are created in the output directory:

associations_decoy.tsv : the reported pairs of associated features (using decoy metabolomics feature list)

False Discovery Rate estimation

After the associations search we assess the quality of results obtained for a particular p-value threshold by estimating the FDR using two methods:

Target-Decoy Approach: FDR = associations_decoy / associations_real , where associations_decoy and associations_real are the amounts of associations of metagenomic features with decoy and real metabolomics features respectively.
Benjamini-Hochberg method: the upper-bound estimate on the FDR is calculated as the lowest level for which all the reported associations are considered significant when applying the BH step-up procedure .

FDR is estimated for a list of fdr_thresholds . Every value in fdr_thresholds should be greater than or equal to the previously used threshold . As the result of this step the following files are created in the output directory:

fdr.tsv : a table of estimated FDRs for the given p-value thresholds

Data formats

Samples

samples.tsv file is a TAB-separated table of the samples data with the first line being the header and every consecutive line describing a sample. The table has the following columns:

name : name of the sample

Other columns are allowed and will be ignored by the workflow.

Features

Every feature file has the extension .feat.<type> , where <type> denotes the data representation type in the file and can be one of the following:

bin_dense : dense representation of which samples every the features is present in
bin_sparse : sparse representation of which samples the features is present in
real : real-valued presence data of the features in the samples, e.g. abundance or intensity

*.feat.bin_sparse files have the following structure:

first line consists of the strings name , annotation and samples separated by TAB characters
every consecutive line describes one feature and consists of the name and the annotation of the feature followed by the list of the samples this feature is present in, all fields being separated by TAB characters
it is required that each feature is present in at least one sample

*.feat.bin_dense files have the following structure:

first line consists of the strings name and annotation followed by the list of all the samples, all fields being separated by TAB characters ( '\t' )
every consecutive line describes one feature and consists of the name and the annotation of the feature followed by the list of '0' and '1' characters, meaning the absence and presence of the feature in the corresponding sample respectively, all fields being separated by TAB characters ( '\t' )

*.feat.real files have the same structure as the *.feat.bin_dense , but it contains real-valued data instead of binary '0' / '1' .

annotation fields are either <type> or <type>/<custom_annotation> , where <custom_annotation> , if available, is the any relevant annotation of the feature, e.g. the OTU name or the result of molecular database dereplication, and type is one of the following:

MBX_DECOY : for metabolomic decoy features
MBX : for metabolomic features
MGX : for metagenomic features

Associations

associations.tsv and other associations files are TAB-separated tables with the first line being the header and every consecutive line describing a reported association pair. The table has the following columns:

feature1 : name of the first feature of the pair
feature2 : name of the second feature of the pair
annotation1 : annotation of the first feature of the pair
annotation2 : annotation of the second feature of the pair
p-value : p-value of the association
samples1 : number of the samples the first feature is present in
samples2 : number of the samples the second feature is present in
samples_shared : number of the samples both features are present in

Other columns are allowed and will be ignored by the workflow.

FDR list

fdr.tsv file is a TAB-separated table with the first line being the header and every consecutive line describing an FDR calculated for a p-value threshold from fdr_thresholds . The table has the following columns:

FDR : the FDR calculated using Target-Decoy Approach
FDR_BH : the FDR calculated using Benjamini-Hochberg method
threshold : the p-value threshold for which the FDR is calculated
n_decoy_assn : the number of associations with decoy metabolomics features having the p-value under the threshold
n_real_assn : the number of associations with real metabolomics features having the p-value under the threshold

Other columns are allowed and will be ignored by the workflow.

Configuration

config.yaml is the configuration file written in the YAML format. The following entries should be specified:

fisher_alternative : the alternative hypothesis used in Fisher's exact test
threshold : the p-value threshold for the association pair to be reported
fdr_thresholds : the list of p-value thresholds for which the FDR values should be calculated
associations_parts : the number of jobs the associations step is split into allowing for parallel execution

Running the workflow

To run the whole workflow perform the following steps:

Prepare and do the data preprocessing step using any method you choose.
Run snakemake .

Individual steps can also be executed. For more information visit Snakemake website.

Authors

Egor Shcherbin, Liu Cao, Hosein Mohimani

Mohimani Lab, CMU, 2019

Code Snippets

import csv
import logging
import numpy as np
import pandas as pd
from scipy.stats import fisher_exact
from timeit import default_timer as timer


N_DEBUGSTEPS = 50000


logging.basicConfig(filename=snakemake.log[0],
                    format='%(asctime)s %(message)s',
                    datefmt='%d.%m.%Y %H:%M:%S | ',
                    level=logging.DEBUG)
logging.info('Started')

threshold = np.float64(snakemake.config['threshold'])
fisher_alternative = snakemake.config['fisher_alternative']
samples = pd.read_table(snakemake.input.samples_fn, index_col=False, dtype=str)
with open(snakemake.input.mgx_fn, newline='') as mgx_file:
    mgx_reader = csv.reader(mgx_file, delimiter='\t')
    next(mgx_reader)  # skip the header
    mgx = [(row[0], row[1], set(row[2:]))
           for row in mgx_reader]


start_time = timer()
with open(snakemake.output.assn_fn, 'w', newline='') as assn_file:
    assn_writer = csv.writer(assn_file, delimiter='\t')
    assn_writer.writerow([
        'feature1',
        'feature2',
        'annotation1',
        'annotation2',
        'p-value',
        'samples1',
        'samples2',
        'samples_shared'])
    with open(snakemake.input.mbx_fn, newline='') as mbx_file:
        mbx_reader = csv.reader(mbx_file, delimiter='\t')
        next(mbx_reader)  # skip the header
        steps = 0
        pairs = 0
        for row in mbx_reader:
            mb = row[0], row[1], set(row[2:])
            for mg in mgx:
                a = len(set.intersection(mb[2], mg[2]))
                b = len(mg[2]) - a
                c = len(mb[2]) - a
                d = len(samples) - a - b - c
                _, p = fisher_exact([[a, b], [c, d]],
                                    alternative=fisher_alternative)
                if p < threshold:
                    assn_writer.writerow([
                        mb[0],
                        mg[0],
                        mb[1],
                        mg[1],
                        str(p),
                        str(len(mb[2])),
                        str(len(mg[2])),
                        str(a)])
                    pairs += 1

                steps += 1
                if steps % N_DEBUGSTEPS == 0:
                    logging.debug(('{} p-values calculated, ' +
                                   '{} associations reported, ' +
                                   '{:.3f} seconds elapsed').format(
                                      steps,
                                      pairs,
                                      timer() - start_time))

logging.info(('Finished calculating {} p-values ' +
              '({} reported) in {:.3f} seconds').format(
    steps,
    pairs,
    timer() - start_time))

Python Pandas numpy scipy From line 2 of scripts/associations.py

import csv
import pandas as pd
import numpy as np


# algorithm by Floyd, see doi:10.1145/30401.315746
def random_subset(n, k):
    s = set()
    for i in range(n - k, n):
        a = np.random.randint(i + 1)
        s.add(a if a not in s else i)
    return np.fromiter(s, dtype=np.int)


if 'seed' in snakemake.config:
    np.random.seed(snakemake.config['seed'])

samples = pd.read_table(snakemake.input.samples_fn,
                        index_col=False,
                        dtype=str).name.values

with open(snakemake.input.mbx_fn, newline='') as mbx_file, \
        open(snakemake.output.mbx_decoy_fn, 'w', newline='') as mbx_decoy_file:
    mbx_reader = csv.reader(mbx_file, delimiter='\t')
    mbx_decoy_writer = csv.writer(mbx_decoy_file, delimiter='\t')
    mbx_decoy_writer.writerow(next(mbx_reader))
    for row in mbx_reader:
        p = random_subset(len(samples), len(row) - 2)
        np.random.shuffle(p)
        mbx_decoy_writer.writerow([
            row[0],
            'MBX_DECOY' + row[1][3:],
            *[samples[i] for i in p]])

Python Pandas numpy From line 1 of scripts/decoy_database.py

from collections import defaultdict, Counter
import numpy as np
import csv


# unique_mgx = defaultdict(set)
cnt_real = Counter()
max_p = defaultdict(lambda: 0)
with open(snakemake.input.assn_fn, newline='') as assn_file:
    assn_reader = csv.reader(assn_file, delimiter='\t')
    next(assn_reader)  # skip the header
    for row in assn_reader:
        for threshold in snakemake.config['fdr_thresholds']:
            # threshold is a string here
            if np.float64(row[4]) < np.float64(threshold):
                # mg_feat = row[0] if row[2].startswith('MGX') else row[1]
                # unique_mgx[threshold].add(mg_feat)
                cnt_real[threshold] += 1
                max_p[threshold] = max(max_p[threshold], np.float64(row[4]))

# unique_mgx_decoy = defaultdict(set)
cnt_decoy = Counter()
with open(snakemake.input.assn_decoy_fn, newline='') as assn_decoy_file:
    assn_decoy_reader = csv.reader(assn_decoy_file, delimiter='\t')
    next(assn_decoy_reader)  # skip the header
    for row in assn_decoy_reader:
        for threshold in snakemake.config['fdr_thresholds']:
            # threshold is a string here
            if np.float64(row[4]) < np.float64(threshold):
                # mg_feat = row[0] if row[2].startswith('MGX') else row[1]
                # unique_mgx_decoy[threshold].add(mg_feat)
                cnt_decoy[threshold] += 1

with open(snakemake.input.mbx_fn) as mbx_file, \
         open(snakemake.input.mgx_fn) as mgx_file:
    n_total_pairs = (len(mbx_file.readlines()) - 1) * \
        (len(mgx_file.readlines()) - 1)
with open(snakemake.output.fdr_fn, 'w', newline='') as fdr_file:
    fdr_writer = csv.writer(fdr_file, delimiter='\t')
    fdr_writer.writerow([
        'FDR',
        'FDR_BH',
        'threshold',
        # 'n_decoy_mgx',
        # 'n_real_mgx',
        'n_decoy_assn',
        'n_real_assn'])
    for threshold in snakemake.config['fdr_thresholds']:
        # n_decoy_mgx = len(unique_mgx_decoy[threshold])
        # n_real_mgx = len(unique_mgx[threshold])
        n_decoy_assn = cnt_decoy[threshold]
        n_real_assn = cnt_real[threshold]
        fdr_writer.writerow([
            # str(n_decoy_mgx / n_real_mgx) if n_real_mgx else 'nan',
            str(n_decoy_assn / n_real_assn) if n_real_assn else 'nan',
            str(n_total_pairs / n_real_assn * max_p[threshold])
            if n_real_assn else 'nan',
            threshold,
            # str(n_decoy_mgx),
            # str(n_real_mgx),
            str(n_decoy_assn),
            str(n_real_assn)])

Python numpy From line 2 of scripts/fdr.py

import csv
import numpy as np
from itertools import product


def multiples(n):
    for i in range(1, n + 1):
        if n % i == 0:
            yield i, n // i


def read_features(feat_fn):
    with open(feat_fn, newline='') as feat_file:
        feat_reader = csv.reader(feat_file, delimiter='\t')
        header = next(feat_reader)
        features = list(feat_reader)
    return features, header


mbx, _ = read_features(snakemake.input.mbx_fn)
mbx_decoy, _ = read_features(snakemake.input.mbx_decoy_fn)
mgx, header = read_features(snakemake.input.mgx_fn)

n = snakemake.config['associations_parts']
min_part_size = n + 1  # find the minimal part size to make tmp files small
for a, b in multiples(n):
    if a <= len(mbx) and b <= len(mgx) and n / min(a, b) < min_part_size:
        mbx_coords = np.linspace(0, len(mbx), num=a + 1, dtype=np.int)
        mgx_coords = np.linspace(0, len(mgx), num=b + 1, dtype=np.int)
        min_part_size = n / min(a, b)
        break
else:
    mbx_coords = np.linspace(0, len(mbx), num=n + 1, dtype=np.int)
    mgx_coords = np.linspace(0, len(mgx), num=2, dtype=np.int)
assert (len(mbx_coords) - 1) * (len(mgx_coords) - 1) == n

for (i, j), mbx_part, mbx_decoy_part, mgx_part in zip(
        product(range(len(mbx_coords) - 1),
                range(len(mgx_coords) - 1)),
        snakemake.output.mbx_parts,
        snakemake.output.mbx_decoy_parts,
        snakemake.output.mgx_parts):
    with open(mgx_part, 'w', newline='') as mgx_file, \
         open(mbx_part, 'w', newline='') as mbx_file, \
         open(mbx_decoy_part, 'w', newline='') as mbx_decoy_file:
        mbx_writer = csv.writer(mbx_file, delimiter='\t')
        mbx_decoy_writer = csv.writer(mbx_decoy_file, delimiter='\t')
        mgx_writer = csv.writer(mgx_file, delimiter='\t')
        mbx_writer.writerow(header)
        mbx_decoy_writer.writerow(header)
        mgx_writer.writerow(header)

        for k in range(mbx_coords[i], mbx_coords[i + 1]):
            mbx_writer.writerow(mbx[k])
            mbx_decoy_writer.writerow(mbx_decoy[k])
        for k in range(mgx_coords[j], mgx_coords[j + 1]):
            mgx_writer.writerow(mgx[k])

Python numpy From line 2 of scripts/split_input.py

import numpy as np
import pandas as pd
from skbio import TreeNode

p_threshold = np.float64(snakemake.config['tree_biclustering']['p_threshold'])
clade_threshold = \
    np.float64(snakemake.config['tree_biclustering']['clade_threshold'])

assn = pd.read_table(snakemake.input.assn_fn, index_col=False,
                     header=0,
                     dtype=str)
assn['p-value'] = assn['p-value'].astype(np.float64)
assn = assn[assn['p-value'] < p_threshold]
assn.reset_index(drop=True, inplace=True)
mbx = assn[['feature1', 'annotation1']].drop_duplicates()
mgx = assn[['feature2', 'annotation2']].drop_duplicates()

tree = TreeNode.read(snakemake.input.tree_fn)
tree = tree.shear(mgx['feature2'].values)
tree.assign_ids()
tree_size = tree.count()
clade_size = np.zeros(tree_size)
for tip in tree.tips(include_self=True):
    clade_size[tip.id] += 1
    for anc in tip.ancestors():
        clade_size[anc.id] += 1

biclust = [set() for _ in range(tree_size)]
for mb in mbx['feature1'].values:
    cnt_assn = np.zeros(tree_size)
    for mg in assn['feature2'][assn['feature1'] == mb].values:
        node = tree.find(mg)
        cnt_assn[node.id] += 1
        for anc in node.ancestors():
            cnt_assn[anc.id] += 1
    taken = set()
    for node in tree.preorder(include_self=True):
        if taken.intersection(set(anc.id for anc in node.ancestors())):
            continue
        if node.is_tip() and cnt_assn[node.id] or \
                cnt_assn[node.id] >= clade_threshold * clade_size[node.id]:
            biclust[node.id].add(mb)
            taken.add(node.id)

with open(snakemake.output.biclust_fn, 'w') as biclust_file:
    for node in tree.traverse(include_self=True):
        biclust_mbx = biclust[node.id]
        if not biclust_mbx:
            continue
        biclust_mgx = set(tip.name for tip in node.tips(include_self=True))
        name = 'Bicluster_{}_{}_{}'.format(node.id,
                                           len(biclust_mgx),
                                           len(biclust_mbx))
        biclust_file.write(name + '\n')
        for mg in mgx[mgx['feature2'].isin(biclust_mgx)].values:
            biclust_file.write('\t'.join(mg) + '\n')
        for mb in mbx[mbx['feature1'].isin(biclust_mbx)].values:
            biclust_file.write('\t'.join(mb) + '\n')
        biclust_file.write('\n')

assn_biclust = np.zeros(assn.shape[0])
for i, row in assn.iterrows():
    mb, mg = row['feature1'], row['feature2']
    node = tree.find(mg)
    while mb not in biclust[node.id]:
        node = node.parent
    assn_biclust[i] = node.id
assn.assign(bicluster_id=assn_biclust).to_csv(snakemake.output.assn_biclust_fn,
                                              sep='\t',
                                              index=False)

for node in tree.non_tips(include_self=True):
    node.name = str(node.id) + ' (' + str(len(biclust[node.id])) + ')'
for node in tree.tips():
    node.name = node.name + ' (' + str(len(biclust[node.id])) + ')'
tree.write(snakemake.output.tree_biclust_fn)

Python Pandas numpy scikit-bio From line 2 of scripts/tree_biclustering.py

script:
    'scripts/decoy_database.py'

SnakeMake From line 16 of master/Snakefile

script:
    'scripts/associations.py'

SnakeMake From line 32 of master/Snakefile

script:
    'scripts/associations.py'

SnakeMake From line 45 of master/Snakefile

script:
    'scripts/split_input.py'

SnakeMake From line 61 of master/Snakefile

script:
    'scripts/associations.py'

SnakeMake From line 74 of master/Snakefile

shell:
    '''
    tail_start=1
    for part in {input.assn_parts}
    do
        tail -n+$tail_start $part >>{output.assn_fn}
        tail_start=2
    done
    '''

SnakeMake From line 83 of master/Snakefile

script:
    'scripts/associations.py'

SnakeMake From line 103 of master/Snakefile

shell:
    '''
    tail_start=1
    for part in {input.assn_parts}
    do
        tail -n+$tail_start $part >>{output.assn_fn}
        tail_start=2
    done
    '''

SnakeMake From line 112 of master/Snakefile

script:
    'scripts/fdr.py'

SnakeMake From line 133 of master/Snakefile

script:
    'scripts/tree_biclustering.py'

SnakeMake From line 145 of master/Snakefile

ShowHide 13 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/mohimanilab/AssociationNetworks

Name: associationnetworks

Version: 1

Badge:

Insert copied code into your website to add a link to this workflow.

License: None

Keywords:

Pandas Snakemake numpy scikit-bio scipy

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free