Materials and Code for the micom manuscript.

public public 1yr ago 0 bookmarks

This repo includes workflows and additional data to reproduce the manuscript:

MICOM: Metagenome-Scale Modeling To Infer Metabolic Interactions in the Gut Microbiota
Christian Diener, Sean M. Gibbons, Osbaldo Resendis-Antonio
mSystems 5:e00606-19
https://doi.org/10.1128/mSystems.00606-19

Additional data is provided in the data subdirectory. For now we will assume you want to start from the abundances and replication rates as those are workflow steps that pertain to MICOM. See further below on instruction how to go start from the raw sequencing data.

Note that we only support Linux and Mac here. If you have any questions or problems running the code please don't hesitate to file an issue .

Setting up you conda environment

We will use conda to manage all the required dependencies. If you already have Miniconda or Anaconda you can skip right ahead otherwise you will need to install Miniconda using those instructions .

Open a Terminal and make sure that conda works.

conda info

Which should look something like this:

 active environment : None shell level : 0 user config file : /Users/cdiener/.condarc populated config files : /Users/cdiener/.condarc conda version : 4.8.0 conda-build version : 3.18.11 python version : 3.6.7.final.0
... offline mode : False

Now either use git to clone this repo or download the zipped version and unzip it.

git clone https://github.com/micom-dev/paper

Now switch to the directory containing the downloaded code. For instance

cd paper

Now we can set up an environment that contains 99% of the required software to run all code in this repo.

conda env create -f micom.yml

Follow this by activating your environment:

conda activate micom

MICOM models all biochemical reactions in all taxa, which means that the optimization problem MICOM solves included hundreds of thousands of variables. There are only a few numerical solvers that can solve quadratic programming problems of that scale. Right, now we support CPLEX or Gurobi , which both have free academic licenses but will require you to sign up for them. We hope to change this in the future by making MICOM compatible with an open-source solver (stay tuned!).

CPLEX

After registering and downloading the CPLEX studio for your OS unpack it (by running the provided installer) to a directory of your choice (we will assume it's called ibm ).

Now install the CPLEX python package:

pip install ibm/cplex/python/3.6/x86-64_linux

Substitute x86-64_linux with the folder corresponding to your system (there will only be one subfolder in that directory).

Gurobi

Gurobi can be installed with conda.

conda install -c gurobi gurobi

You will now have to register the installation using your license key.

grbgetkey YOUR-LICENSE-KEY

The analyses in the paper were obtained with CPLEX 12.9.

You are now all set to reproduce the paper.

Running the workflows

Download the AGORA models

The only data not supplied in this repository is the AGORA database since it would be to big. It can be downloaded from https://www.vmh.life/files/reconstructions/AGORA/1.03/AGORA-1.03-With-Mucins.zip . Unzip this file an transfer all files in reconstructions/sbml to the data/agora folder.

Run all steps

We use snakemake to define a reproducible workflow that generates all analyses and panels form the paper. We do recommend to use the beefiest equipment available to you to use as many CPU cores as possible (this will speed up everything by a factor of the cores). You will need approximately 1GB of RAM per core available.

The basic workflow we are running looks like this:

To run the entire workflow from start to finish use:

snakemake --cores 16

This will recreate intermediate files in data and figures in figures .

Which will run the workflow with 16 cores and will automatically track what has been run already or what is outdated. So interrupting it will not create any problems.

Individual workflow steps are kept in the workflows as directory as simple Python files you can inspect to see what is happening in detail.

The species level workflow is not fully run by default. However, species models are always built. If you want to run species level tradeoff analyses just change the corresponding files to species_tradeoff*.py in Snakefile .

Processing the metagenomics data

This entire part is optional. It will generate abundances.csv and replication_rates.csv which are provided along with this repository.

Install R and packages

If you want to run the metagenomic analyses as well you will need to install R and some dependencies as well.

conda activate micom
conda install -c r r-base

then open R by typing R and install the packages with

install.packages(c("BiocManager", "remotes", "drake"))
setRepositories(ind=1:4)
remotes::install_github("gibbons-lab/mbtools")

Download data

Depending on your setup there may be various ways. The SRA toolkit is installed in the conda environment so you can download fastq files with fastq-dump . The SRA IDs can be found in data/recent.csv in the run_accession column. For instance using R:

library(data.table)
library(futile.logger)
mkdir("data/raw")
samples = fread("data/recent.csv")
for (sa in samples$run_accession) {
 system2("fasterq-dump",
 c("-O", "data/raw", sa))
 flog.info("Downloaded run %s.", sa)
}

You will also need to download the databases for SLIMM which we will sue to assign taxonomy and obtain strain resolved coverage. Those can be obtained from here and should be copied to data/refs .

Run the pipeline

The entire pipeline from raw reads to replication rates is implemented in process.R which uses the workflow manager drake to run the analysis.

Rscript process.R

You can adjust the line options(mc.cores = 8) to specify your preferred number of cores.

We mostly run steps from our mbtools package which is documented here .

Code Snippets

18
19
script:
    "workflows/genera.py"
28
29
script:
    "workflows/build_models.py"
38
39
script:
    "workflows/build_models.py"
48
49
script:
    "workflows/tradeoff.py"
59
60
script:
    "workflows/media_and_gcs.py"
68
69
script:
    "workflows/knockouts.py"
77
78
script:
    "workflows/elasticities.py"
91
92
script:
    "workflows/tradeoff_figs.py"
101
102
script:
    "workflows/growth_rate_figs.py"
SnakeMake From line 101 of master/Snakefile
110
111
script:
    "workflows/knockout_figs.py"
SnakeMake From line 110 of master/Snakefile
127
128
script:
    "workflows/exchange_figs.py"
SnakeMake From line 127 of master/Snakefile
136
137
script:
    "workflows/elasticity_figs.py"
SnakeMake From line 136 of master/Snakefile
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
from os import makedirs
from os.path import isfile
import micom
from micom import Community
import pandas as pd
from micom.workflows import workflow

logger = micom.logger.logger
try:
    max_procs = snakemake.threads
except NameError:
    max_procs = 20

makedirs("data/models", exist_ok=True)

taxonomy = pd.read_csv("data/genera.csv").query("relative > 1e-3")
taxonomy["file"] = taxonomy.file.apply(
    lambda ids: ["data/agora/" + i for i in ids.split("|")]
)
taxonomy["name"] = taxonomy.genus
assert not taxonomy.name.str.contains(" ").any()
taxonomy = taxonomy.rename(columns={"name": "id", "reads": "abundance"})

diet = pd.read_csv("data/western_diet.csv")
diet.index = diet.reaction = diet.reaction.str.replace("_e", "_m")
diet = diet.flux * diet.dilution


def build_and_save(args):
    s, tax = args
    filename = "data/models/" + s + ".pickle"
    if isfile(filename):
        return
    com = Community(tax, id=s, progress=False)
    ex_ids = [r.id for r in com.exchanges]
    logger.info(
        "%d/%d import reactions found in model.",
        diet.index.isin(ex_ids).sum(),
        len(diet),
    )
    com.medium = diet[diet.index.isin(ex_ids)]
    com.to_pickle(filename)


samples = taxonomy.samples.unique()
args = [(s, taxonomy[taxonomy.samples == s]) for s in samples]
workflow(build_and_save, args, max_procs)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from os.path import isfile
import micom
from micom import load_pickle
from micom.elasticity import elasticities
from micom.workflows import workflow


logger = micom.logger.logger
try:
    max_procs = snakemake.threads
except NameError:
    max_procs = 20


def elasticity_worker(sam):
    """Get the exchange elasticities for a sample."""
    model_file = "data/models/" + sam + ".pickle"
    out_file = "data/elasticities_" + sam + ".csv"
    if isfile(out_file):
        return
    com = load_pickle(model_file)
    elast = elasticities(com, fraction=0.5)
    elast.to_csv(out_file, index=False)


samples = ["ERR260275", "ERR260214", "ERR260174"]
workflow(elasticity_worker, samples, max_procs)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
import pandas as pd
import networkx as nx
import nxviz
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

samples = {
    "ERR260275": "healthy",
    "ERR260214": "T2D metformin+",
    "ERR260174": "T2D metformin-",
}
metabolites = pd.read_csv("data/metabolites.csv")[
    ["abbreviation", "fullName", "keggId"]
]
SCFAs = {"butyrate": "EX_but(e)", "acetate": "EX_ac(e)", "propionate": "EX_ppa(e)"}


def direction(els, rid):
    return els[els.reaction == rid].direction.unique()[0]


elast = []
for sa in samples:
    e = pd.read_csv("data/elasticities_" + sa + ".csv")
    e["id"] = sa
    elast.append(e)
elast = pd.concat(elast)
elast = elast[elast.direction == "forward"]
elast["scfa"] = float("nan")
for name, pattern in SCFAs.items():
    elast.loc[elast.reaction.str.startswith(pattern), "scfa"] = name

production = (
    elast.groupby(["id", "effector", "scfa"]).elasticity.sum().reset_index()
)
production["type"] = production.effector.apply(
    lambda e: "diet" if e.startswith("EX_") else "abundance"
)
production["scfa"] = production.scfa + " " + production.id.map(samples)
production["abbreviation"] = production.effector.str.replace("(EX_)|(_m)", "")
production = pd.merge(production, metabolites, on="abbreviation", how="left")
production.loc[production.fullName.isna(), "fullName"] = production.loc[
    production.fullName.isna(), "abbreviation"
]

cmap = sns.color_palette()[0:2]
cmap = dict(zip(production.type.unique(), cmap))
production.fullName = production.fullName.apply(
    lambda s: s if len(s) < 24 else s[:24] + "..."
)
ty = production[["fullName", "type"]].drop_duplicates()
ty = pd.Series(ty.type.values, index=ty.fullName)
typecols = ty.map(cmap).rename("type")
production = production.sort_values(by="id")
mat = production.pivot_table(
    index="scfa", columns="fullName", values="elasticity", fill_value=0
)
g = sns.clustermap(
    mat,
    cmap="seismic",
    vmin=-production.elasticity.abs().max(),
    vmax=production.elasticity.abs().max(),
    figsize=(32, 4),
    col_colors=typecols,
    yticklabels=True,
    xticklabels=True,
    row_cluster=False,
    cbar_kws={"fraction": 1.0},
)
for label in cmap:
    g.ax_col_dendrogram.bar(0, 0, color=cmap[label], label=label, linewidth=0)
g.ax_col_dendrogram.legend(loc="best", ncol=3, frameon=False)
g.ax_heatmap.set_xlabel("")
g.ax_heatmap.set_ylabel("")
cold = g.ax_col_dendrogram.get_position()
g.ax_col_dendrogram.set_position(
    [cold.x0, cold.y0, cold.width, cold.height * 5]
)
g.ax_row_dendrogram.set_visible(False)
g.cax.set_position([0.25, 0.5, 0.01, 0.5])
g.savefig("figures/elasticities.png", dpi=300)

but = production[production.scfa.str.startswith("butyrate")].sort_values(
    by="elasticity", ascending=False
)
print(but.head(20))
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
import pandas as pd
from plotnine import *
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from adjustText import adjust_text
from scipy.stats import ttest_ind
import numpy as np
from itertools import combinations
from scipy.spatial.distance import pdist
from skbio.stats.distance import DistanceMatrix, permanova

sample_keep = ["run_accession", "subset", "status", "type"]
SCFAs = {
    "butyrate": "EX_but\\(e\\)",
    "acetate": "EX_ac\\(e\\)",
    "propionate": "EX_ppa\\(e\\)",
}
theme_set(theme_minimal())


def pairwise_tests(x, y, name="tests", ref="CTRL"):
    groups = x.unique()
    combs = combinations(groups, 2)
    tests = [
        pd.Series(
            ttest_ind(
                y[x == g[0]].dropna(), y[x == g[1]].dropna(), equal_var=False
            )
            + (len(y), g[0], g[1]),
            index=["t", "pval", "n", "x", "y"],
        )
        for g in combs
    ]
    tests = pd.DataFrame(tests)
    tests["name"] = name
    return tests


def export_rates_plot(fluxes, groups, samples, log=False):
    dfs = []
    for name, filt in groups.items():
        df = fluxes[fluxes.reaction.str.contains(filt)].copy()
        res = samples.copy()
        df = df.groupby(["sample", "compartment"]).tot_flux.sum().reset_index()
        res["flux"] = df.groupby("sample").tot_flux.sum().abs()
        res["metabolite"] = name
        dfs.append(res)
    fluxes = pd.concat(dfs)
    fluxes.loc[fluxes.status == "ND", "status"] = ""
    fluxes["name"] = fluxes.status + " " + fluxes.type.fillna("")
    fluxes.name = fluxes.name.str.strip()
    fluxes = fluxes.sort_values("name")
    if log:
        fluxes.flux = np.log10(fluxes.flux)
    fluxes.groupby(["metabolite", "subset"]).apply(
        lambda df: print(
            pairwise_tests(df["name"], df.flux, df.name[0] + " " + df.name[1])
        )
    )
    pl = (
        ggplot(fluxes, aes(x="name", y="flux", color="metabolite"))
        + geom_boxplot(outlier_color="none")
        + geom_jitter(width=0.2, height=0)
        + facet_grid("metabolite ~ subset", scales="free")
        + guides(color=False)
        + labs(x="", y="flux [mmol/h]")
        + theme(axis_text_x=element_text(rotation=45, vjust=1, hjust=1))
    )
    return pl


samples = pd.read_csv("data/recent.csv")[
    ["run_accession", "status", "subset", "type"]
]
samples = samples.rename(columns={"run_accession": "sample"})
samples.index = samples["sample"]

media = pd.read_csv("data/minimal_imports.csv", index_col=0).fillna(0.0)
media["sample"] = media.index
media = media.melt(id_vars="sample", var_name="reaction", value_name="flux")
media = media[media.flux > 0]
metabolites = pd.read_csv("data/metabolites.csv")
metabolites["id"] = metabolites.abbreviation + "_m"
media["id"] = media.reaction.str.lstrip("EX_")
media = pd.merge(media, metabolites, on="id")

mat = media.dropna(subset=["flux"]).pivot("id", "sample", "flux").fillna(0)
stype = (
    samples.loc[mat.columns].status.fillna("")
    + " "
    + samples.loc[mat.columns].type.fillna("")
).replace({"ND CTRL": "CTRL"})
m = dict(zip(stype.unique(), sns.color_palette()[0 : (stype.nunique() + 1)]))
stype_cols = stype.map(m)
stype_cols.name = "group"
mat = mat.apply(lambda x: x / x.abs().max(), axis=1)
g = sns.clustermap(
    mat,
    cmap="BuPu",
    figsize=(18, 18),
    col_colors=stype_cols,
    yticklabels=True,
    xticklabels=True,
)
for label in m:
    g.ax_col_dendrogram.bar(0, 0, color=m[label], label=label, linewidth=0)
g.ax_col_dendrogram.legend(loc="best", ncol=2, frameon=False)
g.ax_heatmap.set_xlabel("")
g.ax_heatmap.set_ylabel("")
col = g.ax_col_colors.get_position()
g.ax_col_colors.set_position([col.x0, col.y0, col.width, col.height * 0.3])
cold = g.ax_col_dendrogram.get_position()
g.ax_col_dendrogram.set_position(
    [cold.x0, cold.y0 - col.height * 0.7, cold.width, cold.height]
)
g.ax_heatmap.set_xticklabels(g.ax_heatmap.get_xmajorticklabels(), fontsize=6)
g.ax_heatmap.set_yticklabels(g.ax_heatmap.get_ymajorticklabels(), fontsize=6)
g.ax_col_colors.set_yticklabels(
    g.ax_col_colors.get_ymajorticklabels(), fontsize=12
)
g.savefig("figures/media.png", dpi=300)
plt.close()

fluxes = pd.read_csv("data/minimal_fluxes.csv.gz", compression="gzip")
fluxes = fluxes.melt(
    id_vars=["sample", "compartment"], var_name="reaction", value_name="flux"
)
fluxes = fluxes[
    fluxes.reaction.str.startswith("EX_")
    & (fluxes.compartment != "medium")
    & fluxes.reaction.str.endswith("(e)")
].fillna(0)
fluxes["taxa"] = fluxes.compartment + "_" + fluxes["sample"]
fluxes["name"] = fluxes.compartment.str.replace("_", " ")

species = pd.read_csv("data/genera.csv")[["samples", "genus", "reads"]]
species["name"] = species.genus
totals = species.groupby("samples").reads.sum()
species["relative"] = species.reads / totals[species.samples].values
fluxes = pd.merge(
    fluxes, species, left_on=["sample", "name"], right_on=["samples", "name"]
)
fluxes["tot_flux"] = fluxes.flux * fluxes.relative
print("Production rates:")
pl = export_rates_plot(fluxes[fluxes.tot_flux > 0], SCFAs, samples)
pl.save("figures/scfas_prod.svg", width=4, height=6)

print("Consumption rates:")
pl = export_rates_plot(fluxes[fluxes.tot_flux < 0], SCFAs, samples)
pl.save("figures/scfas_consumption.svg", width=4, height=6)

print("Net rates:")
pl = export_rates_plot(fluxes, SCFAs, samples)
pl.save("figures/scfas_net.svg", width=4, height=6)

scfa = []
for name, filt in SCFAs.items():
    fl = fluxes[fluxes.reaction.str.contains(filt)].copy()
    fl = fl.groupby(["sample", "name", "relative"]).flux.sum().reset_index()
    fl["metabolite"] = name
    scfa.append(fl)
    mat = fl.pivot("sample", "name", "flux")
    mat = mat.loc[
        :, mat.abs().mean().sort_values(ascending=False).index[0:5]
    ].fillna(0)
    sns.clustermap(
        mat, cmap="seismic", center=0, figsize=(6, 12), yticklabels=False
    )
    plt.savefig("figures/" + name + ".svg")
    plt.close()
scfa = pd.concat(scfa)
ord = scfa.groupby(["name"]).relative.mean().sort_values(ascending=False)
scfa.name = pd.Categorical(
    scfa.name.astype("str"), categories=ord.index[::-1], ordered=True
)
pl = (
    ggplot(
        scfa[scfa.name.isin(ord[ord > 0.01].index)],
        aes(x="name", y="flux", fill="metabolite", group="name"),
    )
    + geom_hline(yintercept=0, linetype="dashed")
    + geom_boxplot(outlier_color="none")
    + facet_wrap("~ metabolite", scales="free", nrow=3)
    + guides(fill=False)
    + labs(x="", y="flux [mmol/(h·gDW)]")
    + coord_flip()
)
pl.save("figures/scfas.svg", width=2.5, height=15)

pl = (
    ggplot(
        scfa[scfa.name.isin(ord[ord > 0.01].index)],
        aes(x="name", y="relative"),
    )
    + geom_boxplot(outlier_color="none")
    + labs(x="", y="relative\nabundance")
    + coord_flip()
)
pl.save("figures/relative_abundance.svg", width=2, height=5)


mat = fluxes[fluxes.flux < 0].pivot("taxa", "reaction", "flux").fillna(0.0)
taxa = mat.index.str.split("_ERR").str[0]
tsne = TSNE(n_components=2).fit_transform(mat)
tsne = pd.DataFrame(tsne, columns=["x", "y"], index=mat.index)
tsne["taxa"] = taxa
sns.set(font_scale=1.5, style="ticks")
g = sns.FacetGrid(tsne, hue="taxa", height=10, aspect=16 / 10)
gm = g.map(plt.scatter, "x", "y", alpha=0.25)
means = tsne.groupby(taxa).agg("median").reset_index()
texts = means.apply(
    lambda df: plt.text(df.x, df.y, df.taxa, alpha=0.65), axis=1
)
texts = adjust_text(
    texts,
    force_text=(0.02, 0.5),
    arrowprops=dict(arrowstyle="-|>", alpha=0.5, color="k"),
)
plt.savefig("figures/individual_media.png", dpi=200)
plt.close()

# Some statistics about metabolite usage
# indicator matrix 0 = metabolite not consumed, 1 = metabolite consumed
binary = mat.where(mat < -1e-6, 0).where(mat > -1e-6, 1)

# Jaccard distances = 1 - percent overlap
J = pdist(binary, "jaccard")
print("Jaccard distances:", pd.Series(J).describe(), sep="\n")

# euclidean distances
E = pdist(mat, "euclidean")

# Test whether genus explains a good amount of that variation
p = permanova(DistanceMatrix(E), taxa)
r2 = 1 - 1 / (1 + p[4] * p[3] / (p[2] - p[3] - 1))
p["R2"] = r2
print("PERMANOVA on euclidean distances:", p, sep="\n")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import pandas as pd
import micom

keep = [
    "samples",
    "kingdom",
    "phylum",
    "class",
    "order",
    "family",
    "genus",
    "oxygenstat",
    "metabolism",
    "gram",
    "mtype",
    "genes",
    "file",
    "reads",
    "relative",
]


def reduce_group(df):
    new = df.iloc[0, :]
    new["file"] = "|".join(df.id.apply(lambda id: f"{id}.xml"))
    return new


agora = micom.data.agora
agora_genus = agora.groupby("genus").apply(reduce_group).reset_index(drop=True)

genera = pd.read_csv("data/abundances.csv")
genera = genera.groupby(["id", "genus"]).sum().reset_index()

genus_models = pd.merge(genera, agora_genus, on="genus", suffixes=["_x", ""])
genus_models = genus_models.rename(
    columns={"id_x": "samples", "mclass": "class"}
)[keep]
genus_models.to_csv("data/genera.csv", index=False)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import pandas as pd
import numpy as np
from plotnine import *
from scipy.stats import pearsonr

theme_set(theme_minimal())

rates = (
    pd.read_csv("data/tradeoff.csv")
    .dropna(subset=["abundance"])
    .rename(columns={"sample": "id", "compartments": "genus"})
)

rates["log_rates"] = np.log10(rates.growth_rate)
rates.loc[rates.growth_rate <= 0, "log_rates"] = -16

pos = rates.query("growth_rate > 1e-6 and tradeoff == 0.5")
o = pos.groupby("genus").log_rates.mean().sort_values().index
pos.genus = pd.Categorical(pos.genus, categories=o, ordered=True)

pl = (
    ggplot(pos, aes(x="genus", y="growth_rate", color="genus"))
    + geom_jitter(width=0.2, height=0, alpha=0.5, stroke=0, size=2)
    + stat_summary(fun_y=np.mean, geom="point", size=2, fill="white", stroke=1)
    + scale_y_log10()
    + guides(color=False)
    + labs(x="", y="growth rate [1/h]")
    + theme(axis_text_x=element_text(rotation=45, hjust=1, vjust=1))
)
pl.save("figures/gcs.svg", width=12, height=4)

community_growth = (
    rates[rates.tradeoff == 0.5]
    .groupby("id")
    .apply(lambda df: sum(df.abundance * df.growth_rate))
)
print(community_growth.describe())


rs = rates[rates.tradeoff == 0.5]
scale = rs.groupby("id").abundance.apply(
    lambda a: community_growth[a.name] / a.pow(2).sum()
)
x = np.linspace(0, 1, 100)
free = pd.DataFrame(
    {
        "abundance": x,
        "max_rate": scale.max() * x,
        "min_rate": scale.min() * x,
        "mean_rate": scale.mean() * x,
    }
)
pl = (
    ggplot(rs, aes(x="abundance", y="growth_rate"))
    + geom_point(stroke=0, alpha=0.2)
    + labs(x="", y="")
    + theme_classic()
)
pl.save("figures/rate_vs_abundance.png", width=3, height=3, dpi=300)

subset = rs.id.unique()[0:10]
free_subset = (
    rs[rs.id.isin(subset)]
    .groupby("id")
    .apply(
        lambda df: pd.DataFrame(
            {"abundance": x, "growth_rate": scale[df.name] * x}
        )
    )
    .reset_index()
)
pl = (
    ggplot(
        rs[rs.id.isin(subset)], aes(x="abundance", y="growth_rate", color="id")
    )
    + geom_point(size=2)
    + geom_line(data=free_subset, linetype="dashed")
    + labs(x="abundance [%]", y="growth rate [1/h]")
    + guides(color=False)
    + theme_classic()
)
pl.save("figures/rate_vs_abundance_individual.svg", width=6, height=5)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import pandas as pd
import networkx as nx
import nxviz
import matplotlib.pyplot as plt
from plotnine import *

theme_set(theme_minimal())

ko = pd.read_csv("data/knockouts.csv", index_col=0)
ko["knocked"] = ko.index
ko = ko.melt(
    id_vars=["knocked", "sample"], var_name="genus", value_name="change"
).dropna()

pos = ko.groupby(["knocked", "genus"]).change.mean().reset_index()
pos = pos[pos.knocked != pos.genus].sort_values(by="knocked")
counts = (
    ko[ko.change.abs() > 1e-2]
    .groupby("knocked")
    .apply(
        lambda df: pd.DataFrame(
            {
                "counts": [
                    df.change[df.change > 0].count(),
                    df.change[df.change < 0].count(),
                ],
                "type": ["competitive", "cooperative"],
            }
        )
    )
).reset_index()
counts = counts.sort_values(by="counts", ascending=False)
counts.knocked = pd.Categorical(
    counts.knocked, counts.knocked.unique()[::-1], ordered=True
)
print(counts.groupby("type").sum())

pl = (
    ggplot(counts, aes(x="knocked", y="counts", fill="type"))
    + geom_col()
    + labs(x="", y="strong interactions")
    + coord_flip()
)
pl.save("figures/knockout_counts.svg", width=4, height=12)

graph = nx.from_pandas_edgelist(
    pos[pos.change.abs() > 1e-2], "knocked", "genus", "change"
)

for idx, _ in graph.nodes(data=True):
    graph.node[idx]["degree"] = float(graph.degree(idx))
circos = nxviz.CircosPlot(
    graph,
    node_labels=True,
    rotate_labels=True,
    edge_cmap="bwr",
    edge_color="change",
    edge_limits=(-pos.change.abs().max(), pos.change.abs().max()),
    node_color="degree",
    figsize=(9, 7.5),
)
# circos.figure.set_dpi(200)
circos.draw()
plt.tight_layout(rect=[0.05, 0.15, 0.7, 0.8])
plt.savefig("figures/circos.svg")
plt.close()

ns = (
    ko[(ko.change.abs() > 1e-2) & (ko.knocked != ko.genus)]
    .groupby(["genus", "knocked"])
    .sample.count()
    .reset_index()
)

pl = (
    ggplot(ns, aes(x="sample"))
    + geom_histogram(bins=32, fill="lightgray", color="black")
    + labs(x="observed samples")
)
pl.save("figures/interaction_prevalence.svg", width=5, height=5)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import pandas as pd
import micom
from micom import load_pickle
from micom.media import minimal_medium
from micom.workflows import workflow

logger = micom.logger.logger
try:
    max_procs = snakemake.threads
except NameError:
    max_procs = 20


def knockout(sam):
    com = load_pickle("data/models/" + sam + ".pickle")
    ko = com.knockout_species(fraction=0.5)
    ko["sample"] = sam
    return ko


samples = pd.read_csv("data/recent.csv")
kos = workflow(knockout, samples.run_accession, max_procs)
pd.concat(kos).to_csv("data/knockouts.csv")
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import pandas as pd
import micom
from micom import load_pickle
from micom.media import minimal_medium
from micom.workflows import workflow
from micom.logger import logger


logger = micom.logger.logger
try:
    max_procs = snakemake.threads
except NameError:
    max_procs = 20


def media_and_gcs(sam):
    com = load_pickle("data/models/" + sam + ".pickle")

    # Get growth rates
    try:
        sol = com.cooperative_tradeoff(fraction=0.5)
        rates = sol.members["growth_rate"].copy()
        rates["community"] = sol.growth_rate
        rates.name = sam
    except Exception:
        logger.warning("Could not solve cooperative tradeoff for %s." % sam)
        return None

    # Get the minimal medium
    med = minimal_medium(com, 0.95 * sol.growth_rate, exports=True)
    med.name = sam

    # Apply medium and reoptimize
    com.medium = med[med > 0]
    sol = com.cooperative_tradeoff(fraction=0.5, fluxes=True, pfba=False)
    fluxes = sol.fluxes
    fluxes["sample"] = sam
    return {"medium": med, "gcs": rates, "fluxes": fluxes}


samples = pd.read_csv("data/recent.csv")
gcs = pd.DataFrame()
media = pd.DataFrame()
fluxes = pd.DataFrame()

results = workflow(media_and_gcs, samples.run_accession, max_procs)

for r in results:
    gcs = gcs.append(r["gcs"])
    media = media.append(r["medium"])
    fluxes = fluxes.append(r["fluxes"])

gcs.to_csv("data/growth_rates.csv")
media.to_csv("data/minimal_imports.csv")
fluxes.to_csv("data/minimal_fluxes.csv.gz", compression="gzip")
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from plotnine import *
from mizani.formatters import percent_format
from matplotlib.ticker import LogFormatterMathtext
import joypy as jp
from scipy.stats import spearmanr, pearsonr

theme_set(theme_minimal())

rates = (
    pd.read_csv("data/tradeoff.csv")
    .dropna(subset=["abundance"])
    .rename(columns={"sample": "id", "compartments": "genus"})
)
rates["tradeoff"] = rates.tradeoff.round(2).astype("str")
rates.loc[rates.tradeoff == "nan", "tradeoff"] = "none"

replication = (
    pd.read_csv("data/replication_rates.csv")
    .query("intercept > 1")
    .groupby(["id", "genus"])
    .rate.mean()
    .reset_index()
)


def corr(df):
    """Calculates correlations.

    In order to run the analysis with Spearman correlations just substitute
    `spearmanr` for `pearsonr` here.
    """
    if df.shape[0] > 2:
        return pd.Series(
            pearsonr(df.rate, df.growth_rate) + (df.shape[0],),
            index=["rho", "p", "n"],
        )
    else:
        return pd.Series(
            [np.nan, np.nan, df.shape[0]], index=["rho", "p", "n"]
        )


both = pd.merge(rates, replication, on=["id", "genus"])
within_samples = both.groupby(["tradeoff", "id"]).apply(corr).reset_index()
across_samples = both.groupby("tradeoff").apply(corr).reset_index()

pl = (
    ggplot(within_samples[within_samples.n > 6], aes(x="tradeoff", y="rho"))
    + geom_hline(yintercept=0, linetype="dashed")
    + geom_boxplot(outlier_color="none")
    + geom_jitter(width=0.15, height=0, alpha=0.5, stroke=0)
    + labs(x="tradeoff", y="Pearson rho")
)
pl.save("figures/within_sample_correlations.svg", width=5, height=5)
within_samples.to_csv("data/correlation_per_sample.csv")

pl = (
    ggplot(across_samples[across_samples.n > 6], aes(x="tradeoff", y="rho"))
    + geom_hline(yintercept=0, linetype="dashed")
    + geom_point()
    + geom_line(aes(group=1))
    + labs(x="tradeoff", y="Pearson rho")
)
pl.save("figures/across_sample_correlations.svg", width=5, height=5)
across_samples.to_csv("data/correlation_all.csv")

rates["log_rates"] = np.log10(rates.growth_rate)
rates.loc[rates.growth_rate <= 0, "log_rates"] = -16

fig, axes = jp.joyplot(
    rates, by="tradeoff", column="log_rates", color="cornflowerblue"
)
lf = LogFormatterMathtext()
xax = axes[-1].xaxis
xax.set_ticklabels(lf(10.0 ** ex) for ex in xax.get_ticklocs())
plt.xlabel("growth rate [1/h]")
plt.ylabel("tradeoff")
plt.savefig("figures/dists.svg")
plt.close()

non_zero = (
    rates.groupby(["id", "tradeoff"])
    .apply(lambda df: (df.growth_rate > 1e-6).sum() / df.shape[0])
    .reset_index(name="non_zero")
)

pl = (
    ggplot(non_zero, aes(x="tradeoff", y="non_zero"))
    + geom_boxplot(outlier_color="none")
    + geom_jitter(width=0.15, height=0, alpha=0.5, stroke=0)
    + scale_y_continuous(labels=percent_format())
    + labs(x="tradeoff", y="percent taxa growing")
)
pl.save("figures/percent_growing.svg", width=5, height=5)

# Show some representative correlations
comp = both[(both.tradeoff == "0.5")]
w = within_samples[within_samples.tradeoff == "0.5"]
w.index = w.id
comp = comp[comp.id.isin(w.id[(w.n >= 10)])]
comp["rho"] = w.loc[comp.id, "rho"].round(2).values
comp.id = comp.id + " (r=" + comp.rho.astype(str) + ")"

pl = (
    ggplot(comp, aes(x="rate", y="growth_rate"))
    + geom_point()
    + facet_wrap("~ id", nrow=3)
    + scale_x_log10()
    + scale_y_log10()
    + labs(x="replication rate [a.u.]", y="predicted growth rate [1/h]")
)
pl.save("figures/corr_examples.png", width=12, height=6, dpi=300)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import micom
from micom import load_pickle
from micom.workflows import workflow
import numpy as np
import pandas as pd


tradeoffs = np.arange(0.1, 1.01, 0.1)
logger = micom.logger.logger
try:
    max_procs = snakemake.threads
except NameError:
    max_procs = 20
logger.info("Using %d threads..." % max_procs)


def growth_rates(sam):
    com = load_pickle("data/models/" + sam + ".pickle")
    sol = com.optimize()
    rates = sol.members
    rates["tradeoff"] = np.nan
    rates["sample"] = sam
    df = [rates]

    # Get growth rates
    try:
        sol = com.cooperative_tradeoff(fraction=tradeoffs)
    except Exception as e:
        logger.warning("Sample %s could not be optimized\n %s" % (sam, str(e)))
        return pd.DataFrame(
            {"tradeoff": tradeoffs, "growth_rate": np.nan, "sample": sam}
        )
    for i, s in enumerate(sol.solution):
        rates = s.members
        rates["tradeoff"] = sol.tradeoff[i]
        rates["sample"] = sam
        df.append(rates)
    df = pd.concat(df)
    return df


samples = pd.read_csv("data/recent.csv")
rates = workflow(growth_rates, samples.run_accession, max_procs)
rates = pd.concat(rates)
rates.to_csv("data/tradeoff.csv")
ShowHide 18 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/micom-dev/paper
Name: paper
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: Apache License 2.0
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...