Materials and Code for the micom manuscript.

public 1yr ago 0 bookmarks

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic
Lack of a description for a new keyword tool/pypi/adjustText .

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

This repo includes workflows and additional data to reproduce the manuscript:

MICOM: Metagenome-Scale Modeling To Infer Metabolic Interactions in the Gut Microbiota
Christian Diener, Sean M. Gibbons, Osbaldo Resendis-Antonio
mSystems 5:e00606-19
https://doi.org/10.1128/mSystems.00606-19

Additional data is provided in the data subdirectory. For now we will assume you want to start from the abundances and replication rates as those are workflow steps that pertain to MICOM. See further below on instruction how to go start from the raw sequencing data.

Note that we only support Linux and Mac here. If you have any questions or problems running the code please don't hesitate to file an issue .

Setting up you conda environment

We will use conda to manage all the required dependencies. If you already have Miniconda or Anaconda you can skip right ahead otherwise you will need to install Miniconda using those instructions .

Open a Terminal and make sure that conda works.

conda info

Which should look something like this:

 active environment : None shell level : 0 user config file : /Users/cdiener/.condarc populated config files : /Users/cdiener/.condarc conda version : 4.8.0 conda-build version : 3.18.11 python version : 3.6.7.final.0
... offline mode : False

Now either use git to clone this repo or download the zipped version and unzip it.

git clone https://github.com/micom-dev/paper

Now switch to the directory containing the downloaded code. For instance

cd paper

Now we can set up an environment that contains 99% of the required software to run all code in this repo.

conda env create -f micom.yml

Follow this by activating your environment:

conda activate micom

MICOM models all biochemical reactions in all taxa, which means that the optimization problem MICOM solves included hundreds of thousands of variables. There are only a few numerical solvers that can solve quadratic programming problems of that scale. Right, now we support CPLEX or Gurobi , which both have free academic licenses but will require you to sign up for them. We hope to change this in the future by making MICOM compatible with an open-source solver (stay tuned!).

CPLEX

After registering and downloading the CPLEX studio for your OS unpack it (by running the provided installer) to a directory of your choice (we will assume it's called ibm ).

Now install the CPLEX python package:

pip install ibm/cplex/python/3.6/x86-64_linux

Substitute x86-64_linux with the folder corresponding to your system (there will only be one subfolder in that directory).

Gurobi

Gurobi can be installed with conda.

conda install -c gurobi gurobi

You will now have to register the installation using your license key.

grbgetkey YOUR-LICENSE-KEY

The analyses in the paper were obtained with CPLEX 12.9.

You are now all set to reproduce the paper.

Running the workflows

Download the AGORA models

The only data not supplied in this repository is the AGORA database since it would be to big. It can be downloaded from https://www.vmh.life/files/reconstructions/AGORA/1.03/AGORA-1.03-With-Mucins.zip . Unzip this file an transfer all files in reconstructions/sbml to the data/agora folder.

Run all steps

We use snakemake to define a reproducible workflow that generates all analyses and panels form the paper. We do recommend to use the beefiest equipment available to you to use as many CPU cores as possible (this will speed up everything by a factor of the cores). You will need approximately 1GB of RAM per core available.

The basic workflow we are running looks like this:

To run the entire workflow from start to finish use:

snakemake --cores 16

This will recreate intermediate files in data and figures in figures .

Which will run the workflow with 16 cores and will automatically track what has been run already or what is outdated. So interrupting it will not create any problems.

Individual workflow steps are kept in the workflows as directory as simple Python files you can inspect to see what is happening in detail.

The species level workflow is not fully run by default. However, species models are always built. If you want to run species level tradeoff analyses just change the corresponding files to species_tradeoff*.py in Snakefile .

Processing the metagenomics data

This entire part is optional. It will generate abundances.csv and replication_rates.csv which are provided along with this repository.

Install R and packages

If you want to run the metagenomic analyses as well you will need to install R and some dependencies as well.

conda activate micom
conda install -c r r-base

then open R by typing R and install the packages with

install.packages(c("BiocManager", "remotes", "drake"))
setRepositories(ind=1:4)
remotes::install_github("gibbons-lab/mbtools")

Download data

Depending on your setup there may be various ways. The SRA toolkit is installed in the conda environment so you can download fastq files with fastq-dump . The SRA IDs can be found in data/recent.csv in the run_accession column. For instance using R:

library(data.table)
library(futile.logger)
mkdir("data/raw")
samples = fread("data/recent.csv")
for (sa in samples$run_accession) {
 system2("fasterq-dump",
 c("-O", "data/raw", sa))
 flog.info("Downloaded run %s.", sa)
}

You will also need to download the databases for SLIMM which we will sue to assign taxonomy and obtain strain resolved coverage. Those can be obtained from here and should be copied to data/refs .

Run the pipeline

The entire pipeline from raw reads to replication rates is implemented in process.R which uses the workflow manager drake to run the analysis.

Rscript process.R

You can adjust the line options(mc.cores = 8) to specify your preferred number of cores.

We mostly run steps from our mbtools package which is documented here .

Code Snippets

script:
    "workflows/genera.py"

SnakeMake From line 18 of master/Snakefile

script:
    "workflows/build_models.py"

SnakeMake From line 28 of master/Snakefile

script:
    "workflows/build_models.py"

SnakeMake From line 38 of master/Snakefile

script:
    "workflows/tradeoff.py"

SnakeMake From line 48 of master/Snakefile

script:
    "workflows/media_and_gcs.py"

SnakeMake From line 59 of master/Snakefile

script:
    "workflows/knockouts.py"

SnakeMake From line 68 of master/Snakefile

script:
    "workflows/elasticities.py"

SnakeMake From line 77 of master/Snakefile

script:
    "workflows/tradeoff_figs.py"

SnakeMake From line 91 of master/Snakefile

script:
    "workflows/growth_rate_figs.py"

SnakeMake From line 101 of master/Snakefile

script:
    "workflows/knockout_figs.py"

SnakeMake From line 110 of master/Snakefile

script:
    "workflows/exchange_figs.py"

SnakeMake From line 127 of master/Snakefile

script:
    "workflows/elasticity_figs.py"

SnakeMake From line 136 of master/Snakefile

from os import makedirs
from os.path import isfile
import micom
from micom import Community
import pandas as pd
from micom.workflows import workflow

logger = micom.logger.logger
try:
    max_procs = snakemake.threads
except NameError:
    max_procs = 20

makedirs("data/models", exist_ok=True)

taxonomy = pd.read_csv("data/genera.csv").query("relative > 1e-3")
taxonomy["file"] = taxonomy.file.apply(
    lambda ids: ["data/agora/" + i for i in ids.split("|")]
)
taxonomy["name"] = taxonomy.genus
assert not taxonomy.name.str.contains(" ").any()
taxonomy = taxonomy.rename(columns={"name": "id", "reads": "abundance"})

diet = pd.read_csv("data/western_diet.csv")
diet.index = diet.reaction = diet.reaction.str.replace("_e", "_m")
diet = diet.flux * diet.dilution


def build_and_save(args):
    s, tax = args
    filename = "data/models/" + s + ".pickle"
    if isfile(filename):
        return
    com = Community(tax, id=s, progress=False)
    ex_ids = [r.id for r in com.exchanges]
    logger.info(
        "%d/%d import reactions found in model.",
        diet.index.isin(ex_ids).sum(),
        len(diet),
    )
    com.medium = diet[diet.index.isin(ex_ids)]
    com.to_pickle(filename)


samples = taxonomy.samples.unique()
args = [(s, taxonomy[taxonomy.samples == s]) for s in samples]
workflow(build_and_save, args, max_procs)

Python Pandas micom From line 3 of workflows/build_models.py

from os.path import isfile
import micom
from micom import load_pickle
from micom.elasticity import elasticities
from micom.workflows import workflow


logger = micom.logger.logger
try:
    max_procs = snakemake.threads
except NameError:
    max_procs = 20


def elasticity_worker(sam):
    """Get the exchange elasticities for a sample."""
    model_file = "data/models/" + sam + ".pickle"
    out_file = "data/elasticities_" + sam + ".csv"
    if isfile(out_file):
        return
    com = load_pickle(model_file)
    elast = elasticities(com, fraction=0.5)
    elast.to_csv(out_file, index=False)


samples = ["ERR260275", "ERR260214", "ERR260174"]
workflow(elasticity_worker, samples, max_procs)

Python micom From line 3 of workflows/elasticities.py

import pandas as pd
import networkx as nx
import nxviz
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

samples = {
    "ERR260275": "healthy",
    "ERR260214": "T2D metformin+",
    "ERR260174": "T2D metformin-",
}
metabolites = pd.read_csv("data/metabolites.csv")[
    ["abbreviation", "fullName", "keggId"]
]
SCFAs = {"butyrate": "EX_but(e)", "acetate": "EX_ac(e)", "propionate": "EX_ppa(e)"}


def direction(els, rid):
    return els[els.reaction == rid].direction.unique()[0]


elast = []
for sa in samples:
    e = pd.read_csv("data/elasticities_" + sa + ".csv")
    e["id"] = sa
    elast.append(e)
elast = pd.concat(elast)
elast = elast[elast.direction == "forward"]
elast["scfa"] = float("nan")
for name, pattern in SCFAs.items():
    elast.loc[elast.reaction.str.startswith(pattern), "scfa"] = name

production = (
    elast.groupby(["id", "effector", "scfa"]).elasticity.sum().reset_index()
)
production["type"] = production.effector.apply(
    lambda e: "diet" if e.startswith("EX_") else "abundance"
)
production["scfa"] = production.scfa + " " + production.id.map(samples)
production["abbreviation"] = production.effector.str.replace("(EX_)|(_m)", "")
production = pd.merge(production, metabolites, on="abbreviation", how="left")
production.loc[production.fullName.isna(), "fullName"] = production.loc[
    production.fullName.isna(), "abbreviation"
]

cmap = sns.color_palette()[0:2]
cmap = dict(zip(production.type.unique(), cmap))
production.fullName = production.fullName.apply(
    lambda s: s if len(s) < 24 else s[:24] + "..."
)
ty = production[["fullName", "type"]].drop_duplicates()
ty = pd.Series(ty.type.values, index=ty.fullName)
typecols = ty.map(cmap).rename("type")
production = production.sort_values(by="id")
mat = production.pivot_table(
    index="scfa", columns="fullName", values="elasticity", fill_value=0
)
g = sns.clustermap(
    mat,
    cmap="seismic",
    vmin=-production.elasticity.abs().max(),
    vmax=production.elasticity.abs().max(),
    figsize=(32, 4),
    col_colors=typecols,
    yticklabels=True,
    xticklabels=True,
    row_cluster=False,
    cbar_kws={"fraction": 1.0},
)
for label in cmap:
    g.ax_col_dendrogram.bar(0, 0, color=cmap[label], label=label, linewidth=0)
g.ax_col_dendrogram.legend(loc="best", ncol=3, frameon=False)
g.ax_heatmap.set_xlabel("")
g.ax_heatmap.set_ylabel("")
cold = g.ax_col_dendrogram.get_position()
g.ax_col_dendrogram.set_position(
    [cold.x0, cold.y0, cold.width, cold.height * 5]
)
g.ax_row_dendrogram.set_visible(False)
g.cax.set_position([0.25, 0.5, 0.01, 0.5])
g.savefig("figures/elasticities.png", dpi=300)

but = production[production.scfa.str.startswith("butyrate")].sort_values(
    by="elasticity", ascending=False
)
print(but.head(20))

Python Pandas numpy matplotlib seaborn networkx nxviz From line 3 of workflows/elasticity_figs.py

import pandas as pd
from plotnine import *
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from adjustText import adjust_text
from scipy.stats import ttest_ind
import numpy as np
from itertools import combinations
from scipy.spatial.distance import pdist
from skbio.stats.distance import DistanceMatrix, permanova

sample_keep = ["run_accession", "subset", "status", "type"]
SCFAs = {
    "butyrate": "EX_but\\(e\\)",
    "acetate": "EX_ac\\(e\\)",
    "propionate": "EX_ppa\\(e\\)",
}
theme_set(theme_minimal())


def pairwise_tests(x, y, name="tests", ref="CTRL"):
    groups = x.unique()
    combs = combinations(groups, 2)
    tests = [
        pd.Series(
            ttest_ind(
                y[x == g[0]].dropna(), y[x == g[1]].dropna(), equal_var=False
            )
            + (len(y), g[0], g[1]),
            index=["t", "pval", "n", "x", "y"],
        )
        for g in combs
    ]
    tests = pd.DataFrame(tests)
    tests["name"] = name
    return tests


def export_rates_plot(fluxes, groups, samples, log=False):
    dfs = []
    for name, filt in groups.items():
        df = fluxes[fluxes.reaction.str.contains(filt)].copy()
        res = samples.copy()
        df = df.groupby(["sample", "compartment"]).tot_flux.sum().reset_index()
        res["flux"] = df.groupby("sample").tot_flux.sum().abs()
        res["metabolite"] = name
        dfs.append(res)
    fluxes = pd.concat(dfs)
    fluxes.loc[fluxes.status == "ND", "status"] = ""
    fluxes["name"] = fluxes.status + " " + fluxes.type.fillna("")
    fluxes.name = fluxes.name.str.strip()
    fluxes = fluxes.sort_values("name")
    if log:
        fluxes.flux = np.log10(fluxes.flux)
    fluxes.groupby(["metabolite", "subset"]).apply(
        lambda df: print(
            pairwise_tests(df["name"], df.flux, df.name[0] + " " + df.name[1])
        )
    )
    pl = (
        ggplot(fluxes, aes(x="name", y="flux", color="metabolite"))
        + geom_boxplot(outlier_color="none")
        + geom_jitter(width=0.2, height=0)
        + facet_grid("metabolite ~ subset", scales="free")
        + guides(color=False)
        + labs(x="", y="flux [mmol/h]")
        + theme(axis_text_x=element_text(rotation=45, vjust=1, hjust=1))
    )
    return pl


samples = pd.read_csv("data/recent.csv")[
    ["run_accession", "status", "subset", "type"]
]
samples = samples.rename(columns={"run_accession": "sample"})
samples.index = samples["sample"]

media = pd.read_csv("data/minimal_imports.csv", index_col=0).fillna(0.0)
media["sample"] = media.index
media = media.melt(id_vars="sample", var_name="reaction", value_name="flux")
media = media[media.flux > 0]
metabolites = pd.read_csv("data/metabolites.csv")
metabolites["id"] = metabolites.abbreviation + "_m"
media["id"] = media.reaction.str.lstrip("EX_")
media = pd.merge(media, metabolites, on="id")

mat = media.dropna(subset=["flux"]).pivot("id", "sample", "flux").fillna(0)
stype = (
    samples.loc[mat.columns].status.fillna("")
    + " "
    + samples.loc[mat.columns].type.fillna("")
).replace({"ND CTRL": "CTRL"})
m = dict(zip(stype.unique(), sns.color_palette()[0 : (stype.nunique() + 1)]))
stype_cols = stype.map(m)
stype_cols.name = "group"
mat = mat.apply(lambda x: x / x.abs().max(), axis=1)
g = sns.clustermap(
    mat,
    cmap="BuPu",
    figsize=(18, 18),
    col_colors=stype_cols,
    yticklabels=True,
    xticklabels=True,
)
for label in m:
    g.ax_col_dendrogram.bar(0, 0, color=m[label], label=label, linewidth=0)
g.ax_col_dendrogram.legend(loc="best", ncol=2, frameon=False)
g.ax_heatmap.set_xlabel("")
g.ax_heatmap.set_ylabel("")
col = g.ax_col_colors.get_position()
g.ax_col_colors.set_position([col.x0, col.y0, col.width, col.height * 0.3])
cold = g.ax_col_dendrogram.get_position()
g.ax_col_dendrogram.set_position(
    [cold.x0, cold.y0 - col.height * 0.7, cold.width, cold.height]
)
g.ax_heatmap.set_xticklabels(g.ax_heatmap.get_xmajorticklabels(), fontsize=6)
g.ax_heatmap.set_yticklabels(g.ax_heatmap.get_ymajorticklabels(), fontsize=6)
g.ax_col_colors.set_yticklabels(
    g.ax_col_colors.get_ymajorticklabels(), fontsize=12
)
g.savefig("figures/media.png", dpi=300)
plt.close()

fluxes = pd.read_csv("data/minimal_fluxes.csv.gz", compression="gzip")
fluxes = fluxes.melt(
    id_vars=["sample", "compartment"], var_name="reaction", value_name="flux"
)
fluxes = fluxes[
    fluxes.reaction.str.startswith("EX_")
    & (fluxes.compartment != "medium")
    & fluxes.reaction.str.endswith("(e)")
].fillna(0)
fluxes["taxa"] = fluxes.compartment + "_" + fluxes["sample"]
fluxes["name"] = fluxes.compartment.str.replace("_", " ")

species = pd.read_csv("data/genera.csv")[["samples", "genus", "reads"]]
species["name"] = species.genus
totals = species.groupby("samples").reads.sum()
species["relative"] = species.reads / totals[species.samples].values
fluxes = pd.merge(
    fluxes, species, left_on=["sample", "name"], right_on=["samples", "name"]
)
fluxes["tot_flux"] = fluxes.flux * fluxes.relative
print("Production rates:")
pl = export_rates_plot(fluxes[fluxes.tot_flux > 0], SCFAs, samples)
pl.save("figures/scfas_prod.svg", width=4, height=6)

print("Consumption rates:")
pl = export_rates_plot(fluxes[fluxes.tot_flux < 0], SCFAs, samples)
pl.save("figures/scfas_consumption.svg", width=4, height=6)

print("Net rates:")
pl = export_rates_plot(fluxes, SCFAs, samples)
pl.save("figures/scfas_net.svg", width=4, height=6)

scfa = []
for name, filt in SCFAs.items():
    fl = fluxes[fluxes.reaction.str.contains(filt)].copy()
    fl = fl.groupby(["sample", "name", "relative"]).flux.sum().reset_index()
    fl["metabolite"] = name
    scfa.append(fl)
    mat = fl.pivot("sample", "name", "flux")
    mat = mat.loc[
        :, mat.abs().mean().sort_values(ascending=False).index[0:5]
    ].fillna(0)
    sns.clustermap(
        mat, cmap="seismic", center=0, figsize=(6, 12), yticklabels=False
    )
    plt.savefig("figures/" + name + ".svg")
    plt.close()
scfa = pd.concat(scfa)
ord = scfa.groupby(["name"]).relative.mean().sort_values(ascending=False)
scfa.name = pd.Categorical(
    scfa.name.astype("str"), categories=ord.index[::-1], ordered=True
)
pl = (
    ggplot(
        scfa[scfa.name.isin(ord[ord > 0.01].index)],
        aes(x="name", y="flux", fill="metabolite", group="name"),
    )
    + geom_hline(yintercept=0, linetype="dashed")
    + geom_boxplot(outlier_color="none")
    + facet_wrap("~ metabolite", scales="free", nrow=3)
    + guides(fill=False)
    + labs(x="", y="flux [mmol/(h·gDW)]")
    + coord_flip()
)
pl.save("figures/scfas.svg", width=2.5, height=15)

pl = (
    ggplot(
        scfa[scfa.name.isin(ord[ord > 0.01].index)],
        aes(x="name", y="relative"),
    )
    + geom_boxplot(outlier_color="none")
    + labs(x="", y="relative\nabundance")
    + coord_flip()
)
pl.save("figures/relative_abundance.svg", width=2, height=5)


mat = fluxes[fluxes.flux < 0].pivot("taxa", "reaction", "flux").fillna(0.0)
taxa = mat.index.str.split("_ERR").str[0]
tsne = TSNE(n_components=2).fit_transform(mat)
tsne = pd.DataFrame(tsne, columns=["x", "y"], index=mat.index)
tsne["taxa"] = taxa
sns.set(font_scale=1.5, style="ticks")
g = sns.FacetGrid(tsne, hue="taxa", height=10, aspect=16 / 10)
gm = g.map(plt.scatter, "x", "y", alpha=0.25)
means = tsne.groupby(taxa).agg("median").reset_index()
texts = means.apply(
    lambda df: plt.text(df.x, df.y, df.taxa, alpha=0.65), axis=1
)
texts = adjust_text(
    texts,
    force_text=(0.02, 0.5),
    arrowprops=dict(arrowstyle="-|>", alpha=0.5, color="k"),
)
plt.savefig("figures/individual_media.png", dpi=200)
plt.close()

# Some statistics about metabolite usage
# indicator matrix 0 = metabolite not consumed, 1 = metabolite consumed
binary = mat.where(mat < -1e-6, 0).where(mat > -1e-6, 1)

# Jaccard distances = 1 - percent overlap
J = pdist(binary, "jaccard")
print("Jaccard distances:", pd.Series(J).describe(), sep="\n")

# euclidean distances
E = pdist(mat, "euclidean")

# Test whether genus explains a good amount of that variation
p = permanova(DistanceMatrix(E), taxa)
r2 = 1 - 1 / (1 + p[4] * p[3] / (p[2] - p[3] - 1))
p["R2"] = r2
print("PERMANOVA on euclidean distances:", p, sep="\n")

Python Pandas numpy ggplot2 matplotlib seaborn scipy sklearn scikit-bio adjustText From line 1 of workflows/exchange_figs.py

import pandas as pd
import micom

keep = [
    "samples",
    "kingdom",
    "phylum",
    "class",
    "order",
    "family",
    "genus",
    "oxygenstat",
    "metabolism",
    "gram",
    "mtype",
    "genes",
    "file",
    "reads",
    "relative",
]


def reduce_group(df):
    new = df.iloc[0, :]
    new["file"] = "|".join(df.id.apply(lambda id: f"{id}.xml"))
    return new


agora = micom.data.agora
agora_genus = agora.groupby("genus").apply(reduce_group).reset_index(drop=True)

genera = pd.read_csv("data/abundances.csv")
genera = genera.groupby(["id", "genus"]).sum().reset_index()

genus_models = pd.merge(genera, agora_genus, on="genus", suffixes=["_x", ""])
genus_models = genus_models.rename(
    columns={"id_x": "samples", "mclass": "class"}
)[keep]
genus_models.to_csv("data/genera.csv", index=False)

Python Pandas micom From line 1 of workflows/genera.py

import pandas as pd
import numpy as np
from plotnine import *
from scipy.stats import pearsonr

theme_set(theme_minimal())

rates = (
    pd.read_csv("data/tradeoff.csv")
    .dropna(subset=["abundance"])
    .rename(columns={"sample": "id", "compartments": "genus"})
)

rates["log_rates"] = np.log10(rates.growth_rate)
rates.loc[rates.growth_rate <= 0, "log_rates"] = -16

pos = rates.query("growth_rate > 1e-6 and tradeoff == 0.5")
o = pos.groupby("genus").log_rates.mean().sort_values().index
pos.genus = pd.Categorical(pos.genus, categories=o, ordered=True)

pl = (
    ggplot(pos, aes(x="genus", y="growth_rate", color="genus"))
    + geom_jitter(width=0.2, height=0, alpha=0.5, stroke=0, size=2)
    + stat_summary(fun_y=np.mean, geom="point", size=2, fill="white", stroke=1)
    + scale_y_log10()
    + guides(color=False)
    + labs(x="", y="growth rate [1/h]")
    + theme(axis_text_x=element_text(rotation=45, hjust=1, vjust=1))
)
pl.save("figures/gcs.svg", width=12, height=4)

community_growth = (
    rates[rates.tradeoff == 0.5]
    .groupby("id")
    .apply(lambda df: sum(df.abundance * df.growth_rate))
)
print(community_growth.describe())


rs = rates[rates.tradeoff == 0.5]
scale = rs.groupby("id").abundance.apply(
    lambda a: community_growth[a.name] / a.pow(2).sum()
)
x = np.linspace(0, 1, 100)
free = pd.DataFrame(
    {
        "abundance": x,
        "max_rate": scale.max() * x,
        "min_rate": scale.min() * x,
        "mean_rate": scale.mean() * x,
    }
)
pl = (
    ggplot(rs, aes(x="abundance", y="growth_rate"))
    + geom_point(stroke=0, alpha=0.2)
    + labs(x="", y="")
    + theme_classic()
)
pl.save("figures/rate_vs_abundance.png", width=3, height=3, dpi=300)

subset = rs.id.unique()[0:10]
free_subset = (
    rs[rs.id.isin(subset)]
    .groupby("id")
    .apply(
        lambda df: pd.DataFrame(
            {"abundance": x, "growth_rate": scale[df.name] * x}
        )
    )
    .reset_index()
)
pl = (
    ggplot(
        rs[rs.id.isin(subset)], aes(x="abundance", y="growth_rate", color="id")
    )
    + geom_point(size=2)
    + geom_line(data=free_subset, linetype="dashed")
    + labs(x="abundance [%]", y="growth rate [1/h]")
    + guides(color=False)
    + theme_classic()
)
pl.save("figures/rate_vs_abundance_individual.svg", width=6, height=5)

Python Pandas numpy ggplot2 scipy From line 1 of workflows/growth_rate_figs.py

import pandas as pd
import networkx as nx
import nxviz
import matplotlib.pyplot as plt
from plotnine import *

theme_set(theme_minimal())

ko = pd.read_csv("data/knockouts.csv", index_col=0)
ko["knocked"] = ko.index
ko = ko.melt(
    id_vars=["knocked", "sample"], var_name="genus", value_name="change"
).dropna()

pos = ko.groupby(["knocked", "genus"]).change.mean().reset_index()
pos = pos[pos.knocked != pos.genus].sort_values(by="knocked")
counts = (
    ko[ko.change.abs() > 1e-2]
    .groupby("knocked")
    .apply(
        lambda df: pd.DataFrame(
            {
                "counts": [
                    df.change[df.change > 0].count(),
                    df.change[df.change < 0].count(),
                ],
                "type": ["competitive", "cooperative"],
            }
        )
    )
).reset_index()
counts = counts.sort_values(by="counts", ascending=False)
counts.knocked = pd.Categorical(
    counts.knocked, counts.knocked.unique()[::-1], ordered=True
)
print(counts.groupby("type").sum())

pl = (
    ggplot(counts, aes(x="knocked", y="counts", fill="type"))
    + geom_col()
    + labs(x="", y="strong interactions")
    + coord_flip()
)
pl.save("figures/knockout_counts.svg", width=4, height=12)

graph = nx.from_pandas_edgelist(
    pos[pos.change.abs() > 1e-2], "knocked", "genus", "change"
)

for idx, _ in graph.nodes(data=True):
    graph.node[idx]["degree"] = float(graph.degree(idx))
circos = nxviz.CircosPlot(
    graph,
    node_labels=True,
    rotate_labels=True,
    edge_cmap="bwr",
    edge_color="change",
    edge_limits=(-pos.change.abs().max(), pos.change.abs().max()),
    node_color="degree",
    figsize=(9, 7.5),
)
# circos.figure.set_dpi(200)
circos.draw()
plt.tight_layout(rect=[0.05, 0.15, 0.7, 0.8])
plt.savefig("figures/circos.svg")
plt.close()

ns = (
    ko[(ko.change.abs() > 1e-2) & (ko.knocked != ko.genus)]
    .groupby(["genus", "knocked"])
    .sample.count()
    .reset_index()
)

pl = (
    ggplot(ns, aes(x="sample"))
    + geom_histogram(bins=32, fill="lightgray", color="black")
    + labs(x="observed samples")
)
pl.save("figures/interaction_prevalence.svg", width=5, height=5)

Python Pandas ggplot2 matplotlib networkx NG-Circos nxviz From line 3 of workflows/knockout_figs.py

import pandas as pd
import micom
from micom import load_pickle
from micom.media import minimal_medium
from micom.workflows import workflow

logger = micom.logger.logger
try:
    max_procs = snakemake.threads
except NameError:
    max_procs = 20


def knockout(sam):
    com = load_pickle("data/models/" + sam + ".pickle")
    ko = com.knockout_species(fraction=0.5)
    ko["sample"] = sam
    return ko


samples = pd.read_csv("data/recent.csv")
kos = workflow(knockout, samples.run_accession, max_procs)
pd.concat(kos).to_csv("data/knockouts.csv")

Python Pandas micom From line 3 of workflows/knockouts.py

import pandas as pd
import micom
from micom import load_pickle
from micom.media import minimal_medium
from micom.workflows import workflow
from micom.logger import logger


logger = micom.logger.logger
try:
    max_procs = snakemake.threads
except NameError:
    max_procs = 20


def media_and_gcs(sam):
    com = load_pickle("data/models/" + sam + ".pickle")

    # Get growth rates
    try:
        sol = com.cooperative_tradeoff(fraction=0.5)
        rates = sol.members["growth_rate"].copy()
        rates["community"] = sol.growth_rate
        rates.name = sam
    except Exception:
        logger.warning("Could not solve cooperative tradeoff for %s." % sam)
        return None

    # Get the minimal medium
    med = minimal_medium(com, 0.95 * sol.growth_rate, exports=True)
    med.name = sam

    # Apply medium and reoptimize
    com.medium = med[med > 0]
    sol = com.cooperative_tradeoff(fraction=0.5, fluxes=True, pfba=False)
    fluxes = sol.fluxes
    fluxes["sample"] = sam
    return {"medium": med, "gcs": rates, "fluxes": fluxes}


samples = pd.read_csv("data/recent.csv")
gcs = pd.DataFrame()
media = pd.DataFrame()
fluxes = pd.DataFrame()

results = workflow(media_and_gcs, samples.run_accession, max_procs)

for r in results:
    gcs = gcs.append(r["gcs"])
    media = media.append(r["medium"])
    fluxes = fluxes.append(r["fluxes"])

gcs.to_csv("data/growth_rates.csv")
media.to_csv("data/minimal_imports.csv")
fluxes.to_csv("data/minimal_fluxes.csv.gz", compression="gzip")

Python Pandas micom From line 3 of workflows/media_and_gcs.py

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from plotnine import *
from mizani.formatters import percent_format
from matplotlib.ticker import LogFormatterMathtext
import joypy as jp
from scipy.stats import spearmanr, pearsonr

theme_set(theme_minimal())

rates = (
    pd.read_csv("data/tradeoff.csv")
    .dropna(subset=["abundance"])
    .rename(columns={"sample": "id", "compartments": "genus"})
)
rates["tradeoff"] = rates.tradeoff.round(2).astype("str")
rates.loc[rates.tradeoff == "nan", "tradeoff"] = "none"

replication = (
    pd.read_csv("data/replication_rates.csv")
    .query("intercept > 1")
    .groupby(["id", "genus"])
    .rate.mean()
    .reset_index()
)


def corr(df):
    """Calculates correlations.

    In order to run the analysis with Spearman correlations just substitute
    `spearmanr` for `pearsonr` here.
    """
    if df.shape[0] > 2:
        return pd.Series(
            pearsonr(df.rate, df.growth_rate) + (df.shape[0],),
            index=["rho", "p", "n"],
        )
    else:
        return pd.Series(
            [np.nan, np.nan, df.shape[0]], index=["rho", "p", "n"]
        )


both = pd.merge(rates, replication, on=["id", "genus"])
within_samples = both.groupby(["tradeoff", "id"]).apply(corr).reset_index()
across_samples = both.groupby("tradeoff").apply(corr).reset_index()

pl = (
    ggplot(within_samples[within_samples.n > 6], aes(x="tradeoff", y="rho"))
    + geom_hline(yintercept=0, linetype="dashed")
    + geom_boxplot(outlier_color="none")
    + geom_jitter(width=0.15, height=0, alpha=0.5, stroke=0)
    + labs(x="tradeoff", y="Pearson rho")
)
pl.save("figures/within_sample_correlations.svg", width=5, height=5)
within_samples.to_csv("data/correlation_per_sample.csv")

pl = (
    ggplot(across_samples[across_samples.n > 6], aes(x="tradeoff", y="rho"))
    + geom_hline(yintercept=0, linetype="dashed")
    + geom_point()
    + geom_line(aes(group=1))
    + labs(x="tradeoff", y="Pearson rho")
)
pl.save("figures/across_sample_correlations.svg", width=5, height=5)
across_samples.to_csv("data/correlation_all.csv")

rates["log_rates"] = np.log10(rates.growth_rate)
rates.loc[rates.growth_rate <= 0, "log_rates"] = -16

fig, axes = jp.joyplot(
    rates, by="tradeoff", column="log_rates", color="cornflowerblue"
)
lf = LogFormatterMathtext()
xax = axes[-1].xaxis
xax.set_ticklabels(lf(10.0 ** ex) for ex in xax.get_ticklocs())
plt.xlabel("growth rate [1/h]")
plt.ylabel("tradeoff")
plt.savefig("figures/dists.svg")
plt.close()

non_zero = (
    rates.groupby(["id", "tradeoff"])
    .apply(lambda df: (df.growth_rate > 1e-6).sum() / df.shape[0])
    .reset_index(name="non_zero")
)

pl = (
    ggplot(non_zero, aes(x="tradeoff", y="non_zero"))
    + geom_boxplot(outlier_color="none")
    + geom_jitter(width=0.15, height=0, alpha=0.5, stroke=0)
    + scale_y_continuous(labels=percent_format())
    + labs(x="tradeoff", y="percent taxa growing")
)
pl.save("figures/percent_growing.svg", width=5, height=5)

# Show some representative correlations
comp = both[(both.tradeoff == "0.5")]
w = within_samples[within_samples.tradeoff == "0.5"]
w.index = w.id
comp = comp[comp.id.isin(w.id[(w.n >= 10)])]
comp["rho"] = w.loc[comp.id, "rho"].round(2).values
comp.id = comp.id + " (r=" + comp.rho.astype(str) + ")"

pl = (
    ggplot(comp, aes(x="rate", y="growth_rate"))
    + geom_point()
    + facet_wrap("~ id", nrow=3)
    + scale_x_log10()
    + scale_y_log10()
    + labs(x="replication rate [a.u.]", y="predicted growth rate [1/h]")
)
pl.save("figures/corr_examples.png", width=12, height=6, dpi=300)

Python Pandas numpy ggplot2 matplotlib scipy joypy mizani From line 1 of workflows/tradeoff_figs.py

import micom
from micom import load_pickle
from micom.workflows import workflow
import numpy as np
import pandas as pd


tradeoffs = np.arange(0.1, 1.01, 0.1)
logger = micom.logger.logger
try:
    max_procs = snakemake.threads
except NameError:
    max_procs = 20
logger.info("Using %d threads..." % max_procs)


def growth_rates(sam):
    com = load_pickle("data/models/" + sam + ".pickle")
    sol = com.optimize()
    rates = sol.members
    rates["tradeoff"] = np.nan
    rates["sample"] = sam
    df = [rates]

    # Get growth rates
    try:
        sol = com.cooperative_tradeoff(fraction=tradeoffs)
    except Exception as e:
        logger.warning("Sample %s could not be optimized\n %s" % (sam, str(e)))
        return pd.DataFrame(
            {"tradeoff": tradeoffs, "growth_rate": np.nan, "sample": sam}
        )
    for i, s in enumerate(sol.solution):
        rates = s.members
        rates["tradeoff"] = sol.tradeoff[i]
        rates["sample"] = sam
        df.append(rates)
    df = pd.concat(df)
    return df


samples = pd.read_csv("data/recent.csv")
rates = workflow(growth_rates, samples.run_accession, max_procs)
rates = pd.concat(rates)
rates.to_csv("data/tradeoff.csv")