Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
- Lack of a description for a new keyword tool/pypi/adjustText .
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
This repo includes workflows and additional data to reproduce the manuscript:
MICOM: Metagenome-Scale Modeling To Infer Metabolic Interactions in the Gut Microbiota
Christian Diener, Sean M. Gibbons, Osbaldo Resendis-Antonio
mSystems 5:e00606-19
https://doi.org/10.1128/mSystems.00606-19
Additional data is provided in the
data
subdirectory. For now we will assume you
want to start from the abundances and replication rates as those are workflow
steps that pertain to MICOM. See further below on instruction how to go start
from the raw sequencing data.
Note that we only support Linux and Mac here. If you have any questions or problems running the code please don't hesitate to file an issue .
Setting up you conda environment
We will use conda to manage all the required dependencies. If you already have Miniconda or Anaconda you can skip right ahead otherwise you will need to install Miniconda using those instructions .
Open a Terminal and make sure that conda works.
conda info
Which should look something like this:
active environment : None shell level : 0 user config file : /Users/cdiener/.condarc populated config files : /Users/cdiener/.condarc conda version : 4.8.0 conda-build version : 3.18.11 python version : 3.6.7.final.0
... offline mode : False
Now either use git to clone this repo or download the zipped version and unzip it.
git clone https://github.com/micom-dev/paper
Now switch to the directory containing the downloaded code. For instance
cd paper
Now we can set up an environment that contains 99% of the required software to run all code in this repo.
conda env create -f micom.yml
Follow this by activating your environment:
conda activate micom
MICOM models all biochemical reactions in all taxa, which means that the optimization problem MICOM solves included hundreds of thousands of variables. There are only a few numerical solvers that can solve quadratic programming problems of that scale. Right, now we support CPLEX or Gurobi , which both have free academic licenses but will require you to sign up for them. We hope to change this in the future by making MICOM compatible with an open-source solver (stay tuned!).
CPLEX
After registering and downloading the CPLEX studio for your OS unpack it (by running the provided installer) to a directory of your choice (we will assume it's called
ibm
).
Now install the CPLEX python package:
pip install ibm/cplex/python/3.6/x86-64_linux
Substitute x86-64_linux with the folder corresponding to your system (there will only be one subfolder in that directory).
Gurobi
Gurobi can be installed with conda.
conda install -c gurobi gurobi
You will now have to register the installation using your license key.
grbgetkey YOUR-LICENSE-KEY
The analyses in the paper were obtained with CPLEX 12.9.
You are now all set to reproduce the paper.
Running the workflows
Download the AGORA models
The only data not supplied in this repository is the AGORA database since it would be
to big. It can be downloaded from https://www.vmh.life/files/reconstructions/AGORA/1.03/AGORA-1.03-With-Mucins.zip .
Unzip this file an transfer all files in
reconstructions/sbml
to the
data/agora
folder.
Run all steps
We use snakemake to define a reproducible workflow that generates all analyses and panels form the paper. We do recommend to use the beefiest equipment available to you to use as many CPU cores as possible (this will speed up everything by a factor of the cores). You will need approximately 1GB of RAM per core available.
The basic workflow we are running looks like this:
To run the entire workflow from start to finish use:
snakemake --cores 16
This will recreate intermediate files in
data
and figures in
figures
.
Which will run the workflow with 16 cores and will automatically track what has been run already or what is outdated. So interrupting it will not create any problems.
Individual workflow steps are kept in the
workflows
as directory as simple
Python files you can inspect to see what is happening in detail.
The species level workflow is not fully run by default. However, species models are always built. If you want to run species level tradeoff analyses just change the corresponding files to
species_tradeoff*.py
inSnakefile
.
Processing the metagenomics data
This entire part is optional. It will generate
abundances.csv
andreplication_rates.csv
which are provided along with this repository.
Install R and packages
If you want to run the metagenomic analyses as well you will need to install R and some dependencies as well.
conda activate micom
conda install -c r r-base
then open R by typing
R
and install the packages with
install.packages(c("BiocManager", "remotes", "drake"))
setRepositories(ind=1:4)
remotes::install_github("gibbons-lab/mbtools")
Download data
Depending on your setup there may be various ways. The SRA toolkit is installed in the conda environment so you can download fastq files with
fastq-dump
. The SRA IDs can be found in
data/recent.csv
in the
run_accession
column. For instance using R:
library(data.table)
library(futile.logger)
mkdir("data/raw")
samples = fread("data/recent.csv")
for (sa in samples$run_accession) {
system2("fasterq-dump",
c("-O", "data/raw", sa))
flog.info("Downloaded run %s.", sa)
}
You will also need to download the databases for
SLIMM
which we will sue to assign taxonomy and
obtain strain resolved coverage. Those can be obtained
from here
and should be copied to
data/refs
.
Run the pipeline
The entire pipeline from raw reads to replication rates is implemented
in
process.R
which uses the workflow manager
drake
to run the analysis.
Rscript process.R
You can adjust the line
options(mc.cores = 8)
to specify your preferred number of cores.
We mostly run steps from our
mbtools
package which is
documented here
.
Code Snippets
18 19 | script: "workflows/genera.py" |
28 29 | script: "workflows/build_models.py" |
38 39 | script: "workflows/build_models.py" |
48 49 | script: "workflows/tradeoff.py" |
59 60 | script: "workflows/media_and_gcs.py" |
68 69 | script: "workflows/knockouts.py" |
77 78 | script: "workflows/elasticities.py" |
91 92 | script: "workflows/tradeoff_figs.py" |
101 102 | script: "workflows/growth_rate_figs.py" |
110 111 | script: "workflows/knockout_figs.py" |
127 128 | script: "workflows/exchange_figs.py" |
136 137 | script: "workflows/elasticity_figs.py" |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 | from os import makedirs from os.path import isfile import micom from micom import Community import pandas as pd from micom.workflows import workflow logger = micom.logger.logger try: max_procs = snakemake.threads except NameError: max_procs = 20 makedirs("data/models", exist_ok=True) taxonomy = pd.read_csv("data/genera.csv").query("relative > 1e-3") taxonomy["file"] = taxonomy.file.apply( lambda ids: ["data/agora/" + i for i in ids.split("|")] ) taxonomy["name"] = taxonomy.genus assert not taxonomy.name.str.contains(" ").any() taxonomy = taxonomy.rename(columns={"name": "id", "reads": "abundance"}) diet = pd.read_csv("data/western_diet.csv") diet.index = diet.reaction = diet.reaction.str.replace("_e", "_m") diet = diet.flux * diet.dilution def build_and_save(args): s, tax = args filename = "data/models/" + s + ".pickle" if isfile(filename): return com = Community(tax, id=s, progress=False) ex_ids = [r.id for r in com.exchanges] logger.info( "%d/%d import reactions found in model.", diet.index.isin(ex_ids).sum(), len(diet), ) com.medium = diet[diet.index.isin(ex_ids)] com.to_pickle(filename) samples = taxonomy.samples.unique() args = [(s, taxonomy[taxonomy.samples == s]) for s in samples] workflow(build_and_save, args, max_procs) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | from os.path import isfile import micom from micom import load_pickle from micom.elasticity import elasticities from micom.workflows import workflow logger = micom.logger.logger try: max_procs = snakemake.threads except NameError: max_procs = 20 def elasticity_worker(sam): """Get the exchange elasticities for a sample.""" model_file = "data/models/" + sam + ".pickle" out_file = "data/elasticities_" + sam + ".csv" if isfile(out_file): return com = load_pickle(model_file) elast = elasticities(com, fraction=0.5) elast.to_csv(out_file, index=False) samples = ["ERR260275", "ERR260214", "ERR260174"] workflow(elasticity_worker, samples, max_procs) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 | import pandas as pd import networkx as nx import nxviz import seaborn as sns import numpy as np import matplotlib.pyplot as plt samples = { "ERR260275": "healthy", "ERR260214": "T2D metformin+", "ERR260174": "T2D metformin-", } metabolites = pd.read_csv("data/metabolites.csv")[ ["abbreviation", "fullName", "keggId"] ] SCFAs = {"butyrate": "EX_but(e)", "acetate": "EX_ac(e)", "propionate": "EX_ppa(e)"} def direction(els, rid): return els[els.reaction == rid].direction.unique()[0] elast = [] for sa in samples: e = pd.read_csv("data/elasticities_" + sa + ".csv") e["id"] = sa elast.append(e) elast = pd.concat(elast) elast = elast[elast.direction == "forward"] elast["scfa"] = float("nan") for name, pattern in SCFAs.items(): elast.loc[elast.reaction.str.startswith(pattern), "scfa"] = name production = ( elast.groupby(["id", "effector", "scfa"]).elasticity.sum().reset_index() ) production["type"] = production.effector.apply( lambda e: "diet" if e.startswith("EX_") else "abundance" ) production["scfa"] = production.scfa + " " + production.id.map(samples) production["abbreviation"] = production.effector.str.replace("(EX_)|(_m)", "") production = pd.merge(production, metabolites, on="abbreviation", how="left") production.loc[production.fullName.isna(), "fullName"] = production.loc[ production.fullName.isna(), "abbreviation" ] cmap = sns.color_palette()[0:2] cmap = dict(zip(production.type.unique(), cmap)) production.fullName = production.fullName.apply( lambda s: s if len(s) < 24 else s[:24] + "..." ) ty = production[["fullName", "type"]].drop_duplicates() ty = pd.Series(ty.type.values, index=ty.fullName) typecols = ty.map(cmap).rename("type") production = production.sort_values(by="id") mat = production.pivot_table( index="scfa", columns="fullName", values="elasticity", fill_value=0 ) g = sns.clustermap( mat, cmap="seismic", vmin=-production.elasticity.abs().max(), vmax=production.elasticity.abs().max(), figsize=(32, 4), col_colors=typecols, yticklabels=True, xticklabels=True, row_cluster=False, cbar_kws={"fraction": 1.0}, ) for label in cmap: g.ax_col_dendrogram.bar(0, 0, color=cmap[label], label=label, linewidth=0) g.ax_col_dendrogram.legend(loc="best", ncol=3, frameon=False) g.ax_heatmap.set_xlabel("") g.ax_heatmap.set_ylabel("") cold = g.ax_col_dendrogram.get_position() g.ax_col_dendrogram.set_position( [cold.x0, cold.y0, cold.width, cold.height * 5] ) g.ax_row_dendrogram.set_visible(False) g.cax.set_position([0.25, 0.5, 0.01, 0.5]) g.savefig("figures/elasticities.png", dpi=300) but = production[production.scfa.str.startswith("butyrate")].sort_values( by="elasticity", ascending=False ) print(but.head(20)) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 | import pandas as pd from plotnine import * import seaborn as sns import matplotlib.pyplot as plt from sklearn.manifold import TSNE from adjustText import adjust_text from scipy.stats import ttest_ind import numpy as np from itertools import combinations from scipy.spatial.distance import pdist from skbio.stats.distance import DistanceMatrix, permanova sample_keep = ["run_accession", "subset", "status", "type"] SCFAs = { "butyrate": "EX_but\\(e\\)", "acetate": "EX_ac\\(e\\)", "propionate": "EX_ppa\\(e\\)", } theme_set(theme_minimal()) def pairwise_tests(x, y, name="tests", ref="CTRL"): groups = x.unique() combs = combinations(groups, 2) tests = [ pd.Series( ttest_ind( y[x == g[0]].dropna(), y[x == g[1]].dropna(), equal_var=False ) + (len(y), g[0], g[1]), index=["t", "pval", "n", "x", "y"], ) for g in combs ] tests = pd.DataFrame(tests) tests["name"] = name return tests def export_rates_plot(fluxes, groups, samples, log=False): dfs = [] for name, filt in groups.items(): df = fluxes[fluxes.reaction.str.contains(filt)].copy() res = samples.copy() df = df.groupby(["sample", "compartment"]).tot_flux.sum().reset_index() res["flux"] = df.groupby("sample").tot_flux.sum().abs() res["metabolite"] = name dfs.append(res) fluxes = pd.concat(dfs) fluxes.loc[fluxes.status == "ND", "status"] = "" fluxes["name"] = fluxes.status + " " + fluxes.type.fillna("") fluxes.name = fluxes.name.str.strip() fluxes = fluxes.sort_values("name") if log: fluxes.flux = np.log10(fluxes.flux) fluxes.groupby(["metabolite", "subset"]).apply( lambda df: print( pairwise_tests(df["name"], df.flux, df.name[0] + " " + df.name[1]) ) ) pl = ( ggplot(fluxes, aes(x="name", y="flux", color="metabolite")) + geom_boxplot(outlier_color="none") + geom_jitter(width=0.2, height=0) + facet_grid("metabolite ~ subset", scales="free") + guides(color=False) + labs(x="", y="flux [mmol/h]") + theme(axis_text_x=element_text(rotation=45, vjust=1, hjust=1)) ) return pl samples = pd.read_csv("data/recent.csv")[ ["run_accession", "status", "subset", "type"] ] samples = samples.rename(columns={"run_accession": "sample"}) samples.index = samples["sample"] media = pd.read_csv("data/minimal_imports.csv", index_col=0).fillna(0.0) media["sample"] = media.index media = media.melt(id_vars="sample", var_name="reaction", value_name="flux") media = media[media.flux > 0] metabolites = pd.read_csv("data/metabolites.csv") metabolites["id"] = metabolites.abbreviation + "_m" media["id"] = media.reaction.str.lstrip("EX_") media = pd.merge(media, metabolites, on="id") mat = media.dropna(subset=["flux"]).pivot("id", "sample", "flux").fillna(0) stype = ( samples.loc[mat.columns].status.fillna("") + " " + samples.loc[mat.columns].type.fillna("") ).replace({"ND CTRL": "CTRL"}) m = dict(zip(stype.unique(), sns.color_palette()[0 : (stype.nunique() + 1)])) stype_cols = stype.map(m) stype_cols.name = "group" mat = mat.apply(lambda x: x / x.abs().max(), axis=1) g = sns.clustermap( mat, cmap="BuPu", figsize=(18, 18), col_colors=stype_cols, yticklabels=True, xticklabels=True, ) for label in m: g.ax_col_dendrogram.bar(0, 0, color=m[label], label=label, linewidth=0) g.ax_col_dendrogram.legend(loc="best", ncol=2, frameon=False) g.ax_heatmap.set_xlabel("") g.ax_heatmap.set_ylabel("") col = g.ax_col_colors.get_position() g.ax_col_colors.set_position([col.x0, col.y0, col.width, col.height * 0.3]) cold = g.ax_col_dendrogram.get_position() g.ax_col_dendrogram.set_position( [cold.x0, cold.y0 - col.height * 0.7, cold.width, cold.height] ) g.ax_heatmap.set_xticklabels(g.ax_heatmap.get_xmajorticklabels(), fontsize=6) g.ax_heatmap.set_yticklabels(g.ax_heatmap.get_ymajorticklabels(), fontsize=6) g.ax_col_colors.set_yticklabels( g.ax_col_colors.get_ymajorticklabels(), fontsize=12 ) g.savefig("figures/media.png", dpi=300) plt.close() fluxes = pd.read_csv("data/minimal_fluxes.csv.gz", compression="gzip") fluxes = fluxes.melt( id_vars=["sample", "compartment"], var_name="reaction", value_name="flux" ) fluxes = fluxes[ fluxes.reaction.str.startswith("EX_") & (fluxes.compartment != "medium") & fluxes.reaction.str.endswith("(e)") ].fillna(0) fluxes["taxa"] = fluxes.compartment + "_" + fluxes["sample"] fluxes["name"] = fluxes.compartment.str.replace("_", " ") species = pd.read_csv("data/genera.csv")[["samples", "genus", "reads"]] species["name"] = species.genus totals = species.groupby("samples").reads.sum() species["relative"] = species.reads / totals[species.samples].values fluxes = pd.merge( fluxes, species, left_on=["sample", "name"], right_on=["samples", "name"] ) fluxes["tot_flux"] = fluxes.flux * fluxes.relative print("Production rates:") pl = export_rates_plot(fluxes[fluxes.tot_flux > 0], SCFAs, samples) pl.save("figures/scfas_prod.svg", width=4, height=6) print("Consumption rates:") pl = export_rates_plot(fluxes[fluxes.tot_flux < 0], SCFAs, samples) pl.save("figures/scfas_consumption.svg", width=4, height=6) print("Net rates:") pl = export_rates_plot(fluxes, SCFAs, samples) pl.save("figures/scfas_net.svg", width=4, height=6) scfa = [] for name, filt in SCFAs.items(): fl = fluxes[fluxes.reaction.str.contains(filt)].copy() fl = fl.groupby(["sample", "name", "relative"]).flux.sum().reset_index() fl["metabolite"] = name scfa.append(fl) mat = fl.pivot("sample", "name", "flux") mat = mat.loc[ :, mat.abs().mean().sort_values(ascending=False).index[0:5] ].fillna(0) sns.clustermap( mat, cmap="seismic", center=0, figsize=(6, 12), yticklabels=False ) plt.savefig("figures/" + name + ".svg") plt.close() scfa = pd.concat(scfa) ord = scfa.groupby(["name"]).relative.mean().sort_values(ascending=False) scfa.name = pd.Categorical( scfa.name.astype("str"), categories=ord.index[::-1], ordered=True ) pl = ( ggplot( scfa[scfa.name.isin(ord[ord > 0.01].index)], aes(x="name", y="flux", fill="metabolite", group="name"), ) + geom_hline(yintercept=0, linetype="dashed") + geom_boxplot(outlier_color="none") + facet_wrap("~ metabolite", scales="free", nrow=3) + guides(fill=False) + labs(x="", y="flux [mmol/(h·gDW)]") + coord_flip() ) pl.save("figures/scfas.svg", width=2.5, height=15) pl = ( ggplot( scfa[scfa.name.isin(ord[ord > 0.01].index)], aes(x="name", y="relative"), ) + geom_boxplot(outlier_color="none") + labs(x="", y="relative\nabundance") + coord_flip() ) pl.save("figures/relative_abundance.svg", width=2, height=5) mat = fluxes[fluxes.flux < 0].pivot("taxa", "reaction", "flux").fillna(0.0) taxa = mat.index.str.split("_ERR").str[0] tsne = TSNE(n_components=2).fit_transform(mat) tsne = pd.DataFrame(tsne, columns=["x", "y"], index=mat.index) tsne["taxa"] = taxa sns.set(font_scale=1.5, style="ticks") g = sns.FacetGrid(tsne, hue="taxa", height=10, aspect=16 / 10) gm = g.map(plt.scatter, "x", "y", alpha=0.25) means = tsne.groupby(taxa).agg("median").reset_index() texts = means.apply( lambda df: plt.text(df.x, df.y, df.taxa, alpha=0.65), axis=1 ) texts = adjust_text( texts, force_text=(0.02, 0.5), arrowprops=dict(arrowstyle="-|>", alpha=0.5, color="k"), ) plt.savefig("figures/individual_media.png", dpi=200) plt.close() # Some statistics about metabolite usage # indicator matrix 0 = metabolite not consumed, 1 = metabolite consumed binary = mat.where(mat < -1e-6, 0).where(mat > -1e-6, 1) # Jaccard distances = 1 - percent overlap J = pdist(binary, "jaccard") print("Jaccard distances:", pd.Series(J).describe(), sep="\n") # euclidean distances E = pdist(mat, "euclidean") # Test whether genus explains a good amount of that variation p = permanova(DistanceMatrix(E), taxa) r2 = 1 - 1 / (1 + p[4] * p[3] / (p[2] - p[3] - 1)) p["R2"] = r2 print("PERMANOVA on euclidean distances:", p, sep="\n") |
1
of
workflows/exchange_figs.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | import pandas as pd import micom keep = [ "samples", "kingdom", "phylum", "class", "order", "family", "genus", "oxygenstat", "metabolism", "gram", "mtype", "genes", "file", "reads", "relative", ] def reduce_group(df): new = df.iloc[0, :] new["file"] = "|".join(df.id.apply(lambda id: f"{id}.xml")) return new agora = micom.data.agora agora_genus = agora.groupby("genus").apply(reduce_group).reset_index(drop=True) genera = pd.read_csv("data/abundances.csv") genera = genera.groupby(["id", "genus"]).sum().reset_index() genus_models = pd.merge(genera, agora_genus, on="genus", suffixes=["_x", ""]) genus_models = genus_models.rename( columns={"id_x": "samples", "mclass": "class"} )[keep] genus_models.to_csv("data/genera.csv", index=False) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 | import pandas as pd import numpy as np from plotnine import * from scipy.stats import pearsonr theme_set(theme_minimal()) rates = ( pd.read_csv("data/tradeoff.csv") .dropna(subset=["abundance"]) .rename(columns={"sample": "id", "compartments": "genus"}) ) rates["log_rates"] = np.log10(rates.growth_rate) rates.loc[rates.growth_rate <= 0, "log_rates"] = -16 pos = rates.query("growth_rate > 1e-6 and tradeoff == 0.5") o = pos.groupby("genus").log_rates.mean().sort_values().index pos.genus = pd.Categorical(pos.genus, categories=o, ordered=True) pl = ( ggplot(pos, aes(x="genus", y="growth_rate", color="genus")) + geom_jitter(width=0.2, height=0, alpha=0.5, stroke=0, size=2) + stat_summary(fun_y=np.mean, geom="point", size=2, fill="white", stroke=1) + scale_y_log10() + guides(color=False) + labs(x="", y="growth rate [1/h]") + theme(axis_text_x=element_text(rotation=45, hjust=1, vjust=1)) ) pl.save("figures/gcs.svg", width=12, height=4) community_growth = ( rates[rates.tradeoff == 0.5] .groupby("id") .apply(lambda df: sum(df.abundance * df.growth_rate)) ) print(community_growth.describe()) rs = rates[rates.tradeoff == 0.5] scale = rs.groupby("id").abundance.apply( lambda a: community_growth[a.name] / a.pow(2).sum() ) x = np.linspace(0, 1, 100) free = pd.DataFrame( { "abundance": x, "max_rate": scale.max() * x, "min_rate": scale.min() * x, "mean_rate": scale.mean() * x, } ) pl = ( ggplot(rs, aes(x="abundance", y="growth_rate")) + geom_point(stroke=0, alpha=0.2) + labs(x="", y="") + theme_classic() ) pl.save("figures/rate_vs_abundance.png", width=3, height=3, dpi=300) subset = rs.id.unique()[0:10] free_subset = ( rs[rs.id.isin(subset)] .groupby("id") .apply( lambda df: pd.DataFrame( {"abundance": x, "growth_rate": scale[df.name] * x} ) ) .reset_index() ) pl = ( ggplot( rs[rs.id.isin(subset)], aes(x="abundance", y="growth_rate", color="id") ) + geom_point(size=2) + geom_line(data=free_subset, linetype="dashed") + labs(x="abundance [%]", y="growth rate [1/h]") + guides(color=False) + theme_classic() ) pl.save("figures/rate_vs_abundance_individual.svg", width=6, height=5) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 | import pandas as pd import networkx as nx import nxviz import matplotlib.pyplot as plt from plotnine import * theme_set(theme_minimal()) ko = pd.read_csv("data/knockouts.csv", index_col=0) ko["knocked"] = ko.index ko = ko.melt( id_vars=["knocked", "sample"], var_name="genus", value_name="change" ).dropna() pos = ko.groupby(["knocked", "genus"]).change.mean().reset_index() pos = pos[pos.knocked != pos.genus].sort_values(by="knocked") counts = ( ko[ko.change.abs() > 1e-2] .groupby("knocked") .apply( lambda df: pd.DataFrame( { "counts": [ df.change[df.change > 0].count(), df.change[df.change < 0].count(), ], "type": ["competitive", "cooperative"], } ) ) ).reset_index() counts = counts.sort_values(by="counts", ascending=False) counts.knocked = pd.Categorical( counts.knocked, counts.knocked.unique()[::-1], ordered=True ) print(counts.groupby("type").sum()) pl = ( ggplot(counts, aes(x="knocked", y="counts", fill="type")) + geom_col() + labs(x="", y="strong interactions") + coord_flip() ) pl.save("figures/knockout_counts.svg", width=4, height=12) graph = nx.from_pandas_edgelist( pos[pos.change.abs() > 1e-2], "knocked", "genus", "change" ) for idx, _ in graph.nodes(data=True): graph.node[idx]["degree"] = float(graph.degree(idx)) circos = nxviz.CircosPlot( graph, node_labels=True, rotate_labels=True, edge_cmap="bwr", edge_color="change", edge_limits=(-pos.change.abs().max(), pos.change.abs().max()), node_color="degree", figsize=(9, 7.5), ) # circos.figure.set_dpi(200) circos.draw() plt.tight_layout(rect=[0.05, 0.15, 0.7, 0.8]) plt.savefig("figures/circos.svg") plt.close() ns = ( ko[(ko.change.abs() > 1e-2) & (ko.knocked != ko.genus)] .groupby(["genus", "knocked"]) .sample.count() .reset_index() ) pl = ( ggplot(ns, aes(x="sample")) + geom_histogram(bins=32, fill="lightgray", color="black") + labs(x="observed samples") ) pl.save("figures/interaction_prevalence.svg", width=5, height=5) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | import pandas as pd import micom from micom import load_pickle from micom.media import minimal_medium from micom.workflows import workflow logger = micom.logger.logger try: max_procs = snakemake.threads except NameError: max_procs = 20 def knockout(sam): com = load_pickle("data/models/" + sam + ".pickle") ko = com.knockout_species(fraction=0.5) ko["sample"] = sam return ko samples = pd.read_csv("data/recent.csv") kos = workflow(knockout, samples.run_accession, max_procs) pd.concat(kos).to_csv("data/knockouts.csv") |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 | import pandas as pd import micom from micom import load_pickle from micom.media import minimal_medium from micom.workflows import workflow from micom.logger import logger logger = micom.logger.logger try: max_procs = snakemake.threads except NameError: max_procs = 20 def media_and_gcs(sam): com = load_pickle("data/models/" + sam + ".pickle") # Get growth rates try: sol = com.cooperative_tradeoff(fraction=0.5) rates = sol.members["growth_rate"].copy() rates["community"] = sol.growth_rate rates.name = sam except Exception: logger.warning("Could not solve cooperative tradeoff for %s." % sam) return None # Get the minimal medium med = minimal_medium(com, 0.95 * sol.growth_rate, exports=True) med.name = sam # Apply medium and reoptimize com.medium = med[med > 0] sol = com.cooperative_tradeoff(fraction=0.5, fluxes=True, pfba=False) fluxes = sol.fluxes fluxes["sample"] = sam return {"medium": med, "gcs": rates, "fluxes": fluxes} samples = pd.read_csv("data/recent.csv") gcs = pd.DataFrame() media = pd.DataFrame() fluxes = pd.DataFrame() results = workflow(media_and_gcs, samples.run_accession, max_procs) for r in results: gcs = gcs.append(r["gcs"]) media = media.append(r["medium"]) fluxes = fluxes.append(r["fluxes"]) gcs.to_csv("data/growth_rates.csv") media.to_csv("data/minimal_imports.csv") fluxes.to_csv("data/minimal_fluxes.csv.gz", compression="gzip") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 | import pandas as pd import numpy as np import matplotlib.pyplot as plt from plotnine import * from mizani.formatters import percent_format from matplotlib.ticker import LogFormatterMathtext import joypy as jp from scipy.stats import spearmanr, pearsonr theme_set(theme_minimal()) rates = ( pd.read_csv("data/tradeoff.csv") .dropna(subset=["abundance"]) .rename(columns={"sample": "id", "compartments": "genus"}) ) rates["tradeoff"] = rates.tradeoff.round(2).astype("str") rates.loc[rates.tradeoff == "nan", "tradeoff"] = "none" replication = ( pd.read_csv("data/replication_rates.csv") .query("intercept > 1") .groupby(["id", "genus"]) .rate.mean() .reset_index() ) def corr(df): """Calculates correlations. In order to run the analysis with Spearman correlations just substitute `spearmanr` for `pearsonr` here. """ if df.shape[0] > 2: return pd.Series( pearsonr(df.rate, df.growth_rate) + (df.shape[0],), index=["rho", "p", "n"], ) else: return pd.Series( [np.nan, np.nan, df.shape[0]], index=["rho", "p", "n"] ) both = pd.merge(rates, replication, on=["id", "genus"]) within_samples = both.groupby(["tradeoff", "id"]).apply(corr).reset_index() across_samples = both.groupby("tradeoff").apply(corr).reset_index() pl = ( ggplot(within_samples[within_samples.n > 6], aes(x="tradeoff", y="rho")) + geom_hline(yintercept=0, linetype="dashed") + geom_boxplot(outlier_color="none") + geom_jitter(width=0.15, height=0, alpha=0.5, stroke=0) + labs(x="tradeoff", y="Pearson rho") ) pl.save("figures/within_sample_correlations.svg", width=5, height=5) within_samples.to_csv("data/correlation_per_sample.csv") pl = ( ggplot(across_samples[across_samples.n > 6], aes(x="tradeoff", y="rho")) + geom_hline(yintercept=0, linetype="dashed") + geom_point() + geom_line(aes(group=1)) + labs(x="tradeoff", y="Pearson rho") ) pl.save("figures/across_sample_correlations.svg", width=5, height=5) across_samples.to_csv("data/correlation_all.csv") rates["log_rates"] = np.log10(rates.growth_rate) rates.loc[rates.growth_rate <= 0, "log_rates"] = -16 fig, axes = jp.joyplot( rates, by="tradeoff", column="log_rates", color="cornflowerblue" ) lf = LogFormatterMathtext() xax = axes[-1].xaxis xax.set_ticklabels(lf(10.0 ** ex) for ex in xax.get_ticklocs()) plt.xlabel("growth rate [1/h]") plt.ylabel("tradeoff") plt.savefig("figures/dists.svg") plt.close() non_zero = ( rates.groupby(["id", "tradeoff"]) .apply(lambda df: (df.growth_rate > 1e-6).sum() / df.shape[0]) .reset_index(name="non_zero") ) pl = ( ggplot(non_zero, aes(x="tradeoff", y="non_zero")) + geom_boxplot(outlier_color="none") + geom_jitter(width=0.15, height=0, alpha=0.5, stroke=0) + scale_y_continuous(labels=percent_format()) + labs(x="tradeoff", y="percent taxa growing") ) pl.save("figures/percent_growing.svg", width=5, height=5) # Show some representative correlations comp = both[(both.tradeoff == "0.5")] w = within_samples[within_samples.tradeoff == "0.5"] w.index = w.id comp = comp[comp.id.isin(w.id[(w.n >= 10)])] comp["rho"] = w.loc[comp.id, "rho"].round(2).values comp.id = comp.id + " (r=" + comp.rho.astype(str) + ")" pl = ( ggplot(comp, aes(x="rate", y="growth_rate")) + geom_point() + facet_wrap("~ id", nrow=3) + scale_x_log10() + scale_y_log10() + labs(x="replication rate [a.u.]", y="predicted growth rate [1/h]") ) pl.save("figures/corr_examples.png", width=12, height=6, dpi=300) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | import micom from micom import load_pickle from micom.workflows import workflow import numpy as np import pandas as pd tradeoffs = np.arange(0.1, 1.01, 0.1) logger = micom.logger.logger try: max_procs = snakemake.threads except NameError: max_procs = 20 logger.info("Using %d threads..." % max_procs) def growth_rates(sam): com = load_pickle("data/models/" + sam + ".pickle") sol = com.optimize() rates = sol.members rates["tradeoff"] = np.nan rates["sample"] = sam df = [rates] # Get growth rates try: sol = com.cooperative_tradeoff(fraction=tradeoffs) except Exception as e: logger.warning("Sample %s could not be optimized\n %s" % (sam, str(e))) return pd.DataFrame( {"tradeoff": tradeoffs, "growth_rate": np.nan, "sample": sam} ) for i, s in enumerate(sol.solution): rates = s.members rates["tradeoff"] = sol.tradeoff[i] rates["sample"] = sam df.append(rates) df = pd.concat(df) return df samples = pd.read_csv("data/recent.csv") rates = workflow(growth_rates, samples.run_accession, max_procs) rates = pd.concat(rates) rates.to_csv("data/tradeoff.csv") |
Support
- Future updates