Data analysis related to the bioconda paper

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

This Snakemake workflow automatically generates all results and figures from the Bioconda paper.

Requirements

Any 64-bit Linux installation with GLIBC 2.5 or newer (i.e. any Linux distribution that is newer than CentOS 6). Note that the restriction of this workflow to Linux is purely a design decision (to save space and ensure reproducibility) and not related to Conda/Bioconda. Bioconda packages are available for both Linux and MacOS in general.

Usage

This workflow can be used to recreate all results found in the Bioconda paper.

Step 1: Setup system

Variant a: Installing Miniconda on your system

If you are on a Linux system with GLIBC 2.5 or newer (i.e. any Linux distribution that is newer than CentOS 6), you can simply install Miniconda3 with

curl -o /tmp/miniconda.sh https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh && bash /tmp/miniconda.sh

Make sure to answer yes to the question whether your PATH variable shall be modified. Afterwards, open a new shell/terminal.

Variant b: Use a Docker container

Otherwise, e.g., on MacOS or if you don't want to modify your system setup, install Docker , run

docker run -it continuumio/miniconda3 /bin/bash

and execute all the following steps within that container.

Variant c: Use an existing Miniconda installation

If you want to use an existing Miniconda installation, please be aware that this is only possible if it uses Python 3 by default. You can check this via

python --version

Further, ensure it is up to date with

conda update --all

Step 2: Setup Bioconda channel

Setup Bioconda with

conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda

Step 3: Install bioconda-utils and Snakmake

Install bioconda-utils and Snakemake >=4.6.0 with

conda install bioconda-utils snakemake

If you already have an older version of Snakemake, please make sure it is updated to >=4.6.0.

Step 4: Download the workflow

First, create a working directory:

mkdir bioconda-workflow
cd bioconda-workflow

Then, download the workflow archive from https://doi.org/10.5281/zenodo.1068297 and unpack it with

tar -xf bioconda-paper-workflow.tar.gz

Step 5: Run the workflow

Execute the analysis workflow with Snakemake

snakemake --use-conda

Please wait a few minutes for the analysis to finish. Results can be found in the folder figs/ . If you have been running the workflow in the docker container (see above), you can obtain the results with

docker cp <container-id>:/bioconda-workflow/figs .

whith <container-id> being the ID of the container.

Known errors

If you see an error like
```
ImportError: No module named 'appdirs'
```
when starting Snakemake, you are likely suffering from a bug in an older conda version. Make sure to update your conda installation with
```
conda update --all
```
and then reinstall the appdirs and snakemake package with
```
conda install -f appdirs snakemake
```
If you see an error like
```
ImportError: Missing required dependencies ['numpy']
```
you are likely suffering from a bug in an older conda version. Make sure to update your conda installation with
```
conda update --all
```
and then reinstall the snakemake package with
```
conda install -f snakemake
```

Code Snippets

import os
from operator import itemgetter
from itertools import filterfalse

import pandas as pd
from github import Github

authors = pd.read_table(snakemake.input[0], index_col=0)
authors["commits"] = 0
def add(commit):
    if commit.author and commit.author.login in authors.index:
        authors.loc[commit.author.login, "commits"] += 1

# add commits
github = Github(os.environ["GITHUB_TOKEN"])
repo = github.get_repo("bioconda/bioconda-recipes")
utils_repo = github.get_repo("bioconda/bioconda-utils")

for commit in repo.get_commits():
    add(commit)

for commit in utils_repo.get_commits():
    add(commit)

# order by commits
authors.sort_values("commits", inplace=True, ascending=False)

first_authors = ["bgruening", "daler"]
core_authors = ["chapmanb", "jerowe", "tomkinsc", "rvalieris", "druvus"]
last_author = ["johanneskoester"]

# put core into the right order
authors = pd.concat([
    authors.loc[first_authors],
    authors.loc[core_authors].sort_values("commits", ascending=False),
    authors.loc[~authors.index.isin(first_authors + core_authors + last_author)],
    authors.loc[last_author]])

authors.to_csv(snakemake.output.table, sep="\t")

Python Pandas From line 1 of scripts/author-list.py

import json
import pandas as pd


packages = []
ndownloads = []
ecosystem = []
versions = []
deps = []
for path in snakemake.input:
    with open(path) as f:
        meta = json.load(f)
        ndownloads.append(sum(f["ndownloads"] for f in meta["files"]))
        name = meta["full_name"].split("/")[1]
        assert name not in packages, "duplicate package: {}".format(name)
        packages.append(name)
        versions.append(len(meta["versions"]))

        if name.startswith("bioconductor-"):
            ecosystem.append("Bioconductor")
        elif name.startswith("r-"):
            ecosystem.append("R")
        else:
            def check_for_dep(dep):
                for f in meta["files"]:
                    for d in f["dependencies"]["depends"]:
                        if d["name"] == dep:
                            return True
                return False
            if check_for_dep("python"):
                ecosystem.append("Python")
            elif check_for_dep("perl") or check_for_dep("perl-threaded"):
                ecosystem.append("Perl")
            else:
                ecosystem.append("Other")

        # stores number of dependencies based on the first (0) recipe
        deps.append(len(meta['files'][0]['attrs']['depends']))


packages = pd.DataFrame({
    "package": packages,
    "downloads": ndownloads,
    "ecosystem": ecosystem,
    "versions": versions,
    "deps": deps
}, columns=["package", "ecosystem", "downloads", "versions", "deps"])

packages.sort_values("downloads", ascending=False, inplace=True)

packages.to_csv(snakemake.output[0], sep="\t", index=False)

Python Pandas JSON From line 1 of scripts/collect-pkg-data.py

import os
import pandas as pd
from github import Github

github = Github(os.environ["GITHUB_TOKEN"])

repo = github.get_repo("bioconda/bioconda-recipes")

prs = []
titles = []
files = []
spans = []
for pr in repo.get_pulls(state="closed"):
    print(pr)
    if pr.merged:
        prs.append(pr.id)
        titles.append(pr.title)
        files.append(pr.changed_files)
        spans.append(pr.merged_at - pr.created_at)

prs = pd.DataFrame({
    "id": prs,
    "title": titles,
    "changed_files": files,
    "span": spans
})

prs.to_csv(snakemake.output[0], sep="\t", index=False)

Python Pandas From line 1 of scripts/collect-pr-data.py

import os
import pandas
from bioconda_utils import utils

repo_dir = os.path.dirname(os.path.dirname(snakemake.input[0]))
recipes = list(utils.get_recipes(os.path.join(repo_dir, 'recipes')))
config = os.path.join(repo_dir, 'config.yml')
df = []
for r in recipes:
    meta = next(utils.load_all_meta(r,config))
    d = dict(
        not_bio_related=" ",
        summary=meta.get('about', {}).get('summary', "").replace('\n', ''),
        name=meta['package']['name'],
        url=meta.get('about', {}).get('home', ""),
    )
    df.append(d)
df = pandas.DataFrame(df).drop_duplicates('name')
df = df.sort_values('name')
df = df[['not_bio_related', 'name', 'summary', 'url']]
df.to_csv(snakemake.output[0], sep='\t', index=False)

Python Pandas From line 1 of scripts/collect-summaries-and-urls.py

from snakemake.shell import shell
import matplotlib
matplotlib.use("agg")
import pandas as pd
import seaborn as sns
import networkx as nx
import glob
import os
from networkx.drawing.nx_pydot import read_dot, graphviz_layout
from matplotlib.colors import rgb2hex
import matplotlib.pyplot as plt
from matplotlib.ticker import NullLocator

packages = pd.read_table(snakemake.input.pkg, index_col=0)
packages.loc[packages.ecosystem == 'Bioconductor', 'ecosystem'] = 'Bioconductor/R'
packages.loc[packages.ecosystem == 'R', 'ecosystem'] = 'Bioconductor/R'
lookup = packages['ecosystem'].to_dict()
colors = dict(zip(['Bioconductor/R', 'Other', 'Python', 'Perl'], sns.color_palette('colorblind')))
g = read_dot(snakemake.input.dag)
# reduce to largest connected component
g = max(nx.weakly_connected_component_subgraphs(g), key=len)

pkg = snakemake.wildcards.pkg
# obtain dependencies
deps = set(nx.ancestors(g, pkg))
sub = deps | {pkg}

pos = graphviz_layout(g, prog='neato')

plt.figure(figsize=(6,6))
# draw DAG
nx.draw_networkx_edges(g, pos, edge_color='#777777', alpha=0.5, arrows=False)
nx.draw_networkx_nodes(g, pos, node_color='#333333', alpha=0.5, node_size=6)

# draw induced subdag
nx.draw_networkx_edges(g, pos,
                       edgelist=[(u, v) for u, v in g.edges(sub) if u in sub and v in sub],
                       edge_color='k', width=3.0, arrows=False)
nx.draw_networkx_nodes(g, pos, nodelist=deps,
                       node_color=[rgb2hex(colors[lookup[v]]) for v in deps],
                       linewidths=0, node_size=120)
nx.draw_networkx_nodes(g, pos, nodelist=[pkg],
                       node_color=rgb2hex(colors[lookup[pkg]]),
                       linewidths=0, node_size=120, node_shape='s')
xs = [x for x, y in pos.values()]
ys = [y for x, y in pos.values()]
plt.xlim((min(xs) - 10, max(xs) + 10))
plt.ylim((min(ys) - 10, max(ys) + 10))
# remove whitespace
plt.axis('off')
plt.gca().xaxis.set_major_locator(NullLocator())
plt.gca().yaxis.set_major_locator(NullLocator())

plt.savefig(snakemake.output[0], bbox_inches='tight')

Python Snakemake Pandas matplotlib seaborn networkx From line 1 of scripts/color-dag.py

from svgutils.compose import *
from common import label

Figure(
    "22cm", "6cm",
    Panel(SVG(snakemake.input.ecosystems), label("a")),
    Panel(SVG(snakemake.input.downloads), label("b")).move(285, 0),
    Panel(SVG(snakemake.input.comp).scale(0.9).move(10, 0), label("c")).move(560, 0),
    Panel(SVG(snakemake.input.age).scale(0.9).move(19, 0)).move(560, 90),
    # Grid(40, 40)
).save(snakemake.output[0])

Python common svgutils From line 1 of scripts/fig1.py

from svgutils.compose import *
from common import label

Figure(
    "24cm", "6.1cm",
    Panel(SVG(snakemake.input.contributions), label("a")),
    Panel(SVG(snakemake.input.add_del), label("b").move(0, -10)).move(0, 90),
    Panel(SVG(snakemake.input.dag).scale(0.6), label("c")).move(285, 0),
    Panel(SVG(snakemake.input.workflow).scale(0.5), label("d")).move(505, 0),
    Panel(SVG(snakemake.input.turnaround).scale(0.9).move(5, 0), label("e")).move(505, 50),
    Panel(SVG(snakemake.input.usage).scale(0.5), label("f")).move(505, 130)
    #Grid(40, 40)
).save(snakemake.output[0])

Python common svgutils From line 1 of scripts/fig2.py

import pandas
import os
import datetime

infile = snakemake.input[0]
outfile = snakemake.output[0]


class chunk(object):
    def __init__(self, block):
        commit, author, time = block[0].split('\t')
        self.author = author
        self.time = datetime.datetime.strptime(time.split('T')[0], "%Y-%m-%d")
        self._block = block
        self.recipes = self._parse_recipes(block[1:])

    def _parse_recipes(self, block):
        recipes = []
        for i in block:
            if not i.startswith('recipes/'):
                continue
            if os.path.basename(i) != 'meta.yaml':
                continue
            recipes.append(os.path.dirname(i.replace('recipes/', '')))
        return set(recipes)


def gen():
    lines = []
    for line in open(infile):
        line = line.strip()
        if len(line) == 0:
            yield chunk(lines)
            lines = []
            continue
        lines.append(line.strip())
    yield chunk(lines)


dfs = []
cumulative_recipes = set()
cumulative_authors = set()
for i in sorted(gen(), key=lambda x: x.time):
    if len(i.recipes) == 0:
        continue

    unique_recipes = i.recipes.difference(cumulative_recipes)
    if len(unique_recipes) > 0:
        dfs.append(
            {
                'time': i.time,
                'author': i.author,
                'recipes': unique_recipes,
                'nadded': len(unique_recipes),
                'new_author': i.author not in cumulative_authors
            },
        )
    cumulative_recipes.update(i.recipes)
    cumulative_authors.update([i.author])

df = pandas.DataFrame(dfs)
df['cumulative_authors'] = df.new_author.astype(int).cumsum()
df['cumulative_recipes'] = df.nadded.cumsum()
df["time"] = pandas.to_datetime(df["time"])
df.to_csv(outfile, sep='\t')

Python Pandas From line 20 of scripts/parse-log.py

import os
import matplotlib
matplotlib.use("agg")
import matplotlib.pyplot as plt
import seaborn as sns
from github import Github
import matplotlib.dates as mdates

import common

github = Github(os.environ["GITHUB_TOKEN"])

repo = github.get_repo("bioconda/bioconda-recipes")

weeks = []
additions = []
deletions = []
print(repo.get_stats_participation().all)
for freq in repo.get_stats_code_frequency():
    weeks.append(freq.week)
    additions.append(freq.additions)
    deletions.append(abs(freq.deletions))


plt.figure(figsize=(4,1.2))

plt.semilogy(weeks, additions, "-", label="additions")
plt.semilogy(weeks, deletions, "-", label="deletions")
plt.ylabel("count per week")
plt.legend(bbox_to_anchor=(0.68, 0.65))

plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m'))
plt.xticks(rotation=45, ha="right")

sns.despine()

plt.savefig(snakemake.output[0], bbox_inches="tight")

Python matplotlib seaborn common From line 1 of scripts/plot-add-del.py

import matplotlib
matplotlib.use("agg")
from matplotlib import pyplot as plt
import datetime
import numpy as np
import seaborn as sns
import pandas as pd

import common

try:
    log = snakemake.input.log
    pkg = snakemake.input.pkg
    outfile = snakemake.output[0]
except NameError:
    # run in the scripts dir for interactive clicking of points
    log = '../git-log/parsed-log.tsv'
    pkg = '../package-data/all.tsv'
    outfile = None

c = pd.read_table(log)
d = pd.read_table(pkg)

s = c.apply(lambda x: pd.Series(list(eval(x['recipes']))), axis=1).stack().reset_index(level=1, drop=True)
s.name = 'recipe'
cc = c.join(s)[['recipe', 'time']]
cc['package'] = cc.recipe.apply(lambda x: x.split('/')[0])
e = cc.groupby('package')['time'].agg('min')
df = d.set_index('package')
df['time'] = pd.to_datetime(e)
df['time'] -= pd.Timestamp(datetime.datetime.now())
df['days'] = df.dropna().time.apply(lambda x: -x.days)
df['log10 downloads'] = np.log10(df['downloads'] + 1)

# note we have to dropna ahead of time so that when interactively picking
# points, the event ind matches the df ind
df = df.dropna()

def callback(event):
    print(df.iloc[event.ind])


fig = plt.figure()
ax = fig.add_subplot(111)

sns.regplot('days', 'log10 downloads', df, ax=ax, scatter_kws=dict(picker=5, s=2, color='k', alpha=0.6))
plt.gca().set_xlabel('Package age (days)')
sns.despine()

if outfile:
    plt.savefig(outfile)
else:
    plt.gcf().canvas.mpl_connect('pick_event', callback)
    plt.show()

Python Pandas numpy matplotlib seaborn common From line 1 of scripts/plot-age-vs-downloads.py

import matplotlib
matplotlib.use("agg")
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from datetime import datetime
import matplotlib.dates as mdates
import numpy as np

import common


summary = pd.read_table(snakemake.input[0])
bio_related = summary.shape[0] - (summary["not_bio_related"] == "x").sum()
# Counts from October 2017
data = pd.DataFrame.from_dict({
    "Bioconda": [bio_related, "2015-09"],
    "Debian Med": [882, "2002-05"],
    "Gentoo Science": [480, "2005-10"],   # category sci-biology
    "EasyBuild": [371, "2012-03"], # moduleclass bio
    "Biolinux": [308, "2006"],
    "Homebrew Science": [297, "2009-10"], # tag bioinformatics
    "GNU Guix": [254, "2014-12"],         # category bioinformatics
    "BioBuilds": [118, "2015-11"]}, orient="index").reset_index()
data.columns = ["source", "count", "date"]
data["date"] = pd.to_datetime(data["date"])
# age in years
data["age"] = pd.to_timedelta(datetime.now() - data["date"]).astype('timedelta64[M]') / 12

plt.figure(figsize=(4,1))

sns.barplot(x="source", y="count", data=data)
plt.gca().set_xticklabels([])
plt.xlabel("")
plt.ylabel("Number of explicitly\nbio-related packages")

# set maximum tick to be that of bioconda
yticks = plt.gca().get_yticks()
yticks[-1] = bio_related
plt.gca().set_yticks(yticks)

sns.despine()
plt.savefig(snakemake.output.counts, bbox_inches="tight")

plt.figure(figsize=(4,1))

sns.barplot(x="source", y="age", data=data)
plt.xlabel("")
plt.ylabel("\nage in years")
plt.xticks(rotation=45, ha="right")
#plt.gca().yaxis.set_major_formatter(mdates.AutoDateFormatter(mdates.AutoDateLocator()))


sns.despine()
plt.savefig(snakemake.output.age, bbox_inches="tight")

# store results as csv
data[["source", "count", "age"]].to_csv(snakemake.output.csv, sep="\t", index=False)

Python Pandas numpy matplotlib seaborn common From line 1 of scripts/plot-comparison.py

import os
import matplotlib
matplotlib.use("agg")
from matplotlib import pyplot as plt
import seaborn as sns
import datetime
import pandas as pd
import matplotlib.dates as mdates

import common

infile = snakemake.input[0]
outfile = snakemake.output[0]

df = pd.read_table(infile)
df["time"] = pd.to_datetime(df["time"])
fig = plt.figure(figsize=(4,1))
plt.semilogy('time', 'cumulative_authors', data=df, label="contributors")
plt.semilogy('time', 'cumulative_recipes', data=df, label="recipes")
plt.legend()
plt.ylabel("count")
plt.xlabel("")

# deactivate xticks because we have them in the plot below in the figure
plt.xticks([])
sns.despine()

fig.savefig(outfile, bbox_inches="tight")

Python Pandas matplotlib seaborn common From line 1 of scripts/plot-contributions.py

import os
import glob
import matplotlib
matplotlib.use("agg")
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

import common

plt.figure(figsize=(4,2))
packages = pd.read_table(snakemake.input[0])
total_downloads = packages["downloads"].sum()
packages.loc[packages.ecosystem == 'Bioconductor', 'ecosystem'] = 'Bioconductor/R'
packages.loc[packages.ecosystem == 'R', 'ecosystem'] = 'Bioconductor/R'

# In case we want to filter downloads by whether or not a current recipe exists
recipes = set(map(os.path.basename, glob.glob('bioconda-recipes/recipes/*')))

sns.boxplot(x="ecosystem",
            y="downloads",
            data=packages,
            color="white",
            whis=False,
            showfliers=False,
            order=['Bioconductor/R', 'Other', 'Python', 'Perl'],
           )
sns.stripplot(x="ecosystem",
              y="downloads",
              data=packages,
              jitter=True,
              alpha=0.5,
              order=['Bioconductor/R', 'Other', 'Python', 'Perl'],
             )
plt.gca().set_yscale("log")
plt.ylabel("downloads (total: {:,})".format(total_downloads))
sns.despine()


plt.savefig(snakemake.output[0], bbox_inches="tight")

# Violin plots to see a little more structure (e.g., 3 tiers of downloads in
# Perl, BioC, R) and lower-limits (e.g., all BioC downloaded at least once, but
# some Perl, Python, R never downloaded).
#
# Take the log10 ahead of time so the KDE works well.
packages['log10 downloads'] = np.log10(packages.downloads + 1)

fig = plt.figure(figsize=(4, 3))
ax = fig.add_subplot(1, 1, 1)
sns.violinplot(
    x="ecosystem",
    y="log10 downloads",
    alpha=0.5,
    cut=0,
    data=packages,
    ax=ax,
    order=['Bioconductor/R', 'Other', 'Python', 'Perl'],
)
ax.text(x=0.5, y=1.0, s="Total downloads: {:,}".format(total_downloads),
         horizontalalignment="center", verticalalignment="top",
        transform=ax.transAxes)
ax.set_xlabel('')
plt.ylabel("downloads")
ax.set_yticklabels(["$10^{{{:.0f}}}$".format(y) for y in ax.get_yticks()])

# make a little room for the "total" text
ax.axis(ymax=6)
fig.tight_layout()
sns.despine()
plt.savefig(snakemake.output[1], bbox_inches="tight")

Python Pandas numpy matplotlib seaborn common From line 1 of scripts/plot-downloads.py

import matplotlib
matplotlib.use("agg")
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import glob
import os

import common

plt.figure(figsize=(4,2))
summary = pd.read_table(snakemake.input.bio)
not_bio_related = summary['name'][summary.not_bio_related == 'x']

packages = pd.read_table(snakemake.input.pkg_data)
packages.loc[packages.ecosystem == 'Bioconductor', 'ecosystem'] = 'Bioconductor/R'
packages.loc[packages.ecosystem == 'R', 'ecosystem'] = 'Bioconductor/R'
recipes = set(map(os.path.basename, glob.glob('bioconda-recipes/recipes/*')))

packages['has_current_recipe'] = packages['package'].isin(recipes)
packages['not_bio_related'] = packages['package'].isin(not_bio_related)

fig = plt.figure(figsize=(4, 3))
ax = fig.add_subplot(1, 1, 1)
all_cnts = packages.ecosystem.value_counts()
bio_cnts = packages[~packages.not_bio_related].ecosystem.value_counts()
non_cnts = packages[packages.not_bio_related].ecosystem.value_counts()

x = range(len(all_cnts))
ax.bar(x=x, height=bio_cnts, color=sns.color_palette())
ax.bar(x=x, height=non_cnts, bottom=bio_cnts, color=sns.color_palette(sns.color_palette(), desat=0.5))
ax.set_ylabel('Available packages')
ax.set_xticks(x)
ax.set_xticklabels(list(all_cnts.index))
ax.set_ylabel("count")
ax.text(x=0.5, y=1, s="Total packages: {}".format(packages.shape[0]),
         horizontalalignment="center", verticalalignment="top",
        transform=ax.transAxes)
sns.despine()
fig.tight_layout()

plt.savefig(snakemake.output[0], bbox_inches="tight")

Python Pandas matplotlib seaborn common From line 1 of scripts/plot-ecosystems.py

import glob
import matplotlib
matplotlib.use("agg")
import matplotlib.pyplot as plt
import seaborn as sns
import json
import pandas as pd

import common

plt.figure(figsize=(4,2))
packages = pd.read_table(snakemake.input[0])

deps = packages["deps"]


plt.hist(deps, range(0,30), lw=1)
plt.xlim([0,30])
plt.grid()
plt.xlabel("Package degree", fontsize=16)


plt.savefig(snakemake.output[0], bbox_inches="tight")

Python Pandas matplotlib JSON seaborn common From line 5 of scripts/plot-package-degrees.py

from datetime import timedelta
import matplotlib
matplotlib.use("agg")
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

import common

# Default palette was a little too dark for the text to show up in the last
# block; increasing available colors lets us stay on the lighter side of the
# palette.
sns.set_palette("Greys", n_colors=8)

prs = pd.read_table(snakemake.input[0])
prs.span = pd.to_timedelta(prs.span)

categories = pd.Series([timedelta(minutes=0), timedelta(minutes=30),
                        timedelta(hours=1), timedelta(hours=5),
                        timedelta(days=1),
                        timedelta(days=365)])
labels = [r"$\leq 30$ min", r"$\leq 1$ hour",
          r"$\leq 5$ hours", r"$\leq 1$ day", r"$ > 1$ day"]
binning = pd.cut(prs.span.dt.total_seconds(),
                 categories.dt.total_seconds(),
                 labels=labels)
counts = binning.value_counts()

# fix order
counts = counts[labels]

perc = counts / counts.sum()
fig = plt.figure(figsize=(5, 1.1))
ax = fig.add_subplot(1, 1, 1)
left = 0
for label, x in perc.items():
    ax.barh(y=0, width=x, left=left, label=label)
    ax.text(left + x/2, 0, label, horizontalalignment='center', verticalalignment='center')
    left += x

sns.despine(top=True, left=True, right=True, trim=True)
ax.set_xlabel('Fraction of pull requests merged')
ax.yaxis.set_visible(False)
fig.tight_layout()
fig.subplots_adjust(top=0.9)

#plt.pie(counts, shadow=False, labels=counts.index, autopct="%.0f%%")
plt.savefig(snakemake.output[0], bbox_inches="tight")

Python Pandas matplotlib seaborn common From line 1 of scripts/plot-turnaround.py

import os
import pandas as pd
import glob
import csv

packages = pd.read_table(snakemake.input.pkg)

# restrict to existing recipes
recipes = set(map(os.path.basename, glob.glob('bioconda-recipes/recipes/*')))
packages['has_current_recipe'] = packages['package'].isin(recipes)
packages = packages[packages.has_current_recipe]

with open(snakemake.output[0], "w") as out:
    out = csv.writer(out, delimiter="\t")
    out.writerow(["downloads", packages["downloads"].sum()])
    out.writerow(["versions", packages["versions"].sum()])
    out.writerow(["packages", packages.shape[0]])

Python Pandas From line 1 of scripts/stats.py

shell:
    "curl -X GET --header 'Accept: application/json' "
    "https://api.anaconda.org/package/bioconda/{wildcards.package} "
    "> {output} && sleep 1"

SnakeMake JSON From line 50 of master/Snakefile

script:
    "scripts/collect-pkg-data.py"

SnakeMake From line 63 of master/Snakefile

shell:
    "rm -rf bioconda-recipes; "
    "git clone https://github.com/bioconda/bioconda-recipes.git bioconda-recipes; "
    "cd bioconda-recipes; "
    "git reset --hard d819a66147566d31316198f89e7744b7a36356fe"

SnakeMake From line 72 of master/Snakefile

shell:
    '(cd bioconda-recipes && '
    'git log '
    '--pretty=format:'
    '"%h\t%aN\t%aI" '
    '--name-only '

SnakeMake From line 86 of master/Snakefile

shell:
    "cd bioconda-recipes; "
    "bioconda-utils dag --hide-singletons --format dot "
    "recipes config.yml > ../{output}"

SnakeMake From line 105 of master/Snakefile

script:
    "scripts/parse-log.py"

SnakeMake From line 118 of master/Snakefile

script:
    "scripts/collect-pr-data.py"

SnakeMake From line 127 of master/Snakefile

script:
    "scripts/collect-summaries-and-urls.py"

SnakeMake From line 139 of master/Snakefile

script:
    "scripts/plot-add-del.py"

SnakeMake From line 151 of master/Snakefile

script:
    "scripts/plot-package-degrees.py"

SnakeMake From line 162 of master/Snakefile

shell:
    "set +o pipefail; ccomps -zX#0 {input} | neato -Tsvg -o {output} "
    '-Nlabel="" -Nstyle=filled -Nfillcolor="#1f77b4" '
    '-Ecolor="#3333335f" -Nwidth=0.2 -LC10 -Gsize="12,12" '
    "-Nshape=circle -Npenwidth=0"

SnakeMake From line 173 of master/Snakefile

script:
    'scripts/color-dag.py'

SnakeMake From line 188 of master/Snakefile

script:
    "scripts/plot-downloads.py"

SnakeMake From line 215 of master/Snakefile

script:
    "scripts/plot-ecosystems.py"

SnakeMake From line 228 of master/Snakefile

script:
    "scripts/plot-comparison.py"

SnakeMake From line 241 of master/Snakefile

script:
    "scripts/plot-contributions.py"

SnakeMake From line 252 of master/Snakefile

script:
    "scripts/plot-age-vs-downloads.py"

SnakeMake From line 264 of master/Snakefile

script:
    "scripts/plot-turnaround.py"

SnakeMake From line 275 of master/Snakefile

script:
    "scripts/stats.py"

SnakeMake From line 289 of master/Snakefile

script:
    "scripts/author-list.py"

SnakeMake From line 300 of master/Snakefile

script:
    "scripts/author-tex.py"

SnakeMake From line 312 of master/Snakefile

script:
    "scripts/fig1.py"

SnakeMake From line 328 of master/Snakefile

script:
    "scripts/fig2.py"

SnakeMake From line 344 of master/Snakefile

shell:
    "cairosvg -f {wildcards.fmt} {input} -o {output}"

SnakeMake From line 355 of master/Snakefile

ShowHide 36 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/bioconda/bioconda-paper

Name: bioconda-paper

Version: 1

Badge:

Insert copied code into your website to add a link to this workflow.

License: MIT License

Keywords:

JSON Pandas Snakemake common matplotlib networkx numpy seaborn svgutils

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free