Data analysis related to the bioconda paper

public public 1yr ago 0 bookmarks

This Snakemake workflow automatically generates all results and figures from the Bioconda paper.

Requirements

Any 64-bit Linux installation with GLIBC 2.5 or newer (i.e. any Linux distribution that is newer than CentOS 6). Note that the restriction of this workflow to Linux is purely a design decision (to save space and ensure reproducibility) and not related to Conda/Bioconda. Bioconda packages are available for both Linux and MacOS in general.

Usage

This workflow can be used to recreate all results found in the Bioconda paper.

Step 1: Setup system

Variant a: Installing Miniconda on your system

If you are on a Linux system with GLIBC 2.5 or newer (i.e. any Linux distribution that is newer than CentOS 6), you can simply install Miniconda3 with

curl -o /tmp/miniconda.sh https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh && bash /tmp/miniconda.sh

Make sure to answer yes to the question whether your PATH variable shall be modified. Afterwards, open a new shell/terminal.

Variant b: Use a Docker container

Otherwise, e.g., on MacOS or if you don't want to modify your system setup, install Docker , run

docker run -it continuumio/miniconda3 /bin/bash

and execute all the following steps within that container.

Variant c: Use an existing Miniconda installation

If you want to use an existing Miniconda installation, please be aware that this is only possible if it uses Python 3 by default. You can check this via

python --version

Further, ensure it is up to date with

conda update --all

Step 2: Setup Bioconda channel

Setup Bioconda with

conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda

Step 3: Install bioconda-utils and Snakmake

Install bioconda-utils and Snakemake >=4.6.0 with

conda install bioconda-utils snakemake

If you already have an older version of Snakemake, please make sure it is updated to >=4.6.0.

Step 4: Download the workflow

First, create a working directory:

mkdir bioconda-workflow
cd bioconda-workflow

Then, download the workflow archive from https://doi.org/10.5281/zenodo.1068297 and unpack it with

tar -xf bioconda-paper-workflow.tar.gz

Step 5: Run the workflow

Execute the analysis workflow with Snakemake

snakemake --use-conda

Please wait a few minutes for the analysis to finish. Results can be found in the folder figs/ . If you have been running the workflow in the docker container (see above), you can obtain the results with

docker cp <container-id>:/bioconda-workflow/figs .

whith <container-id> being the ID of the container.

Known errors

  • If you see an error like

    ImportError: No module named 'appdirs'
    

    when starting Snakemake, you are likely suffering from a bug in an older conda version. Make sure to update your conda installation with

    conda update --all
    

    and then reinstall the appdirs and snakemake package with

    conda install -f appdirs snakemake
    
  • If you see an error like

    ImportError: Missing required dependencies ['numpy']
    

    you are likely suffering from a bug in an older conda version. Make sure to update your conda installation with

    conda update --all
    

    and then reinstall the snakemake package with

    conda install -f snakemake
    

Code Snippets

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import os
from operator import itemgetter
from itertools import filterfalse

import pandas as pd
from github import Github

authors = pd.read_table(snakemake.input[0], index_col=0)
authors["commits"] = 0
def add(commit):
    if commit.author and commit.author.login in authors.index:
        authors.loc[commit.author.login, "commits"] += 1

# add commits
github = Github(os.environ["GITHUB_TOKEN"])
repo = github.get_repo("bioconda/bioconda-recipes")
utils_repo = github.get_repo("bioconda/bioconda-utils")

for commit in repo.get_commits():
    add(commit)

for commit in utils_repo.get_commits():
    add(commit)

# order by commits
authors.sort_values("commits", inplace=True, ascending=False)

first_authors = ["bgruening", "daler"]
core_authors = ["chapmanb", "jerowe", "tomkinsc", "rvalieris", "druvus"]
last_author = ["johanneskoester"]

# put core into the right order
authors = pd.concat([
    authors.loc[first_authors],
    authors.loc[core_authors].sort_values("commits", ascending=False),
    authors.loc[~authors.index.isin(first_authors + core_authors + last_author)],
    authors.loc[last_author]])

authors.to_csv(snakemake.output.table, sep="\t")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import json
import pandas as pd


packages = []
ndownloads = []
ecosystem = []
versions = []
deps = []
for path in snakemake.input:
    with open(path) as f:
        meta = json.load(f)
        ndownloads.append(sum(f["ndownloads"] for f in meta["files"]))
        name = meta["full_name"].split("/")[1]
        assert name not in packages, "duplicate package: {}".format(name)
        packages.append(name)
        versions.append(len(meta["versions"]))

        if name.startswith("bioconductor-"):
            ecosystem.append("Bioconductor")
        elif name.startswith("r-"):
            ecosystem.append("R")
        else:
            def check_for_dep(dep):
                for f in meta["files"]:
                    for d in f["dependencies"]["depends"]:
                        if d["name"] == dep:
                            return True
                return False
            if check_for_dep("python"):
                ecosystem.append("Python")
            elif check_for_dep("perl") or check_for_dep("perl-threaded"):
                ecosystem.append("Perl")
            else:
                ecosystem.append("Other")

        # stores number of dependencies based on the first (0) recipe
        deps.append(len(meta['files'][0]['attrs']['depends']))


packages = pd.DataFrame({
    "package": packages,
    "downloads": ndownloads,
    "ecosystem": ecosystem,
    "versions": versions,
    "deps": deps
}, columns=["package", "ecosystem", "downloads", "versions", "deps"])

packages.sort_values("downloads", ascending=False, inplace=True)

packages.to_csv(snakemake.output[0], sep="\t", index=False)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import os
import pandas as pd
from github import Github

github = Github(os.environ["GITHUB_TOKEN"])

repo = github.get_repo("bioconda/bioconda-recipes")

prs = []
titles = []
files = []
spans = []
for pr in repo.get_pulls(state="closed"):
    print(pr)
    if pr.merged:
        prs.append(pr.id)
        titles.append(pr.title)
        files.append(pr.changed_files)
        spans.append(pr.merged_at - pr.created_at)

prs = pd.DataFrame({
    "id": prs,
    "title": titles,
    "changed_files": files,
    "span": spans
})

prs.to_csv(snakemake.output[0], sep="\t", index=False)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import os
import pandas
from bioconda_utils import utils

repo_dir = os.path.dirname(os.path.dirname(snakemake.input[0]))
recipes = list(utils.get_recipes(os.path.join(repo_dir, 'recipes')))
config = os.path.join(repo_dir, 'config.yml')
df = []
for r in recipes:
    meta = next(utils.load_all_meta(r,config))
    d = dict(
        not_bio_related=" ",
        summary=meta.get('about', {}).get('summary', "").replace('\n', ''),
        name=meta['package']['name'],
        url=meta.get('about', {}).get('home', ""),
    )
    df.append(d)
df = pandas.DataFrame(df).drop_duplicates('name')
df = df.sort_values('name')
df = df[['not_bio_related', 'name', 'summary', 'url']]
df.to_csv(snakemake.output[0], sep='\t', index=False)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
from snakemake.shell import shell
import matplotlib
matplotlib.use("agg")
import pandas as pd
import seaborn as sns
import networkx as nx
import glob
import os
from networkx.drawing.nx_pydot import read_dot, graphviz_layout
from matplotlib.colors import rgb2hex
import matplotlib.pyplot as plt
from matplotlib.ticker import NullLocator

packages = pd.read_table(snakemake.input.pkg, index_col=0)
packages.loc[packages.ecosystem == 'Bioconductor', 'ecosystem'] = 'Bioconductor/R'
packages.loc[packages.ecosystem == 'R', 'ecosystem'] = 'Bioconductor/R'
lookup = packages['ecosystem'].to_dict()
colors = dict(zip(['Bioconductor/R', 'Other', 'Python', 'Perl'], sns.color_palette('colorblind')))
g = read_dot(snakemake.input.dag)
# reduce to largest connected component
g = max(nx.weakly_connected_component_subgraphs(g), key=len)

pkg = snakemake.wildcards.pkg
# obtain dependencies
deps = set(nx.ancestors(g, pkg))
sub = deps | {pkg}

pos = graphviz_layout(g, prog='neato')

plt.figure(figsize=(6,6))
# draw DAG
nx.draw_networkx_edges(g, pos, edge_color='#777777', alpha=0.5, arrows=False)
nx.draw_networkx_nodes(g, pos, node_color='#333333', alpha=0.5, node_size=6)

# draw induced subdag
nx.draw_networkx_edges(g, pos,
                       edgelist=[(u, v) for u, v in g.edges(sub) if u in sub and v in sub],
                       edge_color='k', width=3.0, arrows=False)
nx.draw_networkx_nodes(g, pos, nodelist=deps,
                       node_color=[rgb2hex(colors[lookup[v]]) for v in deps],
                       linewidths=0, node_size=120)
nx.draw_networkx_nodes(g, pos, nodelist=[pkg],
                       node_color=rgb2hex(colors[lookup[pkg]]),
                       linewidths=0, node_size=120, node_shape='s')
xs = [x for x, y in pos.values()]
ys = [y for x, y in pos.values()]
plt.xlim((min(xs) - 10, max(xs) + 10))
plt.ylim((min(ys) - 10, max(ys) + 10))
# remove whitespace
plt.axis('off')
plt.gca().xaxis.set_major_locator(NullLocator())
plt.gca().yaxis.set_major_locator(NullLocator())

plt.savefig(snakemake.output[0], bbox_inches='tight')
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
from svgutils.compose import *
from common import label

Figure(
    "22cm", "6cm",
    Panel(SVG(snakemake.input.ecosystems), label("a")),
    Panel(SVG(snakemake.input.downloads), label("b")).move(285, 0),
    Panel(SVG(snakemake.input.comp).scale(0.9).move(10, 0), label("c")).move(560, 0),
    Panel(SVG(snakemake.input.age).scale(0.9).move(19, 0)).move(560, 90),
    # Grid(40, 40)
).save(snakemake.output[0])
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
from svgutils.compose import *
from common import label

Figure(
    "24cm", "6.1cm",
    Panel(SVG(snakemake.input.contributions), label("a")),
    Panel(SVG(snakemake.input.add_del), label("b").move(0, -10)).move(0, 90),
    Panel(SVG(snakemake.input.dag).scale(0.6), label("c")).move(285, 0),
    Panel(SVG(snakemake.input.workflow).scale(0.5), label("d")).move(505, 0),
    Panel(SVG(snakemake.input.turnaround).scale(0.9).move(5, 0), label("e")).move(505, 50),
    Panel(SVG(snakemake.input.usage).scale(0.5), label("f")).move(505, 130)
    #Grid(40, 40)
).save(snakemake.output[0])
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import pandas
import os
import datetime

infile = snakemake.input[0]
outfile = snakemake.output[0]


class chunk(object):
    def __init__(self, block):
        commit, author, time = block[0].split('\t')
        self.author = author
        self.time = datetime.datetime.strptime(time.split('T')[0], "%Y-%m-%d")
        self._block = block
        self.recipes = self._parse_recipes(block[1:])

    def _parse_recipes(self, block):
        recipes = []
        for i in block:
            if not i.startswith('recipes/'):
                continue
            if os.path.basename(i) != 'meta.yaml':
                continue
            recipes.append(os.path.dirname(i.replace('recipes/', '')))
        return set(recipes)


def gen():
    lines = []
    for line in open(infile):
        line = line.strip()
        if len(line) == 0:
            yield chunk(lines)
            lines = []
            continue
        lines.append(line.strip())
    yield chunk(lines)


dfs = []
cumulative_recipes = set()
cumulative_authors = set()
for i in sorted(gen(), key=lambda x: x.time):
    if len(i.recipes) == 0:
        continue

    unique_recipes = i.recipes.difference(cumulative_recipes)
    if len(unique_recipes) > 0:
        dfs.append(
            {
                'time': i.time,
                'author': i.author,
                'recipes': unique_recipes,
                'nadded': len(unique_recipes),
                'new_author': i.author not in cumulative_authors
            },
        )
    cumulative_recipes.update(i.recipes)
    cumulative_authors.update([i.author])

df = pandas.DataFrame(dfs)
df['cumulative_authors'] = df.new_author.astype(int).cumsum()
df['cumulative_recipes'] = df.nadded.cumsum()
df["time"] = pandas.to_datetime(df["time"])
df.to_csv(outfile, sep='\t')
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import os
import matplotlib
matplotlib.use("agg")
import matplotlib.pyplot as plt
import seaborn as sns
from github import Github
import matplotlib.dates as mdates

import common

github = Github(os.environ["GITHUB_TOKEN"])

repo = github.get_repo("bioconda/bioconda-recipes")

weeks = []
additions = []
deletions = []
print(repo.get_stats_participation().all)
for freq in repo.get_stats_code_frequency():
    weeks.append(freq.week)
    additions.append(freq.additions)
    deletions.append(abs(freq.deletions))


plt.figure(figsize=(4,1.2))

plt.semilogy(weeks, additions, "-", label="additions")
plt.semilogy(weeks, deletions, "-", label="deletions")
plt.ylabel("count per week")
plt.legend(bbox_to_anchor=(0.68, 0.65))

plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m'))
plt.xticks(rotation=45, ha="right")

sns.despine()

plt.savefig(snakemake.output[0], bbox_inches="tight")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import matplotlib
matplotlib.use("agg")
from matplotlib import pyplot as plt
import datetime
import numpy as np
import seaborn as sns
import pandas as pd

import common

try:
    log = snakemake.input.log
    pkg = snakemake.input.pkg
    outfile = snakemake.output[0]
except NameError:
    # run in the scripts dir for interactive clicking of points
    log = '../git-log/parsed-log.tsv'
    pkg = '../package-data/all.tsv'
    outfile = None

c = pd.read_table(log)
d = pd.read_table(pkg)

s = c.apply(lambda x: pd.Series(list(eval(x['recipes']))), axis=1).stack().reset_index(level=1, drop=True)
s.name = 'recipe'
cc = c.join(s)[['recipe', 'time']]
cc['package'] = cc.recipe.apply(lambda x: x.split('/')[0])
e = cc.groupby('package')['time'].agg('min')
df = d.set_index('package')
df['time'] = pd.to_datetime(e)
df['time'] -= pd.Timestamp(datetime.datetime.now())
df['days'] = df.dropna().time.apply(lambda x: -x.days)
df['log10 downloads'] = np.log10(df['downloads'] + 1)

# note we have to dropna ahead of time so that when interactively picking
# points, the event ind matches the df ind
df = df.dropna()

def callback(event):
    print(df.iloc[event.ind])


fig = plt.figure()
ax = fig.add_subplot(111)

sns.regplot('days', 'log10 downloads', df, ax=ax, scatter_kws=dict(picker=5, s=2, color='k', alpha=0.6))
plt.gca().set_xlabel('Package age (days)')
sns.despine()

if outfile:
    plt.savefig(outfile)
else:
    plt.gcf().canvas.mpl_connect('pick_event', callback)
    plt.show()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import matplotlib
matplotlib.use("agg")
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from datetime import datetime
import matplotlib.dates as mdates
import numpy as np

import common


summary = pd.read_table(snakemake.input[0])
bio_related = summary.shape[0] - (summary["not_bio_related"] == "x").sum()
# Counts from October 2017
data = pd.DataFrame.from_dict({
    "Bioconda": [bio_related, "2015-09"],
    "Debian Med": [882, "2002-05"],
    "Gentoo Science": [480, "2005-10"],   # category sci-biology
    "EasyBuild": [371, "2012-03"], # moduleclass bio
    "Biolinux": [308, "2006"],
    "Homebrew Science": [297, "2009-10"], # tag bioinformatics
    "GNU Guix": [254, "2014-12"],         # category bioinformatics
    "BioBuilds": [118, "2015-11"]}, orient="index").reset_index()
data.columns = ["source", "count", "date"]
data["date"] = pd.to_datetime(data["date"])
# age in years
data["age"] = pd.to_timedelta(datetime.now() - data["date"]).astype('timedelta64[M]') / 12

plt.figure(figsize=(4,1))

sns.barplot(x="source", y="count", data=data)
plt.gca().set_xticklabels([])
plt.xlabel("")
plt.ylabel("Number of explicitly\nbio-related packages")

# set maximum tick to be that of bioconda
yticks = plt.gca().get_yticks()
yticks[-1] = bio_related
plt.gca().set_yticks(yticks)

sns.despine()
plt.savefig(snakemake.output.counts, bbox_inches="tight")

plt.figure(figsize=(4,1))

sns.barplot(x="source", y="age", data=data)
plt.xlabel("")
plt.ylabel("\nage in years")
plt.xticks(rotation=45, ha="right")
#plt.gca().yaxis.set_major_formatter(mdates.AutoDateFormatter(mdates.AutoDateLocator()))


sns.despine()
plt.savefig(snakemake.output.age, bbox_inches="tight")

# store results as csv
data[["source", "count", "age"]].to_csv(snakemake.output.csv, sep="\t", index=False)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import os
import matplotlib
matplotlib.use("agg")
from matplotlib import pyplot as plt
import seaborn as sns
import datetime
import pandas as pd
import matplotlib.dates as mdates

import common

infile = snakemake.input[0]
outfile = snakemake.output[0]

df = pd.read_table(infile)
df["time"] = pd.to_datetime(df["time"])
fig = plt.figure(figsize=(4,1))
plt.semilogy('time', 'cumulative_authors', data=df, label="contributors")
plt.semilogy('time', 'cumulative_recipes', data=df, label="recipes")
plt.legend()
plt.ylabel("count")
plt.xlabel("")

# deactivate xticks because we have them in the plot below in the figure
plt.xticks([])
sns.despine()

fig.savefig(outfile, bbox_inches="tight")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import os
import glob
import matplotlib
matplotlib.use("agg")
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

import common

plt.figure(figsize=(4,2))
packages = pd.read_table(snakemake.input[0])
total_downloads = packages["downloads"].sum()
packages.loc[packages.ecosystem == 'Bioconductor', 'ecosystem'] = 'Bioconductor/R'
packages.loc[packages.ecosystem == 'R', 'ecosystem'] = 'Bioconductor/R'

# In case we want to filter downloads by whether or not a current recipe exists
recipes = set(map(os.path.basename, glob.glob('bioconda-recipes/recipes/*')))

sns.boxplot(x="ecosystem",
            y="downloads",
            data=packages,
            color="white",
            whis=False,
            showfliers=False,
            order=['Bioconductor/R', 'Other', 'Python', 'Perl'],
           )
sns.stripplot(x="ecosystem",
              y="downloads",
              data=packages,
              jitter=True,
              alpha=0.5,
              order=['Bioconductor/R', 'Other', 'Python', 'Perl'],
             )
plt.gca().set_yscale("log")
plt.ylabel("downloads (total: {:,})".format(total_downloads))
sns.despine()


plt.savefig(snakemake.output[0], bbox_inches="tight")

# Violin plots to see a little more structure (e.g., 3 tiers of downloads in
# Perl, BioC, R) and lower-limits (e.g., all BioC downloaded at least once, but
# some Perl, Python, R never downloaded).
#
# Take the log10 ahead of time so the KDE works well.
packages['log10 downloads'] = np.log10(packages.downloads + 1)

fig = plt.figure(figsize=(4, 3))
ax = fig.add_subplot(1, 1, 1)
sns.violinplot(
    x="ecosystem",
    y="log10 downloads",
    alpha=0.5,
    cut=0,
    data=packages,
    ax=ax,
    order=['Bioconductor/R', 'Other', 'Python', 'Perl'],
)
ax.text(x=0.5, y=1.0, s="Total downloads: {:,}".format(total_downloads),
         horizontalalignment="center", verticalalignment="top",
        transform=ax.transAxes)
ax.set_xlabel('')
plt.ylabel("downloads")
ax.set_yticklabels(["$10^{{{:.0f}}}$".format(y) for y in ax.get_yticks()])

# make a little room for the "total" text
ax.axis(ymax=6)
fig.tight_layout()
sns.despine()
plt.savefig(snakemake.output[1], bbox_inches="tight")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import matplotlib
matplotlib.use("agg")
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import glob
import os

import common

plt.figure(figsize=(4,2))
summary = pd.read_table(snakemake.input.bio)
not_bio_related = summary['name'][summary.not_bio_related == 'x']

packages = pd.read_table(snakemake.input.pkg_data)
packages.loc[packages.ecosystem == 'Bioconductor', 'ecosystem'] = 'Bioconductor/R'
packages.loc[packages.ecosystem == 'R', 'ecosystem'] = 'Bioconductor/R'
recipes = set(map(os.path.basename, glob.glob('bioconda-recipes/recipes/*')))

packages['has_current_recipe'] = packages['package'].isin(recipes)
packages['not_bio_related'] = packages['package'].isin(not_bio_related)

fig = plt.figure(figsize=(4, 3))
ax = fig.add_subplot(1, 1, 1)
all_cnts = packages.ecosystem.value_counts()
bio_cnts = packages[~packages.not_bio_related].ecosystem.value_counts()
non_cnts = packages[packages.not_bio_related].ecosystem.value_counts()

x = range(len(all_cnts))
ax.bar(x=x, height=bio_cnts, color=sns.color_palette())
ax.bar(x=x, height=non_cnts, bottom=bio_cnts, color=sns.color_palette(sns.color_palette(), desat=0.5))
ax.set_ylabel('Available packages')
ax.set_xticks(x)
ax.set_xticklabels(list(all_cnts.index))
ax.set_ylabel("count")
ax.text(x=0.5, y=1, s="Total packages: {}".format(packages.shape[0]),
         horizontalalignment="center", verticalalignment="top",
        transform=ax.transAxes)
sns.despine()
fig.tight_layout()

plt.savefig(snakemake.output[0], bbox_inches="tight")
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import glob
import matplotlib
matplotlib.use("agg")
import matplotlib.pyplot as plt
import seaborn as sns
import json
import pandas as pd

import common

plt.figure(figsize=(4,2))
packages = pd.read_table(snakemake.input[0])

deps = packages["deps"]


plt.hist(deps, range(0,30), lw=1)
plt.xlim([0,30])
plt.grid()
plt.xlabel("Package degree", fontsize=16)


plt.savefig(snakemake.output[0], bbox_inches="tight")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
from datetime import timedelta
import matplotlib
matplotlib.use("agg")
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

import common

# Default palette was a little too dark for the text to show up in the last
# block; increasing available colors lets us stay on the lighter side of the
# palette.
sns.set_palette("Greys", n_colors=8)

prs = pd.read_table(snakemake.input[0])
prs.span = pd.to_timedelta(prs.span)

categories = pd.Series([timedelta(minutes=0), timedelta(minutes=30),
                        timedelta(hours=1), timedelta(hours=5),
                        timedelta(days=1),
                        timedelta(days=365)])
labels = [r"$\leq 30$ min", r"$\leq 1$ hour",
          r"$\leq 5$ hours", r"$\leq 1$ day", r"$ > 1$ day"]
binning = pd.cut(prs.span.dt.total_seconds(),
                 categories.dt.total_seconds(),
                 labels=labels)
counts = binning.value_counts()

# fix order
counts = counts[labels]

perc = counts / counts.sum()
fig = plt.figure(figsize=(5, 1.1))
ax = fig.add_subplot(1, 1, 1)
left = 0
for label, x in perc.items():
    ax.barh(y=0, width=x, left=left, label=label)
    ax.text(left + x/2, 0, label, horizontalalignment='center', verticalalignment='center')
    left += x

sns.despine(top=True, left=True, right=True, trim=True)
ax.set_xlabel('Fraction of pull requests merged')
ax.yaxis.set_visible(False)
fig.tight_layout()
fig.subplots_adjust(top=0.9)

#plt.pie(counts, shadow=False, labels=counts.index, autopct="%.0f%%")
plt.savefig(snakemake.output[0], bbox_inches="tight")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import os
import pandas as pd
import glob
import csv

packages = pd.read_table(snakemake.input.pkg)

# restrict to existing recipes
recipes = set(map(os.path.basename, glob.glob('bioconda-recipes/recipes/*')))
packages['has_current_recipe'] = packages['package'].isin(recipes)
packages = packages[packages.has_current_recipe]

with open(snakemake.output[0], "w") as out:
    out = csv.writer(out, delimiter="\t")
    out.writerow(["downloads", packages["downloads"].sum()])
    out.writerow(["versions", packages["versions"].sum()])
    out.writerow(["packages", packages.shape[0]])
50
51
52
53
shell:
    "curl -X GET --header 'Accept: application/json' "
    "https://api.anaconda.org/package/bioconda/{wildcards.package} "
    "> {output} && sleep 1"
63
64
script:
    "scripts/collect-pkg-data.py"
72
73
74
75
76
shell:
    "rm -rf bioconda-recipes; "
    "git clone https://github.com/bioconda/bioconda-recipes.git bioconda-recipes; "
    "cd bioconda-recipes; "
    "git reset --hard d819a66147566d31316198f89e7744b7a36356fe"
86
87
88
89
90
91
shell:
    '(cd bioconda-recipes && '
    'git log '
    '--pretty=format:'
    '"%h\t%aN\t%aI" '
    '--name-only '
105
106
107
108
shell:
    "cd bioconda-recipes; "
    "bioconda-utils dag --hide-singletons --format dot "
    "recipes config.yml > ../{output}"
SnakeMake From line 105 of master/Snakefile
118
119
script:
    "scripts/parse-log.py"
SnakeMake From line 118 of master/Snakefile
127
128
script:
    "scripts/collect-pr-data.py"
SnakeMake From line 127 of master/Snakefile
139
140
script:
    "scripts/collect-summaries-and-urls.py"
SnakeMake From line 139 of master/Snakefile
151
152
script:
    "scripts/plot-add-del.py"
SnakeMake From line 151 of master/Snakefile
162
163
script:
    "scripts/plot-package-degrees.py"
SnakeMake From line 162 of master/Snakefile
173
174
175
176
177
shell:
    "set +o pipefail; ccomps -zX#0 {input} | neato -Tsvg -o {output} "
    '-Nlabel="" -Nstyle=filled -Nfillcolor="#1f77b4" '
    '-Ecolor="#3333335f" -Nwidth=0.2 -LC10 -Gsize="12,12" '
    "-Nshape=circle -Npenwidth=0"
SnakeMake From line 173 of master/Snakefile
188
189
script:
    'scripts/color-dag.py'
SnakeMake From line 188 of master/Snakefile
215
216
script:
    "scripts/plot-downloads.py"
SnakeMake From line 215 of master/Snakefile
228
229
script:
    "scripts/plot-ecosystems.py"
SnakeMake From line 228 of master/Snakefile
241
242
script:
    "scripts/plot-comparison.py"
SnakeMake From line 241 of master/Snakefile
252
253
script:
    "scripts/plot-contributions.py"
SnakeMake From line 252 of master/Snakefile
264
265
script:
    "scripts/plot-age-vs-downloads.py"
SnakeMake From line 264 of master/Snakefile
275
276
script:
    "scripts/plot-turnaround.py"
SnakeMake From line 275 of master/Snakefile
289
290
script:
    "scripts/stats.py"
SnakeMake From line 289 of master/Snakefile
300
301
script:
    "scripts/author-list.py"
SnakeMake From line 300 of master/Snakefile
312
313
script:
    "scripts/author-tex.py"
SnakeMake From line 312 of master/Snakefile
328
329
script:
    "scripts/fig1.py"
SnakeMake From line 328 of master/Snakefile
344
345
script:
    "scripts/fig2.py"
SnakeMake From line 344 of master/Snakefile
355
356
shell:
    "cairosvg -f {wildcards.fmt} {input} -o {output}"
SnakeMake From line 355 of master/Snakefile
ShowHide 36 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/bioconda/bioconda-paper
Name: bioconda-paper
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...