Snakemake workflow for comparison of differential abundance ranks

public public 1yr ago Version: v0.3.0a2 0 bookmarks
(Pronounced ka-da-bra )

Qadabra is a Snakemake workflow for running and comparing several differential abundance (DA) methods on the same microbiome dataset.

Importantly, Qadabra focuses on both FDR corrected p-values and feature ranks and generates visualizations of differential abundance results.

Schematic

Installation

pip install qadabra

Qadabra requires the following dependencies:

  • snakemake

  • click

  • biom-format

  • pandas

  • numpy

  • cython

  • iow

Usage

1. Creating the workflow directory

Qadabra can be used on multiple datasets at once. First, we want to create the workflow directory to perfrom differential abundance with all methods:

qadabra create-workflow --workflow-dest <directory_name>

This command will initialize the workflow, but we still need to point to our dataset(s) of interest.

2. Adding a dataset

We can add datasets one-by-one with the add-dataset command:

qadabra add-dataset \
 --workflow-dest <directory_name> \
 --table <directory_name>/data/table.biom \
 --metadata <directory_name>/data/metadata.tsv \
 --tree <directory_name>/data/my_tree.nwk \
 --name my_dataset \
 --factor-name case_control \
 --target-level case \
 --reference-level control \
 --confounder confounding_variable(s) <confounding_var> \
 --verbose

Let's walkthrough the arguments provided here, which represent the inputs to Qadabra:

  • workflow-dest : The location of the workflow that we created earlier

  • table : Feature table (features by samples) in BIOM format

  • metadata : Sample metadata in TSV format

  • tree : Phylogenetic tree in .nwk or other tree format (optional)

  • name : Name to give this dataset

  • factor-name : Metadata column to use for differential abundance

  • target-level : The value in the chosen factor to use as the target

  • reference-level : The reference level to which we want to compare our target

  • confounder : Any confounding variable metadata columns (optional)

  • verbose : Flag to show all preprocessing performed by Qadabra

Your dataset should now be added as a line in my_qadabra/config/datasets.tsv .

You can use qadabra add-dataset --help for more details. To add another dataset, just run this command again with the new dataset information.

3. Running the workflow

The previous commands will create a subdirectory, my_qadabra in which the workflow structure is contained. From the command line, execute the following to start the workflow:

snakemake --use-conda --cores <number of cores preferred> <other options>

Please read the Snakemake documentation for how to run Snakemake best on your system.

When this process is completed, you should have directories figures , results , and log . Each of these directories will have a separate folder for each dataset you added.

4. Generating a report

After Qadabra has finished running, you can generate a Snakemake report of the workflow with the following command:

snakemake --report report.zip

This will create a zipped directory containing the report. Unzip this file and open the report.html file to view the report containing results and visualizations in your browser.

Tutorial

See the tutorial page for a walkthroughon using Qadabra workflow with a microbiome dataset.

FAQs

Coming soon: An FAQs page of commonly asked question on the statistics and code pertaining to Qadabra.

Citation

The manuscript for Qadabra is currently in progress. Please cite this GitHub page if Qadabra is used for your analysis. This project is licensed under the MIT License. See the license file for details.

Code Snippets

19
20
script:
    "../scripts/R/deseq2.R"
35
36
script:
    "../scripts/R/ancombc.R"
51
52
script:
    "../scripts/R/aldex2.R"
67
68
script:
    "../scripts/R/edger.R"
86
87
88
89
90
91
92
93
94
95
96
97
98
99
shell:
    """
    songbird multinomial \
        --input-biom {input.table} \
        --metadata-file {input.metadata} \
        --formula "{params.formula}" \
        --epochs {params.epochs} \
        --differential-prior {params.diff_prior} \
        --summary-interval 1 \
        --min-feature-count 0 \
        --min-sample-count 0 \
        --random-seed 1 \
        --summary-dir {params.outdir} > {log} 2>&1
    """
114
115
script:
    "../scripts/R/maaslin2.R"
SnakeMake From line 114 of rules/diffab.smk
130
131
script:
    "../scripts/R/metagenomeseq.R"
146
147
script:
    "../scripts/R/corncob.R"
SnakeMake From line 146 of rules/diffab.smk
161
162
script:
    "../scripts/process_differentials.py"
SnakeMake From line 161 of rules/diffab.smk
177
178
script:
    "../scripts/concatenate_differentials.py"
SnakeMake From line 177 of rules/diffab.smk
192
193
script:
    "../scripts/process_pvalues.py"
SnakeMake From line 192 of rules/diffab.smk
208
209
script:
    "../scripts/concatenate_pvalues.py"
SnakeMake From line 208 of rules/diffab.smk
222
223
script:
    "../scripts/concatenate_all_results.py"
SnakeMake From line 222 of rules/diffab.smk
12
13
script:
    "../scripts/run_pca.py"
SnakeMake From line 12 of rules/ml.smk
27
28
script:
    "../scripts/get_pctile_feats.py"
SnakeMake From line 27 of rules/ml.smk
40
41
script:
    "../scripts/get_pctile_feats.py"
SnakeMake From line 40 of rules/ml.smk
54
55
script:
    "../scripts/get_log_ratios.py"
SnakeMake From line 54 of rules/ml.smk
72
73
script:
    "../scripts/logistic_regression.py"
SnakeMake From line 72 of rules/ml.smk
20
21
script:
    "../scripts/plot_differentials.py"
38
39
script:
    "../scripts/plot_diff_kendall_heatmap.py"
56
57
script:
    "../scripts/plot_pvalue_kendall_heatmap.py"
87
88
script:
    "../scripts/plot_upset.py"
109
110
script:
    "../scripts/plot_roc.py"
131
132
script:
    "../scripts/plot_prc.py"
151
152
script:
    "../scripts/plot_pca.py"
169
170
script:
    "../scripts/plot_rank_comparison.py"
187
188
script:
    "../scripts/plot_pvalue_pw_comparisons.py"
209
210
script:
    "../scripts/plot_pvalue_volcanoes.py"
229
230
231
232
233
234
235
236
shell:
    """
    qurro \
        --ranks {input.ranks} \
        --table {input.table} \
        --sample-metadata {input.metadata} \
        --output-dir {output} > {log} 2>&1
    """
254
255
256
257
258
259
260
shell:
    """
    empress tree-plot \
        --tree {input.tree} \
        --feature-metadata {input.ranks} \
        --output-dir {output} > {log} 2>&1
    """
277
278
script:
    "../scripts/create_diff_table.py"
294
295
script:
    "../scripts/create_pval_table.py"
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import logging
import pandas as pd

logger = logging.getLogger("qadabra")
logger.setLevel(logging.INFO)
fh = logging.FileHandler(snakemake.log[0], mode="w")
formatter = logging.Formatter(
    f"[%(asctime)s - {snakemake.rule}] :: %(message)s"
)
fh.setFormatter(formatter)
logger.addHandler(fh)

logging.captureWarnings(True)
logging.getLogger("py.warnings").addHandler(fh)

logger.info("Loading differentials...")
concatenated_diff_file = pd.read_csv(snakemake.input["concatenated_differentials"], sep="\t", index_col=0)

for col in concatenated_diff_file.columns:
    new_col_name = col + " differentials"
    concatenated_diff_file.rename(columns={col: new_col_name}, inplace=True)

logger.info("Loading p-values...")
concatenated_pvalue_file = pd.read_csv(snakemake.input["concatenated_pvalues"], sep="\t", index_col=0)

for col in concatenated_pvalue_file.columns:
    new_col_name = col + " significant features"
    concatenated_pvalue_file.rename(columns={col: new_col_name}, inplace=True)


qadabra_all_result = pd.concat([concatenated_diff_file, concatenated_pvalue_file], axis=1)

for index, row in qadabra_all_result.iterrows():
    count = 0
    for col in qadabra_all_result.columns:
        if col.endswith(" significant features"):
            if row[col] < 0.05:
                count += 1
                qadabra_all_result.at[index, col] = "p<0.05"
            else:
                qadabra_all_result.at[index, col] = "ns"
    qadabra_all_result.at[index, "# of tools p > 0.05"] = count


qadabra_all_result.to_csv(snakemake.output[0], sep="\t", index=True)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import logging
import pandas as pd

logger = logging.getLogger("qadabra")
logger.setLevel(logging.INFO)
fh = logging.FileHandler(snakemake.log[0], mode="w")
formatter = logging.Formatter(
    f"[%(asctime)s - {snakemake.rule}] :: %(message)s"
)
fh.setFormatter(formatter)
logger.addHandler(fh)

logging.captureWarnings(True)
logging.getLogger("py.warnings").addHandler(fh)

logger.info("Loading differentials...")
diffs_files = [
    pd.read_table(x, sep="\t", index_col=0) for x in snakemake.input
]
concat_diffs = pd.concat(diffs_files, axis=1)

concat_diffs.to_csv(snakemake.output[0], sep="\t", index=True)
logger.info(f"Saved to {snakemake.output[0]}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import logging
import pandas as pd

logger = logging.getLogger("qadabra")
logger.setLevel(logging.INFO)
fh = logging.FileHandler(snakemake.log[0], mode="w")
formatter = logging.Formatter(
    f"[%(asctime)s - {snakemake.rule}] :: %(message)s"
)
fh.setFormatter(formatter)
logger.addHandler(fh)

logging.captureWarnings(True)
logging.getLogger("py.warnings").addHandler(fh)

logger.info("Loading p-values...")
pvalue_files = [
    pd.read_table(x, sep="\t", index_col=0) for x in snakemake.input
]
concat_pvalues = pd.concat(pvalue_files, axis=1)

concat_pvalues.to_csv(snakemake.output[0], sep="\t", index=True)
logger.info(f"Saved to {snakemake.output[0]}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import logging
import pandas as pd
from bokeh.plotting import output_file, save
from bokeh.models import DataTable, TableColumn, ColumnDataSource


logger = logging.getLogger("qadabra")
logger.setLevel(logging.INFO)
fh = logging.FileHandler(snakemake.log[0], mode="w")
formatter = logging.Formatter(
    f"[%(asctime)s - {snakemake.rule}] :: %(message)s"
)
fh.setFormatter(formatter)
logger.addHandler(fh)

output_file(filename=snakemake.output[0], title="Table")
diff_df = pd.read_table(snakemake.input[0], sep="\t", index_col=0)

diff_df = diff_df.round(2)
diff_df = diff_df.reset_index()

source = ColumnDataSource(diff_df)
columns = [
    TableColumn(field=x, title=x)
    for x in diff_df.columns
]
data_table = DataTable(source=source, columns=columns,
                       frozen_columns=1,
                       sizing_mode="stretch_height")
save(data_table)
logger.info(f"Saved table to {snakemake.output[0]}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import logging
import pandas as pd
from bokeh.plotting import output_file, save
from bokeh.models import DataTable, TableColumn, ColumnDataSource


logger = logging.getLogger("qadabra")
logger.setLevel(logging.INFO)
fh = logging.FileHandler(snakemake.log[0], mode="w")
formatter = logging.Formatter(
    f"[%(asctime)s - {snakemake.rule}] :: %(message)s"
)
fh.setFormatter(formatter)
logger.addHandler(fh)

output_file(filename=snakemake.output[0], title="P-value table")
pval_df = pd.read_table(snakemake.input[0], sep="\t", index_col=0)

pval_df = pval_df.round(2)
pval_df = pval_df.reset_index()

source = ColumnDataSource(pval_df)
columns = [
    TableColumn(field=x, title=x)
    for x in pval_df.columns
]
data_table = DataTable(source=source, columns=columns,
                       frozen_columns=1,
                       sizing_mode="stretch_height")
save(data_table)
logger.info(f"Saved table to {snakemake.output[0]}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import logging

import biom
import numpy as np
import pandas as pd


logger = logging.getLogger("qadabra")
logger.setLevel(logging.INFO)
fh = logging.FileHandler(snakemake.log[0], mode="w")
formatter = logging.Formatter(
    f"[%(asctime)s - {snakemake.rule}] :: %(message)s"
)
fh.setFormatter(formatter)
logger.addHandler(fh)

logger.info("Loading table...")
table = biom.load_table(snakemake.input["table"])
table = table.to_dataframe(dense=True).T

def log_ratio(table, top_feats, bot_feats):
    num_sum = table.loc[:, top_feats].sum(axis=1)
    denom_sum = table.loc[:, bot_feats].sum(axis=1)
    lr_df = pd.concat([num_sum, denom_sum], axis=1)
    lr_df.columns = ["num", "denom"]
    lr_df = lr_df.dropna(how="all")
    lr_df = lr_df + 1
    lr_df["log_ratio"] = np.log(lr_df["num"]/lr_df["denom"]).to_frame()
    return lr_df


feat_df = pd.read_table(snakemake.input["feats"], sep="\t")
if not feat_df.empty:
    top_feats = feat_df.query("location == 'numerator'")["feature_id"]
    bot_feats = feat_df.query("location == 'denominator'")["feature_id"]

    logger.info("Creating log-ratios...")
    lr_df = log_ratio(table, top_feats, bot_feats).reset_index()
    lr_df["pctile"] = snakemake.wildcards["pctile"]
    lr_df["num_feats"] = feat_df["num_feats"].unique().item()

    if snakemake.wildcards.get("tool") is not None:
        lr_df["tool"] = snakemake.wildcards["tool"]
    else:
        lr_df["tool"] = "pca_pc1"

    lr_df = lr_df.rename(columns={"index": "sample_name"})
    lr_df.to_csv(snakemake.output[0], sep="\t", index=False)
    logger.info(f"Saved to {snakemake.output[0]}")
else:
    logger.info("The feature DataFrame is empty. Skipping the rest of the script.")
    lr_df = pd.DataFrame()
    lr_df.to_csv(snakemake.output[0], sep="\t", index=False)
    logger.info(f"Saved to {snakemake.output[0]}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import logging

import pandas as pd


logger = logging.getLogger("qadabra")
logger.setLevel(logging.INFO)
fh = logging.FileHandler(snakemake.log[0], mode="w")
formatter = logging.Formatter(
    f"[%(asctime)s - {snakemake.rule}] :: %(message)s"
)
fh.setFormatter(formatter)
logger.addHandler(fh)

logger.info("Loading differentials...")
diffs = pd.read_table(snakemake.input[0], sep="\t", index_col=0)
if snakemake.wildcards.get("tool") is not None:
    tool = snakemake.wildcards["tool"]
else:
    tool = "PC1"
pctile = int(snakemake.wildcards["pctile"])

sorted_diffs = diffs.sort_values(by=tool, ascending=False)

n = int(diffs.shape[0]*pctile/100)
top_feats = sorted_diffs.head(n).index.tolist()
bot_feats = sorted_diffs.tail(n).index.tolist()
top_feats = pd.DataFrame(
    dict(feature_id=top_feats, location="numerator")
)
bot_feats = pd.DataFrame(
    dict(feature_id=bot_feats, location="denominator")
)
df = pd.concat([top_feats, bot_feats])
df["pctile"] = pctile
df["num_feats"] = n

df.to_csv(snakemake.output[0], sep="\t", index=False)
logger.info(f"Saved to {snakemake.output[0]}")
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
import logging

from joblib import dump
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (roc_curve, auc, precision_recall_curve,
                             average_precision_score)
from sklearn.model_selection import RepeatedStratifiedKFold


logger = logging.getLogger("qadabra")
logger.setLevel(logging.INFO)
fh = logging.FileHandler(snakemake.log[0], mode="w")
formatter = logging.Formatter(
    f"[%(asctime)s - {snakemake.rule}] :: %(message)s"
)
fh.setFormatter(formatter)
logger.addHandler(fh)

logging.captureWarnings(True)
logging.getLogger("py.warnings").addHandler(fh)

covariate = snakemake.params["factor_name"]
target = snakemake.params["target"]

logger.info(f"Covariate: {covariate}")
logger.info(f"Target: {target}")

metadata = pd.read_table(snakemake.input["metadata"], sep="\t",
                         index_col=0).dropna(subset=[covariate])
try:
    log_ratios = pd.read_table(snakemake.input["log_ratios"], sep="\t", index_col=0).join(metadata, how="inner")
    X = log_ratios["log_ratio"].values.reshape(-1, 1)
    y = (log_ratios[covariate] == target).astype(int).values

    mean_fpr = np.linspace(0, 1, 100)
    mean_rec = np.linspace(0, 1, 100)

    n_splits = snakemake.config["ml_params"]["n_splits"]
    n_repeats = snakemake.config["ml_params"]["n_repeats"]
    logger.info(
        f"Using repeated stratified k-fold CV with {n_splits} splits & "
        f"{n_repeats} repeats"
    )
    cv = RepeatedStratifiedKFold(n_splits=n_splits, n_repeats=n_repeats,
                                random_state=1)

    model = LogisticRegression(penalty="none")
    folds = [(train, test) for train, test in cv.split(X, y)]
    model_data = {
        "folds": {
            f"fold_{i+1}": {"train_idx": train, "test_idx": test}
            for i, (train, test) in enumerate(folds)
        }
    }
    model_data["fprs"] = mean_fpr
    model_data["recalls"] = mean_rec

    model_data["log_ratios"] = X
    model_data["truth"] = y

    # https://stackoverflow.com/a/67968945
    tprs = []
    precs = []
    roc_aucs = []
    pr_aucs = []
    for i, (train, test) in enumerate(folds):
        prediction = model.fit(X[train], y[train]).predict_proba(X[test])

        fpr, tpr, _ = roc_curve(y[test], prediction[:, 1])
        tpr_interp = np.interp(mean_fpr, fpr, tpr)
        tprs.append(tpr_interp)
        roc_auc = auc(mean_fpr, tpr_interp)
        roc_aucs.append(roc_auc)

        prec, rec, _ = precision_recall_curve(y[test], prediction[:, 1])
        prec = prec[::-1]
        rec = rec[::-1]
        prec_interp = np.interp(mean_rec, rec, prec)
        precs.append(prec_interp)
        pr_auc = auc(mean_rec, prec_interp)
        pr_aucs.append(pr_auc)

        logger.info(
            f"Fold {i+1}/{len(folds)}: ROC AUC = {roc_auc:.2f}, "
            f"PR AUC = {pr_auc:.2f}"
        )

    model_data["tprs"] = tprs
    model_data["precisions"] = precs
    model_data["roc_aucs"] = roc_aucs
    model_data["pr_aucs"] = pr_aucs

    mean_roc_auc = np.mean(roc_aucs)
    mean_pr_auc = np.mean(pr_aucs)

    logger.info(f"Mean ROC AUC = {mean_roc_auc:.2f}")
    logger.info(f"Mean PR AUC = {mean_pr_auc:.2f}")

    dump(model_data, snakemake.output[0], compress=True)
    logger.info(f"Saving model to {snakemake.output[0]}")
except:
    logger.error(f"The file {snakemake.input['log_ratios']} is empty or does not exist")
    model_data = {}
    dump(model_data, snakemake.output[0], compress=True)
    logger.info(f"Saving empty model to {snakemake.output[0]}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import logging
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from bokeh.plotting import figure, save
from bokeh.models import ColumnDataSource, HoverTool
from bokeh.io import output_file

logger = logging.getLogger("qadabra")
logger.setLevel(logging.INFO)
fh = logging.FileHandler(snakemake.log[0], mode="w")
formatter = logging.Formatter(
    f"[%(asctime)s - {snakemake.rule}] :: %(message)s"
)
fh.setFormatter(formatter)
logger.addHandler(fh)

logging.captureWarnings(True)
logging.getLogger("py.warnings").addHandler(fh)

plt.style.use(snakemake.config["stylesheet"])

logger.info("Loading differentials...")
diffs = pd.read_table(snakemake.input[0], sep="\t", index_col=0).squeeze()
values = diffs.sort_values().values

feature_names = diffs.sort_values().index

logger.info("Plotting differentials...")

# Create a ColumnDataSource to store the data
source = ColumnDataSource(data=dict(x=np.arange(len(feature_names)), feature_names=feature_names, values=values))

# Create the figure
# p = figure(x_range=(0, len(values)), height=400, width=600, toolbar_location=None)
p = figure(x_range=(0, len(values)), toolbar_location=None)

# Plot the bars
p.vbar(x='x', top='values', width=0.8, source=source)

# Add HoverTool to display information on hover
hover = HoverTool(tooltips=[("Feature Name", "@feature_names")])
p.add_tools(hover)

# Set plot properties
p.xaxis.axis_label = "Rank"
p.xgrid.grid_line_color = 'gray'
p.ygrid.grid_line_color = 'gray'

# Configure the title
tool_name = snakemake.wildcards["tool"]
p.title.text = f"{tool_name} Differentials"
p.title.align = 'center'
p.title.text_font_size = '18pt'

# Save plot
output_file(snakemake.output[0])
save(p)

logger.info(f"Saved interactive rank plot to {snakemake.output[0]}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import logging

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns


logger = logging.getLogger("qadabra")
logger.setLevel(logging.INFO)
fh = logging.FileHandler(snakemake.log[0], mode="w")
formatter = logging.Formatter(
    f"[%(asctime)s - {snakemake.rule}] :: %(message)s"
)
fh.setFormatter(formatter)
logger.addHandler(fh)

logging.captureWarnings(True)
logging.getLogger("py.warnings").addHandler(fh)

plt.style.use(snakemake.config["stylesheet"])

logger.info("Loading differentials...")
diffs_df = pd.read_table(snakemake.input[0], sep="\t", index_col=0)

corr = diffs_df.corr("kendall")
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True

fig, ax = plt.subplots(1, 1)
sns.heatmap(
    corr,
    annot=True,
    mask=mask,
    square=True,
    ax=ax
)
ax.set_title("Kendall Rank Correlations")
plt.savefig(snakemake.output[0])
logger.info(f"Saved to {snakemake.output[0]}")
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
import logging

import matplotlib.pyplot as plt
from matplotlib.patches import Patch
import numpy as np
import pandas as pd
import seaborn as sns


logger = logging.getLogger("qadabra")
logger.setLevel(logging.INFO)
fh = logging.FileHandler(snakemake.log[0], mode="w")
formatter = logging.Formatter(
    f"[%(asctime)s - {snakemake.rule}] :: %(message)s"
)
fh.setFormatter(formatter)
logger.addHandler(fh)

logging.captureWarnings(True)
logging.getLogger("py.warnings").addHandler(fh)

plt.style.use(snakemake.config["stylesheet"])

logger.info("Loading PCA results...")
feat_df = pd.read_table(snakemake.input["features"], sep="\t", index_col=0)
tool_df = pd.read_table(snakemake.input["tools"], sep="\t", index_col=0)
prop_exp = pd.read_table(snakemake.input["prop_exp"], sep="\t", index_col=0)
prop_exp = prop_exp.squeeze()

pc_cols = tool_df.columns.tolist()
tool_list = tool_df.index.tolist()

arrow_palette = dict(zip(
    tool_list,
    sns.color_palette("colorblind", len(tool_list)).as_hex()
))

fig, ax = plt.subplots(1, 1)

logger.info("Plotting scatterplot...")
sns.scatterplot(
    data=feat_df,
    x="PC1",
    y="PC2",
    color="lightgray",
    edgecolor="black",
    linewidth=0.1,
    ax=ax
)

xlim = ax.get_xlim()
ylim = ax.get_ylim()

xrange = xlim[1] - xlim[0]
yrange = ylim[1] - ylim[0]

logger.info("Plotting arrows...")
for i, row in tool_df.iterrows():
    tool_name = row.name
    color = arrow_palette[tool_name]
    ax.arrow(
        x=0, y=0,
        dx=row["PC1"], dy=row["PC2"],
        facecolor=color,
        linewidth=0.5,
        head_width=xrange*0.005*5,
        width=xrange*0.005,
        edgecolor="black",
    )


patches = []
for i, row in tool_df.iterrows():
    tool_name = row.name
    color = arrow_palette[tool_name]
    ax.arrow(
        x=0, y=0,
        dx=row["PC1"], dy=row["PC2"],
        facecolor=color,
        linewidth=0.5,
        head_width=xrange*0.005*5,
        width=xrange*0.005,
        edgecolor="black",
    )
    tool_patch = Patch(
        facecolor=color,
        edgecolor="black",
        label=tool_name,
        linewidth=0.5,
    )
    patches.append(tool_patch)

ax.legend(
    handles=patches,
    bbox_to_anchor=[1, 1],
    loc="upper left",
    frameon=False
)

prop_exp_labels = [
    f"{i} ({x*100:.2f}%)"
    for i, x in prop_exp.items()
]
ax.set_xlabel(prop_exp_labels[0])
ax.set_ylabel(prop_exp_labels[1])

ax.set_title("PCA")
plt.savefig(snakemake.output[0])
logger.info(f"Saved to {snakemake.output[0]}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
from collections import defaultdict
import logging

import joblib
from matplotlib.lines import Line2D
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns


logger = logging.getLogger("qadabra")
logger.setLevel(logging.INFO)
fh = logging.FileHandler(snakemake.log[0], mode="w")
formatter = logging.Formatter(
    f"[%(asctime)s - {snakemake.rule}] :: %(message)s"
)
fh.setFormatter(formatter)
logger.addHandler(fh)

logging.captureWarnings(True)
logging.getLogger("py.warnings").addHandler(fh)

plt.style.use(snakemake.config["stylesheet"])

tools = snakemake.config["tools"] + ["pca_pc1"]
palette = dict(zip(
    tools, sns.color_palette("colorblind", len(tools))
))

pctile = snakemake.wildcards["pctile"]

models = defaultdict(dict)
for tool, model_loc in zip(tools, snakemake.input):
    logger.info(f"Loading {model_loc}...")
    models[tool] = joblib.load(model_loc)

fig, ax = plt.subplots(1, 1)

best_tool = None
best_auc = 0
leg_lines = []
for tool in tools:
    this_model_data = models[tool]
    mean_recalls = this_model_data["recalls"]
    mean_precs = np.mean(this_model_data["precisions"], axis=0)

    pr_aucs = this_model_data["pr_aucs"]
    mean_auc = np.mean(pr_aucs)
    std_auc = np.std(pr_aucs)

    if mean_auc > best_auc:
        best_auc = mean_auc
        best_tool = tool

    color = palette[tool]
    line = Line2D([0], [0], color=color, lw=2,
                  label=f"{tool} ({mean_auc:.2f} $\pm$ {std_auc:.2f})")
    leg_lines.append(line)

    ax.plot(mean_recalls, mean_precs, lw=2, color=color)
    logger.info(f"{tool} Mean AUC = {mean_auc:.2f} +- {std_auc:.2f}")

leg = ax.legend(
    handles=leg_lines,
    loc="upper left",
    frameon=False,
    bbox_to_anchor=[1, 1]
)
for text in leg.get_texts():
    if best_tool in text._text:
        text.set_weight("bold")

logger.info(f"Best model: {best_tool} ({best_auc:.2f})")

ax.set_xlabel("Recall")
ax.set_ylabel("Precision")
ax.set_title(f"{pctile}% Features")

plt.savefig(snakemake.output[0])
logger.info(f"Saved to {snakemake.output[0]}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import logging

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns


logger = logging.getLogger("qadabra")
logger.setLevel(logging.INFO)
fh = logging.FileHandler(snakemake.log[0], mode="w")
formatter = logging.Formatter(
    f"[%(asctime)s - {snakemake.rule}] :: %(message)s"
)
fh.setFormatter(formatter)
logger.addHandler(fh)

logging.captureWarnings(True)
logging.getLogger("py.warnings").addHandler(fh)

plt.style.use(snakemake.config["stylesheet"])

logger.info("Loading pvalues...")
diffs_df = pd.read_table(snakemake.input[0], sep="\t", index_col=0)

corr = diffs_df.corr("kendall")
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True

fig, ax = plt.subplots(1, 1)

palette = sns.color_palette("rocket", as_cmap=True)

sns.heatmap(
    corr,
    annot=True,
    mask=mask,
    square=True,
    ax=ax,
    cmap=palette,
    vmin=0,
    vmax=1
)
ax.set_title("Kendall Rank Correlations of P-values")
plt.savefig(snakemake.output[0])
logger.info(f"Saved to {snakemake.output[0]}")
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
import logging

import numpy as np
import pandas as pd
import seaborn as sns

from bokeh.plotting import figure, output_file, save
from bokeh.layouts import column, row
from bokeh.models import ColumnDataSource, Select, HoverTool
from bokeh.models.callbacks import CustomJS


logger = logging.getLogger("qadabra")
logger.setLevel(logging.INFO)
fh = logging.FileHandler(snakemake.log[0], mode="w")
formatter = logging.Formatter(
    f"[%(asctime)s - {snakemake.rule}] :: %(message)s"
)
fh.setFormatter(formatter)
logger.addHandler(fh)

logging.captureWarnings(True)
logging.getLogger("py.warnings").addHandler(fh)

output_file(filename=snakemake.output[0], title="Rank Comparison")

pval_df = pd.read_table(snakemake.input[0], sep="\t", index_col=0)
rank_df = (
    pval_df.rename(columns={f"{x}": f"{x}_pvalue" for x in pval_df.columns})
    .reset_index()
)

tool_list = pval_df.columns.tolist()
tool_rank_list = [x + "_pvalue" for x in tool_list]

# Rank Comparison
# logger.info("Creating rank comparison panel...")
chosen_tool_1 = Select(options=tool_rank_list, value=tool_rank_list[0], title="x-axis")
chosen_tool_2 = Select(options=tool_rank_list, value=tool_rank_list[1], title="y-axis")

rank_df["x"] = rank_df[chosen_tool_1.value]
rank_df["y"] = rank_df[chosen_tool_2.value]

source = ColumnDataSource(rank_df)

hover = HoverTool(mode="mouse", tooltips=[("Name", "points")], attachment="below")

hover.tooltips = (
    [("Feature ID", "@feature_id")] +
    [(f"{x}", f"@{x}") for x in rank_df.columns if x not in ["feature_id", "x", "y"]]
)

tools = [
    hover,
    "box_zoom",
    "reset",
    "pan",
    "save"
]

# plot = figure(tools=tools)
plot = figure(x_range=(0, 1), y_range=(0, 1), tools=tools)

plot.circle(
    source=source,
    x="x",
    y="y",
    color="lightblue",
    line_width=0.5,
    line_color="black",
    size=10,
    name="points"
)
plot.line(
    x=np.arange(0, rank_df.shape[0]),
    y=np.arange(0, rank_df.shape[0]),
    line_width=3,
    line_dash="dashed",
    color="black"
)

tool_callback = CustomJS(args=dict(source=source, plot=plot), code="""
const data = source.data[cb_obj.value];
if (cb_obj.title == 'x-axis') {
    plot.below[0].axis_label = cb_obj.value;
    source.data.x = data;
} else {
    plot.left[0].axis_label = cb_obj.value;
    source.data.y = data;
}

source.change.emit();
""")
chosen_tool_1.js_on_change("value", tool_callback)
chosen_tool_2.js_on_change("value", tool_callback)

plot.title.text = "P-value Comparison"
plot.title.text_font_size = "20pt"

plot.xaxis.axis_label = chosen_tool_1.value
plot.yaxis.axis_label = chosen_tool_2.value

for ax in [plot.xaxis, plot.yaxis]:
    ax.axis_label_text_font_size = "14pt"
    ax.axis_label_text_font_style = "normal"
    ax.major_label_text_font_size = "10pt"

layout = row(column(chosen_tool_1, chosen_tool_2), plot)
save(layout)

logger.info(f"Saved rank comparisons to {snakemake.output[0]}")
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
import logging
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from bokeh.plotting import figure, save
from bokeh.models import HoverTool, ColumnDataSource
from bokeh.io import output_file

# Add logging 
logger = logging.getLogger("qadabra")
logger.setLevel(logging.INFO)
fh = logging.FileHandler(snakemake.log[0], mode="w")
formatter = logging.Formatter(
    f"[%(asctime)s - {snakemake.rule}] :: %(message)s"
)
fh.setFormatter(formatter)
logger.addHandler(fh)

logging.captureWarnings(True)
logging.getLogger("py.warnings").addHandler(fh)

plt.style.use(snakemake.config["stylesheet"])

# Read in tsv containing p-values and LFCs
logger.info("Loading p-values and LFCs...")
data = pd.read_table(snakemake.input[0], sep="\t", index_col=0).squeeze()
pval_col = snakemake.params["pval_col"]
diff_ab_col = snakemake.params["diff_ab_col"]

# Take negative log of pval_col column and add as new column
data['negative_log_pval'] = -np.log10(data[pval_col])

# Create a ColumnDataSource object
source_blue = ColumnDataSource(data=dict(
    diff_ab_col=data[diff_ab_col],
    pval=data['negative_log_pval'],
    feature=data.index
))

source_red = ColumnDataSource(data=dict(
    diff_ab_col=data[diff_ab_col],
    pval=data['negative_log_pval'],
    feature=data.index
))

# Create a figure
tool_name = snakemake.wildcards["tool"]
p = figure(title=f"{tool_name} Volcano Plot", x_axis_label='coefficient', y_axis_label='-log10(pval_col)')

p.title.text_font_size = '20pt'

# Define the desired y-axis range
y_min = data['negative_log_pval'].min() - 1  # Minimum value
y_max = data['negative_log_pval'].max() + 1  # Maximum value

# Set the y-axis range
p.y_range.start = y_min
p.y_range.end = y_max

# Define the desired x-axis range
x_min = data[diff_ab_col].min() - 1  # Minimum value
x_max = data[diff_ab_col].max() + 1  # Maximum value

# Set the y-axis range
p.x_range.start = x_min
p.x_range.end = x_max

# # Add a vertical line at x = -1
p.line(x=[-1, -1], y=[y_min, y_max], line_color='gray', line_width=1, line_dash='dashed')
p.line(x=[1, 1], y=[y_min, y_max], line_color='gray', line_width=1, line_dash='dashed')
p.line(x=[x_min,x_max], y=[-np.log10(0.05), -np.log10(0.05)], line_color='gray', line_width=1, line_dash='dashed')

# Add hover tool
hover = HoverTool(
    tooltips=[
        ('Feature', '@feature'),
        ('coefficient', '@diff_ab_col'),
        ('-log10(pval_col)', '@pval'),
    ]
)
p.add_tools(hover)

# Select points where y >= 1 and x >= -log10(0.05)
pos_sig_diff = (data[diff_ab_col] >= 1) & (data['negative_log_pval'] >= -np.log10(0.05))
# Select points where y <= -1 and x >= -log10(0.05)
neg_sig_diff = (data[diff_ab_col] <= -1) & (data['negative_log_pval'] >= -np.log10(0.05))

# Apply selection to the data source
source_blue.selected.indices = np.where(neg_sig_diff)[0]
# Apply selection to the data source for negative significant differences (blue)
p.circle('diff_ab_col', 'pval', size=5, fill_alpha=0.6, nonselection_line_color='black', nonselection_fill_color='black', source=source_blue, selection_color='blue')

# Apply selection to the data source
source_red.selected.indices = np.where(pos_sig_diff)[0]
# Apply selection to the data source for positive significant differences (red)
p.circle('diff_ab_col', 'pval', size=5, fill_alpha=0.6, nonselection_line_color='black', nonselection_fill_color='black', source=source_red, selection_color='red')


output_file(snakemake.output[0])
save(p)

logger.info(f"Saved volocano plot to {snakemake.output[0]}")
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
import logging

import numpy as np
import pandas as pd
import seaborn as sns

from bokeh.plotting import figure, output_file, save
from bokeh.layouts import column, row
from bokeh.models import ColumnDataSource, Select, HoverTool
from bokeh.models.callbacks import CustomJS


logger = logging.getLogger("qadabra")
logger.setLevel(logging.INFO)
fh = logging.FileHandler(snakemake.log[0], mode="w")
formatter = logging.Formatter(
    f"[%(asctime)s - {snakemake.rule}] :: %(message)s"
)
fh.setFormatter(formatter)
logger.addHandler(fh)

logging.captureWarnings(True)
logging.getLogger("py.warnings").addHandler(fh)

output_file(filename=snakemake.output[0], title="Rank Comparison")

diff_df = pd.read_table(snakemake.input[0], sep="\t", index_col=0)
rank_df = (
    diff_df.rank(ascending=False)
    .rename(columns={f"{x}": f"{x}_rank" for x in diff_df.columns})
    .join(diff_df)
    .reset_index()
)

tool_list = diff_df.columns.tolist()
tool_rank_list = [x + "_rank" for x in tool_list]

# Rank Comparison
logger.info("Creating rank comparison panel...")
chosen_tool_1 = Select(options=tool_rank_list, value=tool_rank_list[0], title="x-axis")
chosen_tool_2 = Select(options=tool_rank_list, value=tool_rank_list[1], title="y-axis")

rank_df["x"] = rank_df[chosen_tool_1.value]
rank_df["y"] = rank_df[chosen_tool_2.value]

source = ColumnDataSource(rank_df)

hover = HoverTool(mode="mouse", tooltips=[("Name", "points")], attachment="below")

hover.tooltips = (
    [("Feature ID", "@feature_id")] +
    [(f"{x}", f"@{x}") for x in rank_df.columns if x not in ["feature_id", "x", "y"]]
)

tools = [
    hover,
    "box_zoom",
    "reset",
    "pan",
    "save"
]

plot = figure(tools=tools)
plot.circle(
    source=source,
    x="x",
    y="y",
    color="lightgreen",
    line_width=0.5,
    line_color="black",
    size=10,
    name="points"
)
plot.line(
    x=np.arange(0, rank_df.shape[0]),
    y=np.arange(0, rank_df.shape[0]),
    line_width=3,
    line_dash="dashed",
    color="black"
)

tool_callback = CustomJS(args=dict(source=source, plot=plot), code="""
const data = source.data[cb_obj.value];
if (cb_obj.title == 'x-axis') {
    plot.below[0].axis_label = cb_obj.value;
    source.data.x = data;
} else {
    plot.left[0].axis_label = cb_obj.value;
    source.data.y = data;
}

source.change.emit();
""")
chosen_tool_1.js_on_change("value", tool_callback)
chosen_tool_2.js_on_change("value", tool_callback)

plot.title.text = "Differential Rank Comparison"
plot.title.text_font_size = "20pt"

plot.xaxis.axis_label = chosen_tool_1.value
plot.yaxis.axis_label = chosen_tool_2.value

for ax in [plot.xaxis, plot.yaxis]:
    ax.axis_label_text_font_size = "14pt"
    ax.axis_label_text_font_style = "normal"
    ax.major_label_text_font_size = "10pt"

layout = row(column(chosen_tool_1, chosen_tool_2), plot)
save(layout)
logger.info(f"Saved rank comparisons to {snakemake.output[0]}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
from collections import defaultdict
import logging

import joblib
from matplotlib.lines import Line2D
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns


logger = logging.getLogger("qadabra")
logger.setLevel(logging.INFO)
fh = logging.FileHandler(snakemake.log[0], mode="w")
formatter = logging.Formatter(
    f"[%(asctime)s - {snakemake.rule}] :: %(message)s"
)
fh.setFormatter(formatter)
logger.addHandler(fh)

logging.captureWarnings(True)
logging.getLogger("py.warnings").addHandler(fh)

plt.style.use(snakemake.config["stylesheet"])

tools = snakemake.config["tools"] + ["pca_pc1"]
palette = dict(zip(
    tools, sns.color_palette("colorblind", len(tools))
))

pctile = snakemake.wildcards["pctile"]

models = defaultdict(dict)
for tool, model_loc in zip(tools, snakemake.input):
    logger.info(f"Loading {model_loc}...")
    models[tool] = joblib.load(model_loc)

fig, ax = plt.subplots(1, 1)
ax.set_aspect("equal")

best_tool = None
best_auc = 0
leg_lines = []
for tool in tools:
    this_model_data = models[tool]
    mean_fprs = this_model_data["fprs"]
    mean_tprs = np.mean(this_model_data["tprs"], axis=0)

    roc_aucs = this_model_data["roc_aucs"]
    mean_auc = np.mean(roc_aucs)
    std_auc = np.std(roc_aucs)

    if mean_auc > best_auc:
        best_auc = mean_auc
        best_tool = tool

    color = palette[tool]
    line = Line2D([0], [0], color=color, lw=2,
                  label=f"{tool} ({mean_auc:.2f} $\pm$ {std_auc:.2f})")
    leg_lines.append(line)

    ax.plot(mean_fprs, mean_tprs, lw=2, color=color)
    logger.info(f"{tool} Mean AUC = {mean_auc:.2f} +- {std_auc:.2f}")

leg = ax.legend(handles=leg_lines, loc="lower right", frameon=False)
for text in leg.get_texts():
    if best_tool in text._text:
        text.set_weight("bold")

logger.info(f"Best model: {best_tool} ({best_auc:.2f})")

ax.plot([0, 1], [0, 1], linestyle="--", lw=2, color="black")
ax.set_xlabel("False Positive Rate")
ax.set_ylabel("True Positive Rate")
ax.set_title(f"{pctile}% Features")

plt.savefig(snakemake.output[0])
logger.info(f"Saved to {snakemake.output[0]}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import logging

import matplotlib.pyplot as plt
import pandas as pd
from upsetplot import UpSet, from_contents


logger = logging.getLogger("qadabra")
logger.setLevel(logging.INFO)
fh = logging.FileHandler(snakemake.log[0], mode="w")
formatter = logging.Formatter(
    f"[%(asctime)s - {snakemake.rule}] :: %(message)s"
)
fh.setFormatter(formatter)
logger.addHandler(fh)

logging.captureWarnings(True)
logging.getLogger("py.warnings").addHandler(fh)

plt.style.use(snakemake.config["stylesheet"])

tool_names = snakemake.config["tools"]
pctile = snakemake.wildcards["pctile"]
logger.info("Loading percentile features...")
feat_df_dict = {
    tool: pd.read_table(f, sep="\t", index_col=0)
    for tool, f in zip(tool_names, snakemake.input)
}

num_list = {
    tool: df.query("location == 'numerator'").index.tolist()
    for tool, df in feat_df_dict.items()
}

denom_list = {
    tool: df.query("location == 'denominator'").index.tolist()
    for tool, df in feat_df_dict.items()
}

try:
    num_plt = UpSet(from_contents(num_list), subset_size="count", show_counts=True).plot()
    plt.title(f"Numerator - {pctile}%")
except IndexError:
    plt.figure()
    plt.title(f"Numerator - {pctile}% (empty)")
    logger.info("num_list is empty. Creating empty numerator plot.")
finally:
    plt.savefig(snakemake.output["numerator"])
    logger.info(f"Saved to {snakemake.output['numerator']}")

try:
    denom_plt = UpSet(from_contents(denom_list), subset_size="count", show_counts=True).plot()
    plt.title(f"Denominator - {pctile}%")
except IndexError:
    plt.figure()
    plt.title(f"Denominator - {pctile}% (empty)")
    logger.info("denom_list is empty. Creating empty denominator plot.")
finally:
    plt.savefig(snakemake.output["denominator"])
    logger.info(f"Saved to {snakemake.output['denominator']}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import logging

import pandas as pd


logger = logging.getLogger("qadabra")
logger.setLevel(logging.INFO)
fh = logging.FileHandler(snakemake.log[0], mode="w")
formatter = logging.Formatter(
    f"[%(asctime)s - {snakemake.rule}] :: %(message)s"
)
fh.setFormatter(formatter)
logger.addHandler(fh)

logger.info(f"Loading {snakemake.wildcards['tool']} differentials...")
col = snakemake.params["col"]
logger.info(f"Using '{col}' as column")
diffs = pd.read_table(snakemake.input[0], sep="\t", index_col=0)[col]
diffs.name = snakemake.wildcards["tool"]
diffs.index.name = "feature_id"
diffs.to_csv(snakemake.output[0], sep="\t", index=True)
logger.info(f"Saved to {snakemake.output[0]}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
import logging
import pandas as pd

logger = logging.getLogger("qadabra")
logger.setLevel(logging.INFO)
fh = logging.FileHandler(snakemake.log[0], mode="w")
formatter = logging.Formatter(
    f"[%(asctime)s - {snakemake.rule}] :: %(message)s"
)
fh.setFormatter(formatter)
logger.addHandler(fh)

logger.info(f"Loading {snakemake.wildcards['tool']} p-values...")
col = snakemake.params["col"]
logger.info(f"Using '{col}' as column")
pvalues = pd.read_table(snakemake.input[0], sep="\t", index_col=0)[col]
pvalues.name = snakemake.wildcards["tool"]
pvalues.index.name = "feature_id"
pvalues.to_csv(snakemake.output[0], sep="\t", index=True)
logger.info(f"Saved to {snakemake.output[0]}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
library(biomformat)
library(ALDEx2)

# Set logging information
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

# Load the input table
print("Loading table...")
table <- biomformat::read_biom(snakemake@input[["table"]])
table <- as.matrix(biomformat::biom_data(table))

# Load the metadata
print("Loading metadata...")
metadata <- read.table(snakemake@input[["metadata"]], sep="\t", header=T,
                       row.names=1)

covariate <- snakemake@params[[1]][["factor_name"]]
target <- snakemake@params[[1]][["target_level"]]
reference <- snakemake@params[[1]][["reference_level"]]
confounders <- snakemake@params[[1]][["confounders"]]

# Harmonize table and metadata samples
print("Harmonizing table and metadata samples...")
samples <- colnames(table)
metadata <- subset(metadata, rownames(metadata) %in% samples)
metadata[[covariate]] <- as.factor(metadata[[covariate]])
metadata[[covariate]] <- relevel(metadata[[covariate]], reference)
sample_order <- row.names(metadata)
table <- table[, sample_order]
row.names(table) <- paste0("F_", row.names(table)) # Append F_ to features to avoid R renaming

# Create the design formula
print("Creating design formula...")
design.formula <- paste0("~", covariate)
if (length(confounders) != 0) {
    confounders_form = paste(confounders, collapse=" + ")
    design.formula <- paste0(design.formula, " + ", confounders_form)
}
design.formula <- as.formula(design.formula)
mm <- model.matrix(design.formula, metadata)

# Run ALDEx2
print("Running ALDEx2...")
x <- ALDEx2::aldex.clr(table, mm) # perform CLR transformation
aldex2.results <- ALDEx2::aldex.glm(x) # apply generalized linear model
saveRDS(aldex2.results, snakemake@output[[2]])
print("Saved RDS!")

# Modify result names
row.names(aldex2.results) <- gsub("^F_", "", row.names(aldex2.results))
target <- paste(covariate, target, sep = "")

# Save results to output file
write.table(aldex2.results, file=snakemake@output[[1]], sep="\t")
print("Saved differentials!")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
library(biomformat)
library(ANCOMBC)
library(phyloseq)

# Set logging information
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

# Load the input table
print("Loading table...")
table <- biomformat::read_biom(snakemake@input[["table"]])
table <- as.matrix(biomformat::biom_data(table))

# Load the metadata
print("Loading metadata...")
metadata <- read.table(snakemake@input[["metadata"]], sep="\t", header=T,
                       row.names=1)

covariate <- snakemake@params[[1]][["factor_name"]]
target <- snakemake@params[[1]][["target_level"]]
reference <- snakemake@params[[1]][["reference_level"]]
confounders <- snakemake@params[[1]][["confounders"]]

# Harmonize table and metadata samples
print("Harmonizing table and metadata samples...")
samples <- colnames(table)
metadata <- subset(metadata, rownames(metadata) %in% samples)
metadata[[covariate]] <- as.factor(metadata[[covariate]])
metadata[[covariate]] <- relevel(metadata[[covariate]], reference)
sample_order <- row.names(metadata)
table <- table[, sample_order]
row.names(table) <- paste0("F_", row.names(table)) # Append F_ to features to avoid R renaming

# Convert to phyloseq object
print("Converting to phyloseq...")
taxa <- phyloseq::otu_table(table, taxa_are_rows=T)
meta <- phyloseq::sample_data(metadata)
physeq <- phyloseq::phyloseq(taxa, meta)

# Create the design formula
print("Creating design formula...")
design.formula <- covariate
if (length(confounders) != 0) {
    confounders_form = paste(confounders, collapse=" + ")
    design.formula <- paste0(design.formula, " + ", confounders_form)
}

# Run ANCOMBC
print("Running ANCOMBC...")
ancombc.results <- ANCOMBC::ancombc(phyloseq=physeq, formula=design.formula, zero_cut=1.0, p_adj_method = "BH",)
saveRDS(ancombc.results, snakemake@output[[2]])
print("Saved RDS!")

# Access coefficients and p-values from the ANCOMBC results
coef_col <- paste("coefs", covariate, target, sep = ".")
pval_col <- paste("pvals", covariate, target, sep = ".")
coefs <- ancombc.results$res$beta
pvals <- ancombc.results$res$p_val
# qvals <- ancombc.results$res$q_val

# Modify result names
row.names(coefs) <- gsub("^F_", "", row.names(coefs))
row.names(pvals) <- gsub("^F_", "", row.names(pvals))
# row.names(qvals) <- gsub("^F_", "", row.names(qvals))

results_all <- data.frame(coefs=coefs, pvals=pvals)
colnames(results_all) <- c("coefs", "pvals")

# Save results to output file
write.table(results_all, file=snakemake@output[[1]], sep="\t")
print("Saved differentials!")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
library(biomformat)
library(corncob)
library(phyloseq)

# Set logging information
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

# Load the input table
print("Loading table...")
table <- biomformat::read_biom(snakemake@input[["table"]])
table <- as.matrix(biomformat::biom_data(table))

# Load the metadata
print("Loading metadata...")
metadata <- read.table(snakemake@input[["metadata"]], sep="\t", header=T,
                       row.names=1)

covariate <- snakemake@params[[1]][["factor_name"]]
target <- snakemake@params[[1]][["target_level"]]
reference <- snakemake@params[[1]][["reference_level"]]
confounders <- snakemake@params[[1]][["confounders"]]

# Harmonize table and metadata samples
print("Harmonizing table and metadata samples...")
samples <- colnames(table)
metadata <- subset(metadata, rownames(metadata) %in% samples)
metadata[[covariate]] <- as.factor(metadata[[covariate]])
metadata[[covariate]] <- relevel(metadata[[covariate]], reference)
sample_order <- row.names(metadata)
table <- table[, sample_order]
row.names(table) <- paste0("F_", row.names(table)) # Append F_ to features to avoid R renaming

# Convert to phyloseq object
print("Converting to phyloseq...")
taxa <- phyloseq::otu_table(table, taxa_are_rows=T)
meta <- phyloseq::sample_data(metadata)
physeq <- phyloseq::phyloseq(taxa, meta)

# Create the design formula
print("Creating design formula...")
design.formula <- paste0("~", covariate)
if (length(confounders) != 0) {
    confounders_form = paste(confounders, collapse=" + ")
    design.formula <- paste0(design.formula, " + ", confounders_form)
}
design.formula <- as.formula(design.formula)
print(design.formula)

# Run Corncob
print("Running corncob...")
fit <- corncob::differentialTest(
    formula=design.formula,
    formula_null=~1,
    phi.formula=design.formula,
    phi.formula_null=design.formula,
    test="Wald",
    data=physeq,
    boot=F,
    filter_discriminant=F,
    full_output=T
)
saveRDS(fit, snakemake@output[[2]])
print("Saved RDS!")

# Create results table
taxa_names <- names(fit$p)
colname <- paste0("mu.", covariate, target)

print("Aggregating models...")
all_models <- fit$all_models
coefs <- c()
for (mod in all_models) {
    coef <- mod$coefficients[c(colname), "Estimate"]
    coefs <- c(coefs, coef)
}
coefs <- as.data.frame(coefs)
row.names(coefs) <- taxa_names
row.names(coefs) <- gsub("^F_", "", row.names(coefs))

pvals <- as.data.frame(fit$p)
adjusted_p_values <- p.adjust(fit$p, method = "BH")
pval_BH_adj <- as.data.frame(adjusted_p_values)

results_all <- data.frame(coefs=coefs, pvals=pvals, pval_BH_adj=pval_BH_adj)

# Save results to output file
write.table(results_all, snakemake@output[[1]], sep="\t")
print("Saved differentials!")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
library(biomformat)
library(DESeq2)

# Set logging information
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

# Load the input table
print("Loading table...")
table <- biomformat::read_biom(snakemake@input[["table"]])
table <- as.matrix(biomformat::biom_data(table))

# Load the metadata
print("Loading metadata...")
metadata <- read.table(snakemake@input[["metadata"]], sep="\t", header=T,
                       row.names=1)

covariate <- snakemake@params[[1]][["factor_name"]]
target <- snakemake@params[[1]][["target_level"]]
reference <- snakemake@params[[1]][["reference_level"]]
confounders <- snakemake@params[[1]][["confounders"]]

# Harmonize table and metadata samples
print("Harmonizing table and metadata samples...")
samples <- colnames(table)
metadata <- subset(metadata, rownames(metadata) %in% samples)
metadata[[covariate]] <- as.factor(metadata[[covariate]])
metadata[[covariate]] <- relevel(metadata[[covariate]], reference)
sample_order <- row.names(metadata)
table <- table[, sample_order]
row.names(table) <- paste0("F_", row.names(table)) # Append F_ to features to avoid R renaming

# Create the design formula
print("Creating design formula...")
design.formula <- paste0("~", covariate)
if (length(confounders) != 0) {
    confounders_form = paste(confounders, collapse=" + ")
    design.formula <- paste0(design.formula, " + ", confounders_form)
}
design.formula <- as.formula(design.formula)

# Run DESeq2
print("Running DESeq2...")
dds <- DESeq2::DESeqDataSetFromMatrix(
    countData=table,
    colData=metadata,
    design=design.formula
)
dds.results <- DESeq2::DESeq(dds, sfType="poscounts")
saveRDS(dds.results, snakemake@output[[2]])
print("Saved RDS!")

# Create results table and modify result names
results <- DESeq2::results(
    dds.results,
    format="DataFrame",
    tidy=TRUE,
    cooksCutoff=FALSE,
    pAdjustMethod="BH",
    contrast=c(covariate, target, reference)
)
row.names(results) <- gsub("^F_", "", row.names(table))

# Save results to output file
write.table(results, file=snakemake@output[[1]], sep="\t")
print("Saved differentials!")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
library(biomformat)
library(edgeR)

# Set logging information
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

# Load the input table
print("Loading table...")
table <- biomformat::read_biom(snakemake@input[["table"]])
table <- as.matrix(biomformat::biom_data(table))

# Load the metadata
print("Loading metadata...")
metadata <- read.table(snakemake@input[["metadata"]], sep="\t", header=T,
                       row.names=1)

covariate <- snakemake@params[[1]][["factor_name"]]
target <- snakemake@params[[1]][["target_level"]]
reference <- snakemake@params[[1]][["reference_level"]]
confounders <- snakemake@params[[1]][["confounders"]]

# Harmonize table and metadata samples
print("Harmonizing table and metadata samples...")
samples <- colnames(table)
metadata <- subset(metadata, rownames(metadata) %in% samples)
metadata[[covariate]] <- as.factor(metadata[[covariate]])
metadata[[covariate]] <- relevel(metadata[[covariate]], reference)
sample_order <- row.names(metadata)
table <- table[, sample_order]
row.names(table) <- paste0("F_", row.names(table)) # Append F_ to features to avoid R renaming

# Create the design formula
print("Creating design formula...")
design.formula <- paste0("~", covariate)
if (length(confounders) != 0) {
    confounders_form = paste(confounders, collapse=" + ")
    design.formula <- paste0(design.formula, " + ", confounders_form)
}
design.formula <- as.formula(design.formula)
print(design.formula)
mm <- model.matrix(design.formula, metadata)

# Run edgeR
print("Running edgeR...")
d <- edgeR::DGEList(counts=table, group=metadata[[covariate]])
d <- edgeR::calcNormFactors(d)
d <- edgeR::estimateDisp(d, design=mm)
fit <- edgeR::glmQLFit(d, mm, robust=T)
saveRDS(fit, snakemake@output[[2]])
print("Saved RDS!")

# Obtain coefficients using glmQLFit
coeff_results <- fit$coefficients
row.names(coeff_results) <- gsub("^F_", "", row.names(coeff_results))

# Obtain p-values and corrected p-values using glmQLFTest
res <- edgeR::glmQLFTest(fit, coef=2)
pval_results <- res$table
adjusted_p_values <- p.adjust(pval_results$PValue, method = "BH")
pval_results$PValue_BH_adj <- adjusted_p_values
row.names(pval_results) <- gsub("^F_", "", row.names(pval_results))

# Create results table
results_all <- cbind(coeff_results, pval_results[match(rownames(coeff_results), rownames(pval_results)),])

# Save results to output file
write.table(results_all, file=snakemake@output[[1]], sep="\t")
print("Saved differentials!")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
library(biomformat)
library(dplyr)
library(Maaslin2)

# Set logging information
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

# Load the input table
print("Loading table...")
table <- biomformat::read_biom(snakemake@input[["table"]])
table <- as.matrix(biomformat::biom_data(table))

# Load the metadata
print("Loading metadata")
metadata <- read.table(snakemake@input[["metadata"]], sep="\t", header=T,
                       row.names=1)

covariate <- snakemake@params[[1]][["factor_name"]]
target <- snakemake@params[[1]][["target_level"]]
reference <- snakemake@params[[1]][["reference_level"]]
confounders <- snakemake@params[[1]][["confounders"]]

# Harmonize table and metadata samples
print("Harmonizing table and metadata samples...")
samples <- colnames(table)
metadata <- subset(metadata, rownames(metadata) %in% samples)
metadata[[covariate]] <- as.factor(metadata[[covariate]])
metadata[[covariate]] <- relevel(metadata[[covariate]], reference)
sample_order <- row.names(metadata)

# Create results table and modify result names
row.names(table) <- paste0("F_", row.names(table)) # Append F_ to features to avoid R renaming
table <- t(table[, sample_order])
fixed.effects <- c(covariate, confounders)
specify.reference <- paste(covariate, reference, sep = ",", collapse = ",")

# Run maAsLin
print("Running MaAsLin...")
fit.data <- Maaslin2::Maaslin2(
    input_data=table,
    input_metadata=metadata,
    output=snakemake@output[["out_dir"]],
    fixed_effects=fixed.effects,
    reference=specify.reference,
    min_prevalence=0,
    min_abundance=0,
    plot_scatter=F,
    plot_heatmap=F
)
results <- fit.data$results %>% dplyr::filter(metadata==covariate)
results <- results %>% distinct(feature, .keep_all = TRUE)

row.names(results) <- gsub("^F_", "", results$feature)
results <- results %>% select(-c("feature"))

# Extract coefficient, p-value, and adjusted p-value results
adjusted_p_values <- p.adjust(results$pval, method = "BH")
results$pval_BH_adj <- adjusted_p_values
# results$coef <- results$coef

# Save results to output file
write.table(results, file=snakemake@output[["diff_file"]], sep="\t")
print("Saved differentials!")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
library(biomformat)
library(Biobase)
library(metagenomeSeq)

# Set logging information
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

# Load the input table
print("Loading table...")
table <- biomformat::read_biom(snakemake@input[["table"]])
table <- as.matrix(biomformat::biom_data(table))
table <- as.data.frame(table)

# Load the metadata
print("Loading metadata...")
metadata <- read.table(snakemake@input[["metadata"]], sep="\t", header=T,
                       row.names=1)

covariate <- snakemake@params[[1]][["factor_name"]]
target <- snakemake@params[[1]][["target_level"]]
reference <- snakemake@params[[1]][["reference_level"]]
confounders <- snakemake@params[[1]][["confounders"]]

# Harmonize table and metadata samples
print("Harmonizing table and metadata samples...")
samples <- colnames(table)
metadata <- subset(metadata, rownames(metadata) %in% samples)
metadata[[covariate]] <- as.factor(metadata[[covariate]])
metadata[[covariate]] <- relevel(metadata[[covariate]], reference)
sample_order <- row.names(metadata)
table <- table[, sample_order]
row.names(table) <- paste0("F_", row.names(table)) # Append F_ to features to avoid R renaming

# Create AnnotatedDataFrame from metadata
pheno <- Biobase::AnnotatedDataFrame(metadata)

# Create MRexperiment object
experiment <- metagenomeSeq::newMRexperiment(
    counts=table,
    phenoData=pheno,
)

# Perform cumulative sum scaling (CSS) normalization
pctile <- metagenomeSeq::cumNormStat(experiment, pFlag=T)
experiment_norm <- metagenomeSeq::cumNorm(experiment, p=pctile)
pd <- Biobase::pData(experiment_norm)

# Create design formula for model matrix
print("Creating design formula...")
design.formula <- paste0("~", covariate)
if (length(confounders) != 0) {
    for (c in confounders) {
        metadata[[c]] <- as.factor(metadata[[c]])
    }
    confounders_form = paste(confounders, collapse=" + ")
    design.formula <- paste0(design.formula, " + ", confounders_form)
}
design.formula <- as.formula(design.formula)

mm <- model.matrix(design.formula, data=pd)

# Run MetagenomeSeq
print("Running metagenomeSeq...")
fit <- metagenomeSeq::fitZig(experiment_norm, mm)
saveRDS(fit, snakemake@output[[2]])
print("Saved RDS!")

# Create results table and modify results names
results <- metagenomeSeq::MRcoefs(
    obj=fit,
    number=dim(table)[1],
    adjustMethod="BH",
)
row.names(results) <- gsub("^F_", "", row.names(results))

# Save results to output file
write.table(results, file=snakemake@output[[1]], sep="\t")
print("Saved differentials!")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import logging

import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler


logger = logging.getLogger("qadabra")
logger.setLevel(logging.INFO)
fh = logging.FileHandler(snakemake.log[0], mode="w")
formatter = logging.Formatter(
    f"[%(asctime)s - {snakemake.rule}] :: %(message)s"
)
fh.setFormatter(formatter)
logger.addHandler(fh)

logger.info("Loading differentials...")
diff_df = pd.read_table(snakemake.input[0], sep="\t", index_col=0)
rank_df = (
    diff_df.rank(ascending=False)
    .rename(columns={f"{x}": f"{x}_rank" for x in diff_df.columns})
    .join(diff_df)
)

tool_list = diff_df.columns.tolist()
tool_rank_list = [x + "_rank" for x in tool_list]

logger.info("Scaling values...")
scaled_diff_values = StandardScaler().fit_transform(diff_df.values)

pc_cols = [f"PC{x+1}" for x in range(len(tool_list))]
logger.info("Running PCA...")
pca = PCA(n_components=len(tool_list)).fit(scaled_diff_values)
feat_df = pd.DataFrame(
    pca.transform(scaled_diff_values),
    index=diff_df.index,
    columns=pc_cols
).join(rank_df)
max_vals = feat_df[pc_cols].max()

tool_df = pd.DataFrame(
    np.dot(pca.components_, np.diag(max_vals)),
    index=diff_df.columns,
    columns=pc_cols
)
tool_df.index.name = "tool"

feat_df.to_csv(snakemake.output["features"], sep="\t", index=True)
tool_df.to_csv(snakemake.output["tools"], sep="\t", index=True)

logger.info(f"Saved feature PCs to {snakemake.output['features']}")
logger.info(f"Saved tool PCs to {snakemake.output['tools']}")

prop_exp = pca.explained_variance_ratio_
prop_exp = pd.Series(prop_exp, index=pc_cols)
prop_exp.name = "proportion_var_explained"
prop_exp.to_csv(snakemake.output["prop_exp"], sep="\t", index=True)
logger.info(f"Saved explained variance to {snakemake.output['prop_exp']}")
ShowHide 48 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/gibsramen/qadabra
Name: qadabra
Version: v0.3.0a2
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: BSD 3-Clause "New" or "Revised" License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...