A Snakemake workflow for standardised sc/snRNAseq analysis

public public 1yr ago Version: v0.1.2 0 bookmarks

A [Snakemake][sm] workflow for standardised sc/snRNAseq analysis.

Find us in the "Standardized Usage" Section of the [Snakemake Workflow Catalog][sm_wc]

Every single cell analysis is slightly different. This represents what I would call a "core" analysis, as nearly every analysis I perform start with something very akin to this. Given this custom nature of single cell, this workflow is not designed to be all encompassing. Rather, it aims to be extensible , modular , and reproducible . Any given step can be easily modified - as they are all self contained scripts - and a new rule can be easily added - see the downstream rules for an example. Finally, by taking advantage of the integrated [Conda][conda] and [Singularity][sing] support, we can run the whole thing in an isolated environment.

Notes on Installation

A full walkthrough on how to install and use this pipeline can be found [here][sm_wc].

To take advantage of Singularity, you'll need to install that separately. If you are running on a Linux system, then singularity can be installed from conda like so:

conda install -n snakemake -c conda-forge singularity

It's a bit more challenging for other operating systems. Your best bet is to follow their instructions [here][sing_install]. But don't worry! Singularity is not regquired! Snakemake will still run each step in its own Conda environment, it just won't put each Conda environment in a container.

Get the Source Code

Alternatively, you may grab the source code. You likely will not need these steps if you aren't planning to contribute.

Navigate to our [release][releases] page on github and download the most recent version. The following will do the trick:

curl -s https://api.github.com/repos/IMS-Bio2Core-Facility/single_snake_sequencing/releases/latest |
grep tarball_url |
cut -d " " -f 4 |
tr -d '",' |
xargs -n1 curl -sL |
tar xzf -

After querying the github api to get the most recent release information, we grep for the desired URL, split the line and extract the field, trim superfluous characters, use xargs to pipe this to curl while allowing for re-directs, and un-tar the files. Easy!

Alternatively, for the bleeding edge, please clone the repo like so:

git clone https://github.com/IMS-Bio2Core-Facility/single_snake_sequencing

:warning: Heads Up! The bleeding edge may not be stable, as it contains all active development.

However you choose to install it, cd into the directory.

Notes on Data

This pipeline expects de-multiplexed fastq.gz files, normally produced by some deriviative of bcl2fastq after sequencing. They can (technically) be placed anywhere, but we recommend creating a data directory in your project for them.

Notes on the tools

The analysis pipeline was run using Snakemake v6.6.1. The full version and software lists can be found under the relevant yaml files in workflow/envs . The all reasonable efforts have been made to ensure that the repository adheres to the best practices outlined [here][sm_bp].

Notes on the analysis

For a full discussion on the analysis methods, please see the technical documentation .

Briefly, the count matrix was produced using Cellranger, droplet calling with DropletUtils::emptyDrops , doublet detection with SOLO from the scVI family, batch effect removal with harmonypy , and general analysis and data handling with scanpy .

Future work

  • [ ] Supply tests

  • [ ] Track lane in samples that have been pooled and de-multiplexed

  • [ ] Parallelise emptyDrops

  • [ ] Support custom references

  • [ ] Support SCTransform?

Code Snippets

26
27
script:
    "../scripts/dim_reduc.py"
65
66
script:
    "../scripts/cluster.py"
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
shell:
    """
    {input.bin} \
    count \
    --nosecondary \
    {params.introns} \
    --id {wildcards.sample} \
    --transcriptome {input.genome} \
    --fastqs data \
    --sample {wildcards.sample} \
    --expect-cells {params.n_cells} \
    --localcores {threads} \
    --localmem {params.mem} \
    &> {log} && \
    mv {wildcards.sample} results/counts/{wildcards.sample}
    """
20
21
script:
    "../scripts/densities.py"
27
28
script:
    "../scripts/filter_empty.R"
73
74
script:
    "../scripts/qc.py"
97
98
script:
    "../scripts/doublets.py"
12
13
wrapper:
    "0.79.0/bio/fastqc"
30
31
wrapper:
    "0.79.0/bio/multiqc"
13
14
15
16
17
18
19
shell:
    """
        wget -O resources/cellranger.tar.gz "{params.url}" &> {log}
    tar -xzf resources/cellranger.tar.gz -C resources &> {log} && \
    rm -rf resources/cellranger.tar.gz
    mv resources/cellranger-* resources/cellranger
    """
31
32
33
34
35
36
37
shell:
    """
        wget -O resources/genome.tar.gz "{params.url}" &> {log}
    tar -xzf resources/genome.tar.gz -C resources &> {log} && \
    rm -rf resources/genome.tar.gz
    mv resources/refdata-* resources/genome
    """
49
50
51
52
shell:
    """
    unzip {input.bcl_zip} -d results/bcl2fastq &> {log}
    """
65
66
67
68
shell:
    """
    mv {input} {output} &> {log}
    """
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
if __name__ == "__main__":
    from pathlib import Path

    import matplotlib.pyplot as plt
    import pandas as pd
    import scanpy as sc
    from helpers.logs.get_logger import get_logger
    from helpers.logs.sc_logs import set_sc_log
    from helpers.write_de import get_de
    from matplotlib.backends.backend_pdf import PdfPages

    LOG = snakemake.log[0]  # noqa: F821
    PARAMS = snakemake.params  # noqa: F821
    INPUT = snakemake.input  # noqa: F821
    OUTPUT = snakemake.output  # noqa: F821
    THREADS = snakemake.threads  # noqa: F821

    logger = get_logger(__name__, LOG)
    sc.settings = set_sc_log(sc.settings, logfile=LOG)
    sc.settings.n_jobs = THREADS

    adata = sc.read_h5ad(INPUT["data"])
    logger.info(f"Adata read from {INPUT['data']}")
    logger.info(f"Input data: {adata}")

    sc.pp.neighbors(adata, use_rep="X_harmony")
    sc.tl.umap(adata)
    sc.tl.leiden(adata, resolution=PARAMS["res"])
    logger.info(
        f"{len(adata.obs.leiden.unique())} clusters identified at resolution {PARAMS['res']}"
    )

    _ = sc.pl.umap(
        adata,
        color=[
            "lane",
            "sample",
            "leiden",
            "total_counts",
            "pct_counts_mt",
            "pct_counts_ribo_prot",
        ],
        use_raw=True,
        show=False,
        save=False,
        ncols=2,
    )
    plt.savefig(OUTPUT["umap"], dpi=300, bbox_inches="tight")
    plt.close()

    # Marker genes
    if len(PARAMS["markers"]) == 0:
        Path(OUTPUT["markers"]).touch(exist_ok=True)
        logger.info("No markers provided. File touched, but empty.")
    else:
        with PdfPages(OUTPUT["markers"]) as pdf:
            for i in range(0, len(PARAMS["markers"]), 9):
                genes = PARAMS["markers"][i : i + 9]
                genes = [x for x in genes if x in adata.var_names]
                fig = sc.pl.umap(
                    adata,
                    color=genes,
                    use_raw=True,
                    return_fig=True,
                    save=False,
                    ncols=3,
                )
                pdf.savefig(fig, dpi=300, bbox_inches="tight")
                plt.close()
                logger.info(f"Markers {genes} plotted.")

    # Run dendrogram
    sc.tl.dendrogram(
        adata,
        use_rep="X_harmony",
        groupby="leiden",
        optimal_ordering=True,
    )

    # Differential expression
    # Not ranking by abs will return the 100 highest scoring upregulated genes
    sc.tl.rank_genes_groups(
        adata,
        groupby="leiden",
        use_raw=True,
        method="t-test_overestim_var",
        pts=True,
    )

    # And plot
    _ = sc.pl.rank_genes_groups_dotplot(
        adata,
        n_genes=5,
        values_to_plot="logfoldchanges",
        save=False,
        show=False,
        cmap="coolwarm",
        vmin=-4,
        vmax=4,
        colorbar_title="log2FC",
    )
    plt.savefig(OUTPUT["dot"], dpi=300, bbox_inches="tight")
    plt.close()

    # Write results
    results = get_de(adata)
    with pd.ExcelWriter(OUTPUT["de"], engine="openpyxl") as writer:
        for j, df in results:
            df.to_excel(writer, sheet_name=f"cluster_{j}")

    logger.info(f"Final adata: {adata}")
    adata.write_h5ad(OUTPUT["data"])
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
if __name__ == "__main__":
    import matplotlib.pyplot as plt
    import scanpy as sc
    from helpers.logs.get_logger import get_logger
    from helpers.logs.sc_logs import set_sc_log
    from matplotlib.backends.backend_pdf import PdfPages

    LOG = snakemake.log[0]  # noqa: F821
    PARAMS = snakemake.params  # noqa: F821
    INPUT = snakemake.input  # noqa: F821
    OUTPUT = snakemake.output  # noqa: F821
    THREADS = snakemake.threads  # noqa: F821

    logger = get_logger(__name__, LOG)
    sc.settings = set_sc_log(sc.settings, logfile=LOG)
    sc.settings.n_jobs = THREADS

    adata = sc.read_h5ad(INPUT["data"])
    logger.info(f"Adata read from {INPUT['data']}")
    logger.info(f"Input data: {adata}")

    # Calculate
    with PdfPages(OUTPUT["densities"]) as pdf:
        for feature in PARAMS["features"]:
            sc.tl.embedding_density(adata, groupby=feature)
            fig = sc.pl.embedding_density(
                adata,
                groupby=feature,
                group="all",
                color_map="plasma",
                ncols=3,
                save=False,
                return_fig=True,
            )
            fig.suptitle(f"Densities for feature: {feature}")
            pdf.savefig(fig, dpi=300, bbox_inches="tight")
            plt.close()
            logger.info(f"Feature {feature} plotted")

    # Save data
    logger.info(f"Final data: {adata}")
    adata.write_h5ad(OUTPUT["data"])
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
if __name__ == "__main__":
    import anndata as ad
    import matplotlib.pyplot as plt
    import scanpy as sc
    import scanpy.external as sce
    from helpers.logs.get_logger import get_logger
    from helpers.logs.sc_logs import set_sc_log
    from helpers.select_pcs import select_pcs

    LOG = snakemake.log[0]  # noqa: F821
    PARAMS = snakemake.params  # noqa: F821
    INPUT = snakemake.input  # noqa: F821
    OUTPUT = snakemake.output  # noqa: F821
    THREADS = snakemake.threads  # noqa: F821

    logger = get_logger(__name__, LOG)
    sc.settings = set_sc_log(sc.settings, logfile=LOG)
    sc.settings.n_jobs = THREADS

    # Concatenate samples
    adata = ad.concat(
        [sc.read_h5ad(path) for path in INPUT["data"]],
        join="outer",
        merge="same",
        label=None,
    )
    adata.obs_names_make_unique()
    logger.info(f"Adata read from {INPUT['data']}")
    logger.info(f"Input data: {adata}")

    # HVGs
    # Before normalisation as seurat_v3 expects raw counts
    if not PARAMS["nHVG"]:
        nHVG = max(min(len(adata.obs) / 2, 10000), 1000)
        logger.info(f"nHVG not provided. Using {nHVG}.")
    sc.pp.highly_variable_genes(
        adata,
        n_top_genes=nHVG,
        flavor="seurat_v3",
        batch_key="lane",
        subset=False,
    )
    _ = sc.pl.highly_variable_genes(
        adata,
        log=False,
        show=False,
        save=False,
    )
    plt.savefig(OUTPUT["hvg"], dpi=300, bbox_inches="tight")
    plt.close()

    # Normalise
    # Exclude highly expressed to prevent skew of normalisation
    sc.pp.normalize_total(adata, exclude_highly_expressed=True)
    sc.pp.log1p(adata)

    # Save raw and filter
    adata.raw = adata

    # Regress and scale
    # No batch - covered with bbknn
    sc.pp.regress_out(adata, ["total_counts", "pct_counts_mt"], n_jobs=None)
    sc.pp.scale(adata, max_value=10)

    # PCA
    sc.tl.pca(adata, n_comps=50, use_highly_variable=True)
    _ = sc.pl.pca_variance_ratio(adata, n_pcs=50, show=False, save=False)
    plt.savefig(OUTPUT["elbow"], dpi=300, bbox_inches="tight")
    plt.close()

    # Harmony for batch correction
    # As it runs on all pcs include, we must first filter to desired
    npc, adata = select_pcs(adata, threshold=PARAMS["var_thresh"])
    logger.info(f"{npc} PCs used.")

    sce.pp.harmony_integrate(
        adata,
        key="lane",
        adjusted_basis="X_harmony",
        max_iter_harmony=50,
    )

    # And save
    adata.write_h5ad(OUTPUT["data"])
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
if __name__ == "__main__":
    import sys

    import anndata as ad
    import matplotlib.pyplot as plt
    import pandas as pd
    import scanpy as sc
    import scvi
    from helpers.logs.get_logger import get_logger
    from pandas.plotting import table

    LOG = snakemake.log[0]  # noqa: F821
    PARAMS = snakemake.params  # noqa: F821
    INPUT = snakemake.input  # noqa: F821
    OUTPUT = snakemake.output  # noqa: F821
    WILDS = snakemake.wildcards  # noqa: F821

    logger = get_logger(__name__, LOG)
    scvi.settings.seed = 42

    # Get data
    adata = ad.concat(
        [sc.read_h5ad(path) for path in INPUT["data"]],
        join="outer",
        merge="same",
        label=None,
    )
    adata.obs_names_make_unique()
    adata.obs["lane"] = str(WILDS["lane"])
    logger.info(f"Adata read from {INPUT['data']}")
    logger.info(f"Input data: {adata}")

    # SCVI doesn't seem to have a way to specify where the output logs to,
    # So we have to wrap and redirect
    with open(LOG, "a") as file:
        sys.stderr = sys.stdout = file

        # No covariates - not used with SOLO
        scvi.data.setup_anndata(
            adata,
        )
        vae = scvi.model.SCVI(adata)
        vae.train(
            early_stopping=True, check_val_every_n_epoch=2, early_stopping_patience=20
        )

        solo = scvi.external.SOLO.from_scvi_model(
            vae,
        )
        solo.train(train_size=0.9, early_stopping=True, early_stopping_patience=20)
        preds = solo.predict(soft=True)

    # SCVI appends a random -0 to the index...
    adata.obs.index = adata.obs.index + "-0"
    adata.obs["singlet_score"] = preds.loc[:, "singlet"]
    adata.obs["doublet_score"] = preds.loc[:, "doublet"]

    # Save a results table
    preds = (
        pd.DataFrame.from_dict(
            {
                "Sample": adata.obs["sample"],
                "Singlet": adata.obs["singlet_score"] > adata.obs["doublet_score"],
                "Doublet": adata.obs["singlet_score"] < adata.obs["doublet_score"],
            }
        )
        .groupby("Sample")
        .sum()
    )

    ax = plt.subplot(111, frame_on=False)
    ax.xaxis.set_visible(False)
    ax.yaxis.set_visible(False)
    _ = table(ax, preds, cellLoc="center")
    plt.savefig(OUTPUT["table"], dpi=300, bbox_inches="tight")
    plt.close()

    # Filter
    adata = adata[adata.obs.singlet_score > adata.obs.doublet_score, :]
    logger.info(f"Data after doublet calls: {adata}")

    # Clean up after scvi
    del adata.uns
    adata.obs = adata.obs.loc[:, ~adata.obs.columns.str.startswith("_")].copy()
    adata.obs["sample"] = adata.obs["sample"].astype("category")
    adata.obs["lane"] = adata.obs["lane"].astype("category")

    adata.write_h5ad(OUTPUT["data"])
    logger.info(f"Data saved to {OUTPUT['data']}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
packages <- c("DropletUtils", "logger", "Matrix")
invisible(
  suppressPackageStartupMessages(
    lapply(packages, library, character.only = TRUE)
  )
)

set.seed(0)

# Configure logger
style <- layout_glue_generator("{time} :: {level} :: {namespace} :: {msg}")
logfile <- snakemake@log[[1]]
log_appender(appender_file(logfile))
log_layout(style)
log_threshold(INFO)

# Read in data
# readMM fails with floats, so we hack python
counts <- readMM(snakemake@input[["mtx"]])
log_success("Data read from {snakemake@input[['counts']]}")
log_info("Dimensions: {dim(counts)}")

totals <- colSums(counts)
totals <- totals[totals > 25]
totals <- totals[totals < 2000]

# The bound beneath which all droplets are "empty"
lower <- mean(totals) + (2 * sd(totals))
log_info("Lower threshold: {lower}")

# Tail end plot
png(snakemake@output[["tail"]], res = 300, height = 4, width = 4, units = "in")
invisible(hist(totals, breaks = 500, xlab = "nUMI", main = NULL, col = "black"))
invisible(abline(v = lower, col = "darkgreen", lty = 2))
invisible(dev.off())
log_success("Tail end plot saved at {snakemake@output[['tail']]}")

# Knee plot
bcrank <- barcodeRanks(counts)
uniq <- !duplicated(bcrank$rank)
png(snakemake@output[["knee"]], res = 300, height = 4, width = 4, units = "in")
invisible(plot(bcrank$rank[uniq], bcrank$total[uniq],
  log = "xy",
  xlab = "Rank", ylab = "Total UMI count", cex.lab = 1.2
))
invisible(abline(h = lower, col = "darkgreen", lty = 2))
invisible(dev.off())
log_success("Knee plot saved at {snakemake@output[['knee']]}")

# And run emptyDrops
results <- emptyDrops(
  counts,
  lower = lower, niters = snakemake@params[["niters"]]
)
write.csv(results, snakemake@output[["empty"]])
log_success("Data saved at {snakemake@output[['empty']]}")
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
if __name__ == """__main__""":
    from pathlib import Path

    import matplotlib.pyplot as plt
    import pandas as pd
    import scanpy as sc
    from helpers.logs.get_logger import get_logger
    from helpers.logs.sc_logs import set_sc_log

    LOG = snakemake.log[0]  # noqa: F821
    PARAMS = snakemake.params  # noqa: F821
    INPUT = snakemake.input  # noqa: F821
    OUTPUT = snakemake.output  # noqa: F821
    WILDS = snakemake.wildcards  # noqa: F821

    logger = get_logger(__name__, LOG)
    sc.settings = set_sc_log(sc.settings, logfile=LOG)

    adata = sc.read_10x_mtx(INPUT["dir"], make_unique=True)
    adata.obs_names_make_unique()
    logger.info(f"Adata read from {INPUT['dir']}")
    logger.info(f"Input data: {adata}")

    empty = pd.read_csv(INPUT["empty"], index_col=0)
    empty.index = adata.obs_names
    empty.FDR = empty.FDR.fillna(1)
    logger.info(f"Empty read from {INPUT['empty']}")

    # Transfer call
    adata.obs["is_cell"] = empty.FDR < 0.005

    # Initial filtering
    adata = adata[adata.obs["is_cell"], :]
    sc.pp.filter_genes(adata, min_cells=1, inplace=True)

    # Mark mitochondrial and ribosomal genes
    adata.var["mt"] = adata.var_names.str.startswith("MT-")
    adata.var["ribo_prot"] = adata.var_names.str.startswith("RPS", "RPL")

    # Calculate qc metrics
    sc.pp.calculate_qc_metrics(
        adata, qc_vars=["mt", "ribo_prot"], percent_top=None, log1p=False, inplace=True
    )

    # Various plots
    # Specify various sc figure settings
    sc.set_figure_params(dpi_save=300, fontsize=12, figsize=(6, 6), format="png")
    sc.settings.figdir = Path(".")

    _ = sc.pl.violin(
        adata,
        [
            "total_counts",
            "n_genes_by_counts",
            "pct_counts_mt",
        ],
        multi_panel=True,
        show=False,
        save=False,
    )
    plt.savefig(OUTPUT["qc_violin"], dpi=300, bbox_inches="tight")
    plt.close()

    _ = sc.pl.scatter(
        adata,
        "total_counts",
        "pct_counts_mt",
        show=False,
        save=False,
    )
    plt.savefig(OUTPUT["mt_scatter"], dpi=300, bbox_inches="tight")
    plt.close()

    _ = sc.pl.scatter(
        adata,
        "total_counts",
        "n_genes_by_counts",
        show=False,
        save=False,
    )
    plt.savefig(OUTPUT["n_genes"], dpi=300, bbox_inches="tight")
    plt.close()

    # Filter cells
    adata = adata[adata.obs.pct_counts_mt < PARAMS["pct_counts_mt"], :]
    adata = adata[adata.obs.total_counts <= PARAMS["total_counts"], :]
    adata = adata[
        adata.obs.n_genes_by_counts >= PARAMS["n_genes_by_counts"],
        :,
    ]

    # Filter genes
    # If introns, then single nucleus -> mt is artefact, so drop
    if PARAMS["introns"]:
        adata = adata[:, ~adata.var.mt]

    # Final plot...
    _ = sc.pl.highest_expr_genes(adata, n_top=50, show=False, save=False)
    plt.savefig(OUTPUT["n_top"], dpi=300, bbox_inches="tight")
    plt.close()

    # Mark and save
    adata.obs["sample"] = WILDS["sample"]
    adata.write_h5ad(OUTPUT["data"])
    logger.info(f"Filtered data: {adata}")
    logger.info(f"Data saved to {OUTPUT['data']}")
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path
import re
from tempfile import TemporaryDirectory

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=True, stderr=True)


def basename_without_ext(file_path):
    """Returns basename of file path, without the file extension."""

    base = path.basename(file_path)
    # Remove file extension(s) (similar to the internal fastqc approach)
    base = re.sub("\\.gz$", "", base)
    base = re.sub("\\.bz2$", "", base)
    base = re.sub("\\.txt$", "", base)
    base = re.sub("\\.fastq$", "", base)
    base = re.sub("\\.fq$", "", base)
    base = re.sub("\\.sam$", "", base)
    base = re.sub("\\.bam$", "", base)

    return base


# Run fastqc, since there can be race conditions if multiple jobs
# use the same fastqc dir, we create a temp dir.
with TemporaryDirectory() as tempdir:
    shell(
        "fastqc {snakemake.params} -t {snakemake.threads} "
        "--outdir {tempdir:q} {snakemake.input[0]:q}"
        " {log}"
    )

    # Move outputs into proper position.
    output_base = basename_without_ext(snakemake.input[0])
    html_path = path.join(tempdir, output_base + "_fastqc.html")
    zip_path = path.join(tempdir, output_base + "_fastqc.zip")

    if snakemake.output.html != html_path:
        shell("mv {html_path:q} {snakemake.output.html:q}")

    if snakemake.output.zip != zip_path:
        shell("mv {zip_path:q} {snakemake.output.zip:q}")
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path

from snakemake.shell import shell


input_dirs = set(path.dirname(fp) for fp in snakemake.input)
output_dir = path.dirname(snakemake.output[0])
output_name = path.basename(snakemake.output[0])
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "multiqc"
    " {snakemake.params}"
    " --force"
    " -o {output_dir}"
    " -n {output_name}"
    " {input_dirs}"
    " {log}"
)
ShowHide 13 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/IMS-Bio2Core-Facility/single_snake_sequencing
Name: single_snake_sequencing
Version: v0.1.2
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...