Snakemake workflow: rna-seq-kallisto-sleuth

public 1yr ago 0 bookmarks

View Workflow

snakemake-rna-seq-kallisto-sleuth — View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

This workflow performs a differential expression analysis with Kallisto and Sleuth .

Authors

Johannes Köster (https://koesterlab.github.io)
David Lähnemann

Usage

In any case, if you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and, if available, its DOI (see above).

Step 1: Obtain a copy of this workflow

Create a new github repository using this workflow as a template .
Clone the newly created repository to your local system, into the place where you want to perform the data analysis.

Step 2: Configure workflow

Configure the workflow according to your needs via editing the file config/config.yaml , as well as the sample sheet config/samples.tsv , and the unit sheet config/units.tsv .

Step 3: Execute workflow

Test your configuration by performing a dry-run via

snakemake --use-conda -n

Execute the workflow locally via

snakemake --use-conda --cores $N

using $N cores or run it in a cluster environment via

snakemake --use-conda --cluster qsub --jobs 100

snakemake --use-conda --drmaa --jobs 100

If you not only want to fix the software stack but also the underlying OS, use

snakemake --use-conda --use-singularity

in combination with any of the modes above. See the Snakemake documentation for further details.

Step 4: Investigate results

After successful execution, you can create a self-contained interactive HTML report with all results via:

snakemake --report report.html

This report can, e.g., be forwarded to your collaborators. An example (using some trivial test data) can be seen here .

Step 5: Commit changes

Whenever you change something, don't forget to commit the changes back to your github copy of the repository:

git commit -a
git push

Step 6: Obtain updates from upstream

Whenever you want to synchronize your workflow copy with new developments from upstream, do the following.

Once, register the upstream repository in your local copy: git remote add -f upstream [email protected]:snakemake-workflows/dna-seq-varlociraptor.git or git remote add -f upstream https://github.com/snakemake-workflows/dna-seq-varlociraptor.git if you do not have setup ssh keys.
Update the upstream version: git fetch upstream .
Create a diff with the current version: git diff HEAD upstream/master workflow > upstream-changes.diff .
Investigate the changes: vim upstream-changes.diff .
Apply the modified diff via: git apply upstream-changes.diff .
Carefully check whether you need to update the config files: git diff HEAD upstream/master config . If so, do it manually, and only where necessary, since you would otherwise likely overwrite your settings and samples.

Step 7: Contribute back

In case you have also changed or added steps, please consider contributing them back to the original repository:

Fork the original repo to a personal or lab account.
Clone the fork to your local system, to a different place than where you ran your analysis.
Copy the modified files from your analysis to the clone of your fork, e.g., cp -r workflow path/to/fork . Make sure to not accidentally copy config file contents or sample sheets.
Adapt the example config files if needed.
Commit and push your changes to your fork.
Create a pull request against the original repository.

Testing

Test cases are in the subfolder .test and are automtically executed via continuous integration with Github Actions . As creating a workflow-repository from the template does not include the submodule with the testing data, getting the GitHub Actions tests to pass requires adding the submodule:

cd .test
git submodule add -f https://github.com/snakemake-workflows/ngs-test-data.git
git commit -m "add ngs-test-data submodule to get GitHub Actions tests to pass"

Code Snippets

run:
    samples_ = units[["sample", "unit"]].merge(samples, on="sample")
    samples_["sample"] = samples_.apply(
        lambda row: "{}-{}".format(row["sample"], row["unit"]), axis=1)
    samples_["path"] = kallisto_output
    del samples_["unit"]
    samples_.to_csv(output[0], sep="\t")

SnakeMake From line 13 of rules/diffexp.smk

script:
    "../scripts/sleuth-init.R"

SnakeMake From line 44 of rules/diffexp.smk

script:
    "../scripts/sleuth-diffexp.R"

SnakeMake From line 84 of rules/diffexp.smk

script:
    "../scripts/ihw-fdr-control.R"

SnakeMake From line 101 of rules/diffexp.smk

script:
    "../scripts/plot-bootstrap.R"

SnakeMake From line 124 of rules/diffexp.smk

script:
    "../scripts/plot-pca.R"

SnakeMake From line 139 of rules/diffexp.smk

script:
    "../scripts/plot-diffexp-heatmap.R"

SnakeMake From line 155 of rules/diffexp.smk

script:
    "../scripts/plot-diffexp-pval-hist.R"

SnakeMake From line 170 of rules/diffexp.smk

script:
    "../scripts/sleuth-to-matrix.R"

SnakeMake From line 183 of rules/diffexp.smk

script:
    "../scripts/plot-group-density.R"

SnakeMake From line 195 of rules/diffexp.smk

script:
    "../scripts/plot-scatter.R"

SnakeMake From line 209 of rules/diffexp.smk

script:
    "../scripts/plot-fld.R"

SnakeMake From line 221 of rules/diffexp.smk

script:
    "../scripts/plot-variances.R"

SnakeMake From line 236 of rules/diffexp.smk

shell:
    "conda create --quiet --yes -p {params.path} --channel bioconda bioconductor-{wildcards.package}={params.version}"

SnakeMake From line 12 of rules/enrichment.smk

script:
    "../scripts/spia.R"

SnakeMake From line 40 of rules/enrichment.smk

script:
    "../scripts/fgsea.R"

SnakeMake From line 83 of rules/enrichment.smk

script:
    "../scripts/plot-fgsea-gene-sets.R"

SnakeMake From line 106 of rules/enrichment.smk

script:
    "../scripts/ens_gene_to_go.R"

SnakeMake From line 122 of rules/enrichment.smk

shell:
    "( curl --silent -o {output} {params.download} ) 2> {log}"

SnakeMake From line 135 of rules/enrichment.smk

script:
    "../scripts/goatools-go-enrichment-analysis.py"

SnakeMake From line 165 of rules/enrichment.smk

shell:
    "kallisto index -i {output} {input} 2> {log}"

SnakeMake kallisto From line 10 of rules/quant.smk

shell:
    "kallisto quant -i {input.idx} -o {output} "
    "{params.extra} {input.fq} 2> {log}"

SnakeMake Quant kallisto From line 39 of rules/quant.smk

wrapper:
    "0.31.1/bio/cutadapt/pe"

SnakeMake From line 12 of rules/trim.smk

wrapper:
    "0.31.1/bio/cutadapt/se"

SnakeMake From line 26 of rules/trim.smk

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

# provides `tidyverse` and load_bioconductor_package()
source( file.path(snakemake@scriptdir, 'common.R') )

pkg <- snakemake@params[["bioc_pkg"]]
load_bioconductor_package(snakemake@input[["species_anno"]], pkg)

ens_gene_to_go <- AnnotationDbi::select(  get(pkg),
                                          keys=keys(get(pkg), keytype="ENSEMBL"),
                                          columns=c("GO"),
                                          keytype="ENSEMBL"
                                          ) %>%
                  # rows with empty ENSEMBL or GO IDs are useless and will lead to trouble below
                  drop_na(ENSEMBL,GO) %>%
                  dplyr::rename(ensembl_gene_id = ENSEMBL) %>%
                  group_by(ensembl_gene_id) %>%
                  summarise(go_ids = str_c(GO, collapse = ";"))


write_tsv(ens_gene_to_go, path = snakemake@output[[1]], col_names = FALSE)

R From line 1 of scripts/ens_gene_to_go.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library("fgsea")

# provides library("tidyverse") and functions load_bioconductor_package() and
# get_prefix_col(), the latter requires snakemake@output[["samples"]] and
# snakemake@params[["covariate"]]
source( file.path(snakemake@scriptdir, 'common.R') )

pkg <- snakemake@params[["bioc_pkg"]]
load_bioconductor_package(snakemake@input[["species_anno"]], pkg)

gene_sets <- gmtPathways(snakemake@input[["gene_sets"]])
diffexp <- read_tsv(snakemake@input[["diffexp"]]) %>%
                  drop_na(ext_gene) %>%
                  mutate(ext_gene = str_to_upper(ext_gene)) %>%
                  # resolve multiple occurences of the same ext_gene, usually
                  # meaning that several ENSEMBLE genes map to the same gene
                  # symbol -- so we may loose resolution here, as long as gene
                  # symbols are used
                  group_by(ext_gene) %>%
                    filter( qval == min(qval, na.rm = TRUE) ) %>%
                    filter( pval == min(pval, na.rm = TRUE) ) %>%
                    # for the case of min(qval) == 1 and min(pval) == 1, we
                    # need something else to select a unique entry per gene
                    filter( mean_obs == max(mean_obs, na.rm = TRUE) ) %>%
                    mutate(target_id = str_c(target_id, collapse=",")) %>%
                    mutate(ens_gene = str_c(ens_gene, collapse=",")) %>%
                  distinct()

signed_pi <- get_prefix_col("signed_pi_value", colnames(diffexp))

ranked_genes <- diffexp %>%
                  dplyr::select(ext_gene, !!signed_pi) %>%
                  deframe()

# get and write out rank values that are tied -- a way to check up on respecitve warnings
rank_ties <- enframe(ranked_genes) %>%
               mutate(dup = duplicated(value) | duplicated(value, fromLast = TRUE) ) %>%
               filter(dup == TRUE) %>%
               dplyr::select(ext_gene = name, !!signed_pi := value)
write_tsv(rank_ties, snakemake@output[["rank_ties"]])

fgsea_res <- fgsea(pathways = gene_sets,
                    stats = ranked_genes,
                    minSize=10,
                    maxSize=700,
                    nproc=snakemake@threads,
                    nperm=snakemake@params[["nperm"]]
                    ) %>%
                as_tibble()

# Annotation is impossible without any entries, so then just write out empty files
if ( (fgsea_res %>% count() %>% pull(n)) == 0 ) {

    write_tsv(fgsea_res, path = snakemake@output[["enrichment"]])
    write_tsv(fgsea_res, path = snakemake@output[["significant"]])

} else {

    # add further annotation
    annotated <- fgsea_res %>%
                    unnest(leadingEdge) %>%
                    mutate(
                        leading_edge_symbol = str_to_title(leadingEdge),
                        leading_edge_entrez_id = mapIds(leading_edge_symbol, x=get(pkg), keytype="SYMBOL", column="ENTREZID"),
                        leading_edge_ens_gene = mapIds(leading_edge_symbol, x=get(pkg), keytype="SYMBOL", column="ENSEMBL")
                        ) %>%
                    group_by(pathway) %>%
                    summarise(
                        leadingEdge = str_c(leadingEdge, collapse = ','),
                        leading_edge_symbol = str_c(leading_edge_symbol, collapse = ','),
                        leading_edge_entrez_id = str_c(leading_edge_entrez_id, collapse = ','),
                        leading_edge_ens_gene = str_c(leading_edge_ens_gene, collapse = ',')
                    ) %>%
                    inner_join(fgsea_res %>% dplyr::select(-leadingEdge), by = "pathway") %>%
                    dplyr::select(-leadingEdge, -leading_edge_symbol,
                           -leading_edge_entrez_id, -leading_edge_ens_gene,
                            leading_edge_symbol, leading_edge_ens_gene,
                            leading_edge_entrez_id, leadingEdge)

    # write out fgsea results for all gene sets
    write_tsv(annotated, path = snakemake@output[["enrichment"]])

    # select significant pathways
    sig_gene_sets <- annotated %>%
                       filter( padj < snakemake@params[["gene_set_fdr"]] )

    # write out fgsea results for gene sets found to be significant
    write_tsv(sig_gene_sets, path = snakemake@output[["significant"]])
}

height = .7 * (length(gene_sets) + 2)

# table plot of all gene sets
tg <- plotGseaTable(
            pathway = gene_sets,
            stats = ranked_genes,
            fgseaRes = fgsea_res,
            gseaParam = 1,
            render = FALSE
        )
ggsave(filename = snakemake@output[["plot"]], plot = tg, width = 12, height = height, limitsize=FALSE)

R tidyverse fgsea From line 1 of scripts/fgsea.R

import sys
sys.stderr = open(snakemake.log[0], "w")

import pandas as pd
import matplotlib.pyplot as plt
from goatools.obo_parser import GODag
from goatools.anno.idtogos_reader import IdToGosReader
from goatools.goea.go_enrichment_ns import GOEnrichmentStudyNS
from goatools.godag_plot import plot_results#, plot_goid2goobj, plot_gos


# read in directed acyclic graph of GO terms / IDs
obodag = GODag(snakemake.input.obo)


# read in mapping gene ids from input to GO terms / IDs
objanno = IdToGosReader(snakemake.input.ens_gene_to_go, godag = obodag)


# extract namespace(?) -> id2gos mapping
ns2assoc = objanno.get_ns2assc()

for nspc, id2gos in ns2assoc.items():
    print("{NS} {N:,} annotated genes".format(NS=nspc, N=len(id2gos)))

# read gene diffexp table
all_genes = pd.read_table(snakemake.input.diffexp)

# select genes significantly differentially expressed according to BH FDR of sleuth
fdr_level_gene = float(snakemake.params.gene_fdr)
sig_genes = all_genes[all_genes['qval']<fdr_level_gene]


# initialize GOEA object
fdr_level_go_term = float(snakemake.params.go_term_fdr)

goeaobj = GOEnrichmentStudyNS(
    # list of 'population' of genes looked at in total
    pop = all_genes['ens_gene'].tolist(),
    # geneid -> GO ID mapping
    ns2assoc = ns2assoc,
    # ontology DAG
    godag = obodag,
    propagate_counts = False,
    # multiple testing correction method (fdr_bh is false discovery rate control with Benjamini-Hochberg)
    methods = ['fdr_bh'],
    # significance cutoff for method named above
    alpha = fdr_level_go_term
    )

goea_results_all = goeaobj.run_study(sig_genes['ens_gene'].tolist())


# write results to text file
goeaobj.wr_tsv(snakemake.output.enrichment, goea_results_all)


# plot results
ensembl_id_to_symbol = dict(zip(all_genes['ens_gene'], all_genes['ext_gene']))


# from first plot output file name, create generic file name to trigger
# separate plots for each of the gene ontology name spaces
outplot_generic = snakemake.output.plot[0].replace('_BP.','_{NS}.', 1).replace('_CC.','_{NS}.', 1).replace('_MF.', '_{NS}.', 1)

goea_results_sig = [r for r in goea_results_all if r.p_fdr_bh < fdr_level_go_term]

plot_results(
    outplot_generic,
    # use pvals for coloring
    goea_results=goea_results_sig,
    # print general gene symbol instead of Ensembl ID
    id2symbol=ensembl_id_to_symbol,
    # number of genes to print, or True for printing all (including count of genes)
    study_items=True,
    # number of genes to print per line
    items_p_line=6,
    # p-values to use for coloring of GO term nodes (argument name determined from code and testing against value "p_uncorrected")
    pval_name="p_fdr_bh"
    )

# for all name spaces
for ns in ns2assoc.keys():
    # check if no GO terms were found to be significant
    if len([r for r in goea_results_sig if r.NS == ns]) == 0:
        fig = plt.figure(figsize=(12, 2))
        text = fig.text(0.5, 0.5,
                        "No plot generated, because no GO terms were found significant\n"
                        "for name space {} and significance levels: genes ({}), GO terms ({}).\n"
                        "You might want to check those levels and/or your intermediate data.".format(
                        ns, fdr_level_gene, fdr_level_go_term),
                        ha='center', va='center', size=20)
        fig.savefig( outplot_generic.replace('_{NS}.', "_{}.".format(ns)) )

Python Pandas matplotlib goatools From line 1 of scripts/goatools-go-enrichment-analysis.py

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library("tidyverse")
library("ggpubr")
library("IHW")

number_of_groups <- 7

gene_data <- read_tsv(snakemake@input[[1]])
level_name <- snakemake@wildcards[["level"]]

# calculate covariate mean_obs for gene.aggregated data
if(level_name == "genes-aggregated") {
  gene_data <- gene_data %>%
    mutate(mean_obs = if_else(num_aggregated_transcripts > 0, sum_mean_obs_counts / num_aggregated_transcripts, 0), ens_gene = target_id)
}

# determine the appropriate number of groups for grouping in ihw calculation
tested_number_of_groups <- number_of_groups
ihw_test_grouping <- function(x){
  tryCatch(
    expr = {
      ihw(pval ~ mean_obs, data = gene_data, alpha = 0.1, nbins = x)
      return(TRUE)
    },
    error = function(e) {
      print(str_c("Number of groups was set too hight, trying grouping by ", tested_number_of_groups - 1, " groups."))
      return(FALSE)
    })
}

while(!ihw_test_grouping(tested_number_of_groups) && tested_number_of_groups > 0){
  tested_number_of_groups <- tested_number_of_groups -1
  ihw_test_grouping(tested_number_of_groups)
}

if (tested_number_of_groups < number_of_groups) {
  print(str_c("The chosen number of groups for ", level_name, " dataset was too large, instead IHW was calculated on ", tested_number_of_groups, " groups."))
  number_of_groups <- tested_number_of_groups
}

gene_data <- gene_data %>%
  drop_na(pval, mean_obs) %>%
  select(ens_gene, ext_gene, pval, mean_obs) %>%
  mutate(grouping = groups_by_filter(mean_obs, number_of_groups))

### diagnostic plots for covariate and grouping
#dispersion plot
dispersion <- ggplot(gene_data, aes(x = percent_rank(mean_obs), y = -log10(pval))) +
  geom_point() +
  ggtitle("IHW: scatter plot of p-values vs. mean of counts") +
  theme(plot.title = element_text(size = 12, face = "bold", hjust = 0.5)) +
  xlab("percent rank of mean_obs") +
  ylab(expression(-log[10](p-value)))

ggsave(snakemake@output[["dispersion"]], dispersion)

#histograms
hist_overall <- ggplot(gene_data, aes(x = pval)) +
  geom_histogram(binwidth = 0.025, boundary = 0) +
  xlab("p-values without grouping") +
  ylab("density")

hist_groups <- ggplot(gene_data, aes(x = pval)) +
  geom_histogram(binwidth = 0.025, boundary = 0) +
  xlab("p-values of the individual groups") +
  ylab("density") +
  facet_wrap(~ grouping, nrow = ceiling(number_of_groups / 4)) 

plots_agg_mean_obs <- ggarrange(hist_overall, hist_groups, nrow = 1)

histograms <- annotate_figure(plots_agg_mean_obs,
                              top = text_grob("IHW: histograms for p-values of mean of counts",
                                              color = "black",
                                              face = "bold",
                                              size = 12))

ggexport(histograms,
         filename = snakemake@output[["histograms"]],
         width=14)

# ihw calculation
ihw_results_mean <- ihw(pval ~ mean_obs, data = gene_data, alpha = 0.1, nbins = tested_number_of_groups)

# merging ens_gene-IDs and ext_gene-names
ihw_mean <- as.data.frame(ihw_results_mean) %>% 
  # TODO remove ugly hack if ihw in future allows annotation columns (feature requested here: https://support.bioconductor.org/p/129972/)
  right_join(gene_data, by = c(pvalue = "pval", covariate = "mean_obs", group = "grouping")) %>%
  unique() %>%
  select(ens_gene, ext_gene, everything())

write_tsv(ihw_mean, snakemake@output[["results"]])

### diagnostic plots for ihw calculation
# plot of trends of the covariate
plot_trend_mean <- plot(ihw_results_mean) +
  ggtitle("IHW: Plot of trends of mean of counts") +
  theme(plot.title = element_text(size=12))
ggsave(snakemake@output[["trends"]], plot_trend_mean)

# plots of decision boundaries for unweighted p-values as a function of the covariate
plot_decision_mean <- plot(ihw_results_mean, what = "decisionboundary") +
  ggtitle("IHW: Decision boundaries for unweighted p-values vs. mean of counts") +
  theme(plot.title = element_text(size=12))
ggsave(snakemake@output[["decision"]], plot_decision_mean)

# p-values vs. adusted p-values
plot_ihw_pval_mean <- ggplot(ihw_mean, aes(x = pvalue, y = adj_pvalue, col = group)) +
  geom_point(size = 0.25) +
  ggtitle("IHW: raw p-values vs. adusted p-values from ihw-analysis") +
  theme(plot.title = element_text(size=12)) +
  scale_colour_hue(l = 70, c = 150, drop = FALSE)
ggsave(snakemake@output[["adj_pvals"]], plot_ihw_pval_mean)

R ggplot2 tidyverse ggpubr IHW From line 1 of scripts/ihw-fdr-control.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library("tidyverse")
library("sleuth")

so <- sleuth_load(snakemake@input[["so"]])

top_n <- -strtoi(snakemake@params["top_n"])

results <- read_tsv(snakemake@input[["transcripts"]])

top_genes <- results %>%
    filter(qval <= snakemake@params[["fdr"]]) %>%
    top_n(top_n, qval) %>%
    dplyr::select(ext_gene) %>%
    drop_na() %>%
    distinct(ext_gene)

genes_of_interest <- tibble( ext_gene = snakemake@params[["genes"]]) %>%
    distinct(ext_gene)

genes <- full_join(top_genes, genes_of_interest, by = 'ext_gene') %>%
    add_row(ext_gene = "Custom") %>%
    pull(ext_gene)

dir.create( snakemake@output[[1]] )

for (gene in genes) {
    transcripts <- results %>%
        filter(ext_gene == gene) %>%
        drop_na() %>%
        pull(target_id)

    if ( length( transcripts > 0 ) ) {
        for (transcript in transcripts) {
            plot_bootstrap(so, transcript, color_by = snakemake@params[["color_by"]], units = "tpm")
            ggsave(file = str_c(snakemake@output[[1]], "/", gene, ".", transcript, ".", snakemake@wildcards[["model"]] , ".bootstrap.pdf"))
        }
    }
}

R tidyverse sleuth From line 1 of scripts/plot-bootstrap.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library("sleuth")

so <- sleuth_load(snakemake@input[["so"]])

diffexp <- read.table(snakemake@input[["diffexp"]], sep = "\t", header = TRUE)

pdf(file = snakemake@output[[1]], width = 14)

plot_transcript_heatmap(so, transcripts = diffexp$target_id[1:20], main="Top 20 differentially expressed transcripts (TPM)")

# TODO, once pachterlab/sleuth#214 is merged, add this to get gene names
# , labels_row = diffexp$ext_gene[1:20])
dev.off()

R sleuth From line 1 of scripts/plot-diffexp-heatmap.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library("sleuth")
library("ggplot2")
library("tidyr")

diffexp <- sleuth_load(snakemake@input[["diffexp_rds"]]) %>% drop_na(pval)

ggplot(diffexp) + geom_histogram(aes(pval), bins = 100)
ggsave(file = snakemake@output[[1]], width = 14)

R ggplot2 tidyr sleuth From line 1 of scripts/plot-diffexp-pval-hist.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library("fgsea")

# provides library("tidyverse") and function get_prefix_col()
# the latter requires snakemake@output[["samples"]] and
# snakemake@params[["covariate"]]
source( file.path(snakemake@scriptdir, 'common.R') )


gene_sets <- gmtPathways(snakemake@input[["gene_sets"]])
sig_gene_sets <- read_tsv(snakemake@input[["sig_gene_sets"]])
diffexp <- read_tsv(snakemake@input[["diffexp"]]) %>%
                  drop_na(ext_gene) %>%
                  mutate(ext_gene = str_to_upper(ext_gene)) %>%
			group_by(ext_gene) %>%
                	filter( qval == min(qval, na.rm = TRUE) ) %>%
                	mutate(target_id = str_c(target_id, collapse=",")) %>%
                	mutate(ens_gene = str_c(ens_gene, collapse=",")) %>%
			distinct()

signed_pi <- get_prefix_col("signed_pi_value", colnames(diffexp))

ranked_genes <- diffexp %>%
                  dplyr::select(ext_gene, !!signed_pi) %>%
                  deframe()

dir.create( snakemake@output[[1]] )

for ( set in names(gene_sets) ) {
  # plot gene set enrichment
  p <- plotEnrichment(gene_sets[[set]], ranked_genes) +
         ggtitle(str_c("gene set: ", set),
           subtitle = str_c(
             sig_gene_sets %>% filter(pathway == set) %>% pull(size)," genes;  ",
             "p-value (BH-adjusted): ", sig_gene_sets %>% filter(pathway ==  set) %>% pull(padj), "\n",
             "normalized enrichment score (NES):", sig_gene_sets %>% filter(pathway == set) %>% pull(NES)
             )
         ) +
         xlab("gene rank") +
         theme_bw( base_size = 16 )
  ggsave(
    file = str_c( snakemake@output[[1]], "/", snakemake@wildcards[["model"]], ".", set, ".gene-set-plot.pdf"),
    width = 10,
    height = 7
  )
}

R tidyverse fgsea From line 1 of scripts/plot-fgsea-gene-sets.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library("sleuth")

so <- sleuth_load(snakemake@input[[1]])

pdf(file = snakemake@output[[1]])
plot_fld(so, paste0(snakemake@wildcards[["sample"]], "-", snakemake@wildcards[["unit"]]))
dev.off()

R sleuth From line 1 of scripts/plot-fld.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library("sleuth")
library("tidyverse")


so <- sleuth_load(snakemake@input[[1]])

plot_group_density(so,
                   use_filtered = TRUE,
                   units = "est_counts",
                   trans = "log",
                   grouping = setdiff(colnames(so$sample_to_covariates), "sample"),
                   offset = 1)
ggsave(snakemake@output[[1]])

R tidyverse sleuth From line 1 of scripts/plot-group-density.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library("sleuth")
library("ggpubr")

so <- sleuth_load(snakemake@input[[1]])

#principal components
pc <- 4

# plot pca
so <- sleuth_load(snakemake@input[[1]])
pc12 <- plot_pca(so, color_by = snakemake@wildcards[["covariate"]], units = "est_counts", text_labels = TRUE)
pc34 <- plot_pca(so, pc_x = 3, pc_y = 4, color_by = snakemake@wildcards[["covariate"]], units = "est_counts", text_labels = TRUE)

ggarrange(pc12, pc34, ncol = 2, common.legend = TRUE)
ggsave(snakemake@output[["pca"]], width=14)

# plot pc variance
plot_pc_variance(so, use_filtered = TRUE, units = "est_counts", pca_number = pc)
ggsave(snakemake@output[["pc_var"]], width=14)

# plot loadings
pc_loading_plots <- list()
for(i in 1:pc) {
  pc_loading_plots[[paste0("pc", i, "_loading")]] <- plot_loadings(so, 
                                                                   scale = TRUE, 
                                                                   pc_input = i,
                                                                   pc_count = 10, 
                                                                   units = "est_counts") +
    ggtitle(paste0(snakemake@wildcards[["covariate"]], ": plot loadings for principal component ", i)) +
    theme(plot.title = element_text(size = 14, face = "bold", hjust = 0.5))
}
plots_loading <- ggarrange(plotlist = pc_loading_plots, ncol = 1, nrow = 1, common.legend = TRUE)
ggexport(plots_loading, filename = snakemake@output[["loadings"]], width = 14)

R ggpubr sleuth From line 1 of scripts/plot-pca.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library("sleuth")
library("tidyverse")
library("ggpubr")

so <- sleuth_load(snakemake@input[[1]])
plot_list <- list()

# combinations between samples without repetition 

# sample_matrix <- t(combn(so$sample_to_covariates$sample, 2))
# combinations <- as.numeric(row(sample_matrix))
# 
# for(i in combinations) {
#   sample1 <- sample_matrix[i, 1]
#   sample2 <- sample_matrix[i, 2]
#   
#   plot_list[[i]] <- plot_scatter(so, 
#                                  sample_x = sample1, 
#                                  sample_y = sample2, 
#                                  use_filtered = TRUE, 
#                                  units = "est_counts", 
#                                  point_alpha = 0.2, 
#                                  xy_line = TRUE, 
#                                  xy_line_color = "red") +
#     labs(x = sample1, y = sample2)
#   }


# all combinations between samples (with repitition)

samples <- so$sample_to_covariates$sample
sample_matrix <- crossing(x = samples, y = samples) # creates all combinations between samples
combinations <- as.numeric(row(sample_matrix))

for(i in combinations) {
  sample1 <- sample_matrix[i, 1]
  sample2 <- sample_matrix[i, 2]
  if(sample1 != sample2) {
    plot_list[[i]] <- plot_scatter(so,
                                   sample_x = sample1,
                                   sample_y = sample2,
                                   use_filtered = TRUE,
                                   units = "est_counts",
                                   point_alpha = 0.2,
                                   xy_line = TRUE,
                                   xy_line_color = "red") +
      labs(x = sample1, y = sample2)

  } else {
    x_val <- y_val <- c(0,0)
    test <- tibble(x_val, y_val)
    plot_list[[i]] <- ggplot(test, aes(x = x_val, y = y_val)) +
      theme_void() +
      geom_text(label = "") +
      annotate("text", label = sample1, x=0, y=0, colour = "blue", fontface = "bold")
  }
}

all_plots <- ggarrange(plotlist = plot_list)
plot_figure <- annotate_figure(all_plots, 
                               top = text_grob("log transformed scatter plots of different samples",
                                               color = "black", 
                                               face = "bold", 
                                               size = 14))
ggsave(snakemake@output[[1]], plot_figure)

R ggplot2 tidyverse ggpubr sleuth From line 1 of scripts/plot-scatter.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library("sleuth")

sl <- snakemake@params[['sig_level']]

so <- sleuth_load(snakemake@input[[1]])

# colors from colorblind-safe Brewer palette "Dark2":
# http://colorbrewer2.org/#type=qualitative&scheme=Dark2&n=3
cb_safe_green <- "#1B9E77"
cb_safe_red <- "#D95F02"

p <- plot_vars(so,
        test = NULL,
        test_type = "lrt",
        which_model = "full",
        sig_level = sl,
        point_alpha = 0.2,
        sig_color = cb_safe_red,
        xy_line = TRUE,
        xy_line_color = cb_safe_red,
        highlight = NULL,
        highlight_color = cb_safe_green
    )
ggsave(filename = snakemake@output[[1]], width = 7, height = 7)

R sleuth From line 1 of scripts/plot-variances.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library("sleuth")
library("tidyverse")
library("fs")
library("grid")
library("gridExtra")

model <- snakemake@params[["model"]]

sleuth_object <- sleuth_load(snakemake@input[[1]])

sleuth_object <- sleuth_fit(sleuth_object, as.formula(model[["full"]]), 'full')
sleuth_object <- sleuth_fit(sleuth_object, as.formula(model[["reduced"]]), 'reduced')
sleuth_object <- sleuth_lrt(sleuth_object, "reduced", "full")

# plot mean-variance
mean_var_plot <- plot_mean_var(sleuth_object, 
                               which_model = "full", 
                               point_alpha = 0.4,
                               point_size = 2, 
                               point_colors = c("black", "dodgerblue"),
                               smooth_alpha = 1, 
                               smooth_size = 0.75, 
                               smooth_color = "red")
ggsave(snakemake@output[["mean_var_plot"]], mean_var_plot)

write_results <- function(so, mode, output, output_all) {
    so$pval_aggregate <- FALSE
    if(mode == "aggregate") {
      # workaround the following bug-request:
      # https://github.com/pachterlab/sleuth/pull/240
      # TODO renaming can be removed when fixed
      g_col <- so$gene_column
      so$gene_column <- NULL
      so$pval_aggregate <- TRUE
      so$gene_column <- g_col
    }

    plot_model <- snakemake@wildcards[["model"]]

    # list for qq-plots to make a multipage pdf-file as output
    qq_list <- list()

    print("Performing likelihood ratio test")
    all <- sleuth_results(so, "reduced:full", "lrt", show_all = TRUE, pval_aggregate = so$pval_aggregate) %>%
            arrange(pval)

    covariates <- colnames(design_matrix(so, "full"))
    covariates <- covariates[covariates != "(Intercept)"]

    # iterate over all covariates and perform wald test in order to obtain beta estimates
    if(!so$pval_aggregate) {

      # lists for volcano and ma-plots to make a multipage pdf-file as output
      volcano_list <- list()
      ma_list <- list()

      for(covariate in covariates) {
            print(str_c("Performing wald test for ", covariate))
            so <- sleuth_wt(so, covariate, "full")

            volc_plot_title <- str_c(plot_model, ": volcano plot for ", covariate)
            ma_plot_title <- str_c(plot_model, ": ma-plot for ", covariate)
            qq_plot_title <- str_c(plot_model, ": qq-plot from wald test for ", covariate)

            # volcano plot
            print(str_c("Performing volcano plot for ", covariate))
            volcano <- plot_volcano(so, covariate, "wt", "full",
                                    sig_level = snakemake@params[["sig_level_volcano"]], 
                                    point_alpha = 0.2, 
                                    sig_color = "red",
                                    highlight = NULL) +
              ggtitle(volc_plot_title) +
              theme(plot.title = element_text(size = 16, face = "bold", hjust = 0.5))
            volcano_list[[volc_plot_title]] <- volcano

            # ma-plot
            print(str_c("Performing ma-plot for ", covariate))
            ma_plot <- plot_ma(so, covariate, "wt", "full",
                               sig_level = snakemake@params[["sig_level_ma"]], 
                               point_alpha = 0.2, 
                               sig_color = "red",
                               highlight = NULL, 
                               highlight_color = "green") +
              ggtitle(ma_plot_title) +
              theme(plot.title = element_text(size = 16, face = "bold", hjust = 0.5))
            ma_list[[ma_plot_title]] <- ma_plot

            # qq-plots from wald test
            print(str_c("Performing qq-plot from wald test for ", covariate))
            qq_plot <- plot_qq(so, covariate, "wt", "full", 
                               sig_level = snakemake@params[["sig_level_qq"]],
                               point_alpha = 0.2, 
                               sig_color = "red", 
                               highlight = NULL, 
                               highlight_color = "green",
                               line_color = "blue") +
              ggtitle(qq_plot_title) +
              theme(plot.title = element_text(size = 16, face = "bold", hjust = 0.5))
            qq_list[[qq_plot_title]] <- qq_plot


            beta_col_name <- str_c("b", covariate, sep = "_")
            beta_se_col_name <- str_c(beta_col_name, "se", sep = "_")
            all_wald <- sleuth_results(so, covariate, "wt", show_all = TRUE, pval_aggregate = FALSE) %>%
                        dplyr::select( target_id = target_id,
                                !!beta_col_name := b,
                                !!beta_se_col_name := se_b)
            signed_pi_col_name <- str_c("signed_pi_value", covariate, sep = "_")
            all <- inner_join(all, all_wald, by = "target_id") %>%
	              # calculate a signed version of the pi-value from:
                    # https://dx.doi.org/10.1093/bioinformatics/btr671
	              # e.g. useful for GSEA ranking
	              mutate( !!signed_pi_col_name := -log10(pval) * !!sym(beta_col_name) )
      }
      # saving volcano plots
      marrange_volcano <- marrangeGrob(grobs=volcano_list, nrow=1, ncol=1, top = NULL)
      ggsave(snakemake@output[["volcano_plots"]], plot = marrange_volcano, width = 14)

      # saving ma-plots
      marrange_ma <- marrangeGrob(grobs=ma_list, nrow=1, ncol=1, top = NULL)
      ggsave(snakemake@output[["ma_plots"]], plot = marrange_ma, width = 14)
    }

    if(mode == "mostsignificant") {
        # for each gene, select the most significant transcript (this is equivalent to sleuth_gene_table, but with bug fixes)
      all <- all %>%
                drop_na(ens_gene) %>%
                group_by(ens_gene) %>%
                filter( qval == min(qval, na.rm = TRUE) ) %>%
                # ties in qval (e.g. min(qval) == 1) can lead to multiple entries per gene
                filter( pval == min(pval, na.rm = TRUE) ) %>%
                # for min(qval) == 1, then usually also min(pval) == 1, and
                # we need something else to select a unique entry per gene
                filter( mean_obs == max(mean_obs, na.rm = TRUE) ) %>%
                # sometimes we still get two transcript variants with exactly
                # the same values, i.e. they have exactly the same sequence
                # but (slightly) different annotations -- then we retain a string
                # with a comma-separated list of them
                mutate(target_id = str_c(target_id, collapse=",")) %>%
                distinct() %>%
                # useful sort for scrolling through output by increasing q-values
                arrange(qval)

      # qq-plot from likelihood ratio test
      print(str_c("Performing qq-plot from likelihood ratio test"))
      qq_plot_title_trans <- str_c(plot_model, ": qq-plot from likelihood ratio test")
      qq_plot_trans <- plot_qq(so, 
                               test = 'reduced:full', 
                               test_type = 'lrt', 
                               sig_level = snakemake@params[["sig_level_qq"]],
                               point_alpha = 0.2, 
                               sig_color = "red", 
                               highlight = NULL, 
                               highlight_color = "green",
                               line_color = "blue") +
        ggtitle(qq_plot_title_trans) +
        theme(plot.title = element_text(size = 16, face = "bold", hjust = 0.5))
      qq_list[[qq_plot_title_trans]] <- qq_plot_trans
    }

    # saving qq-plots
    marrange_qq <- marrangeGrob(grobs=qq_list, nrow=1, ncol=1, top = NULL)
    ggsave(snakemake@output[["qq_plots"]], plot = marrange_qq, width = 14)

    write_tsv(all, path = output, quote_escape = "none")
    write_rds(all, path = output_all, compress = "none")
}
write_results(sleuth_object, "transcripts", snakemake@output[["transcripts"]], snakemake@output[["transcripts_rds"]])
write_results(sleuth_object, "aggregate", snakemake@output[["genes_aggregated"]], snakemake@output[["genes_aggregated_rds"]])
write_results(sleuth_object, "mostsignificant", snakemake@output[["genes_mostsigtrans"]], snakemake@output[["genes_mostsigtrans_rds"]])

R tidyverse gridExtra sleuth GSEA fs From line 1 of scripts/sleuth-diffexp.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library("sleuth")
library("biomaRt")
library("tidyverse")

model <- snakemake@params[["model"]]

# this variable holds a mirror name until
# useEnsembl succeeds ("www" is last, because 
# of very frequent "Internal Server Error"s)
mart <- "useast"
rounds <- 0
while ( class(mart)[[1]] != "Mart" ) {
  mart <- tryCatch(
    {
      # done here, because error function does not
      # modify outer scope variables, I tried
      if (mart == "www") rounds <- rounds + 1
      # equivalent to useMart, but you can choose
      # the mirror instead of specifying a host
      biomaRt::useEnsembl(
        biomart = "ENSEMBL_MART_ENSEMBL",
        dataset = str_c(snakemake@params[["species"]], "_gene_ensembl"),
        mirror = mart
      )
    },
    error = function(e) {
      # change or make configurable if you want more or
      # less rounds of tries of all the mirrors
      if (rounds >= 3) {
        stop(
          str_c(
            "Have tried all 4 available Ensembl biomaRt mirrors ",
            rounds,
            " times. You might have a connection problem, or no mirror is responsive."
          )
        )
      }
      # hop to next mirror
      mart <- switch(mart,
                     useast = "uswest",
                     uswest = "asia",
                     asia = "www",
                     www = {
                       # wait before starting another round through the mirrors,
                       # hoping that intermittent problems disappear
                       Sys.sleep(30)
                       "useast"
                     }
              )
    }
  )
}

t2g <- biomaRt::getBM(
            attributes = c( "ensembl_transcript_id",
                            "ensembl_gene_id",
                            "external_gene_name"),
            mart = mart,
            useCache = FALSE
            ) %>%
        rename( target_id = ensembl_transcript_id,
                ens_gene = ensembl_gene_id,
                ext_gene = external_gene_name
                )

samples <- read_tsv(snakemake@input[["samples"]], na = "", col_names = TRUE) %>%
            # make everything except the index, sample name and path string a factor
            mutate_at(  vars(-X1, -sample, -path),
                        list(~factor(.))
                        )

if(!is.null(snakemake@params[["exclude"]])) {
    samples <- samples %>%
                filter( !sample %in% snakemake@params[["exclude"]] )
}

if(!is.null(model)) {
    # retrieve the model formula
    formula <- as.formula(model)
    # extract variables from the formula and unnest any nested variables
    variables <- labels(terms(formula)) %>%
                    strsplit('[:*]') %>%
                    unlist()
    # remove samples with an NA value in any of the columns
    # relevant for sleuth under the current model
    samples <- samples %>%
                drop_na(
                    c( sample, path, all_of(variables) )
                )
}


so <- sleuth_prep(  samples,
                    extra_bootstrap_summary = TRUE,
                    target_mapping = t2g,
                    aggregation_column = "ens_gene",
                    read_bootstrap_tpm = TRUE,
                    transform_fun_counts = function(x) log2(x + 0.5),
                    num_cores = snakemake@threads
                    )

custom_transcripts <- so$obs_raw %>%
                        # find transcripts not in the target_mapping
                        filter(!target_id %in% so$target_mapping$target_id) %>%
                        # make it a succinct list with no repititions
                        distinct(target_id) %>%
                        # pull it out into a vector
                        pull(target_id)

if(!length(custom_transcripts) == 0) {
    so$target_mapping <- so$target_mapping %>%
                        # add those custom transcripts as rows to the target mapping
                        add_row(ens_gene = NA, ext_gene = "Custom", target_id = custom_transcripts)
}

sleuth_save(so, snakemake@output[[1]])

R tidyverse sleuth biomaRt From line 1 of scripts/sleuth-init.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library("sleuth")

so <- sleuth_load(snakemake@input[[1]])

tpm <- sleuth_to_matrix(so, "obs_norm", "tpm")
target_mapping <- so$target_mapping
rownames(target_mapping) <- target_mapping$target_id
tpm <- cbind(ext_gene = target_mapping[rownames(tpm), "ext_gene"], tpm)

write.table(tpm, file = snakemake@output[[1]], col.names = NA, row.names = TRUE)

R sleuth From line 1 of scripts/sleuth-to-matrix.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library("SPIA")
library("graphite")

# provides library("tidyverse") and functions load_bioconductor_package() and
# get_prefix_col(), the latter requires snakemake@output[["samples"]] and
# snakemake@params[["covariate"]]
source( file.path(snakemake@scriptdir, 'common.R') )

pkg <- snakemake@params[["bioc_pkg"]]
load_bioconductor_package(snakemake@input[["species_anno"]], pkg)



pw_db <- snakemake@params[["pathway_db"]]

db <- pathways(snakemake@params[["species"]], pw_db)
db <- convertIdentifiers(db, "ENSEMBL")

options(Ncpus = snakemake@threads)

diffexp <- read_tsv(snakemake@input[["diffexp"]]) %>%
            drop_na(ens_gene) %>%
            mutate(ens_gene = str_c("ENSEMBL:", ens_gene))
universe <- diffexp %>% pull(var = ens_gene)
sig_genes <- diffexp %>% filter(qval <= 0.05)

# get logFC equivalent (the sum of beta scores of covariates of interest)

beta_col <- get_prefix_col("b", colnames(sig_genes))

beta <- sig_genes %>%
            dplyr::select(ens_gene, !!beta_col) %>%
            deframe()

t <- tempdir(check=TRUE)
olddir <- getwd()
setwd(t)
prepareSPIA(db, pw_db)
res <- runSPIA(de = beta, all = universe, pw_db, plots = TRUE, verbose = TRUE)
setwd(olddir)

file.copy(file.path(t, "SPIAPerturbationPlots.pdf"), snakemake@output[["plots"]])
write_tsv(res, snakemake@output[["table"]])

R tidyverse graphite SPIA From line 1 of scripts/spia.R

__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


n = len(snakemake.input)
assert n == 2, "Input must contain 2 (paired-end) elements."

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

shell(
    "cutadapt"
    " {snakemake.params}"
    " -o {snakemake.output.fastq1}"
    " -p {snakemake.output.fastq2}"
    " {snakemake.input}"
    " > {snakemake.output.qc} {log}")

Python Snakemake Cutadapt From line 3 of pe/wrapper.py

__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


log = snakemake.log_fmt_shell(stdout=False, stderr=True)

shell(
    "cutadapt"
    " {snakemake.params}"
    " -o {snakemake.output.fastq}"
    " {snakemake.input[0]}"
    " > {snakemake.output.qc} {log}")