Snakemake workflow: rna-seq-kallisto-sleuth

public public 1yr ago Version: 5 0 bookmarks

This workflow performs a differential expression analysis with Kallisto and Sleuth .

Authors

  • Johannes Köster (https://koesterlab.github.io)

  • David Lähnemann

Usage

In any case, if you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and, if available, its DOI (see above).

Step 1: Obtain a copy of this workflow

  1. Create a new github repository using this workflow as a template .

  2. Clone the newly created repository to your local system, into the place where you want to perform the data analysis.

Step 2: Configure workflow

Configure the workflow according to your needs via editing the file config/config.yaml , as well as the sample sheet config/samples.tsv , and the unit sheet config/units.tsv .

Step 3: Execute workflow

Test your configuration by performing a dry-run via

snakemake --use-conda -n

Execute the workflow locally via

snakemake --use-conda --cores $N

using $N cores or run it in a cluster environment via

snakemake --use-conda --cluster qsub --jobs 100

or

snakemake --use-conda --drmaa --jobs 100

If you not only want to fix the software stack but also the underlying OS, use

snakemake --use-conda --use-singularity

in combination with any of the modes above. See the Snakemake documentation for further details.

Step 4: Investigate results

After successful execution, you can create a self-contained interactive HTML report with all results via:

snakemake --report report.html

This report can, e.g., be forwarded to your collaborators. An example (using some trivial test data) can be seen here .

Step 5: Commit changes

Whenever you change something, don't forget to commit the changes back to your github copy of the repository:

git commit -a
git push

Step 6: Contribute back

In case you have also changed or added steps, please consider contributing them back to the original repository:

  1. Fork the original repo to a personal or lab account.

  2. Clone the fork to your local system, to a different place than where you ran your analysis.

  3. Copy the modified files from your analysis to the clone of your fork, e.g., cp -r workflow path/to/fork . Make sure to not accidentally copy config file contents or sample sheets.

  4. Commit and push your changes to your fork.

  5. Create a pull request against the original repository.

Testing

Test cases are in the subfolder .test . They are automtically executed via continuous integration with Github Actions .

Code Snippets

13
14
15
16
17
18
19
run:
    samples_ = units[["sample", "unit"]].merge(samples, on="sample")
    samples_["sample"] = samples_.apply(
        lambda row: "{}-{}".format(row["sample"], row["unit"]), axis=1)
    samples_["path"] = kallisto_output
    del samples_["unit"]
    samples_.to_csv(output[0], sep="\t")
43
44
script:
    "../scripts/sleuth-init.R"
69
70
script:
    "../scripts/sleuth-diffexp.R"
84
85
script:
    "../scripts/plot-bootstrap.R"
97
98
script:
    "../scripts/plot-pca.R"
113
114
script:
    "../scripts/plot-diffexp-heatmap.R"
128
129
script:
    "../scripts/plot-diffexp-pval-hist.R"
141
142
script:
    "../scripts/sleuth-to-matrix.R"
154
155
script:
    "../scripts/plot-fld.R"
169
170
script:
    "../scripts/plot-variances.R"
16
17
script:
    "../scripts/spia.R"
58
59
script:
    "../scripts/fgsea.R"
80
81
script:
    "../scripts/fgsea_plot_gene_set.R"
94
95
script:
    "../scripts/biomart-ens_gene_to_go.R"
107
108
shell:
    "( curl --silent -o {output} {params.download} ) 2> {log}"
137
138
script:
    "../scripts/goatools-go-enrichment-analysis.py"
10
11
shell:
    "kallisto index -i {output} {input} 2> {log}"
39
40
41
shell:
    "kallisto quant -i {input.idx} -o {output} "
    "{params.extra} {input.fq} 2> {log}"
12
13
wrapper:
    "0.31.1/bio/cutadapt/pe"
SnakeMake From line 12 of rules/trim.smk
26
27
wrapper:
    "0.31.1/bio/cutadapt/se"
SnakeMake From line 26 of rules/trim.smk
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library("biomaRt")
library("tidyverse")

# create an ensembl biomart for the species specified in the params field
mart <- biomaRt::useMart(
  biomart = "ENSEMBL_MART_ENSEMBL",
  dataset = str_c(snakemake@params[["species"]], "_gene_ensembl"),
  host = 'ensembl.org')

# get all ensembl_gene_id - go_id pairs and collapse into key->values mapping:
#     ensembl_gene_id -> go_id;go_id;go_id
ens_gene_to_go <- biomaRt::getBM(attributes = c("ensembl_gene_id", "go_id"), mart = mart) %>%
                    # rows with empty go_id are useless and will lead to extra semicolons below
                    filter(go_id != "") %>%
                    group_by(ensembl_gene_id) %>%
                    summarise(go_ids = str_c(go_id, collapse = ";"))

write_tsv(ens_gene_to_go, path = snakemake@output[[1]], col_names = FALSE)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library("fgsea")

# provides library("tidyverse") and function get_prefix_col()
# the latter requires snakemake@output[["samples"]] and
# snakemake@params[["covariate"]]
source( file.path(snakemake@scriptdir, 'common.R') )


gene_sets <- gmtPathways(snakemake@input[["gene_sets"]])
sig_gene_sets <- read_tsv(snakemake@input[["sig_gene_sets"]])
diffexp <- read_tsv(snakemake@input[["diffexp"]]) %>%
                  drop_na(ext_gene) %>%
                  mutate(ext_gene = str_to_upper(ext_gene)) %>%
			group_by(ext_gene) %>%
                	filter( qval == min(qval, na.rm = TRUE) ) %>%
                	mutate(target_id = str_c(target_id, collapse=",")) %>%
                	mutate(ens_gene = str_c(ens_gene, collapse=",")) %>%
			distinct()

signed_pi <- get_prefix_col("signed_pi_value", colnames(diffexp))

ranked_genes <- diffexp %>%
                  dplyr::select(ext_gene, !!signed_pi) %>%
                  deframe()

set <- snakemake@wildcards[["gene_set"]]

# plot gene set enrichment
p <- plotEnrichment(gene_sets[[set]], ranked_genes) +
            ggtitle(str_c("gene set: ", set),
                subtitle = str_c(
                    sig_gene_sets %>% filter(pathway == set) %>% pull(size), " genes;  ",
                    "p-value (BH-adjusted): ", sig_gene_sets %>% filter(pathway ==  set) %>% pull(padj), "\n",
                    "normalized enrichment score (NES):", sig_gene_sets %>% filter(pathway == set) %>% pull(NES)
                    )
            ) +
            xlab("gene rank") + 
            theme_bw(base_size=16)
ggsave(snakemake@output[[1]], width=10, height=7)
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library("fgsea")
library("AnnotationDbi")

# provides library("tidyverse") and function get_prefix_col()
# the latter requires snakemake@output[["samples"]] and
# snakemake@params[["covariate"]]
source( file.path(snakemake@scriptdir, 'common.R') )

# get the species for adding annotations below
species <- str_to_title( str_sub(snakemake@params[["species"]], 1, 2) )

# construct the respective package name
pkg <- str_c("org.", species, ".eg.db")

# check if the package is installed in the fgsea environment and load,
# give a useful error message otherwise; installed per default are:
# * org.Mm.eg.db
# * org.Hs.eg.db
if ( pkg %in% rownames( installed.packages() ) ) {
    library(pkg, character.only = TRUE)
} else {
    stop(
        str_c(
            "\n",
            "Package not installed: ", pkg, "\n",
            "Package name inferred from species name: ", species, "\n",
            "Check if package 'bioconductor-", pkg, "' exists and add to envs/fgsea.yaml\n",
            "\n"
        )
    )
}


gene_sets <- gmtPathways(snakemake@input[["gene_sets"]])
diffexp <- read_tsv(snakemake@input[["diffexp"]]) %>%
                  drop_na(ext_gene) %>%
                  mutate(ext_gene = str_to_upper(ext_gene)) %>%
                  group_by(ext_gene) %>%
                	filter( qval == min(qval, na.rm = TRUE) ) %>%
                	mutate(target_id = str_c(target_id, collapse=",")) %>%
                	mutate(ens_gene = str_c(ens_gene, collapse=",")) %>%
                  distinct()

signed_pi <- get_prefix_col("signed_pi_value", colnames(diffexp))

ranked_genes <- diffexp %>%
                  dplyr::select(ext_gene, !!signed_pi) %>%
                  deframe()

# get and write out rank values that are tied -- a way to check up on respecitve warnings
rank_ties <- enframe(ranked_genes) %>%
               mutate(dup = duplicated(value) | duplicated(value, fromLast = TRUE) ) %>%
               filter(dup == TRUE) %>%
               dplyr::select(ext_gene = name, !!signed_pi := value)
write_tsv(rank_ties, snakemake@output[["rank_ties"]])

fgsea_res <- fgsea(pathways = gene_sets, 
                    stats = ranked_genes,
                    minSize=10,
                    maxSize=700,
                    nproc=snakemake@threads,
                    nperm=snakemake@params[["nperm"]]
                    ) %>%
                as_tibble() 

# Annotation is impossible without any entries, so then just write out empty files
if ( (fgsea_res %>% count() %>% pull(n)) == 0 ) {

    write_tsv(fgsea_res, path = snakemake@output[["enrichment"]])
    write_tsv(fgsea_res, path = snakemake@output[["significant"]])

} else {

    # add further annotation
    annotated <- fgsea_res %>%
                    unnest(leadingEdge) %>%
                    mutate(
                        leading_edge_symbol = str_to_title(leadingEdge),
                        leading_edge_entrez_id = mapIds(leading_edge_symbol, x=get(pkg), keytype="SYMBOL", column="ENTREZID"),
                        leading_edge_ens_gene = mapIds(leading_edge_symbol, x=get(pkg), keytype="SYMBOL", column="ENSEMBL")
                        ) %>%
                    group_by(pathway) %>%
                    summarise(
                        leadingEdge = str_c(leadingEdge, collapse = ','),
                        leading_edge_symbol = str_c(leading_edge_symbol, collapse = ','),
                        leading_edge_entrez_id = str_c(leading_edge_entrez_id, collapse = ','),
                        leading_edge_ens_gene = str_c(leading_edge_ens_gene, collapse = ',')
                    ) %>%
                    inner_join(fgsea_res %>% select(-leadingEdge), by = "pathway") %>%
                    select(-leadingEdge, -leading_edge_symbol,
                           -leading_edge_entrez_id, -leading_edge_ens_gene,
                            leading_edge_symbol, leading_edge_ens_gene,
                            leading_edge_entrez_id, leadingEdge)

    # write out fgsea results for all gene sets
    write_tsv(annotated, path = snakemake@output[["enrichment"]])

    # select significant pathways
    sig_gene_sets <- annotated %>%
                       filter( padj < snakemake@params[["gene_set_fdr"]] )

    # write out fgsea results for gene sets found to be significant
    write_tsv(sig_gene_sets, path = snakemake@output[["significant"]])
}

height = .7 * (length(gene_sets) + 2)

# table plot of all gene sets
tg <- plotGseaTable(
            pathway = gene_sets,
            stats = ranked_genes,
            fgseaRes = fgsea_res,
            gseaParam = 1,
            render = FALSE
        )
ggsave(filename = snakemake@output[["plot"]], plot = tg, width = 12, height = height, limitsize=FALSE)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
import sys
sys.stderr = open(snakemake.log[0], "w")

import pandas as pd
import matplotlib.pyplot as plt
from goatools.obo_parser import GODag
from goatools.anno.idtogos_reader import IdToGosReader
from goatools.goea.go_enrichment_ns import GOEnrichmentStudyNS
from goatools.godag_plot import plot_results#, plot_goid2goobj, plot_gos


# read in directed acyclic graph of GO terms / IDs
obodag = GODag(snakemake.input.obo)


# read in mapping gene ids from input to GO terms / IDs
objanno = IdToGosReader(snakemake.input.ens_gene_to_go, godag = obodag)


# extract namespace(?) -> id2gos mapping
ns2assoc = objanno.get_ns2assc()

for nspc, id2gos in ns2assoc.items():
    print("{NS} {N:,} annotated mouse genes".format(NS=nspc, N=len(id2gos)))

# read gene diffexp table
all_genes = pd.read_table(snakemake.input.diffexp)

# select genes significantly differentially expressed according to BH FDR of sleuth
fdr_level_gene = float(snakemake.params.gene_fdr)
sig_genes = all_genes[all_genes['qval']<fdr_level_gene]


# initialize GOEA object
fdr_level_go_term = float(snakemake.params.go_term_fdr)

goeaobj = GOEnrichmentStudyNS(
    # list of 'population' of genes looked at in total
    pop = all_genes['ens_gene'].tolist(), 
    # geneid -> GO ID mapping
    ns2assoc = ns2assoc, 
    # ontology DAG
    godag = obodag, 
    propagate_counts = False,
    # multiple testing correction method (fdr_bh is false discovery rate control with Benjamini-Hochberg)
    methods = ['fdr_bh'],
    # significance cutoff for method named above
    alpha = fdr_level_go_term
    )

goea_results_all = goeaobj.run_study(sig_genes['ens_gene'].tolist())


# write results to text file
goeaobj.wr_tsv(snakemake.output.enrichment, goea_results_all)


# plot results
ensembl_id_to_symbol = dict(zip(all_genes['ens_gene'], all_genes['ext_gene']))


# from first plot output file name, create generic file name to trigger
# separate plots for each of the gene ontology name spaces
outplot_generic = snakemake.output.plot[0].replace('_BP.','_{NS}.', 1).replace('_CC.','_{NS}.', 1).replace('_MF.', '_{NS}.', 1)

goea_results_sig = [r for r in goea_results_all if r.p_fdr_bh < fdr_level_go_term]

plot_results(
    outplot_generic,
    # use pvals for coloring
    goea_results=goea_results_sig,
    # print general gene symbol instead of Ensembl ID
    id2symbol=ensembl_id_to_symbol, 
    # number of genes to print, or True for printing all (including count of genes)
    study_items=True,
    # number of genes to print per line
    items_p_line=6,
    # p-values to use for coloring of GO term nodes (argument name determined from code and testing against value "p_uncorrected")
    pval_name="p_fdr_bh"
    )

# for all name spaces
for ns in ns2assoc.keys():
    # check if no GO terms were found to be significant
    if len([r for r in goea_results_sig if r.NS == ns]) == 0:
        fig = plt.figure(figsize=(12, 2))
        text = fig.text(0.5, 0.5,
                        "No plot generated, because no GO terms were found significant\n"
                        "for name space {} and significance levels: genes ({}), GO terms ({}).\n"
                        "You might want to check those levels and/or your intermediate data.".format(
                        ns, fdr_level_gene, fdr_level_go_term),
                        ha='center', va='center', size=20)
        fig.savefig( outplot_generic.replace('_{NS}.', "_{}.".format(ns)) )
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library("sleuth")

so <- sleuth_load(snakemake@input[[1]])

pdf(file = snakemake@output[[1]])
plot_bootstrap(so, snakemake@wildcards[["transcript"]], color_by = snakemake@params[["color_by"]], units = "tpm")
dev.off()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library("sleuth")

so <- sleuth_load(snakemake@input[["so"]])

diffexp <- read.table(snakemake@input[["diffexp"]], sep = "\t", header = TRUE)

pdf(file = snakemake@output[[1]], width = 14)
plot_transcript_heatmap(so, transcripts = diffexp$target_id[1:20]) 
# TODO, once pachterlab/sleuth#214 is merged, add this to get gene names
# , labels_row = diffexp$ext_gene[1:20])
dev.off()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library("sleuth")
library("ggplot2")
library("tidyr")

diffexp <- sleuth_load(snakemake@input[["diffexp_rds"]]) %>% drop_na(pval)

ggplot(diffexp) + geom_histogram(aes(pval), bins = 100)
ggsave(file = snakemake@output[[1]], width = 14)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library("sleuth")

so <- sleuth_load(snakemake@input[[1]])

pdf(file = snakemake@output[[1]])
plot_fld(so, paste0(snakemake@wildcards[["sample"]], "-", snakemake@wildcards[["unit"]]))
dev.off()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library("sleuth")

so <- sleuth_load(snakemake@input[[1]])
pdf(file = snakemake@output[[1]])
plot_pca(so, color_by = snakemake@wildcards[["covariate"]], units = "tpm", text_labels = TRUE)
dev.off()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library("sleuth")

sl <- snakemake@params[['sig_level']]

so <- sleuth_load(snakemake@input[[1]])

# colors from colorblind-safe Brewer palette "Dark2":
# http://colorbrewer2.org/#type=qualitative&scheme=Dark2&n=3
cb_safe_green <- "#1B9E77"
cb_safe_red <- "#D95F02"

p <- plot_vars(so,
        test = NULL,
        test_type = "lrt",
        which_model = "full",
        sig_level = sl,
        point_alpha = 0.2,
        sig_color = cb_safe_red,
        xy_line = TRUE,
        xy_line_color = cb_safe_red,
        highlight = NULL,
        highlight_color = cb_safe_green
    )
ggsave(filename = snakemake@output[[1]], width = 7, height = 7)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library("sleuth")
library("tidyverse")

model <- snakemake@params[["model"]]

so <- sleuth_load(snakemake@input[[1]])

so <- sleuth_fit(so, as.formula(model[["full"]]), 'full')
so <- sleuth_fit(so, as.formula(model[["reduced"]]), 'reduced')
so <- sleuth_lrt(so, "reduced", "full")

write_results <- function(mode, output, output_all) {
    aggregate <- FALSE
    if(mode == "aggregate") {
        aggregate <- TRUE
    }
    print("Performing likelihood ratio test")
    all <- sleuth_results(so, "reduced:full", "lrt", show_all = TRUE, pval_aggregate = aggregate) %>%
            arrange(pval)

    covariates <- colnames(design_matrix(so, "full"))
    covariates <- covariates[covariates != "(Intercept)"]

    # iterate over all covariates and perform wald test in order to obtain beta estimates
    if(!aggregate) {
        for(covariate in covariates) { 
            print(str_c("Performing wald test for ", covariate))
            so <- sleuth_wt(so, covariate, "full")

	      beta_col_name <- str_c("b", covariate, sep = "_")
            beta_se_col_name <- str_c(beta_col_name, "se", sep = "_")
            all_wald <- sleuth_results(so, covariate, "wt", show_all = TRUE, pval_aggregate = FALSE) %>%
                        select( target_id = target_id,
                                !!beta_col_name := b,
                                !!beta_se_col_name := se_b)
            signed_pi_col_name <- str_c("signed_pi_value", covariate, sep = "_")
            all <- inner_join(all, all_wald, by = "target_id") %>%
	              # calculate a signed version of the pi-value from:
                    # https://dx.doi.org/10.1093/bioinformatics/btr671
	              # e.g. useful for GSEA ranking
	              mutate( !!signed_pi_col_name := -log10(pval) * !!sym(beta_col_name) )
        }
    }

    if(mode == "mostsignificant") {
        # for each gene, select the most significant transcript (this is equivalent to sleuth_gene_table, but with bug fixes)
	  all <- all %>%
                drop_na(ens_gene) %>%
                group_by(ens_gene) %>%
                filter( qval == min(qval, na.rm = TRUE) ) %>%
                # ties in qval (e.g. min(qval) == 1) can lead to multiple entries per gene
                filter( pval == min(pval, na.rm = TRUE) ) %>%
                # for min(qval) == 1, then usually also min(pval) == 1, and
                # we need something else to select a unique entry per gene
                filter( mean_obs == max(mean_obs, na.rm = TRUE) ) %>%
                # sometimes we still get two transcript variants with exactly
                # the same values, i.e. they have exactly the same sequence
                # but (slightly) different annotations -- then we retain a string
                # with a comma-separated list of them
                mutate(target_id = str_c(target_id, collapse=",")) %>%
                distinct() %>%
                # useful sort for scrolling through output by increasing q-values
                arrange(qval)
    }

    write_tsv(all, path = output, quote_escape = "none")
    write_rds(all, path = output_all, compress = "none")
}

write_results("transcripts", snakemake@output[["transcripts"]], snakemake@output[["transcripts_rds"]])
write_results("aggregate", snakemake@output[["genes_aggregated"]], snakemake@output[["genes_aggregated_rds"]])
write_results("mostsignificant", snakemake@output[["genes_mostsigtrans"]], snakemake@output[["genes_mostsigtrans_rds"]])
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library("sleuth")
library("biomaRt")
library("tidyverse")

model <- snakemake@params[["model"]]

mart <- biomaRt::useMart(
            biomart = "ENSEMBL_MART_ENSEMBL",
            dataset = str_c(snakemake@params[["species"]], "_gene_ensembl"),
            host = 'ensembl.org'
            )
t2g <- biomaRt::getBM(
            attributes = c( "ensembl_transcript_id",
                            "ensembl_gene_id",
                            "external_gene_name"),
            mart = mart
            ) %>%
        rename( target_id = ensembl_transcript_id,
                ens_gene = ensembl_gene_id,
                ext_gene = external_gene_name
                )

samples <- read_tsv(snakemake@input[["samples"]], na = "", col_names = TRUE) %>%
            # make everything except the index, sample name and path string a factor
            mutate_at(  vars(-X1, -sample, -path),
                        list(~factor(.))
                        )

if(!is.null(snakemake@params[["exclude"]])) {
    samples <- samples %>%
                filter( !sample %in% snakemake@params[["exclude"]] )
}

if(!is.null(model)) {
    # retrieve the model formula
    formula <- as.formula(snakemake@params[["model"]])
    # extract variables from the formula and unnest any nested variables
    variables <- labels(terms(formula)) %>%
                    strsplit('[:*]') %>%
                    unlist()
    # select the columns required by sleuth and filter to all samples where
    # none of the given variables are NA
    samples <- samples %>%
                select(sample, path, condition, variables) %>%
                drop_na()
}

so <- sleuth_prep(  samples,
                    extra_bootstrap_summary = TRUE,
                    target_mapping = t2g,
                    aggregation_column = "ens_gene",
                    read_bootstrap_tpm = TRUE,
                    transform_fun_counts = function(x) log2(x + 0.5)
                    )

custom_transcripts <- so$obs_raw %>%
                        # find transcripts not in the target_mapping
                        filter(!target_id %in% so$target_mapping$target_id) %>%
                        # make it a succinct list with no repititions
                        distinct(target_id) %>%
                        # pull it out into a vector
                        pull(target_id)
so$target_mapping <- so$target_mapping %>%
                        # add those custom transcripts as rows to the target mapping
                        add_row(ens_gene = NA, ext_gene = "Custom", target_id = custom_transcripts)

sleuth_save(so, snakemake@output[[1]])
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library("sleuth")

so <- sleuth_load(snakemake@input[[1]])

tpm <- sleuth_to_matrix(so, "obs_norm", "tpm")
target_mapping <- so$target_mapping
rownames(target_mapping) <- target_mapping$target_id
tpm <- cbind(ext_gene = target_mapping[rownames(tpm), "ext_gene"], tpm)

write.table(tpm, file = snakemake@output[[1]], col.names = NA, row.names = TRUE)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library("SPIA")
library("graphite")
library("AnnotationDbi")

# provides library("tidyverse") and function get_prefix_col()
# the latter requires snakemake@output[["samples"]] and
# snakemake@params[["covariate"]]
source( file.path(snakemake@scriptdir, 'common.R') )

options(Ncpus = snakemake@threads)

pw_db <- snakemake@params[["pathway_db"]]

db <- pathways(snakemake@params[["species"]], pw_db)
db <- convertIdentifiers(db, "ENSEMBL")

prepareSPIA(db, pw_db)


diffexp <- read_tsv(snakemake@input[["diffexp"]]) %>%
            drop_na(ens_gene) %>%
            mutate(ens_gene = str_c("ENSEMBL", ens_gene))
universe <- diffexp %>% pull(var = ens_gene)
sig_genes <- diffexp %>% filter(qval <= 0.05)

# get logFC equivalent (the sum of beta scores of covariates of interest)

beta_col <- get_prefix_col("b", colnames(sig_genes))

beta <- sig_genes %>%
            select(ens_gene, !!beta_col) %>%
            deframe()

res <- runSPIA(de = beta, all = universe, pw_db, plots = TRUE)

write_tsv(res, snakemake@output[[1]])
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


n = len(snakemake.input)
assert n == 2, "Input must contain 2 (paired-end) elements."

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

shell(
    "cutadapt"
    " {snakemake.params}"
    " -o {snakemake.output.fastq1}"
    " -p {snakemake.output.fastq2}"
    " {snakemake.input}"
    " > {snakemake.output.qc} {log}")
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


log = snakemake.log_fmt_shell(stdout=False, stderr=True)

shell(
    "cutadapt"
    " {snakemake.params}"
    " -o {snakemake.output.fastq}"
    " {snakemake.input[0]}"
    " > {snakemake.output.qc} {log}")
ShowHide 25 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/fuzhiliang/rna-seq-kallisto-sleuth
Name: rna-seq-kallisto-sleuth
Version: 5
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...