RASflow: RNA-Seq Analysis Snakemake Workflow

public public 1yr ago Version: Version 1 0 bookmarks

RASflow: RNA-Seq Analysis Snakemake Workflow

RASflow is a modular, flexible and user-friendly RNA-Seq analysis workflow.

RASflow can be applied to both model and non-model organisms. It supports mapping RNA-Seq raw reads to both genome and transcriptome (can be downloaded from public database or can be homemade by users) and it can do both transcript- and gene-level Differential Expression Analysis (DEA) when transcriptome is used as mapping reference. It requires little programming skill for basic use. If you're good at programming, you can do more magic with RASflow!

Code Snippets

 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import yaml
import pandas as pd
import numpy as np
# import ipdb

def load_globals():

    global samples; global groups; global config; global input_path; global output_path

    with open('configs/config_main.yaml') as yamlfile:
        config = yaml.load(yamlfile)

    samples = np.array(pd.read_table(config["METAFILE"], header = 0)['sample'])

    groups = np.array(pd.read_table(config["METAFILE"], header = 0)['group'])

    input_path = config["FINALOUTPUT"] + "/" + config["PROJECT"] + "/genome/countFile"
    output_path = config["FINALOUTPUT"] + "/" + config["PROJECT"] + "/genome/dea"


def main():

    load_globals()

    groups_unique = np.unique(groups)

    num_samples = len(samples)
    num_groups_unique = len(groups_unique)

    id_list = get_id_list()

    for name_group in groups_unique:
        combine_group(name_group, id_list)

def combine_group(name_group, id_list):
    index_group = [index for index, element in enumerate(groups) if element == name_group]  # all the index of this group
    samples_group = samples[index_group]  # the samples belonging to this group

    group_count = pd.DataFrame()
    group_count['ID'] = id_list

    for sample in samples_group:
        group_count["sample_" + str(sample)] = np.array(pd.read_table(input_path + "/" + str(sample) + "_count.tsv", header = None))[:, 1]

    # write to file
    group_count.to_csv(output_path + "/countGroup/" + str(name_group) + "_gene_count.tsv", sep = "\t", header = True, index = False)

def get_id_list():

    # extract the id list from the count file
    id_list = np.array(pd.read_table(input_path + '/' + str(samples[0]) + "_count.tsv", header = None))[:, 0]
    return id_list


if __name__ == "__main__":
    main()
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
library(biomaRt)
library(yaml)
library(tximport)

# ====================== define some functions ======================

# remove the version in the transcript ID
remove_version <- function(files) {  # input files (file names with directory) are output from Salmon
  for (i in c(1:length(files))) {
    quant.file <- files[i]
    quant.table <- read.table(quant.file, header = TRUE, stringsAsFactors = FALSE)
    trans.id.version <- quant.table$Name
    trans.id <- rep('ID', length(trans.id.version))
    for (j in c(1:length(trans.id.version))) {
      trans.id[j] <- strsplit(trans.id.version[j], ".", fixed = TRUE)[[1]][1]
    }
    quant.table$Name <- trans.id
    write.table(quant.table, file.path(input.path, samples[i], "quant_noVersion.sf"), sep = "\t", quote = FALSE, row.names = FALSE)
  }
  return(trans.id)
}

# ====================== load parameters in config file ======================
# load the config file
yaml.file <- yaml.load_file('configs/config_main.yaml')

# extract the information from the yaml file
project <- yaml.file$PROJECT  # project name
input.path <- file.path(yaml.file$FINALOUTPUT, project, "trans/quant")
gene.level <- yaml.file$GENE_LEVEL
controls <- yaml.file$CONTROL  # all groups used as control
treats <- yaml.file$TREAT  # all groups used as treat, should correspond to control
meta.file <- yaml.file$METAFILE
ENSEMBL <- yaml.file$ENSEMBL
dataset <- yaml.file$EnsemblDataSet
tx2gene.file <- yaml.file$TX2GENE
output.path <- file.path(yaml.file$FINALOUTPUT, project, "trans/dea")

num.control <- length(controls)  # number of comparisons that the user wants to do
num.treat <- length(treats)  # should equals to num.control

if (num.control != num.treat) {
  message("Error: Control groups don't mathch with treat groups!")
  message("Please check config_dea.yaml")
  quit(save = 'no')
}

# extract the metadata
meta.data <- read.table(meta.file, header = TRUE, sep = '\t', stringsAsFactors = FALSE)
samples <- meta.data$sample
group.all <- meta.data$group
subject.all <- meta.data$subject

# ====================== prepare the quant files ======================

# the original quant files from Salmon
files <- file.path(input.path, samples, "quant.sf")
names(files) <- samples

# get the normalized abundance tables for all samples (scaled using the average transcript length over samples and then the library size (lengthScaledTPM))

# ====================== prepare the tx2gene table ======================
if (gene.level) {
    if (ENSEMBL) {
        ensembl <- useEnsembl(biomart = "ensembl", dataset = dataset)
        datasets <- listDatasets(ensembl)

        attributes <- listAttributes(mart = ensembl)

        # remove the version to get more matches
        trans.id <- remove_version(files)

        files.noVersion <- file.path(input.path, samples, "quant_noVersion.sf")
        names(files.noVersion) <- samples

        tx2gene <- getBM(attributes=c('ensembl_transcript_id', 'ensembl_gene_id'),
                        filters = 'ensembl_transcript_id', values = trans.id, mart = ensembl)
    } else {
        tx2gene <- read.csv(tx2gene.file, header = FALSE, sep = "\t")
    }
    # save tx2gene
    output.file.tx2gene <- file.path(output.path, "countGroup", 'tx2gene.RData')
    save(tx2gene, file = output.file.tx2gene)
}

# ====================== get raw and normalized abundance tables ======================

trans.matrix <- tximport(files, type = "salmon", txOut = TRUE, countsFromAbundance = "no")
trans.count <- trans.matrix$counts

trans.matrix.tpm <- tximport(files, type = "salmon", txOut = TRUE, countsFromAbundance = "lengthScaledTPM")
trans.count.tpm <- trans.matrix.tpm$counts

if (gene.level) {
    if (ENSEMBL) {
        gene.matrix <- tximport(files.noVersion, type = "salmon", tx2gene = tx2gene, countsFromAbundance = "no")
        gene.count <- gene.matrix$counts

        gene.matrix.tpm <- tximport(files.noVersion, type = "salmon", tx2gene = tx2gene, countsFromAbundance = "lengthScaledTPM")
        gene.count.tpm <- gene.matrix.tpm$counts
    } else {
        gene.matrix <- tximport(files, type = "salmon", tx2gene = tx2gene, countsFromAbundance = "no")
        gene.count <- gene.matrix$counts

        gene.matrix.tpm <- tximport(files, type = "salmon", tx2gene = tx2gene, countsFromAbundance = "lengthScaledTPM")
        gene.count.tpm <- gene.matrix.tpm$counts
    }
}

# ====================== combine the samples to groups ======================

for (group in unique(group.all)) {
  index.group = which(group.all == group)  # all the index of this group
  samples.group = samples[index.group]  # the samples belonging to this group

  group.count.trans <- trans.count[, index.group]
  group.count.trans.tpm <- trans.count.tpm[, index.group]

  if (gene.level) {
    group.count.gene <- gene.count[, index.group]
    group.count.gene.tpm <- gene.count.tpm[, index.group]
  }

  # write to files
  output.file.trans <- file.path(output.path, "countGroup", paste(group, "_trans_abundance.tsv", sep = ""))
  write.table(group.count.trans, output.file.trans, sep = '\t', quote = FALSE, row.names = TRUE, col.names = TRUE)

  output.file.trans.norm <- file.path(output.path, "countGroup", paste(group, "_trans_norm.tsv", sep = ""))
  write.table(group.count.trans.tpm, output.file.trans.norm, sep = '\t', quote = FALSE, row.names = TRUE, col.names = TRUE)

  if (gene.level) {
    output.file.gene <- file.path(output.path, "countGroup", paste(group, "_gene_abundance.tsv", sep = ""))
    write.table(group.count.gene, output.file.gene, sep = '\t', quote = FALSE, row.names = TRUE, col.names = TRUE)

    output.file.gene.norm <- file.path(output.path, "countGroup", paste(group, "_gene_norm.tsv", sep = ""))
    write.table(group.count.gene.tpm, output.file.gene.norm, sep = '\t', quote = FALSE, row.names = TRUE, col.names = TRUE)
  }
}
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
library(yaml)
library(edgeR)
library(DESeq2)

# ====================== define the function of DEA ======================

DEA <- function(control, treat) {
  count.control <- read.table(paste(output.path, '/countGroup/', control, '_gene_count.tsv', sep = ''), header = TRUE, row.names = 1)
  count.treat <- read.table(paste(output.path, '/countGroup/', treat, '_gene_count.tsv', sep = ''), header = TRUE, row.names = 1)
  count.table <- cbind(count.control, count.treat)  # merge the control and treat tables together

  # number of samples in control and treat groups (should be the same if it's a pair test)
  num.sample <- ncol(count.table)
  num.sample.control <- ncol(count.control)
  num.sample.treat <- ncol(count.treat)

  # samples of two groups
  sample.control <- colnames(count.control)
  sample.treat <- colnames(count.treat)

  # save gene list in gene.list for extracting gene names later
  gene.list <- rownames(count.table)

  # get the sample id
  samples <- colnames(count.table)

  # define the subject and group; define the reference group (control)
  subject <- factor(subject.all[c(which(group.all == control), which(group.all == treat))])
  group <- relevel(factor(group.all[c(which(group.all == control), which(group.all == treat))]), ref = control)

  # The design matrix
  if (pair.test) {
    design <- model.matrix(~subject+group)
  } else {
    design <- model.matrix(~group)
  }

  if (dea.tool == 'edgeR') {  # use edgeR for DEA
    # normalize the two groups and save the normalized count table
    y.control <- DGEList(counts = count.control, genes = gene.list)
    y.treat <- DGEList(counts = count.treat, genes = gene.list)

    y.control <- calcNormFactors(y.control, method="TMM")
    count.table.control.norm <- cpm(y.control)
    write.table(count.table.control.norm, paste(output.path, '/countGroup/', control, '_gene_norm.tsv', sep = ''), quote = FALSE, sep = "\t")

    y.treat <- calcNormFactors(y.treat, method="TMM")
    count.table.treat.norm <- cpm(y.treat)
    write.table(count.table.treat.norm, paste(output.path, '/countGroup/', treat, '_gene_norm.tsv', sep = ''), quote = FALSE, sep = "\t")

    # Put the data into a DGEList object
    y <- DGEList(counts = count.table, genes = gene.list)

    # do DEA

    # Filtering
    if (filter.need) {
      countsPerMillion <- cpm(y)
      countCheck <- countsPerMillion > 1
      keep <- which(rowSums(countCheck) > 1)
      y <- y[keep, ]
    }

    # Normalization
    y <- calcNormFactors(y, method="TMM")

    y$samples$subject <- subject
    y$samples$group <- group

    rownames(design) <- colnames(y)

    # Estimating the dispersion

    # estimate the NB dispersion for the dataset
    y <- estimateDisp(y, design, robust = TRUE)

    # Differential expression

    # determine differentially expressed genes
    # fit genewise glms
    fit <- glmFit(y, design)

    # conduct likelihood ratio tests for tumour vs normal tissue differences and show the top genes
    lrt <- glmLRT(fit)

    # the DEA result for all the genes
    # dea <- lrt$table
    toptag <- topTags(lrt, n = nrow(y$genes), p.value = 1)
    dea <- toptag$table  # just to add one more column of FDR
    dea <- dea[order(dea$FDR, -abs(dea$logFC), decreasing = FALSE), ]  # sort the table: ascending of FDR then descending of absolute valued of logFC

    # differentially expressed genes
    toptag <- topTags(lrt, n = nrow(y$genes), p.value = fdr.thr)
    deg <- toptag$table
    if (!is.null(deg)) {
      deg <- deg[order(deg$FDR, -abs(deg$logFC), decreasing = FALSE), ]  # sort the table: ascending of FDR then descending of absolute valued of logFC
    }

    # save the DEA result and DEGs to files
    write.table(dea, paste(output.path, '/DEA/dea_', control, '_', treat, '.tsv', sep = ''), row.names = F, quote = FALSE, sep = '\t')
    write.table(deg, paste(output.path, '/DEA/deg_', control, '_', treat, '.tsv', sep = ''), row.names = F, quote = FALSE, sep = '\t') 
  } else if (dea.tool == "DESeq2") {  # use DESeq2 for DEA

    ## create the DESeqDataSet
    colData = data.frame(samples, subject, group)
    dds <- DESeqDataSetFromMatrix(count.table, colData = colData, design = design)

    # generate normalized counts
    dds <- estimateSizeFactors(dds)
    normalized_counts <- counts(dds, normalized=TRUE)

    normalized_counts.control <- normalized_counts[, which(colnames(normalized_counts) %in% sample.control)]
    write.table(normalized_counts.control, paste(output.path, '/countGroup/', control, '_gene_norm.tsv', sep = ''), quote = FALSE, sep = "\t")

    normalized_counts.treat <- normalized_counts[, which(colnames(normalized_counts) %in% sample.treat)]
    write.table(normalized_counts.treat, paste(output.path, '/countGroup/', treat, '_gene_norm.tsv', sep = ''), quote = FALSE, sep = "\t")

    ## Filtering
    if (filter.need) {
      keep <- rowSums(counts(dds)) >= 10
      dds <- dds[keep,]
    }

    ## perform DEA
    dds <- DESeq(dds)

    ## export the results
    res.dea <- results(dds)
    res.dea <- res.dea[complete.cases(res.dea), ]  # remove any rows with NA

    dea <- as.data.frame(res.dea)
    dea <- dea[order(dea$padj, -abs(dea$log2FoldChange), decreasing = FALSE), ]  # sort the table: ascending of FDR then descending of absolute valued of logFC

    deg <- dea[dea$padj < fdr.thr, ]
    if (nrow(deg) > 1) {
      deg <- deg[order(deg$padj, -abs(deg$log2FoldChange), decreasing = FALSE), ]  # sort the table: ascending of FDR then descending of absolute valued of logFC
    }

    # save the DEA result and DEGs to files
    write.table(dea, paste(output.path, '/DEA/dea_', control, '_', treat, '.tsv', sep = ''), row.names = T, quote = FALSE, sep = '\t')
    write.table(deg, paste(output.path, '/DEA/deg_', control, '_', treat, '.tsv', sep = ''), row.names = T, quote = FALSE, sep = '\t')
  }
}

# load the config file
yaml.file <- yaml.load_file('configs/config_main.yaml')

# extract the information from the yaml file
project <- yaml.file$PROJECT  # project name of this analysis
dea.tool <- yaml.file$DEATOOL  # tool used for DEA
controls <- yaml.file$CONTROL  # all groups used as control
treats <- yaml.file$TREAT  # all groups used as treat, should correspond to control
filter.need <- yaml.file$FILTER$yesOrNo
pair.test <- yaml.file$PAIR
fdr.thr <- yaml.file$FDR  # threshold of FDR/adjusted P-value for significantlly differentially expressed genes
meta.file <- yaml.file$METAFILE
output.path <- file.path(yaml.file$FINALOUTPUT, project, "genome/dea")

# extract the metadata
meta.data <- read.csv(meta.file, header = TRUE, sep = '\t')
group.all <- meta.data$group
subject.all <- meta.data$subject

num.control <- length(controls)  # number of comparisons that the user wants to do
num.treat <- length(treats)  # should equals to num.control

if (num.control != num.treat) {
  message("Error: Control groups don't mathch with treat groups!")
  message("Please check config_dea.yaml")
  quit(save = 'no')
}

num.comparison <- num.control

# Do DEA

# the main function
for (ith.comparison in c(1:num.comparison)) {
  control <- controls[ith.comparison]
  treat <- treats[ith.comparison]
  DEA(control, treat)
}
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
library(biomaRt)
library(yaml)
library(edgeR)
library(DESeq2)
library(tximport)

# ====================== define the function of DEA ======================

DEA <- function(control, treat, file.control, file.treat, output.path.dea) {
  count.control <- read.table(file.control, header = TRUE, row.names = 1, check.names = FALSE)
  count.treat <- read.table(file.treat, header = TRUE, row.names = 1, check.names = FALSE)
  count.table <- cbind(count.control, count.treat)  # merge the control and treat tables together

  # number of samples in control and treat groups (should be the same if it's a pair test)
  num.sample.control <- ncol(count.control)
  num.sample.treat <- ncol(count.treat)

  # save gene list in gene.list for extracting gene names later
  gene.list <- rownames(count.table)

  # get the sample id
  samples <- colnames(count.table)

  # define the subject and group; define the reference group (control)
  subject <- factor(subject.all[c(which(group.all == control), which(group.all == treat))])
  group <- relevel(factor(group.all[c(which(group.all == control), which(group.all == treat))]), ref = control)

  # The design matrix
  if (pair.test) {
    design <- model.matrix(~subject+group)
  } else {
    design <- model.matrix(~group)
  }

  if (dea.tool == 'edgeR') {  # use edgeR for DEA
    # Put the data into a DGEList object
    y <- DGEList(counts = count.table, genes = gene.list)

    # Filtering
    if (filter.need) {
      countsPerMillion <- cpm(y)
      countCheck <- countsPerMillion > 1
      keep <- which(rowSums(countCheck) > 1)
      y <- y[keep, ]
    }

    # Normalization
    y <- calcNormFactors(y, method="TMM")

    y$samples$subject <- subject
    y$samples$group <- group


    rownames(design) <- colnames(y)

    # Estimating the dispersion

    # estimate the NB dispersion for the dataset
    y <- estimateDisp(y, design, robust = TRUE)

    # Differential expression

    # determine differentially expressed genes
    # fit genewise glms
    fit <- glmFit(y, design)

    # conduct likelihood ratio tests for tumour vs normal tissue differences and show the top genes
    lrt <- glmLRT(fit)

    # the DEA result for all the genes
    # dea <- lrt$table
    toptag <- topTags(lrt, n = nrow(y$genes), p.value = 1)
    dea <- toptag$table  # just to add one more column of FDR
    dea <- dea[order(dea$FDR, -abs(dea$logFC), decreasing = FALSE), ]  # sort the table: ascending of FDR then descending of absolute valued of logFC

    # differentially expressed genes
    toptag <- topTags(lrt, n = nrow(y$genes), p.value = fdr.thr)
    deg <- toptag$table
    if (!is.null(deg)) {
      deg <- deg[order(deg$FDR, -abs(deg$logFC), decreasing = FALSE), ]  # sort the table: ascending of FDR then descending of absolute valued of logFC
    }

    # save the DEA result and DEGs to files
    write.table(dea, paste(output.path.dea, '/dea_', control, '_', treat, '.tsv', sep = ''), row.names = F, quote = FALSE, sep = '\t')
    write.table(deg, paste(output.path.dea, '/deg_', control, '_', treat, '.tsv', sep = ''), row.names = F, quote = FALSE, sep = '\t')
  } else if (dea.tool == "DESeq2") {  # use DESeq2 for DEA

    ## prepare txi

    if (gene.level & gene.level.flag) {
      ### load tx2gene
      load(file.path(output.path, "countGroup", 'tx2gene.RData'))

      if (ENSEMBL) {
          files.noVersion <- file.path(quant.path, samples, "quant_noVersion.sf")
          names(files.noVersion) <- samples

          ### import quantification as txi
          txi <- tximport(files.noVersion, type = "salmon", tx2gene = tx2gene, countsFromAbundance = "no")
      } else {
          ### the original quant files from Salmon
          files <- file.path(quant.path, samples, "quant.sf")
          names(files) <- samples

          ### import quantification as txi
          txi <- tximport(files, type = "salmon", tx2gene = tx2gene, countsFromAbundance = "no")
      }
    } else {
      ### the original quant files from Salmon
      files <- file.path(quant.path, samples, "quant.sf")
      names(files) <- samples

      ### import quantification as txi
      txi <- tximport(files, type = "salmon", txOut = TRUE, countsFromAbundance = "no")
    }

    ## create the DESeqDataSet
    colData = data.frame(samples, subject, group)
    dds <- DESeqDataSetFromTximport(txi, colData = colData, design = design)

    # Filtering
    if (filter.need) {
      keep <- rowSums(counts(dds)) >= 10
      dds <- dds[keep,]
    }

    ## specify the control group
    dds$group <- relevel(dds$group, ref = control)

    ## perform DEA
    dds <- DESeq(dds)

    ## export the results
    res.dea <- results(dds)
    res.dea <- res.dea[complete.cases(res.dea), ]  # remove any rows with NA

    dea <- as.data.frame(res.dea)
    dea <- dea[order(dea$padj, -abs(dea$log2FoldChange), decreasing = FALSE), ]  # sort the table: ascending of FDR then descending of absolute valued of logFC

    deg <- dea[dea$padj < fdr.thr, ]
    if (nrow(deg) > 1) {
      deg <- deg[order(deg$padj, -abs(deg$log2FoldChange), decreasing = FALSE), ]  # sort the table: ascending of FDR then descending of absolute valued of logFC
    }    

    # save the DEA result and DEGs to files
    write.table(dea, paste(output.path.dea, '/dea_', control, '_', treat, '.tsv', sep = ''), row.names = T, quote = FALSE, sep = '\t')
    write.table(deg, paste(output.path.dea, '/deg_', control, '_', treat, '.tsv', sep = ''), row.names = T, quote = FALSE, sep = '\t')
  }
}

# ====================== load parameters in config file ======================

# load the config file
yaml.file <- yaml.load_file('configs/config_main.yaml')

# extract the information from the yaml file
project <- yaml.file$PROJECT  # project name
dea.tool <- yaml.file$DEATOOL  # tool used for DEA
quant.path <- file.path(yaml.file$FINALOUTPUT, project, "trans/quant")
gene.level <- yaml.file$GENE_LEVEL  # whether to do gene-level analysis
controls <- yaml.file$CONTROL  # all groups used as control
treats <- yaml.file$TREAT  # all groups used as treat, should correspond to control
filter.need <- yaml.file$FILTER$yesOrNo
pair.test <- yaml.file$PAIR
meta.file <- yaml.file$METAFILE
ENSEMBL <- yaml.file$ENSEMBL
dataset <- yaml.file$EnsemblDataSet
output.path <- file.path(yaml.file$FINALOUTPUT, project, "trans/dea")
fdr.thr <- yaml.file$FDR  # threshold of FDR/adjusted P-value for significantlly differentially expressed genes
num.control <- length(controls)  # number of comparisons that the user wants to do
num.treat <- length(treats)  # should equals to num.control

if (num.control != num.treat) {
  message("Error: Control groups don't mathch with treat groups!")
  message("Please check config_dea.yaml")
  quit(save = 'no')
}

num.comparison <- num.control

# extract the metadata
meta.data <- read.csv(meta.file, header = TRUE, sep = '\t', stringsAsFactors = FALSE)
samples <- meta.data$sample
group.all <- meta.data$group
subject.all <- meta.data$subject

# ====================== Do DEA ======================

for (ith.comparison in c(1:num.comparison)) {
  control <- controls[ith.comparison]
  treat <- treats[ith.comparison]

  # --------------------- On transctipt level ---------------------
  file.control <- paste(output.path, '/countGroup/', control, '_trans_abundance.tsv', sep = '')
  file.treat <- paste(output.path, '/countGroup/', treat, '_trans_abundance.tsv', sep = '')
  output.path.dea <- paste(output.path, '/DEA/transcript-level', sep = '')

  gene.level.flag <- FALSE
  DEA(control, treat, file.control, file.treat, output.path.dea)

  # --------------------- On gene level ---------------------
  if (gene.level) {
    file.control <- paste(output.path, '/countGroup/', control, '_gene_abundance.tsv', sep = '')
    file.treat <- paste(output.path, '/countGroup/', treat, '_gene_abundance.tsv', sep = '')
    output.path.dea <- paste(output.path, '/DEA/gene-level', sep = '')

    gene.level.flag <- TRUE
    DEA(control, treat, file.control, file.treat, output.path.dea)
  }
}
9
awk -F'"' '{print $2"\t"$6}' $1 > $2
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
if (!require("plotscale")) install.packages('scripts/plotscale_0.1.6.tar.gz', repos = NULL, type="source")
library(yaml)
library(hash)
library(mygene)
library(EnhancedVolcano)
library(plotscale)

# ====================== load parameters in config file ======================

# passing the params from command line
args <- commandArgs(TRUE)
norm.path <- args[1]
dea.path <- args[2]
out.path <- args[3]

# load the config file
if (length(args) > 3) {  # this script is used in  visualize_test.rules
    yaml.file <- yaml.load_file(args[4])
} else {  # this script is used in visualize.rules
    yaml.file <- yaml.load_file('configs/config_main.yaml')
}

# extract the information from the yaml file
controls <- yaml.file$CONTROL
treats <- yaml.file$TREAT
dea.tool <- yaml.file$DEATOOL

# check the number of comparisons
num.control <- length(controls)  # number of comparisons that the user wants to do
num.treat <- length(treats)  # should equals to num.control

if (num.control != num.treat) {
  message("Error: Control groups don't mathch with treat groups!")
  message("Please check config_dea.yaml")
  quit(save = 'no')
}

num.comparison <- num.control

convert.id2symbol <- function(gene.id) {
  gene.symbol <- gene.id  # initialize the gene symbol with the gene id

  # it may happen that no symbol can be found for any id. In that case, "queryMany" will throw an error
  # so "try" is used here to take care of that error
  try({
    gene.symbol.all <- queryMany(gene.id, scopes = 'ensembl.gene', fields = 'symbol')

    h <- hash()
    for (i in 1:nrow(gene.symbol.all)) {
      query <- gene.symbol.all$query[i]
      symbol <- gene.symbol.all$symbol[i]
      if (has.key(query, h)) {  # if there's duplicate for the same query
        h[[query]] <- paste(hash::values(h, keys = query), symbol, sep = ', ')
      } else {
        if (is.na(symbol)) {  # if there's no hit for the query, keep the original id
          h[[query]] <- query
        } else {
          h[[query]] <- symbol
        }
      }
    }

    for (i in c(1:length(gene.symbol))) {
      gene.symbol[i] <- h[[gene.id[i]]]
    }
  })

  return(gene.symbol)
}

# function to plot volcano plot and heatmap
plot.volcano.heatmap <- function(name.control, name.treat) {
  file.dea.table <- paste(dea.path, "/dea_", name.control, "_", name.treat, ".tsv", sep = "")
  norm.control <- paste(norm.path, "/", name.control, "_gene_norm.tsv", sep = "")  # normalized table of control
  norm.treat <- paste(norm.path, "/", name.treat, "_gene_norm.tsv", sep = "")  # normalized table of treat

  dea.table <- read.table(file.dea.table, header = TRUE, row.names = 1)
  # sort the dea table: ascending of FDR then descending of absolute valued of logFC
  if (dea.tool == 'edgeR') {
    dea.table <- dea.table[order(dea.table$FDR, -abs(dea.table$logFC), decreasing = FALSE), ]  
  } else if (dea.tool == 'DESeq2') {
    dea.table <- dea.table[order(dea.table$padj, -abs(dea.table$log2FoldChange), decreasing = FALSE), ]
  }

  gene.id.dea <- row.names(dea.table)

  # convert the gene id to gene symbol, and keep the ID if no symbol can be found for that ID
  gene.dea <- convert.id2symbol(gene.id.dea)

  # volcano plot
  dea.table.volcano <- dea.table  # for better volcano plot, 0 FDRs/padj will be changed to a very low value

  if (dea.tool == 'edgeR') {
    # change the 0 FDR to a low value (100 times smaller than the minumum non-zero value)
    FDR <- dea.table.volcano$FDR
    FDR.min.non.zero <- min(FDR[FDR>0])
    dea.table.volcano$FDR[FDR==0] <- FDR.min.non.zero/100
    ## define the range of x-axis and y-axis
    log2FC_lim <- max(abs(dea.table.volcano$logFC))
    FDR_lim <- -log10(min(dea.table.volcano$FDR)) # NAs already removed from dea.table

    fig.volcano <- EnhancedVolcano(dea.table.volcano, lab = gene.dea, xlab = bquote(~Log[2]~ "fold change"), x = 'logFC', y = 'FDR', pCutoff = 10e-5, col = c("grey30", "orange2", "royalblue", "red2"),
                                   FCcutoff = 1, xlim = c(-log2FC_lim-1, log2FC_lim+1), ylim = c(0, FDR_lim), title = NULL, subtitle = NULL)
  } else if (dea.tool == 'DESeq2') {
    # change the 0 padj to a low value (100 times smaller than the minumum non-zero value)
    padj <- dea.table.volcano$padj
    padj.min.non.zero <- min(padj[padj>0])
    dea.table.volcano$padj[padj==0] <- padj.min.non.zero/100
    ## define the range of x-axis and y-axis
    log2FC_lim <- max(abs(dea.table.volcano$log2FoldChange))
    padj_lim <- -log10(min(dea.table.volcano$padj)) # NAs already removed from dea.table

    fig.volcano <- EnhancedVolcano(dea.table.volcano, lab = gene.dea, xlab = bquote(~Log[2]~ "fold change"), x = 'log2FoldChange', y = 'padj', pCutoff = 10e-5, col = c("grey30", "orange2", "royalblue", "red2"),
                                   FCcutoff = 1, xlim = c(-log2FC_lim-1, log2FC_lim+1), ylim = c(0, padj_lim), title = NULL, subtitle = NULL)
  }

  # ## if the range of x-axis or y-axis gets crazy, you may also manually define it to show a subset of the genes (but to be noted, the genes out of range will not be shown)
  # if (dea.tool == 'edgeR') {
  #   fig.volcano <- EnhancedVolcano(dea.table, lab = gene.dea, xlab = bquote(~Log[2]~ "fold change"), x = 'logFC', y = 'FDR', pCutoff = 10e-5, col = c("grey30", "orange2", "royalblue", "red2"),
  #                                FCcutoff = 1, xlim = c(-5, 5), ylim = c(0, 10), transcriptPointSize = 1.5, title = NULL, subtitle = NULL)
  # } else if (dea.tool == 'DESeq2') {
  #   fig.volcano <- EnhancedVolcano(dea.table, lab = gene.dea, xlab = bquote(~Log[2]~ "fold change"), x = 'log2FoldChange', y = 'padj', pCutoff = 10e-5, col = c("grey30", "orange2", "royalblue", "red2"),
  #                                FCcutoff = 1, xlim = c(-5, 5), ylim = c(0, 10), transcriptPointSize = 1.5, title = NULL, subtitle = NULL)
  # }

  as.pdf(fig.volcano, width = 9, height = 6, scaled = TRUE, file = file.path(out.path, paste('volcano_plot_', name.control, '_', name.treat, '.pdf', sep = '')))

  # heatmap
  norm.table.control <- read.table(norm.control, header = TRUE, row.names = 1)
  norm.table.treat <- read.table(norm.treat, header = TRUE, row.names = 1)

  num.control <- dim(norm.table.control)[2]
  num.treat <- dim(norm.table.treat)[2]

  norm.table <- cbind(norm.table.control, norm.table.treat)
  groups <- c(name.control, name.treat)

  # instead using all genes, only use the top 20 genes in dea.table
  index.deg <- which(row.names(norm.table) %in% gene.id.dea[1:20])
  norm.table.deg <- norm.table[index.deg,]

  gene.id.norm.table <- rownames(norm.table.deg)

  # convert the gene id to gene symbol, and keep the ID if no symbol can be found for that ID
  gene.norm.table <- convert.id2symbol(gene.id.norm.table)

  palette <- c("#999999", "#377EB8")
  palette.group <- c(rep(palette[1], num.control), rep(palette[2], num.treat))

  ## draw heatmap
  pdf(file = file.path(out.path, paste('heatmap_', name.control, '_', name.treat, '.pdf', sep = '')), width = 15, height = 12, title = 'Heatmap using the top features')
  heatmap(as.matrix(norm.table.deg), ColSideColors = palette.group, margins = c(9,5.5), labRow = gene.norm.table, cexRow = 1.9, cexCol = 1.9)
  legend("topleft", title = 'Group', legend=groups, text.font = 15,
         col = palette, fill = palette, cex=1.8)
  dev.off()
}

# the main function
for (ith.comparison in c(1:num.comparison)) {
  name.control <- controls[ith.comparison]
  name.treat <- treats[ith.comparison]
  plot.volcano.heatmap(name.control, name.treat)
}
18
19
shell:
    "hisat2-build -p {config[NCORE]} {input.trans} {params.index}"
27
28
29
run:
    shell("scp -i {input.key} {config[NELSIN]}/{wildcards.sample}_*_R1_001.fastq.gz {output.forward}")
    shell("scp -i {input.key} {config[NELSIN]}/{wildcards.sample}_*_R2_001.fastq.gz {output.reverse}")
43
44
45
shell:
    "hisat2 -p {config[NCORE]} -x {params.index} -1 {input.forward} -2 {input.reverse} -S {output.sam}"
    " && samtools view -b -S {output.sam} > {output.bam}"
52
53
shell:
    "samtools sort {input.bam} -o {output.sort} && samtools index {output.sort}"
60
61
shell:
    "samtools idxstats {input.sort} > {output.count}"
68
69
shell:
    "cd ../scripts && javac -cp opencsv-1.8.jar:. sumgenescod.java && java -cp opencsv-1.8.jar:. sumgenescod codgenelist.csv {input}"
76
77
shell:
    "sh ../scripts/formatCount.sh {input.geneCount} {output.formatCount}"
28
29
30
run:
    shell("scp -i {params.key} {params.input_path}/{wildcards.sample}_*R1*.f*q.gz {output.forward}"),
    shell("scp -i {params.key} {params.input_path}/{wildcards.sample}_*R2*.f*q.gz {output.reverse}")
39
40
41
42
43
shell:
    """
    shopt -s extglob
    scp -i {params.key} {params.input_path}/{wildcards.sample}?(_*)?(.*).f*q.gz {output.read}
    """
53
54
55
shell:
    "mkdir {output.indexes} && hisat2-build -p {config[NCORE]} {input.genome} {params.index}"
    "&& hisat2_extract_splice_sites.py {config[ANNOTATION]} > {output.splicesites}"
71
72
73
run:
    shell("hisat2 -p {config[NCORE]} --known-splicesite-infile {input.splicesites} -x {params.index} -1 {input.forward} -2 {input.reverse} -S {output.sam}")
    shell("samtools view -@ {config[NCORE]} -b -S {output.sam} > {output.bam}")
87
88
89
run:
    shell("hisat2 -p {config[NCORE]} --known-splicesite-infile {input.splicesites} -x {params.index} -U {input.forward} -S {output.sam}")
    shell("samtools view -@ {config[NCORE]} -b -S {output.sam} > {output.bam}")
96
97
shell:
    "samtools sort -@ {config[NCORE]} {input.bam} -o {output.sort}"
106
107
108
109
110
111
112
113
run:
    if config["COUNTER"]=="featureCounts":
        if config["END"]=="pair":
            shell("featureCounts -p -T {config[NCORE]} -t exon -g {config[ATTRIBUTE]} -a {input.annotation} -o {output.count} {input.sort} && tail -n +3 {output.count} | cut -f1,7 > temp.{wildcards.sample} && mv temp.{wildcards.sample} {output.count}")
        else:
            shell("featureCounts -T {config[NCORE]} -t exon -g {config[ATTRIBUTE]} -a {input.annotation} -o {output.count} {input.sort} && tail -n +3 {output.count} | cut -f1,7 > temp.{wildcards.sample} && mv temp.{wildcards.sample} {output.count}")
    elif config["COUNTER"]=="htseq-count":
        shell("htseq-count -f bam -i {config[ATTRIBUTE]} -s no -t exon {input.sort} {input.annotation} | sed '/^__/ d' > {output.count}")
121
122
shell:
    "qualimap bamqc -bam {input.sort} -nt {config[NCORE]} --java-mem-size=6G -outdir {output.bamqc}"
130
131
shell:
    "multiqc {input.bamqc} {input.count_summary} --filename {output.report}"
138
139
shell:
    "multiqc {input.count_summary} --filename {output.report}"
26
27
shell:
    "python scripts/combine2group_genome.py"
37
38
shell:
    "Rscript scripts/dea_genome.R"
27
28
shell:
    "Rscript scripts/combine2group_trans.R"
39
40
shell:
    "Rscript scripts/dea_trans.R"
53
54
shell:
    "Rscript scripts/combine2group_trans.R"
62
63
shell:
    "Rscript scripts/dea_trans.R"
23
24
25
run:
    shell("scp -i {params.key} {params.input_path}/{wildcards.sample}_*R1*.f*q.gz {output.forward}"),
    shell("scp -i {params.key} {params.input_path}/{wildcards.sample}_*R2*.f*q.gz {output.reverse}")
36
37
38
shell:
    "fastqc -t $(({config[NCORE]}+0)) -o {params.outputpath} {input.forward} && "
    "fastqc -t $(({config[NCORE]}+0)) -o {params.outputpath} {input.reverse}"
48
49
shell:
    "multiqc {params.path} --filename {output.report}"
58
59
60
61
62
shell:
    """
    shopt -s extglob
    scp -i {params.key} {params.input_path}/{wildcards.sample}?(_*)?(.*).f*q.gz {output.read}
    """
71
72
shell:
    "fastqc -t $(({config[NCORE]}+0)) -o {params.outputpath} {input.read}"
81
82
shell:
    "multiqc {params.path} --filename {output.report}"
27
28
29
run:
    shell("scp -i {params.key} {params.input_path}/{wildcards.sample}_*R1*.f*q.gz {output.forward_read}"),
    shell("scp -i {params.key} {params.input_path}/{wildcards.sample}_*R2*.f*q.gz {output.reverse_read}")
38
39
40
41
42
shell:
    """
    shopt -s extglob
    scp -i {params.key} {params.input_path}/{wildcards.sample}?(_*)?(.*).f*q.gz {output.read}
    """
49
50
shell:
    "salmon index -t {input} -i {output} --type quasi -k 31 -p {config[NCORE]}"
61
62
63
64
65
shell:
    """
    salmon quant -i {input.index} -l A -1 {input.forward_read} -2 {input.reverse_read} -o {output.quant_dir} -p {config[NCORE]} --seqBias --useVBOpt --validateMappings
    awk 'NR==1{{next}}{{print $1"\\t"$4}}' {output.quant_dir}/quant.sf > {output.tpm}
    """
74
75
76
77
78
shell:
    """
    salmon quant -i {input.index} -l A -r {input.read} -o {output.quant_dir} -p {config[NCORE]} --seqBias --useVBOpt --validateMappings
    awk 'NR==1{{next}}{{print $1"\\t"$4}}' {output.quant_dir}/quant.sf > {output.tpm}
    """
85
86
shell:
    "multiqc {input} --filename {output}"
31
32
33
run:
    shell("scp -i {params.key} {params.input_path}/{wildcards.sample}_*R1*.f*q.gz {output.forward}"),
    shell("scp -i {params.key} {params.input_path}/{wildcards.sample}_*R2*.f*q.gz {output.reverse}")
44
45
shell:
    "trim_galore --fastqc -j 4 --paired --basename {wildcards.sample} -o {params.outputpath} {input.forward} {input.reverse}"
54
55
56
57
58
shell:
    """
    shopt -s extglob
    scp -i {params.key} {params.input_path}/{wildcards.sample}?(_*)?(.*).f*q.gz {output.read}
    """
67
68
shell:
    "trim_galore --fastqc -j 4 --basename {wildcards.sample} -o {params.outputpath} {input.read}"
77
78
shell:
    "multiqc {params.path} --filename {output.report}"
30
31
shell:
    "Rscript scripts/visualize.R {input.norm_path} {input.dea_path} {params.output_path}"
52
53
shell:
    "Rscript scripts/visualize.R {input.norm_path} {input.dea_path} {params.output_path}"
ShowHide 35 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/zhxiaokang/RASflow.git
Name: rasflow-rna-seq-analysis-snakemake-workflow
Version: Version 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Other Versions:
Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...