A Snakemake pipeline to analyse RNA-Seq reads

public public 1yr ago Version: v0.1.1 0 bookmarks

RNA_seq_Snakemake

A snakemake pipeline for the analysis of RNA-seq data that makes use of hisat2 and DESeq2 .

Aim

To align, count, normalize counts and compute gene differential expressions between conditions using paired-end Illumina RNA-seq data.

Description

This pipeline analyses the raw RNA-seq data and produces a file containing normalized counts, differential expression, numbers of the clusters the genes have been assigned to and functions of transcripts. The raw fastq files will be trimmed for adaptors and quality checked with trimmomatic. Next, the necessary genome sequence fastas will be downloaded to be used for the mapping of the trimmed reads using hisat2. With stringtie and a downloaded reference annotation a new annotation will be created. This new annotation will be used to obtain the raw counts and do a local blast to a reference transcriptome fasta containing predicted functions. The counts are normalized and differential expressions are calculated using DESeq2. significantly expressed genes are used to create heatmaps and plots both in total and clustered. finally the DESeq2, blast and clustering data are combined to get the final results table.

Prerequisites: what you should be able to do before using this Snakemake pipeline

  • Some command of the Unix Shell to connect to a remote server where you will execute the pipeline (e.g. SURF Lisa Cluster). You can find a good tutorial from the Software Carpentry Foundation here and another one from Berlin Bioinformatics here .

  • Some command of the Unix Shell to transfer datasets to and from a remote server (to transfer sequencing files and retrieve the results/). The Berlin Bioinformatics Unix begginer guide available [here] should be sufficient for that (check the wget and scp commands).

  • An understanding of the steps of a canonical RNA-Seq analysis (trimming, alignment, etc.). You can find some info here .

Content

  • Snakefile : a master file that contains the desired outputs and the rules to generate them from the input files.

  • config.yaml : the configuration files making the Snakefile adaptable to any input files, genome and parameter for the rules.

  • data/ : a folder containing samples.txt (sample descriptions) and subsetted paired-end fastq files used to test locally the pipeline. Generated using Seqtk : seqtk sample -s100 {inputfile(can be gzipped)} 250000 > {output(always gunzipped)} This folder should contain the fastq of the paired-end RNA-seq data, you want to run.

  • envs/ : a folder containing the environments needed for the conda package manager. If run with the --use-conda command, Snakemake will install the necessary softwares and packages using the conda environment files.

  • samples.tsv : a file containing information about the names, the paths and the conditions of the samples used as input for the pipeline. This file has to be adapted to your sample names before running the pipeline .

Usage

Download or clone the Github repository

You will need a local copy of the Snakemake_hisat-DESeq on your machine. You can either:

  • use git in the shell: git clone [email protected]:KoesGroup/Snakemake_hisat-DESeq.git

  • click on "Clone or download" and select download

Installing and activating a virtual environment

First, you need to create an environment where Snakemake and the python pandas package will be installed. To do that, we will use the conda package manager.

  1. Create a virtual environment named rnaseq using the global_env.yaml file with the following command: conda env create --name rnaseq --file envs/global_env.yaml Then, activate this virtual environment with source activate rnaseq

The Snakefile will then take care of installing and loading the packages and softwares required by each step of the pipeline.

Configuration file

Make sure you have changed the parameters in the config.yaml file that specifies where to find the sample data file, the genomic and transcriptomic reference fasta files to use and the parameters for certains rules etc.
This file is used so the Snakefile does not need to be changed when locations or parameters need to be changed.

Snakemake execution

The Snakemake pipeline/workflow management system reads a master file (often called Snakefile ) to list the steps to be executed and defining their order. It has many rich features. Read more here .

Dry run

From the folder containing the Snakefile , use the command snakemake --use-conda -np to perform a dry run that prints out the rules and commands.

Real run

Simply type Snakemake --use-conda and provide the number of cores with --cores 10 for ten cores for instance.
For cluster execution, please refer to the Snakemake reference .

Main outputs

  • the RNA-Seq read alignment files *.bam

  • the fastqc report files *.html

  • the unscaled RNA-Seq read counts: counts.txt

  • the differential expression file results.tsv

  • the combined results file final.txt

Directed Acyclic Graph of jobs

dag

Code Snippets

  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
library(DESeq2)
library(optparse)

# arguments to provide
option_list = list(
  make_option(c("-c", "--counts"), type="character", default="results/counts.txt", help="counts tabulated file from Feature Counts", metavar="character"),
  make_option(c("-s", "--samplefile"), type="character", default="data/samples2.tsv", help="sample files used to get conditions for DESEq2 model fit", metavar="character"),
  make_option(c("-o", "--outdir"), type="character", default="results/deseqNew.csv", help="where to place differential expression files", metavar="character"),
  make_option(c("-f", "--helperfile"), type="character", default="results/helperfile.csv", help="helper file needed for the creation of the clustering and the heatmaps", metavar="character"),
  make_option(c("-m", "--maxfraction"), type="double", default=1.0, help="maximum fraction of the total number of genes to be allowed to be differential between two conditions to be included (number between 0 and 1)", metavar="double")
) 

# parse the command-line arguments and pass them to a list called 'opt'
opt_parser = OptionParser(option_list=option_list);
opt = parse_args(opt_parser);

######## Import and wrangle counts produced by FeatureCounts #######
countdata <- read.table(opt$counts, header=TRUE, skip=1, row.names=1,stringsAsFactors = F, check.names = F)
countdata <- countdata[ ,6:ncol(countdata)] # Remove first five columns (chr, start, end, strand, length)

# Remove .bam or .sam from filenames
colnames(countdata) <- gsub("\\.[sb]am$", "", colnames(countdata))

# Convert to matrix
countdata <- as.matrix(countdata)

######### Extract experimental conditions from the file specifying the samples
samplefile = read.delim(file = opt$samplefile,header = T,stringsAsFactors = F)
col.names = colnames(samplefile)                                                # extract all column names
columns.to.discard = c("sample","fq1","fq2")                                    # column we don't need to extract all conditions
colsForConditions = col.names[! col.names %in% columns.to.discard]                     # only keeping the column of interest

# one condition
if (length(colsForConditions) == 1){
  condition <- factor(samplefile[,colsForConditions])
  # two conditions
} else if (length(colsForConditions == 2)){
  # two conditions --> make a third column that is the product of the two
  samplefile$conditions = paste(samplefile[,colsForConditions[1]],samplefile[,colsForConditions[2]],sep = ".")
  condition <- factor(x = samplefile[,"conditions"],levels = unique(samplefile[,"conditions"]))
} else if (length(conditions > 2)){
  print("too many conditions to compare, skipping")
}

# create helper file for downstream use in the pipeline.
helperFile <- as.data.frame(condition)
rownames(helperFile) <- samplefile$sample
colnames(helperFile) <- NULL
print(helperFile)
write.csv(helperFile, file = opt$helperfile, quote = F, col.names = F, row.names = T)

# Analysis with DESeq2 ----------------------------------------------------
# Create a coldata frame and instantiate the DESeqDataSet. See ?DESeqDataSetFromMatrix
coldata <- data.frame(row.names=colnames(countdata), condition)
dds <- DESeqDataSetFromMatrix(countData=countdata, colData=coldata, design=~condition)

# Run the DESeq pipeline
dds <- DESeq(dds)

# create dataframe containing normalized counts, to which the differential expression values will be added
resdata <- as.data.frame(counts(dds, normalized=TRUE))

## function for Volcano plot with "significant" genes labeled
volcanoplot <- function (res, lfcthresh=2, sigthresh=0.05, main="Volcano Plot", legendpos="bottomright", labelsig=TRUE, textcx=1, ...) {
  with(res, plot(log2FoldChange, -log10(padj), pch=20, main=main, ...))
  with(subset(res, padj<sigthresh ), points(log2FoldChange, -log10(padj), pch=20, col="red", ...))
  with(subset(res, abs(log2FoldChange)>lfcthresh), points(log2FoldChange, -log10(padj), pch=20, col="orange", ...))
  with(subset(res, padj<sigthresh & abs(log2FoldChange)>lfcthresh), points(log2FoldChange, -log10(padj), pch=20, col="green", ...))
  if (labelsig) {
    require(calibrate)
    with(subset(res, padj<sigthresh & abs(log2FoldChange)>lfcthresh), textxy(log2FoldChange, -log10(padj), labs=colnames(res), cex=textcx, ...))
  }
  legend(legendpos, xjust=1, yjust=1, legend=c(paste("FDR<",sigthresh,sep=""), paste("|LogFC|>",lfcthresh,sep=""), "both"), pch=20, col=c("red","orange","green"))
}

# open file that will be containing a vulcano plot for each of the combinations of conditions
#pdf(file="vulcanoplots.pdf")

# iterate through the list of conditions to create differential expression (DE) values for all possible pairs
x <- 1
for(i in levels(condition)){
  x <- x + 1
  if (x <= length(levels(condition))){
    for(j in x:length(levels(condition))){
      res <- results(dds, contrast=c("condition", i,  levels(condition)[j]))                      # i = first in pair, levels(condition)[j] is the second in pair.
      d <- paste(i, levels(condition)[j], sep="&")                                                # paste the two conditions in one character, to be used as the pair name
      resP <- as.data.frame(table(res$padj<0.05))                                                 # get number of DE values with P-value < 0.05
      if(resP[1,2]< opt$maxfraction*nrow(resdata)){                                               # only continue with the pair if it is less then the maximum fraction(set by user in commandline) differentially expressed 
        #volcanoplot(res, lfcthresh=1, sigthresh=0.05, textcx=.8, xlim=c(-10, 10), main=d)
        print(c(d,"Number of differentials is within accepted limit"))
        colnames(res) = paste(d,c(colnames(res)),sep = "-")                                       # paste the pair name to the column name
        resdata <- merge(as.data.frame(resdata), as.data.frame(res), by="row.names", sort=FALSE)  # merge the DE values to the matrix
        rownames(resdata) <- resdata$Row.names                                                    # redifine rownames as they have disapeared?? in the merging??
        resdata$Row.names <- NULL                                                                 # delete column containing the rownames, to avoid the same colname in the next iteration
      }
      else{
        print(c(d,"More differentials then allowed"))
      }
    }
  }
}
#dev.off()
# create first column containing the genenames.
resdata$genes <- rownames(resdata)
resdata <- merge(as.data.frame(resdata["genes"]), as.data.frame(resdata[,2:ncol(resdata)-1]), by="row.names", sort=FALSE)
resdata$Row.names <- NULL

# write the data to a file.
write.table(resdata, file=opt$outdir,sep = "\t",quote=F,row.names=F)
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
from optparse import OptionParser 
import glob
import os

# get the command line arguments
parser = OptionParser(description="Script that combines the normalized and differential expression data as outputted by DESeq2.R, the clusters as outputted by plotscript.R and the results from the blast against the reference proteome. Resulting in a combined tab-delimeted data file")
parser.add_option('-r', '--result_file', 
                    type=str,
                    default="results/result.csv",
                    metavar="",
                    help = "path and name of of the input file, being the output file of the R-script DESeq2.R, default = results/result.csv")
parser.add_option('-a', '--anno_file', 
                    type=str,
                    default="annotations/Petunia_axillaris_v1.6.2_proteins.mapman_annotation.txt",
                    metavar="",
                    help = "path and name of the stringtie transcriptome fasta file, a file containing the fastas of the de novo generated stringtie transcriptome, default = result/helperFile.csv")
parser.add_option('-c', '--cluster_file', 
                    type=str,
                    default="results/clusters.txt",
                    metavar="",
                    help = "path and name of the clusters file, a tab delimited file containing cluster numbers and depending on the method of clustering correlation values or membership values., default = result/helperFile.csv")
parser.add_option('-o', '--output_file', 
                    type=str,
                    default="results/final.txt",
                    metavar="",
                    help = "path and name of the output file. A tab delimimited file containing normalised reads, differential expressions and (adjusted)p-values, number of clusters the genes partisioned to and the hypothetical function as found by a blast. default = result/final.txt")
parser.add_option('-p', '--working_dir', 
                    type=str,
                    default="temp/mapped/",
                    metavar="",
                    help = "path to bam files, to be removed from the sample names in the header of the final output file. default = temp/mapped/")


(options, args) = parser.parse_args()

clusts = open(options.cluster_file, "r")
clusters = {}
for line in clusts:
    line = line.rstrip()
    line = line.split("\t")
    if len(line) == 2:
        line.append("NaN")
    clusters[line[0].lower()] = "\t".join(line[1:])

clusts.close()

# if there is an annotation file available add annotaions.
if options.anno_file in glob.glob(options.anno_file):
    if glob.glob(options.anno_file)[0].endswith(".gz"):
        os.system(f"gunzip {options.anno_file}")
        annoFile = ".".join(options.anno_file.split(".")[:-1])
    else:
        annoFile = options.anno_file
    fa = open(annoFile, "r")
    annos = {}
    count = 0
    for l in fa:
        l = l.split("\t")
        l = [x.strip("'") for x in l]
        if len(l)>2:
            if len(l[2])>1:
                if l[2].lower() not in annos:
                    annos[l[2].lower()] = ".".join(l[1].split(".")[:3]).replace(" ", "_")
                else:
                    annos[l[2].lower()] += (";" + ".".join(l[1].split(".")[:3]).replace(" ", "_"))
    fa.close()

    inFile  = open(options.result_file, "r")
    uitFile = open(options.output_file, "w")
    for l in inFile:
        l = l.rstrip().split("\t")
        if "genes" in l[0]:
            l = [x.replace(options.working_dir, "") for x in l]    # remove path from sample names
            l.append("\t".join([clusters["gene"], "annotation"]))
        elif l[0].lower() in clusters:
            if l[0].lower() in annos:
                l.append(clusters[l[0].lower()] + "\t" + annos[l[0].lower()])
            else:
                l.append(clusters[l[0].lower()] + "\tno annotation available.")
        elif l[0].lower() in annos:
            l.append("NaN\tNaN\t" + annos[l[0].lower()])
        else:
            l.append("NaN\tNaN\tno annotation available.")
        uitFile.write("\t".join(l)+"\n")
    inFile.close()
    uitFile.close()

else:
    inFile  = open(options.result_file, "r")
    uitFile = open(options.output_file, "w")
    for l in inFile:
        l = l.rstrip().split("\t")
        if "genes" in l[0]:
            l = [x.replace(options.working_dir, "") for x in l]    # remove path from sample names
            l.append("\t".join([clusters["gene"]]))
        elif l[0].lower() in clusters:
            l.append(clusters[l[0].lower()])
        else:
            l.append("NaN\tNaN")
        uitFile.write("\t".join(l)+"\n")
    inFile.close()
    uitFile.close()
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
import numpy as np 
from optparse import OptionParser 

# get the command line arguments
parser = OptionParser(description="python script to create pvalue and log2(foldchange) filtered ")
parser.add_option('-i', '--input_file', 
                    type=str,
                    default="results/plotSelection.txt",
                    metavar="",
                    help = "path and name of of the input file, being the output file of the R-script DESeq2.R, default = results/plotSelection.txt")
parser.add_option('-f', '--helper_file', 
                    type=str,
                    default="result/helperFile.csv",
                    metavar="",
                    help = "path and name of the helper file, a tab delimited file containing one column samples and a column conditions, default = result/helperFile.csv")
parser.add_option('-o', '--output_file', 
                    type=str,
                    default="result/plotSelection.txt",
                    metavar="",
                    help = "path and name of the output file a tab delimimited file containing normalised reads of significantly differentially expressed genes of all the samples or averaged to the conditions, default = result/plotSelection.txt")
parser.add_option('-v', '--minimum_foldchange', 
                    type=float,
                    default=2,
                    metavar="",
                    help = "minimum log2(fold_change) in any combination of conditions; integer > 1, default = 2")
parser.add_option('-r', '--minimum_reads', 
                    type=int,
                    default=100,
                    metavar="", 
                    help = "minimum number of reads of all samples together; integer >= 0, default = 100")
parser.add_option('-p', '--maximum_pvalue', 
                    type=float,
                    default=0.05,
                    metavar="",
                    help = "maximum exepted adjusted pvalue; 0.0 to 1.0, default = 0.05")
parser.add_option('-a', '--average_samples', 
                    type=str,
                    default="yes",
                    metavar="",
                    help = "output needs to contain averages of conditions; yes or no, default = yes")

(options, args) = parser.parse_args()


# get a list of the conditions 
helper = open(options.helper_file, "r")
samples = {}
conditions = []
x=0
for l in helper:
    l = l.rstrip()
    if len(l) > 0:
        x += 1
        l = l.split(",")
        name = l[1]
        samples[x] = name
        if name not in conditions:
            conditions.append(name)
helper.close()

# function to average the samples within a condition
def getAverages(counts):
    global samples, conditions
    sets = {}
    averages = [counts[0]]
    for i in samples:
        if samples[i] in sets:
            sets[samples[i]].append(float(counts[i]))
        else:
            sets[samples[i]] = [float(counts[i])]
    for c in conditions:
        averages.append(str(np.mean(sets[c])))
    return(averages)

inputData = open(options.input_file, "r")
outAll    = open(options.output_file, "w")
totalBoth = 0
count     = 0
min_reads = options.minimum_reads
foldchange= options.minimum_foldchange
pvalue    = options.maximum_pvalue
avarage   = options.average_samples.upper()

for l in inputData:
    count += 1
    if l.startswith("genes") == False and len(l) > 5:
        l = l.replace("NA", "1")         # replace NA values in P-value to 1
        l = l.replace("#VALUE!", "0")    # replace NA values in foldchange to 0
        f = l.rstrip()                   # remove white space and line break at the end.
        f = f.split("\t")                # split string to list on tab
        if len(f) > x+1:
            fc = [float(f[i]) for i in range(x+2, len(f), 6)]       # list of faultchanges
            pv = [float(f[i]) for i in range(x+6, len(f), 6)]       # list of adjusted p-values
            counts = [float(f[i]) for i in range(1,x+1)]
            if sum(counts) > min_reads:             # total number of reads more then ~20 per sample
                for a1, a2 in zip(fc,pv):
                    if a1 > foldchange and a2 < pvalue and a2 == min(pv):
                        if avarage == "YES":
                            line = "\t".join(getAverages(f[:x+1]))
                        else:
                            line = "\t".join(f[:x+1])
                        outAll.write(line+"\n")
                        totalBoth += 1
                        break
                    elif a1 < -1*foldchange and a2 < pvalue and a2 == min(pv):
                        if avarage == "YES":
                            line = "\t".join(getAverages(f[:x+1]))
                        else:
                            line = "\t".join(f[:x+1])
                        outAll.write(line+"\n")
                        totalBoth += 1
                        break
        else:
            print("No fold changes and p-values were found. Try rerunning the DESeq2.R with a higher maximum fraction (to be adjusted in config.yaml)")
    else:
        if len(l) > 10:
            if avarage == "YES":
                line = "\t".join(conditions)
            else:
                line = "\t".join(l.split("\t")[1:x+1])
            outAll.write(l.split("\t")[0] + "\t" + line + "\n")

print(f"Total number of genes is: {str(count-1)}.")
print(f"Number of genes kept for plots is: {str(totalBoth)}.")

inputData.close() 
outAll.close()
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
library(optparse)

option_list = list(
  make_option(c("-i", "--input_file"), type="character", default="results/counts.txt", help="normalized and filtered counts tabulated file from filter_for_plots.py", metavar="character"),
  make_option(c("-m", "--method_of_clustering"), type="character", default="hierarchical", help="method of clustering chose between: hierarchical, 
                kmean or fuzzy_kmean
                default is hierarchical", metavar="character"),
  make_option(c("-n", "--opt_clust_number"), type="character", default="dynamic", help="determination of optimal number of clusters, 
                If --method_of_clustering=hierarchical the options are: fixed, height or dynamic.
                If--method_of_clustering=kmean or --method_of_clustering=fuzzy_kmean the options
                are: fixed, average_silhouette_width, calinsky_criterion, 
                gap_statistic or affinity_propogation.
                default is dynamic.", metavar="character"),
  make_option(c("-k", "--number_of_clusters"), type="integer", default=5, help="number of clusters to be used. choose an integer of two or more. 
                To be used if --det_hclust_number=fixed or --det_kclust_number=fixed.
                default is 5", metavar="integer"),
  make_option(c("-H", "--height_in_dendrogram"), type="double", default=1.5, help="height in dendrogram to use for determination of cluster number.
               default is 1.5", metavar="double"),
  make_option(c("-q", "--membership_min"), type="double", default=0.2, help="minimal membership value of fuzzy kmean cluster to be included in 
                the cluster. should be a value between 0.0 and 1.0. Only to 
                be used in combination with -m = fuzzy_kmean. Best to first run with a low value,
	        and adjust based on results if needed.
                default is o.2", metavar="double"),
  make_option(c("-r", "--correlation_min"), type="double", default=0.5, help="minimal correlation value to a kmean cluster to be included in 
                the cluster, if too low it will be excluded from all clusters!. should be a value
                between 0.0 and 1.0. Only to be used in combination with -m = kmean or -m = pam. 
                Best to first run with a low value, and adjust based on results if needed.
                default is 0.5", metavar="double"),
  make_option(c("-c", "--colour_of_heatmaps"), type="character", default=c("white","green","green4","violet","purple"), help="colors of the heatmaps. needs to be a list of two or more colors. 
                To see a list of all 657 colour names in R: colors()
                default is the colourblind friendly: 
                c(\"white\",\"green\",\"green4\",\"violet\",\"purple\")", metavar="character"),
  make_option(c("-p", "--plots_output_file"), type="character", default="results/plots.pdf", help="name of multipage pdf file where to output all the plots.", metavar="character"),
  make_option(c("-o", "--clusters_output_file"), type="character", default="results/plots.pdf", help="name of multipage pdf file where to output all the plots.", metavar="character")
  )
# parse the command-line arguments and pass them to a list called 'opt'
opt_parser = OptionParser(option_list=option_list , add_help_option = TRUE, description = "\nThis script produces multiple plots and clusters from a RNAseq
output from the filter_for_plots.py. 
plots include dendrograms of the samples and the genes, a heatmap, an elbow plot,
a heatmap with clusters specicied by colorbar, and finally plots and heatmaps of the
different clusters.
Because the 'correct' method for clustering RNAseq data is a matter of perspective; 
it is the one that allows the researcher to make the most out of her data. Gene expression
data is also full of noise which can make clustering tricky when using algorithms optimised
for chunkier data. With that in mind it's good to try several methods and compare them.
So for that multiple ways of clusting can be choosen also the number of clusters can be manually
choosen or calculated by a number of given methods. When making use of the snakemake pipeline,
it is essential to remove or rename the output file (plots.pdf), change the arguments you want to 
change in the config.yaml, and rerun the pipeline with 'snakemake --use-conda' (to be sure only
the last script reruns, start of with 'snakemake -np').", epilogue = "Pease feel free to mail me, [email protected], if you would like more info or have suggestions for optimization.\n\n");
opt = parse_args(opt_parser);

i <- opt$input_file
m <- opt$method_of_clustering
n <- opt$opt_clust_number
k <- opt$number_of_clusters
h <- opt$height_in_dendrogram
c <- opt$colour_of_heatmaps
q <- opt$membership_min
r <- opt$correlation_min
p <- opt$plots_output_file
o <- opt$clusters_output_file

######################################################################
##############          import needed packages         ###############
######################################################################


library(gplots)
library(dendextend)
library(dynamicTreeCut)
#library("pheatmap")
library(cluster)
library(vegan)
library(apcluster)
#library(ggplot2)
library(reshape2)
library(RColorBrewer)
library(dplyr)
library(colorRamps)
library(e1071)
library(tidyr)
library(ggplot2)

######################################################################
###############              read the data           #################
######################################################################

data <- read.delim(i, header=T, row.names="genes")
z <- as.matrix((data))

# scaling the data to avoid clustering based on expression level
scaledata <- t(scale(t(z))) # Centers and scales data.
scaledata <- scaledata[complete.cases(scaledata),]

# open pdf for the plots
pdf(file=p)

#mean/variance calculations and plot
z_var <- apply(z, 1, var)
z_mean <- apply(z, 1, mean)
plot(log2(z_mean), log2(z_var), pch='.')


##############################################################
############    hierarchical dendrogram      #################
##############################################################

# dendrogram of the samples
hc <- hclust(as.dist(1-cor(scaledata, method="spearman")), method="complete") # Clusters columns by Spearman correlation.
TreeC = as.dendrogram(hc, method="average")
plot(TreeC,
     main = "Sample Clustering",
     ylab = "Height")

# dendrogram of the genes
hr <- hclust(as.dist(1-cor(t(scaledata), method="pearson")), method="complete") # Cluster rows by Pearson correlation.
TreeR = as.dendrogram(hr, method="average")
plot(TreeR,
     leaflab = "none",
     main = "Gene Clustering",
     ylab = "Height")

#############################################################
##############        create Heatmap       ##################
#############################################################

# colors of the heatmap
my_palette <- colorRampPalette(c)(100)
#(c("magenta", "black", "green"))(n = 299)

# plot the heatmap
heatmap.2(z,
          Rowv=as.dendrogram(hr), 
          Colv=NA,
          col=my_palette,
          scale="row",
          margins = c(7, 7),
          cexCol = 0.7,
          labRow = F,
          dendrogram = "row",
          main = "Heatmap.2",
          trace = "none")


#################################################################
#######   Elbow plot ( the sum of squared error (SSE))    #######
#################################################################

wss <- (nrow(scaledata)-1)*sum(apply(scaledata,2,var))
for (i in 2:20) wss[i] <- sum(kmeans(scaledata,
                                     centers=i)$withinss)
plot(1:20, wss, type="b", xlab="Number of Clusters",
     ylab="Within groups sum of squares")


##############################################################
##############        Do the clustering        ###############
##############################################################



if(m == "kmean" || m == "fuzzy_kmean" || m == "pam"){
  if(n == "average_silhouette_width"){
    # method 1: average silhouette width
    sil <- rep(0, 20)
    #repeat k-means for 1:20 and extract silhouette:
    for(i in 2:20){
      k1to20 <- kmeans(scaledata, centers = i, nstart = 25, iter.max = 20)
      ss <- silhouette(k1to20$cluster, dist(scaledata))
      sil[i] <- mean(ss[, 3])
    }
    # Plot the  average silhouette width
    plot(1:20, sil, type = "b", pch = 19, xlab = "Number of clusters k", ylab="Average silhouette width")
    abline(v = which.max(sil), lty = 2)

    k <- which.max(sil)
  }
  else if(n == "calinsky_criterion"){
    # method 2: Calinsky criterion
    fit <- cascadeKM(scaledata, 1, 20, iter = 100)
    plot(fit, sortg = TRUE, grpmts.plot = TRUE)
    calinski.best <- as.numeric(which.max(fit$results[2,]))
    cat("Calinski criterion optimal number of clusters:", calinski.best, "\n")

    k <- calinski.best
  }
  else if(n == "gap_statistic"){
    # method 3: Gap statistic
    set.seed(13)
    gap <- clusGap(scaledata, kmeans, 20, B = 100, verbose = interactive())
    plot(gap, main = "Gap statistic")
    abline(v=which.max(gap$Tab[,3]), lty = 2)

    k <- which.max(gap$Tab[,3])
  }
  else if(n == "affinity_propogation"){
    # method 4: Affinity propogation
    d.apclus <- apcluster(negDistMat(r=2), scaledata)
    cat("affinity propogation optimal number of clusters:", length(d.apclus@clusters), "\n")
    #uncomment the next line for the heatmap, it takes a long time to run
    #heatmap(d.apclus,cexRow=0, cexCol=0)

    k <- length(d.apclus@clusters)
  }
  else if(n == "fixed"){
    k <- k
  }else{
    print("your choosen method of determining the optimal number of clusters is unclear.")
    print("When using --method_of_clustering=kmean or --method_of_clustering=fuzzy_kmean ")
    print("the options are:fixed, average_silhouette_width, calinsky_criterion,")
    print("gap_statistic or affinity_propogation.")
  }
  print(paste0("Dataset is clustered in ",k, " clusters "))
  if(m == "kmean"){
    set.seed(20)
    clusts <- kmeans(scaledata, centers=k, nstart = 1000, iter.max = 20)
    clusts <- clusts$cluster
  }
  else if(m == "fuzzy_kmean"){
    mestimate<- function(df){
      N <-  dim(df)[[1]]
      D <- dim(df)[[2]]
      m.sj <- 1 + (1418/N + 22.05)*D^(-2) + (12.33/N +0.243)*D^(-0.0406*log(N) - 0.1134)
      return(m.sj)
    }
    mes <- mestimate(scaledata)
    fcm_results <- cmeans(scaledata,centers=k,m=mes)
    fcm_plotting_df <- data.frame(scaledata)
    fcm_plotting_df$gene <- row.names(fcm_plotting_df)
    fcm_plotting_df$cluster <- fcm_results$cluster
    fcm_plotting_df$membership <- sapply(1:length(fcm_plotting_df$cluster),function(row){
      clust <- fcm_plotting_df$cluster[row]
      fcm_results$membership[row,clust]
    })
    clusts <- fcm_plotting_df$cluster
    names(clusts) <- fcm_plotting_df$gene
    # get the core data
    fcm_centroids <- fcm_results$centers
    fcm_centroids_df <- data.frame(fcm_centroids)
    fcm_centroids_df$cluster <- row.names(fcm_centroids_df)
    centroids_long <- tidyr::gather(fcm_centroids_df,"sample",'value', 1:ncol(scaledata))
    #start with the input data
    fcm_plotting_df <- data.frame(scaledata)
    #add genes
    fcm_plotting_df$gene <- row.names(fcm_plotting_df)
    #bind cluster assinment
    fcm_plotting_df$cluster <- fcm_results$cluster
    #fetch the membership for each gene/top scoring cluster
    fcm_plotting_df$membership <- sapply(1:length(fcm_plotting_df$cluster),function(row){
      clust <- fcm_plotting_df$cluster[row]
      fcm_results$membership[row,clust]
    })
    # filter out genes don't really belong to any of the clusters
    selection <- fcm_plotting_df %>% filter(membership > q)
  }
  else if(m == "pam"){
    pam.Pclusts <- pam(scaledata, k=k)
    clusts <- pam.Pclusts$'clustering'
  }
}else if(opt$method_of_clustering == "hierarchical"){
  if(n == "dynamic"){
    clusts <- cutreeDynamic(hr, distM = as.matrix(as.dist(1-cor(t(scaledata)))), method = "hybrid")
    names(clusts) <- rownames(scaledata)
    clust <- as.data.frame(clusts)
    colnames(clust) <- "cluster"
    k = length(unique(clusts))
    print(paste0("Dataset is clustered in ",k, " clusters "))
  }
  else if(n == "fixed"){
    clusts = cutree(hr, k=k)
    k = length(unique(clusts))
    print(paste0("Dataset is clustered in ",k, " clusters "))
  }
  else if(n == "height"){
    clusts = cutree(hr, h=h)
    k = length(unique(clusts))
    print(paste0("Dataset is clustered in ",k, " clusters "))
  }
  else{
    print("your choosen method of determining the optimal number of clusters is unclear.\
          When using --method_of_clustering=hierarchical the options are: fixed, height or dynamic.")
  } 
  }else{
    print("your choosen method of clustering is unclear. 
          It should be one of the following: hierarchical, kmean or fuzzy_kmean")
}


###################################################################
############        Heatmap with clusterbar          ##############
###################################################################

mycolhc <- rainbow(length(unique(clusts)), start=0.1, end=0.9)
mycolhc <- mycolhc[as.vector(clusts)]

hr <- hclust(as.dist(1-cor(t(scaledata), method="pearson")), method="complete")

heatmap.2(z,
          Rowv=as.dendrogram(hr), 
          Colv=NA,
          col=my_palette,
          scale="row",
          margins = c(7, 7),
          cexCol = 0.7,
          labRow = F,
          dendrogram = "row",
          main = paste0("Heatmap.2 with colour bar indicating the clusters (",k,")"),
          trace = "none",
          RowSideColors=mycolhc,
          key = FALSE)


###############################################################################
########   calculate the cluster 'cores' aka centroids and plot them   ########
###############################################################################

# get the cluster centroids (the average profile of expression for each of the clusters)
clust.centroid = function(i, dat, clusters) {
  ind = (clusters == i)
  colMeans(dat[ind,])
}
clusUniq <- unique(clusts)
kClustcentroids <- sapply(levels(factor(clusts)), clust.centroid, scaledata, clusts)
head(kClustcentroids)

Kmolten <- melt(kClustcentroids)
colnames(Kmolten) <- c('sample','cluster','value')

# In case of use of pam, the cluster centroids wil be the medoids (the expression profiles of the genes that fit the clusters best)
if (m == "pam"){
  pClustcentroids <- t(pam.Pclusts$"medoids")
  Kmolten <- melt(pClustcentroids)
  colnames(Kmolten) <- c('sample','cluster','value')
} 

p1 <- ggplot(Kmolten, aes(x=sample,y=value, group=cluster, colour=as.factor(cluster))) + 
  geom_point() + 
  geom_line() +
  xlab("Time") +
  ylab("Expression") +
  labs(title= "Cluster Expression of the samples",color = "Cluster")
p1

# in case of pam, change names of the clusters to numbers in stead of the medoids gene names
if (m == "pam"){
  colnames(pClustcentroids) <- clusUniq
  Kmolten <- melt(pClustcentroids)
  colnames(Kmolten) <- c('sample','cluster','value')
}
###########################################################################
#############      heatmaps and plots of the clusters      ################
###########################################################################

# open list to collect the correlation values in case of kmean, fuzzy_kmean and pam
cors <- c()

# loop though the clusters and create expression profile plots and heatmap for each of them
if(m == "fuzzy_kmean"){
  for(i in 1:k){
    print(i)
    cluster_plot_df <- dplyr::filter(selection, cluster == i) %>%
      dplyr::select(.,1:ncol(scaledata),membership,gene) %>%
      tidyr::gather(.,"sample",'value',1:ncol(scaledata)) 
    #order the dataframe by score
    cluster_plot_df <- cluster_plot_df[order(cluster_plot_df$membership),]
    #set the order by setting the factors using forcats
    cluster_plot_df$gene = forcats::fct_inorder(cluster_plot_df$gene)

    #subset the cores by cluster
    core <- dplyr::filter(centroids_long, cluster == i)

    g1 <-  ggplot(cluster_plot_df, aes(x=sample,y=value)) + 
      geom_line(aes(colour=membership, group=gene)) +
      scale_colour_gradientn(colours=c('blue1','red2')) +
      #this adds the core 
      geom_line(data=core, aes(sample,value, group=cluster), color="black",inherit.aes=FALSE) +
      geom_point(data=core, aes(sample,value, group=cluster), color="black",inherit.aes=FALSE) +
      xlab("Sample") +
      ylab("Expression") +
      labs(title= paste0("Cluster ",i," Expression of the Samples"),color = "Score")
    print(g1)

    group <- selection %>% filter(selection$cluster==i)
    rownames(group) <- group$gene
    group <- as.matrix(group[,1:ncol(scaledata)])
    if(nrow(group) > 1){
      hrc <- hclust(as.dist(1-cor(t(group), method="pearson")), method="complete")
      #pheatmap(group, color = colorRampPalette(rev(brewer.pal(n = 7, name = "RdYlBu")))(length(breaksList)), breaks = breaksList, cluster_cols = F)  #cellheight = 8
      heatmap.2(group,
                Rowv=as.dendrogram(hrc), 
                Colv=NA,
                col= magenta2green(100),
                labRow=rownames(group),
                scale="row",
                margins = c(7, 7),
                cexCol = 0.7,
                dendrogram = "row",
                main = paste0("Cluster ",i, " consisting ", nrow(group), " genes"),
                trace = "none",key = F)
    }
  }
}else if(m == "kmean" || m == "hierarchical" || m == "pam"){
  for(i in clusUniq){
    K2 <- (scaledata[clusts==i,])
    if(nrow(K2) > 1){
      #calculate the correlation with the core
      core <- Kmolten[Kmolten$cluster==i,]
      corscore <- function(x){cor(x,core$value)}
      score <- apply(K2, 1, corscore)
      # add the correlations to the list
      cors <- c(cors, score)
      #get the data frame into long format for plotting
      K2molten <- melt(K2)
      colnames(K2molten) <- c('gene','sample','value')
      #add the score
      K2molten <- merge(K2molten,score, by.x='gene',by.y='row.names', all.x=T)
      colnames(K2molten) <- c('gene','sample','value','score')
      #order the dataframe by score
      #to do this first create an ordering factor
      K2molten$order_factor <- 1:length(K2molten$gene)
      #order the dataframe by score
      K2molten <- K2molten[order(K2molten$score),]
      #set the order by setting the factors
      K2molten$order_factor <- factor(K2molten$order_factor , levels = K2molten$order_factor)

      # Everything on the same plot
      if(m == "kmean" || m =="pam"){
        p2 <- ggplot(K2molten, aes(x=sample,y=value)) + 
          geom_line(aes(colour=score, group=gene)) +
          scale_colour_gradientn(colours=c('blue1','red2')) +
          #this adds the core 
          geom_line(data=core, aes(sample,value, group=cluster), color="black",inherit.aes=FALSE) +
          geom_point(data=core, aes(sample,value, group=cluster), color="black",inherit.aes=FALSE) +
          xlab("Samples") +
          ylab("Expression") +
          labs(title= paste0("Cluster ",i, " consisting ", nrow(K2), " genes"),color = "Score")
      }
      else if(m == "hierarchical"){
        p2 <- ggplot(K2molten, aes(x=sample,y=value)) + 
          geom_line(color="grey", aes(color="grey", group=gene)) +
          #this adds the core 
          geom_line(data=core, aes(sample,value, group=cluster), color="blue",inherit.aes=FALSE) +
          geom_point(data=core, aes(sample,value, group=cluster), color="blue",inherit.aes=FALSE) +
          xlab("Samples") +
          ylab("Expression") +
          labs(title= paste0("Cluster ",i, " consisting ", nrow(K2), " genes"),color = "Score")
      }
      print(p2)
      hrc <- hclust(as.dist(1-cor(t(K2), method="pearson")), method="complete")
      #pheatmap(group, color = colorRampPalette(rev(brewer.pal(n = 7, name = "RdYlBu")))(length(breaksList)), breaks = breaksList, cluster_cols = F)  #cellheight = 8
      heatmap.2(K2,
                Rowv=as.dendrogram(hrc), 
                Colv=NA,
                col= my_palette,
                labRow=rownames(K2),
                scale="row",
                margins = c(7, 7),
                cexCol = 0.7,
                dendrogram = "row",
                main = paste0("Cluster ",i, " consisting ", nrow(K2), " genes"),
                trace = "none",key = F)
    }
  }
}
dev.off() 

#########################################################################################
#############       create output file containing the clusters          #################
#########################################################################################

# create data frame containing cluster and if available correlation values or memberships
if(m == "fuzzy_kmean"){
  # fuzzy_kmeans
  lijst <- data.frame(selection$gene, selection$cluster, selection$membership )
  colnames(lijst) <- c("gene", "clusters", "membership")
}else if(m == "kmean" || m == "pam"){
  # kmean and pam
  lijst <- merge(as.data.frame(clusts), as.data.frame(cors), by="row.names", sort=FALSE)
  colnames(lijst) <- c("gene", "clusters", "correlation")
  lijst <- lijst %>% filter(correlation > r)
}else{
  # hierarchical
  lijst <- data.frame(rownames(as.data.frame(clusts)), clusts)
  colnames(lijst) <- c("gene", "clusters")
}

# Write the dataframe as a tab-delimited file
write.table(lijst, file=o, sep = "\t",quote=F,row.names=F)
97
98
shell:
    "wget -O {output} {genome_url}"
107
108
shell:
    "wget -O {output} {transcriptome_gtf_url}"
SnakeMake From line 107 of master/Snakefile
129
130
131
132
133
134
135
136
137
138
139
140
141
run:
    if sample_is_single_end(params.sampleName):
        shell("fastp --thread {threads}  --html {output.html} \
        --qualified_quality_phred {params.qualified_quality_phred} \
        --in1 {input} --out1 {output} \
        2> {log}; \
        touch {output.fq2}")
    else:
        shell("fastp --thread {threads}  --html {output.html} \
        --qualified_quality_phred {params.qualified_quality_phred} \
        --detect_adapter_for_pe \
        --in1 {input[0]} --in2 {input[1]} --out1 {output.fq1} --out2 {output.fq2}; \
        2> {log}")
156
157
shell:
    "hisat2-build -p {threads} {input} {params} --quiet"
175
176
177
178
179
180
181
run:
    if sample_is_single_end(params.sampleName):
        shell("hisat2 -p {threads} --summary-file {output.sum} --met-file {output.met} -x {params.indexName} \
        -U {input} | samtools view -Sb -F 4 -o {output.bams}")
    else:
        shell("hisat2 -p {threads} --summary-file {output.sum} --met-file {output.met} -x {params.indexName} \
        -1 {input[0]} -2 {input[1]} | samtools view -Sb -F 4 -o {output.bams}")
197
198
shell:
    "featureCounts -O -t mRNA -g ID -F 'gtf' -a {input.gff} -o {output} {input.bams}"
218
219
shell:
    "Rscript scripts/DESeq2.R -c {input.counts} -s {input.samplefile} -o {output.result} -m {params.maxfraction} -f {output.helper}"
SnakeMake From line 218 of master/Snakefile
232
233
234
235
236
237
238
shell:
    "python scripts/DE_with_Function.py "
    "-a {params.annos} "
    "-c {input.clusts} "
    "-r {input.deseq} "
    "-o {output.final} "
    "-p {params.path}"
SnakeMake From line 232 of master/Snakefile
256
257
258
259
260
261
262
263
264
shell:
    "python scripts/filterForPlots.py "
    "-i {input.result} "
    "-f {input.helper} "
    "-o {output} "
    "-v {params.minimum_foldchange} "
    "-r {params.minimum_reads} "
    "-p {params.maximum_pvalue} "
    "-a {params.average_samples}"
SnakeMake From line 256 of master/Snakefile
281
282
283
284
285
286
287
288
shell:
    "Rscript scripts/plotscript.R "
    "-i {input} "
    "-m {params.method_of_clustering} "
    "-n {params.opt_clust_number} "
    "-k {params.number_of_clusters} "
    "-H {params.height_in_dendrogram} "
    "-q {params.membership_min} "
SnakeMake From line 281 of master/Snakefile
ShowHide 6 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/KoesGroup/Snakemake_hisat-DESeq
Name: snakemake_hisat-deseq
Version: v0.1.1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...