A Snakemake pipeline to analyse RNA-Seq reads

public 1yr ago Version: v0.1.1 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic
Lack of a description for a new keyword DESeq2 .

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

RNA_seq_Snakemake

A snakemake pipeline for the analysis of RNA-seq data that makes use of hisat2 and DESeq2 .

Aim

To align, count, normalize counts and compute gene differential expressions between conditions using paired-end Illumina RNA-seq data.

Description

This pipeline analyses the raw RNA-seq data and produces a file containing normalized counts, differential expression, numbers of the clusters the genes have been assigned to and functions of transcripts. The raw fastq files will be trimmed for adaptors and quality checked with trimmomatic. Next, the necessary genome sequence fastas will be downloaded to be used for the mapping of the trimmed reads using hisat2. With stringtie and a downloaded reference annotation a new annotation will be created. This new annotation will be used to obtain the raw counts and do a local blast to a reference transcriptome fasta containing predicted functions. The counts are normalized and differential expressions are calculated using DESeq2. significantly expressed genes are used to create heatmaps and plots both in total and clustered. finally the DESeq2, blast and clustering data are combined to get the final results table.

Prerequisites: what you should be able to do before using this Snakemake pipeline

Some command of the Unix Shell to connect to a remote server where you will execute the pipeline (e.g. SURF Lisa Cluster). You can find a good tutorial from the Software Carpentry Foundation here and another one from Berlin Bioinformatics here .
Some command of the Unix Shell to transfer datasets to and from a remote server (to transfer sequencing files and retrieve the results/). The Berlin Bioinformatics Unix begginer guide available [here] should be sufficient for that (check the wget and scp commands).
An understanding of the steps of a canonical RNA-Seq analysis (trimming, alignment, etc.). You can find some info here .

Content

Snakefile : a master file that contains the desired outputs and the rules to generate them from the input files.
config.yaml : the configuration files making the Snakefile adaptable to any input files, genome and parameter for the rules.
data/ : a folder containing samples.txt (sample descriptions) and subsetted paired-end fastq files used to test locally the pipeline. Generated using Seqtk : seqtk sample -s100 {inputfile(can be gzipped)} 250000 > {output(always gunzipped)} This folder should contain the fastq of the paired-end RNA-seq data, you want to run.
envs/ : a folder containing the environments needed for the conda package manager. If run with the --use-conda command, Snakemake will install the necessary softwares and packages using the conda environment files.
samples.tsv : a file containing information about the names, the paths and the conditions of the samples used as input for the pipeline. This file has to be adapted to your sample names before running the pipeline .

Usage

Download or clone the Github repository

You will need a local copy of the Snakemake_hisat-DESeq on your machine. You can either:

use git in the shell: git clone [email protected]:KoesGroup/Snakemake_hisat-DESeq.git
click on "Clone or download" and select download

Installing and activating a virtual environment

First, you need to create an environment where Snakemake and the python pandas package will be installed. To do that, we will use the conda package manager.

Create a virtual environment named rnaseq using the global_env.yaml file with the following command: conda env create --name rnaseq --file envs/global_env.yaml Then, activate this virtual environment with source activate rnaseq

The Snakefile will then take care of installing and loading the packages and softwares required by each step of the pipeline.

Configuration file

Make sure you have changed the parameters in the config.yaml file that specifies where to find the sample data file, the genomic and transcriptomic reference fasta files to use and the parameters for certains rules etc.
This file is used so the Snakefile does not need to be changed when locations or parameters need to be changed.

Snakemake execution

The Snakemake pipeline/workflow management system reads a master file (often called Snakefile ) to list the steps to be executed and defining their order. It has many rich features. Read more here .

Dry run

From the folder containing the Snakefile , use the command snakemake --use-conda -np to perform a dry run that prints out the rules and commands.

Real run

Simply type Snakemake --use-conda and provide the number of cores with --cores 10 for ten cores for instance.
For cluster execution, please refer to the Snakemake reference .

Main outputs

the RNA-Seq read alignment files *.bam
the fastqc report files *.html
the unscaled RNA-Seq read counts: counts.txt
the differential expression file results.tsv
the combined results file final.txt

Directed Acyclic Graph of jobs

dag

Code Snippets

library(DESeq2)
library(optparse)

# arguments to provide
option_list = list(
  make_option(c("-c", "--counts"), type="character", default="results/counts.txt", help="counts tabulated file from Feature Counts", metavar="character"),
  make_option(c("-s", "--samplefile"), type="character", default="data/samples2.tsv", help="sample files used to get conditions for DESEq2 model fit", metavar="character"),
  make_option(c("-o", "--outdir"), type="character", default="results/deseqNew.csv", help="where to place differential expression files", metavar="character"),
  make_option(c("-f", "--helperfile"), type="character", default="results/helperfile.csv", help="helper file needed for the creation of the clustering and the heatmaps", metavar="character"),
  make_option(c("-m", "--maxfraction"), type="double", default=1.0, help="maximum fraction of the total number of genes to be allowed to be differential between two conditions to be included (number between 0 and 1)", metavar="double")
) 

# parse the command-line arguments and pass them to a list called 'opt'
opt_parser = OptionParser(option_list=option_list);
opt = parse_args(opt_parser);

######## Import and wrangle counts produced by FeatureCounts #######
countdata <- read.table(opt$counts, header=TRUE, skip=1, row.names=1,stringsAsFactors = F, check.names = F)
countdata <- countdata[ ,6:ncol(countdata)] # Remove first five columns (chr, start, end, strand, length)

# Remove .bam or .sam from filenames
colnames(countdata) <- gsub("\\.[sb]am$", "", colnames(countdata))

# Convert to matrix
countdata <- as.matrix(countdata)

######### Extract experimental conditions from the file specifying the samples
samplefile = read.delim(file = opt$samplefile,header = T,stringsAsFactors = F)
col.names = colnames(samplefile)                                                # extract all column names
columns.to.discard = c("sample","fq1","fq2")                                    # column we don't need to extract all conditions
colsForConditions = col.names[! col.names %in% columns.to.discard]                     # only keeping the column of interest

# one condition
if (length(colsForConditions) == 1){
  condition <- factor(samplefile[,colsForConditions])
  # two conditions
} else if (length(colsForConditions == 2)){
  # two conditions --> make a third column that is the product of the two
  samplefile$conditions = paste(samplefile[,colsForConditions[1]],samplefile[,colsForConditions[2]],sep = ".")
  condition <- factor(x = samplefile[,"conditions"],levels = unique(samplefile[,"conditions"]))
} else if (length(conditions > 2)){
  print("too many conditions to compare, skipping")
}

# create helper file for downstream use in the pipeline.
helperFile <- as.data.frame(condition)
rownames(helperFile) <- samplefile$sample
colnames(helperFile) <- NULL
print(helperFile)
write.csv(helperFile, file = opt$helperfile, quote = F, col.names = F, row.names = T)

# Analysis with DESeq2 ----------------------------------------------------
# Create a coldata frame and instantiate the DESeqDataSet. See ?DESeqDataSetFromMatrix
coldata <- data.frame(row.names=colnames(countdata), condition)
dds <- DESeqDataSetFromMatrix(countData=countdata, colData=coldata, design=~condition)

# Run the DESeq pipeline
dds <- DESeq(dds)

# create dataframe containing normalized counts, to which the differential expression values will be added
resdata <- as.data.frame(counts(dds, normalized=TRUE))

## function for Volcano plot with "significant" genes labeled
volcanoplot <- function (res, lfcthresh=2, sigthresh=0.05, main="Volcano Plot", legendpos="bottomright", labelsig=TRUE, textcx=1, ...) {
  with(res, plot(log2FoldChange, -log10(padj), pch=20, main=main, ...))
  with(subset(res, padj<sigthresh ), points(log2FoldChange, -log10(padj), pch=20, col="red", ...))
  with(subset(res, abs(log2FoldChange)>lfcthresh), points(log2FoldChange, -log10(padj), pch=20, col="orange", ...))
  with(subset(res, padj<sigthresh & abs(log2FoldChange)>lfcthresh), points(log2FoldChange, -log10(padj), pch=20, col="green", ...))
  if (labelsig) {
    require(calibrate)
    with(subset(res, padj<sigthresh & abs(log2FoldChange)>lfcthresh), textxy(log2FoldChange, -log10(padj), labs=colnames(res), cex=textcx, ...))
  }
  legend(legendpos, xjust=1, yjust=1, legend=c(paste("FDR<",sigthresh,sep=""), paste("|LogFC|>",lfcthresh,sep=""), "both"), pch=20, col=c("red","orange","green"))
}

# open file that will be containing a vulcano plot for each of the combinations of conditions
#pdf(file="vulcanoplots.pdf")

# iterate through the list of conditions to create differential expression (DE) values for all possible pairs
x <- 1
for(i in levels(condition)){
  x <- x + 1
  if (x <= length(levels(condition))){
    for(j in x:length(levels(condition))){
      res <- results(dds, contrast=c("condition", i,  levels(condition)[j]))                      # i = first in pair, levels(condition)[j] is the second in pair.
      d <- paste(i, levels(condition)[j], sep="&")                                                # paste the two conditions in one character, to be used as the pair name
      resP <- as.data.frame(table(res$padj<0.05))                                                 # get number of DE values with P-value < 0.05
      if(resP[1,2]< opt$maxfraction*nrow(resdata)){                                               # only continue with the pair if it is less then the maximum fraction(set by user in commandline) differentially expressed 
        #volcanoplot(res, lfcthresh=1, sigthresh=0.05, textcx=.8, xlim=c(-10, 10), main=d)
        print(c(d,"Number of differentials is within accepted limit"))
        colnames(res) = paste(d,c(colnames(res)),sep = "-")                                       # paste the pair name to the column name
        resdata <- merge(as.data.frame(resdata), as.data.frame(res), by="row.names", sort=FALSE)  # merge the DE values to the matrix
        rownames(resdata) <- resdata$Row.names                                                    # redifine rownames as they have disapeared?? in the merging??
        resdata$Row.names <- NULL                                                                 # delete column containing the rownames, to avoid the same colname in the next iteration
      }
      else{
        print(c(d,"More differentials then allowed"))
      }
    }
  }
}
#dev.off()
# create first column containing the genenames.
resdata$genes <- rownames(resdata)
resdata <- merge(as.data.frame(resdata["genes"]), as.data.frame(resdata[,2:ncol(resdata)-1]), by="row.names", sort=FALSE)
resdata$Row.names <- NULL

# write the data to a file.
write.table(resdata, file=opt$outdir,sep = "\t",quote=F,row.names=F)

R DESeq2 optparse calibrate From line 2 of scripts/DESeq2.R

from optparse import OptionParser 
import glob
import os

# get the command line arguments
parser = OptionParser(description="Script that combines the normalized and differential expression data as outputted by DESeq2.R, the clusters as outputted by plotscript.R and the results from the blast against the reference proteome. Resulting in a combined tab-delimeted data file")
parser.add_option('-r', '--result_file', 
                    type=str,
                    default="results/result.csv",
                    metavar="",
                    help = "path and name of of the input file, being the output file of the R-script DESeq2.R, default = results/result.csv")
parser.add_option('-a', '--anno_file', 
                    type=str,
                    default="annotations/Petunia_axillaris_v1.6.2_proteins.mapman_annotation.txt",
                    metavar="",
                    help = "path and name of the stringtie transcriptome fasta file, a file containing the fastas of the de novo generated stringtie transcriptome, default = result/helperFile.csv")
parser.add_option('-c', '--cluster_file', 
                    type=str,
                    default="results/clusters.txt",
                    metavar="",
                    help = "path and name of the clusters file, a tab delimited file containing cluster numbers and depending on the method of clustering correlation values or membership values., default = result/helperFile.csv")
parser.add_option('-o', '--output_file', 
                    type=str,
                    default="results/final.txt",
                    metavar="",
                    help = "path and name of the output file. A tab delimimited file containing normalised reads, differential expressions and (adjusted)p-values, number of clusters the genes partisioned to and the hypothetical function as found by a blast. default = result/final.txt")
parser.add_option('-p', '--working_dir', 
                    type=str,
                    default="temp/mapped/",
                    metavar="",
                    help = "path to bam files, to be removed from the sample names in the header of the final output file. default = temp/mapped/")


(options, args) = parser.parse_args()

clusts = open(options.cluster_file, "r")
clusters = {}
for line in clusts:
    line = line.rstrip()
    line = line.split("\t")
    if len(line) == 2:
        line.append("NaN")
    clusters[line[0].lower()] = "\t".join(line[1:])

clusts.close()

# if there is an annotation file available add annotaions.
if options.anno_file in glob.glob(options.anno_file):
    if glob.glob(options.anno_file)[0].endswith(".gz"):
        os.system(f"gunzip {options.anno_file}")
        annoFile = ".".join(options.anno_file.split(".")[:-1])
    else:
        annoFile = options.anno_file
    fa = open(annoFile, "r")
    annos = {}
    count = 0
    for l in fa:
        l = l.split("\t")
        l = [x.strip("'") for x in l]
        if len(l)>2:
            if len(l[2])>1:
                if l[2].lower() not in annos:
                    annos[l[2].lower()] = ".".join(l[1].split(".")[:3]).replace(" ", "_")
                else:
                    annos[l[2].lower()] += (";" + ".".join(l[1].split(".")[:3]).replace(" ", "_"))
    fa.close()

    inFile  = open(options.result_file, "r")
    uitFile = open(options.output_file, "w")
    for l in inFile:
        l = l.rstrip().split("\t")
        if "genes" in l[0]:
            l = [x.replace(options.working_dir, "") for x in l]    # remove path from sample names
            l.append("\t".join([clusters["gene"], "annotation"]))
        elif l[0].lower() in clusters:
            if l[0].lower() in annos:
                l.append(clusters[l[0].lower()] + "\t" + annos[l[0].lower()])
            else:
                l.append(clusters[l[0].lower()] + "\tno annotation available.")
        elif l[0].lower() in annos:
            l.append("NaN\tNaN\t" + annos[l[0].lower()])
        else:
            l.append("NaN\tNaN\tno annotation available.")
        uitFile.write("\t".join(l)+"\n")
    inFile.close()
    uitFile.close()

else:
    inFile  = open(options.result_file, "r")
    uitFile = open(options.output_file, "w")
    for l in inFile:
        l = l.rstrip().split("\t")
        if "genes" in l[0]:
            l = [x.replace(options.working_dir, "") for x in l]    # remove path from sample names
            l.append("\t".join([clusters["gene"]]))
        elif l[0].lower() in clusters:
            l.append(clusters[l[0].lower()])
        else:
            l.append("NaN\tNaN")
        uitFile.write("\t".join(l)+"\n")
    inFile.close()
    uitFile.close()

Python BLAST StringTie From line 1 of scripts/DE_with_Function.py

import numpy as np 
from optparse import OptionParser 

# get the command line arguments
parser = OptionParser(description="python script to create pvalue and log2(foldchange) filtered ")
parser.add_option('-i', '--input_file', 
                    type=str,
                    default="results/plotSelection.txt",
                    metavar="",
                    help = "path and name of of the input file, being the output file of the R-script DESeq2.R, default = results/plotSelection.txt")
parser.add_option('-f', '--helper_file', 
                    type=str,
                    default="result/helperFile.csv",
                    metavar="",
                    help = "path and name of the helper file, a tab delimited file containing one column samples and a column conditions, default = result/helperFile.csv")
parser.add_option('-o', '--output_file', 
                    type=str,
                    default="result/plotSelection.txt",
                    metavar="",
                    help = "path and name of the output file a tab delimimited file containing normalised reads of significantly differentially expressed genes of all the samples or averaged to the conditions, default = result/plotSelection.txt")
parser.add_option('-v', '--minimum_foldchange', 
                    type=float,
                    default=2,
                    metavar="",
                    help = "minimum log2(fold_change) in any combination of conditions; integer > 1, default = 2")
parser.add_option('-r', '--minimum_reads', 
                    type=int,
                    default=100,
                    metavar="", 
                    help = "minimum number of reads of all samples together; integer >= 0, default = 100")
parser.add_option('-p', '--maximum_pvalue', 
                    type=float,
                    default=0.05,
                    metavar="",
                    help = "maximum exepted adjusted pvalue; 0.0 to 1.0, default = 0.05")
parser.add_option('-a', '--average_samples', 
                    type=str,
                    default="yes",
                    metavar="",
                    help = "output needs to contain averages of conditions; yes or no, default = yes")

(options, args) = parser.parse_args()


# get a list of the conditions 
helper = open(options.helper_file, "r")
samples = {}
conditions = []
x=0
for l in helper:
    l = l.rstrip()
    if len(l) > 0:
        x += 1
        l = l.split(",")
        name = l[1]
        samples[x] = name
        if name not in conditions:
            conditions.append(name)
helper.close()

# function to average the samples within a condition
def getAverages(counts):
    global samples, conditions
    sets = {}
    averages = [counts[0]]
    for i in samples:
        if samples[i] in sets:
            sets[samples[i]].append(float(counts[i]))
        else:
            sets[samples[i]] = [float(counts[i])]
    for c in conditions:
        averages.append(str(np.mean(sets[c])))
    return(averages)

inputData = open(options.input_file, "r")
outAll    = open(options.output_file, "w")
totalBoth = 0
count     = 0
min_reads = options.minimum_reads
foldchange= options.minimum_foldchange
pvalue    = options.maximum_pvalue
avarage   = options.average_samples.upper()

for l in inputData:
    count += 1
    if l.startswith("genes") == False and len(l) > 5:
        l = l.replace("NA", "1")         # replace NA values in P-value to 1
        l = l.replace("#VALUE!", "0")    # replace NA values in foldchange to 0
        f = l.rstrip()                   # remove white space and line break at the end.
        f = f.split("\t")                # split string to list on tab
        if len(f) > x+1:
            fc = [float(f[i]) for i in range(x+2, len(f), 6)]       # list of faultchanges
            pv = [float(f[i]) for i in range(x+6, len(f), 6)]       # list of adjusted p-values
            counts = [float(f[i]) for i in range(1,x+1)]
            if sum(counts) > min_reads:             # total number of reads more then ~20 per sample
                for a1, a2 in zip(fc,pv):
                    if a1 > foldchange and a2 < pvalue and a2 == min(pv):
                        if avarage == "YES":
                            line = "\t".join(getAverages(f[:x+1]))
                        else:
                            line = "\t".join(f[:x+1])
                        outAll.write(line+"\n")
                        totalBoth += 1
                        break
                    elif a1 < -1*foldchange and a2 < pvalue and a2 == min(pv):
                        if avarage == "YES":
                            line = "\t".join(getAverages(f[:x+1]))
                        else:
                            line = "\t".join(f[:x+1])
                        outAll.write(line+"\n")
                        totalBoth += 1
                        break
        else:
            print("No fold changes and p-values were found. Try rerunning the DESeq2.R with a higher maximum fraction (to be adjusted in config.yaml)")
    else:
        if len(l) > 10:
            if avarage == "YES":
                line = "\t".join(conditions)
            else:
                line = "\t".join(l.split("\t")[1:x+1])
            outAll.write(l.split("\t")[0] + "\t" + line + "\n")

print(f"Total number of genes is: {str(count-1)}.")
print(f"Number of genes kept for plots is: {str(totalBoth)}.")

inputData.close() 
outAll.close()

Python numpy From line 1 of scripts/filterForPlots.py

library(optparse)

option_list = list(
  make_option(c("-i", "--input_file"), type="character", default="results/counts.txt", help="normalized and filtered counts tabulated file from filter_for_plots.py", metavar="character"),
  make_option(c("-m", "--method_of_clustering"), type="character", default="hierarchical", help="method of clustering chose between: hierarchical, 
                kmean or fuzzy_kmean
                default is hierarchical", metavar="character"),
  make_option(c("-n", "--opt_clust_number"), type="character", default="dynamic", help="determination of optimal number of clusters, 
                If --method_of_clustering=hierarchical the options are: fixed, height or dynamic.
                If--method_of_clustering=kmean or --method_of_clustering=fuzzy_kmean the options
                are: fixed, average_silhouette_width, calinsky_criterion, 
                gap_statistic or affinity_propogation.
                default is dynamic.", metavar="character"),
  make_option(c("-k", "--number_of_clusters"), type="integer", default=5, help="number of clusters to be used. choose an integer of two or more. 
                To be used if --det_hclust_number=fixed or --det_kclust_number=fixed.
                default is 5", metavar="integer"),
  make_option(c("-H", "--height_in_dendrogram"), type="double", default=1.5, help="height in dendrogram to use for determination of cluster number.
               default is 1.5", metavar="double"),
  make_option(c("-q", "--membership_min"), type="double", default=0.2, help="minimal membership value of fuzzy kmean cluster to be included in 
                the cluster. should be a value between 0.0 and 1.0. Only to 
                be used in combination with -m = fuzzy_kmean. Best to first run with a low value,
	        and adjust based on results if needed.
                default is o.2", metavar="double"),
  make_option(c("-r", "--correlation_min"), type="double", default=0.5, help="minimal correlation value to a kmean cluster to be included in 
                the cluster, if too low it will be excluded from all clusters!. should be a value
                between 0.0 and 1.0. Only to be used in combination with -m = kmean or -m = pam. 
                Best to first run with a low value, and adjust based on results if needed.
                default is 0.5", metavar="double"),
  make_option(c("-c", "--colour_of_heatmaps"), type="character", default=c("white","green","green4","violet","purple"), help="colors of the heatmaps. needs to be a list of two or more colors. 
                To see a list of all 657 colour names in R: colors()
                default is the colourblind friendly: 
                c(\"white\",\"green\",\"green4\",\"violet\",\"purple\")", metavar="character"),
  make_option(c("-p", "--plots_output_file"), type="character", default="results/plots.pdf", help="name of multipage pdf file where to output all the plots.", metavar="character"),
  make_option(c("-o", "--clusters_output_file"), type="character", default="results/plots.pdf", help="name of multipage pdf file where to output all the plots.", metavar="character")
  )
# parse the command-line arguments and pass them to a list called 'opt'
opt_parser = OptionParser(option_list=option_list , add_help_option = TRUE, description = "\nThis script produces multiple plots and clusters from a RNAseq
output from the filter_for_plots.py. 
plots include dendrograms of the samples and the genes, a heatmap, an elbow plot,
a heatmap with clusters specicied by colorbar, and finally plots and heatmaps of the
different clusters.
Because the 'correct' method for clustering RNAseq data is a matter of perspective; 
it is the one that allows the researcher to make the most out of her data. Gene expression
data is also full of noise which can make clustering tricky when using algorithms optimised
for chunkier data. With that in mind it's good to try several methods and compare them.
So for that multiple ways of clusting can be choosen also the number of clusters can be manually
choosen or calculated by a number of given methods. When making use of the snakemake pipeline,
it is essential to remove or rename the output file (plots.pdf), change the arguments you want to 
change in the config.yaml, and rerun the pipeline with 'snakemake --use-conda' (to be sure only
the last script reruns, start of with 'snakemake -np').", epilogue = "Pease feel free to mail me, [email protected], if you would like more info or have suggestions for optimization.\n\n");
opt = parse_args(opt_parser);

i <- opt$input_file
m <- opt$method_of_clustering
n <- opt$opt_clust_number
k <- opt$number_of_clusters
h <- opt$height_in_dendrogram
c <- opt$colour_of_heatmaps
q <- opt$membership_min
r <- opt$correlation_min
p <- opt$plots_output_file
o <- opt$clusters_output_file

######################################################################
##############          import needed packages         ###############
######################################################################


library(gplots)
library(dendextend)
library(dynamicTreeCut)
#library("pheatmap")
library(cluster)
library(vegan)
library(apcluster)
#library(ggplot2)
library(reshape2)
library(RColorBrewer)
library(dplyr)
library(colorRamps)
library(e1071)
library(tidyr)
library(ggplot2)

######################################################################
###############              read the data           #################
######################################################################

data <- read.delim(i, header=T, row.names="genes")
z <- as.matrix((data))

# scaling the data to avoid clustering based on expression level
scaledata <- t(scale(t(z))) # Centers and scales data.
scaledata <- scaledata[complete.cases(scaledata),]

# open pdf for the plots
pdf(file=p)

#mean/variance calculations and plot
z_var <- apply(z, 1, var)
z_mean <- apply(z, 1, mean)
plot(log2(z_mean), log2(z_var), pch='.')


##############################################################
############    hierarchical dendrogram      #################
##############################################################

# dendrogram of the samples
hc <- hclust(as.dist(1-cor(scaledata, method="spearman")), method="complete") # Clusters columns by Spearman correlation.
TreeC = as.dendrogram(hc, method="average")
plot(TreeC,
     main = "Sample Clustering",
     ylab = "Height")

# dendrogram of the genes
hr <- hclust(as.dist(1-cor(t(scaledata), method="pearson")), method="complete") # Cluster rows by Pearson correlation.
TreeR = as.dendrogram(hr, method="average")
plot(TreeR,
     leaflab = "none",
     main = "Gene Clustering",
     ylab = "Height")

#############################################################
##############        create Heatmap       ##################
#############################################################

# colors of the heatmap
my_palette <- colorRampPalette(c)(100)
#(c("magenta", "black", "green"))(n = 299)

# plot the heatmap
heatmap.2(z,
          Rowv=as.dendrogram(hr), 
          Colv=NA,
          col=my_palette,
          scale="row",
          margins = c(7, 7),
          cexCol = 0.7,
          labRow = F,
          dendrogram = "row",
          main = "Heatmap.2",
          trace = "none")


#################################################################
#######   Elbow plot ( the sum of squared error (SSE))    #######
#################################################################

wss <- (nrow(scaledata)-1)*sum(apply(scaledata,2,var))
for (i in 2:20) wss[i] <- sum(kmeans(scaledata,
                                     centers=i)$withinss)
plot(1:20, wss, type="b", xlab="Number of Clusters",
     ylab="Within groups sum of squares")


##############################################################
##############        Do the clustering        ###############
##############################################################



if(m == "kmean" || m == "fuzzy_kmean" || m == "pam"){
  if(n == "average_silhouette_width"){
    # method 1: average silhouette width
    sil <- rep(0, 20)
    #repeat k-means for 1:20 and extract silhouette:
    for(i in 2:20){
      k1to20 <- kmeans(scaledata, centers = i, nstart = 25, iter.max = 20)
      ss <- silhouette(k1to20$cluster, dist(scaledata))
      sil[i] <- mean(ss[, 3])
    }
    # Plot the  average silhouette width
    plot(1:20, sil, type = "b", pch = 19, xlab = "Number of clusters k", ylab="Average silhouette width")
    abline(v = which.max(sil), lty = 2)

    k <- which.max(sil)
  }
  else if(n == "calinsky_criterion"){
    # method 2: Calinsky criterion
    fit <- cascadeKM(scaledata, 1, 20, iter = 100)
    plot(fit, sortg = TRUE, grpmts.plot = TRUE)
    calinski.best <- as.numeric(which.max(fit$results[2,]))
    cat("Calinski criterion optimal number of clusters:", calinski.best, "\n")

    k <- calinski.best
  }
  else if(n == "gap_statistic"){
    # method 3: Gap statistic
    set.seed(13)
    gap <- clusGap(scaledata, kmeans, 20, B = 100, verbose = interactive())
    plot(gap, main = "Gap statistic")
    abline(v=which.max(gap$Tab[,3]), lty = 2)

    k <- which.max(gap$Tab[,3])
  }
  else if(n == "affinity_propogation"){
    # method 4: Affinity propogation
    d.apclus <- apcluster(negDistMat(r=2), scaledata)
    cat("affinity propogation optimal number of clusters:", length(d.apclus@clusters), "\n")
    #uncomment the next line for the heatmap, it takes a long time to run
    #heatmap(d.apclus,cexRow=0, cexCol=0)

    k <- length(d.apclus@clusters)
  }
  else if(n == "fixed"){
    k <- k
  }else{
    print("your choosen method of determining the optimal number of clusters is unclear.")
    print("When using --method_of_clustering=kmean or --method_of_clustering=fuzzy_kmean ")
    print("the options are:fixed, average_silhouette_width, calinsky_criterion,")
    print("gap_statistic or affinity_propogation.")
  }
  print(paste0("Dataset is clustered in ",k, " clusters "))
  if(m == "kmean"){
    set.seed(20)
    clusts <- kmeans(scaledata, centers=k, nstart = 1000, iter.max = 20)
    clusts <- clusts$cluster
  }
  else if(m == "fuzzy_kmean"){
    mestimate<- function(df){
      N <-  dim(df)[[1]]
      D <- dim(df)[[2]]
      m.sj <- 1 + (1418/N + 22.05)*D^(-2) + (12.33/N +0.243)*D^(-0.0406*log(N) - 0.1134)
      return(m.sj)
    }
    mes <- mestimate(scaledata)
    fcm_results <- cmeans(scaledata,centers=k,m=mes)
    fcm_plotting_df <- data.frame(scaledata)
    fcm_plotting_df$gene <- row.names(fcm_plotting_df)
    fcm_plotting_df$cluster <- fcm_results$cluster
    fcm_plotting_df$membership <- sapply(1:length(fcm_plotting_df$cluster),function(row){
      clust <- fcm_plotting_df$cluster[row]
      fcm_results$membership[row,clust]
    })
    clusts <- fcm_plotting_df$cluster
    names(clusts) <- fcm_plotting_df$gene
    # get the core data
    fcm_centroids <- fcm_results$centers
    fcm_centroids_df <- data.frame(fcm_centroids)
    fcm_centroids_df$cluster <- row.names(fcm_centroids_df)
    centroids_long <- tidyr::gather(fcm_centroids_df,"sample",'value', 1:ncol(scaledata))
    #start with the input data
    fcm_plotting_df <- data.frame(scaledata)
    #add genes
    fcm_plotting_df$gene <- row.names(fcm_plotting_df)
    #bind cluster assinment
    fcm_plotting_df$cluster <- fcm_results$cluster
    #fetch the membership for each gene/top scoring cluster
    fcm_plotting_df$membership <- sapply(1:length(fcm_plotting_df$cluster),function(row){
      clust <- fcm_plotting_df$cluster[row]
      fcm_results$membership[row,clust]
    })
    # filter out genes don't really belong to any of the clusters
    selection <- fcm_plotting_df %>% filter(membership > q)
  }
  else if(m == "pam"){
    pam.Pclusts <- pam(scaledata, k=k)
    clusts <- pam.Pclusts$'clustering'
  }
}else if(opt$method_of_clustering == "hierarchical"){
  if(n == "dynamic"){
    clusts <- cutreeDynamic(hr, distM = as.matrix(as.dist(1-cor(t(scaledata)))), method = "hybrid")
    names(clusts) <- rownames(scaledata)
    clust <- as.data.frame(clusts)
    colnames(clust) <- "cluster"
    k = length(unique(clusts))
    print(paste0("Dataset is clustered in ",k, " clusters "))
  }
  else if(n == "fixed"){
    clusts = cutree(hr, k=k)
    k = length(unique(clusts))
    print(paste0("Dataset is clustered in ",k, " clusters "))
  }
  else if(n == "height"){
    clusts = cutree(hr, h=h)
    k = length(unique(clusts))
    print(paste0("Dataset is clustered in ",k, " clusters "))
  }
  else{
    print("your choosen method of determining the optimal number of clusters is unclear.\
          When using --method_of_clustering=hierarchical the options are: fixed, height or dynamic.")
  } 
  }else{
    print("your choosen method of clustering is unclear. 
          It should be one of the following: hierarchical, kmean or fuzzy_kmean")
}


###################################################################
############        Heatmap with clusterbar          ##############
###################################################################

mycolhc <- rainbow(length(unique(clusts)), start=0.1, end=0.9)
mycolhc <- mycolhc[as.vector(clusts)]

hr <- hclust(as.dist(1-cor(t(scaledata), method="pearson")), method="complete")

heatmap.2(z,
          Rowv=as.dendrogram(hr), 
          Colv=NA,
          col=my_palette,
          scale="row",
          margins = c(7, 7),
          cexCol = 0.7,
          labRow = F,
          dendrogram = "row",
          main = paste0("Heatmap.2 with colour bar indicating the clusters (",k,")"),
          trace = "none",
          RowSideColors=mycolhc,
          key = FALSE)


###############################################################################
########   calculate the cluster 'cores' aka centroids and plot them   ########
###############################################################################

# get the cluster centroids (the average profile of expression for each of the clusters)
clust.centroid = function(i, dat, clusters) {
  ind = (clusters == i)
  colMeans(dat[ind,])
}
clusUniq <- unique(clusts)
kClustcentroids <- sapply(levels(factor(clusts)), clust.centroid, scaledata, clusts)
head(kClustcentroids)

Kmolten <- melt(kClustcentroids)
colnames(Kmolten) <- c('sample','cluster','value')

# In case of use of pam, the cluster centroids wil be the medoids (the expression profiles of the genes that fit the clusters best)
if (m == "pam"){
  pClustcentroids <- t(pam.Pclusts$"medoids")
  Kmolten <- melt(pClustcentroids)
  colnames(Kmolten) <- c('sample','cluster','value')
} 

p1 <- ggplot(Kmolten, aes(x=sample,y=value, group=cluster, colour=as.factor(cluster))) + 
  geom_point() + 
  geom_line() +
  xlab("Time") +
  ylab("Expression") +
  labs(title= "Cluster Expression of the samples",color = "Cluster")
p1

# in case of pam, change names of the clusters to numbers in stead of the medoids gene names
if (m == "pam"){
  colnames(pClustcentroids) <- clusUniq
  Kmolten <- melt(pClustcentroids)
  colnames(Kmolten) <- c('sample','cluster','value')
}
###########################################################################
#############      heatmaps and plots of the clusters      ################
###########################################################################

# open list to collect the correlation values in case of kmean, fuzzy_kmean and pam
cors <- c()

# loop though the clusters and create expression profile plots and heatmap for each of them
if(m == "fuzzy_kmean"){
  for(i in 1:k){
    print(i)
    cluster_plot_df <- dplyr::filter(selection, cluster == i) %>%
      dplyr::select(.,1:ncol(scaledata),membership,gene) %>%
      tidyr::gather(.,"sample",'value',1:ncol(scaledata)) 
    #order the dataframe by score
    cluster_plot_df <- cluster_plot_df[order(cluster_plot_df$membership),]
    #set the order by setting the factors using forcats
    cluster_plot_df$gene = forcats::fct_inorder(cluster_plot_df$gene)

    #subset the cores by cluster
    core <- dplyr::filter(centroids_long, cluster == i)

    g1 <-  ggplot(cluster_plot_df, aes(x=sample,y=value)) + 
      geom_line(aes(colour=membership, group=gene)) +
      scale_colour_gradientn(colours=c('blue1','red2')) +
      #this adds the core 
      geom_line(data=core, aes(sample,value, group=cluster), color="black",inherit.aes=FALSE) +
      geom_point(data=core, aes(sample,value, group=cluster), color="black",inherit.aes=FALSE) +
      xlab("Sample") +
      ylab("Expression") +
      labs(title= paste0("Cluster ",i," Expression of the Samples"),color = "Score")
    print(g1)

    group <- selection %>% filter(selection$cluster==i)
    rownames(group) <- group$gene
    group <- as.matrix(group[,1:ncol(scaledata)])
    if(nrow(group) > 1){
      hrc <- hclust(as.dist(1-cor(t(group), method="pearson")), method="complete")
      #pheatmap(group, color = colorRampPalette(rev(brewer.pal(n = 7, name = "RdYlBu")))(length(breaksList)), breaks = breaksList, cluster_cols = F)  #cellheight = 8
      heatmap.2(group,
                Rowv=as.dendrogram(hrc), 
                Colv=NA,
                col= magenta2green(100),
                labRow=rownames(group),
                scale="row",
                margins = c(7, 7),
                cexCol = 0.7,
                dendrogram = "row",
                main = paste0("Cluster ",i, " consisting ", nrow(group), " genes"),
                trace = "none",key = F)
    }
  }
}else if(m == "kmean" || m == "hierarchical" || m == "pam"){
  for(i in clusUniq){
    K2 <- (scaledata[clusts==i,])
    if(nrow(K2) > 1){
      #calculate the correlation with the core
      core <- Kmolten[Kmolten$cluster==i,]
      corscore <- function(x){cor(x,core$value)}
      score <- apply(K2, 1, corscore)
      # add the correlations to the list
      cors <- c(cors, score)
      #get the data frame into long format for plotting
      K2molten <- melt(K2)
      colnames(K2molten) <- c('gene','sample','value')
      #add the score
      K2molten <- merge(K2molten,score, by.x='gene',by.y='row.names', all.x=T)
      colnames(K2molten) <- c('gene','sample','value','score')
      #order the dataframe by score
      #to do this first create an ordering factor
      K2molten$order_factor <- 1:length(K2molten$gene)
      #order the dataframe by score
      K2molten <- K2molten[order(K2molten$score),]
      #set the order by setting the factors
      K2molten$order_factor <- factor(K2molten$order_factor , levels = K2molten$order_factor)

      # Everything on the same plot
      if(m == "kmean" || m =="pam"){
        p2 <- ggplot(K2molten, aes(x=sample,y=value)) + 
          geom_line(aes(colour=score, group=gene)) +
          scale_colour_gradientn(colours=c('blue1','red2')) +
          #this adds the core 
          geom_line(data=core, aes(sample,value, group=cluster), color="black",inherit.aes=FALSE) +
          geom_point(data=core, aes(sample,value, group=cluster), color="black",inherit.aes=FALSE) +
          xlab("Samples") +
          ylab("Expression") +
          labs(title= paste0("Cluster ",i, " consisting ", nrow(K2), " genes"),color = "Score")
      }
      else if(m == "hierarchical"){
        p2 <- ggplot(K2molten, aes(x=sample,y=value)) + 
          geom_line(color="grey", aes(color="grey", group=gene)) +
          #this adds the core 
          geom_line(data=core, aes(sample,value, group=cluster), color="blue",inherit.aes=FALSE) +
          geom_point(data=core, aes(sample,value, group=cluster), color="blue",inherit.aes=FALSE) +
          xlab("Samples") +
          ylab("Expression") +
          labs(title= paste0("Cluster ",i, " consisting ", nrow(K2), " genes"),color = "Score")
      }
      print(p2)
      hrc <- hclust(as.dist(1-cor(t(K2), method="pearson")), method="complete")
      #pheatmap(group, color = colorRampPalette(rev(brewer.pal(n = 7, name = "RdYlBu")))(length(breaksList)), breaks = breaksList, cluster_cols = F)  #cellheight = 8
      heatmap.2(K2,
                Rowv=as.dendrogram(hrc), 
                Colv=NA,
                col= my_palette,
                labRow=rownames(K2),
                scale="row",
                margins = c(7, 7),
                cexCol = 0.7,
                dendrogram = "row",
                main = paste0("Cluster ",i, " consisting ", nrow(K2), " genes"),
                trace = "none",key = F)
    }
  }
}
dev.off() 

#########################################################################################
#############       create output file containing the clusters          #################
#########################################################################################

# create data frame containing cluster and if available correlation values or memberships
if(m == "fuzzy_kmean"){
  # fuzzy_kmeans
  lijst <- data.frame(selection$gene, selection$cluster, selection$membership )
  colnames(lijst) <- c("gene", "clusters", "membership")
}else if(m == "kmean" || m == "pam"){
  # kmean and pam
  lijst <- merge(as.data.frame(clusts), as.data.frame(cors), by="row.names", sort=FALSE)
  colnames(lijst) <- c("gene", "clusters", "correlation")
  lijst <- lijst %>% filter(correlation > r)
}else{
  # hierarchical
  lijst <- data.frame(rownames(as.data.frame(clusts)), clusts)
  colnames(lijst) <- c("gene", "clusters")
}

# Write the dataframe as a tab-delimited file
write.table(lijst, file=o, sep = "\t",quote=F,row.names=F)

R Snakemake ggplot2 dplyr tidyr optparse RColorBrewer reshape2 pheatmap gplots vegan e1071 cluster dendextend colorRamps apcluster dynamicTreeCut From line 6 of scripts/plotscript.R

shell:
    "wget -O {output} {genome_url}"

SnakeMake From line 97 of master/Snakefile

shell:
    "wget -O {output} {transcriptome_gtf_url}"

SnakeMake From line 107 of master/Snakefile

run:
    if sample_is_single_end(params.sampleName):
        shell("fastp --thread {threads}  --html {output.html} \
        --qualified_quality_phred {params.qualified_quality_phred} \
        --in1 {input} --out1 {output} \
        2> {log}; \
        touch {output.fq2}")
    else:
        shell("fastp --thread {threads}  --html {output.html} \
        --qualified_quality_phred {params.qualified_quality_phred} \
        --detect_adapter_for_pe \
        --in1 {input[0]} --in2 {input[1]} --out1 {output.fq1} --out2 {output.fq2}; \
        2> {log}")

SnakeMake fastp From line 129 of master/Snakefile

shell:
    "hisat2-build -p {threads} {input} {params} --quiet"

SnakeMake HISAT2 From line 156 of master/Snakefile

run:
    if sample_is_single_end(params.sampleName):
        shell("hisat2 -p {threads} --summary-file {output.sum} --met-file {output.met} -x {params.indexName} \
        -U {input} | samtools view -Sb -F 4 -o {output.bams}")
    else:
        shell("hisat2 -p {threads} --summary-file {output.sum} --met-file {output.met} -x {params.indexName} \
        -1 {input[0]} -2 {input[1]} | samtools view -Sb -F 4 -o {output.bams}")

SnakeMake SAMtools HISAT2 From line 175 of master/Snakefile

shell:
    "featureCounts -O -t mRNA -g ID -F 'gtf' -a {input.gff} -o {output} {input.bams}"

SnakeMake FeatureCounts From line 197 of master/Snakefile

shell:
    "Rscript scripts/DESeq2.R -c {input.counts} -s {input.samplefile} -o {output.result} -m {params.maxfraction} -f {output.helper}"

SnakeMake From line 218 of master/Snakefile

shell:
    "python scripts/DE_with_Function.py "
    "-a {params.annos} "
    "-c {input.clusts} "
    "-r {input.deseq} "
    "-o {output.final} "
    "-p {params.path}"

SnakeMake From line 232 of master/Snakefile

shell:
    "python scripts/filterForPlots.py "
    "-i {input.result} "
    "-f {input.helper} "
    "-o {output} "
    "-v {params.minimum_foldchange} "
    "-r {params.minimum_reads} "
    "-p {params.maximum_pvalue} "
    "-a {params.average_samples}"

SnakeMake From line 256 of master/Snakefile

shell:
    "Rscript scripts/plotscript.R "
    "-i {input} "
    "-m {params.method_of_clustering} "
    "-n {params.opt_clust_number} "
    "-k {params.number_of_clusters} "
    "-H {params.height_in_dendrogram} "
    "-q {params.membership_min} "