cluster_rnaseq: RNA-seq pipeline.

public public 1yr ago 0 bookmarks

Pipeline status

Introduction

cluster_rnaseq is a Snakemake pipeline that performs a comprehensive RNA-seq analysis, covering from the basic steps (QC, alignment, quantification) to the more advanced downstream analyses (diferential expresion).

The pipeline makes extensive use of Snakemake's integration with the conda package manager, to automatically take care of software requirements and dependencies.

We've built cluster_rnaseq for flexibility, with the aim of allowing the user to adjust the pipeline to different experiments using configuration parameters, including setting the pipeline to align or quantify with different tools. When in doubt, the default parameters were calculated to offer a good starting configuration.

Workflow overview

This is a schema of the complete workflow. Certain parts may be skipped depending on input data and chosen configuration.

cluster_rnaseq workflow diagram

Authors

  • Daniel Cerdán-Vélez

  • María José Jiménez-Santos

  • Santiago García-Martín

Setup

The setup of the pipeline consists of the modifying two configuration files, setting the desired parameters and the location of the input/output files; and two other files to know which samples enter the differential expression if it is going to be carried out, and its design matrix. A general description of these files follows. See the Usage section for more details.

Configuration files

  • config.yaml contains all pipeline parameters.

  • units.tsv contains information on the samples to be analysed and their paths.

  • samples.tsv : Indicates whether a sample is included in the differential expression.

  • designmatrix.tsv : Indicates the experimental conditions of each sample.

Input files

  • raw data in gzip compressed FASTQ files

Usage

1. Set up the environment

cluster_rnaseq requires the conda package manager in order to work. Please install conda by following the bioconda installation instructions . In addition, of course, it is essential to install Snakemake; following the steps in the Snakemake documentation .

2. Download cluster_rnaseq repository from GitHub.

Use git clone command to create a local copy.

git clone https://github.com/cnio-bu/cluster_rnaseq.git

3. Configure the pipeline.

Before executing the pipeline, the users must configure it according to their samples. To do this, they must fill these files:

TIP: different analysis can be run using just one cloned repository. This is achieved by changing the outdir and logdir in the configuration file. Also different parameters values can be used in the different analysis.

a. config.yaml

This is the pipeline configuration file, where you can tune all the available parameters to customise your RNAseq analysis. Here the aligner and quantifier to be used are indicated, as well as the necessary input or output file paths and a resource management section for each rule.

The example file ( template_config.yaml ) features extensive inline documentation. Rename it to config.yaml to use it during your execution.

b. units.tsv

This file is used to configure the FASTQ input files.

An example file ( template_units.tsv ) is included in the repository.

Rename it to units.tsv and edit its contents according to the following table:

Field name Description
sample Sample name (must match the sample name specified in samples.tsv ).
lane Lane identifier from which the samples will be concatenated.
fq1 Path to FASTQ file for read 1.
fq2 Path to FASTQ file for read 2.
md5_fq1 MD5 hash for the file in 'fq1'.
md5_fq2 MD5 hash for the file in 'fq2'.

The first three columns, "sample", "lane" and "fq1", are mandatory and defines the sample name, lane and path for each sample. The other columns are optionals and must be filled is the experiment is paired-end ("fq2") or if you have the MD5 hashes ("md5_fq1" and "md5_fq2").

c. samples.tsv

This table contains the name of each sample and an indication to include it (if diffexp = True) or exclude it (if diffexp = False) from the differential expression analysis.

An example file ( template_samples.tsv) is included in the repository. Rename it to samples.tsv and edit its contents.

d. designmatrix.tsv

This is the table with the experimental conditions of each sample. The control condition must be preceded with an asterisk. Extra columns can be added as batch variables to correct the batch effect.

An example file ( template_designmatrix.tsv) is included in the repository. Rename it to designmatrix.tsv and edit its contents before performing the differential expression.

4. Create the Conda environments.

To run the pipeline, the user needs to create the conda environments first, which will take some minutes. This step is done automatically using this command:

snakemake --use-conda --conda-create-envs-only --conda-frontend mamba

5. Run the pipeline.

Once the pipeline is configured and conda environments are created, the user just needs to run cluster_rnaseq.

snakemake --use-conda -j 32

The mandatory arguments are:

  • --use-conda : to install and use the conda environemnts.

  • -j : number of threads/jobs provided to snakemake.

Pipeline steps

Here you can find a general description of the main steps of the pipeline.

1. MD5 check and FASTQ quality control.

MD5 check

If MD5 hashes are included in units.tsv , the pipeline generates the hashes for your files and compares them with the ones in units.tsv ; giving a error if they does not match.

General QC

cluster_rnaseq implements FastQC to check the overal quality of the input FASTQ files.

Contamination

FastQ Screen can optionally be enabled in order to check the input FASTQ files for contaminants.

MultiQC report

cluster_rnaseq creates a Quality Control HTML reports using MultiQC . This report includes information from FastQC, from both the original files and the complete samples after concatenating them; and also from the rest of tools used in the execution.

2. Concatenation, downsampling & trimming.

Before starting the analysis, the samples are concatenated by their lane, to obtain a complete file for each of the samples.

After this, there is the possibility of standardizing the number of reads of the files by performing a downsampling of the samples. The downsamplig parameter of config.yaml must be True and the desired number of reads for each sample must be indicated; as well as a seed to control the randomness of the analysys and allow to be replicated.

Finally, trimming is performed to remove the adapters used during sequencing. The bbduk tool, from the BBTools package, is used. By default, the file adapters.fa with different adapters is passed to the tool.

3. Alignment.

There are three different methods to align the reads: STAR, Salmon or HISAT2. You can choose the desired method in config.yaml .

STAR and HISAT2 are aligners and require a parameter, defined in the configuration file:

  • The corresponding index path, whether it already exists or not

if the index does not exist, it can be generated through the pipe (explained later) with:

  • An annotation file containing the features of interest (in GTF format, must match the target genome)

  • A FASTA file with the genome

STAR alignment is sorted by coordinate internaly, which is necesary for the quatification. HISAT2 is not, so its output is treated with samtools sort to get an equivalent output.

Salmon is a pseudo-aligner that directly returns the counts for each of the samples, so none of the quantization tools in the next step will be used. It require a parameter, defined in the configuration file too:

  • The corresponding index path, whether it already exists or not

if the index does not exist, it can be generated through the pipe (explained later) with:

  • A FASTA file containing the reference transcriptome

  • The genome assembly

4. Quantification.

Two tools are used to quantify STAR and HISAT2 results: htseq-count and featureCounts . In the same wat as for alignment, you can choose the desired method in config.yaml .

Before quantification, the BAM files with the alignments are indexed using samtools index , obtaining new .bam.bai files.

Both quantizers use both the BAM files generated by the aligners and the .bam.bai generated before. In addition, they need the same GTF file with the annotation used by the aligner. Other parameters can be configured in the config.yaml file, such as strandedness .

These quantification methods return a counts file for each sample. These files, like Salmon's, are transformed to generate the counts matrix, counts.tsv , from which the differential expression is performed.

5. Differential expression.

Differential expression is performed with the Deseq2 utility and consists of two steps. The first step creates the Deseq object dds.rds that will be used to perform the differential expression, as well as a file with the normalized counts and another with the variance stabilized values object, which will be used for plotting. It requires four inputs:

  • Counts matrix generated after quantification.

  • Design matrix file.

  • Design of the experiment, specified in config.yaml .

  • File samples.tsv to know which samples should be taken into account.

Once the Deseq object is created, the second step is to carry out the differential expression from the dds.rds object generated above. For each contrast it returns four files:

Two equivalent files with the default deseq2 results:

  • (contrast)_diffexp.tsv , a file plain text file format the data separated by tabs.

  • (contrast)_diffexp.xlsx , a file in excel format with the data formatted in a similar way to Nextpresso output.

And two equivalent files with the deseq2 results where the log2 foldchange estimates are shrunken using the function lfcShrink() from Deseq2. This shrinkage of the log2 foldchange estimates toward zero is performed when the information for a gene is low (low counts and high dispersion values). The two equivalent files contain the same columns as the default results, except for the 'stats' column which is not present in the output with shrunken log2 foldchange estimates:

  • (contrast)_lfcShrink_diffexp.tsv , a file plain text file format the data separated by tabs.

  • (contrast)_lfcShrink_diffexp.xlsx , a file in excel format with the data formatted in a similar way to Nextpresso output.

6. Plots.

Once the Deseq object has been generated, different plots can be obtained for each of the contrasts:

  • PCA : Represents the distribution of the samples based on their first three principal components.

  • Distance : Calculates the distance between samples, using the Euclidean distance, to check if the samples belonging to the same condition have similarities.

  • Expression heatmap : From the normalized counts, creates a heatmap with the 25 most upregulated genes and the 25 most downregulated genes.

  • MA plot : Visualizes the differences between measurements by transforming the data onto M (log ratio) and A (mean average) scales.

7. Optional steps.

This pipeline has two additional options that, although they may be included in the ordinary execution, are also interesting independently.

7.1 Index generation

As mentioned in point 3, to perform the alignments it is necessary to have an index generated for each aligner. This index can be created beforehand and simply indicate the path in which it is located. But if it does not exist, it can be generated automatically adding to config.yaml the parameters indicated in point 3.

For STAR and HISAT2:

  • An annotation file containing the features of interest (in GTF format, must match the target genome)

  • A FASTA file with the genome

  • The output path where the index will be created

For Salmon:

  • A FASTA file containing the reference transcriptome

  • The genome assembly

  • The output path where the index will be created

7.2 Complete MultiQC report

Each execution of the pipeline returns a MultiQC report with the information of the aligner and the quantifier used in that iteration, in addition to the reports of the rest of the tools. However, there is the possibility of generating a MultiQC file that contains the results of all the executed tools, allowing to compare, for example, the results of different aligners.

Target rules

cluster_rnaseq features a shortcut system based on specfic targets rulesrelated to the pipeline's steps. Each target calls a end point rule which terminate the pipeline execution.

To use the shorcuts, you only need to run the pipeline as usual, but specifyung the target.

snakemake --use-conda -j 32 target_rule

The available targets are:

  • all : executes the whole pipepile, from the original files until the plots creation. It is the same as not indicating any rule.

  • files_qc : check MD5 hashes and perform quality control analysis with fastQC on the original files.

  • trimming : run cluster_rnaseq until trimming step included.

  • alignment : run cluster_rnaseq until aligment step included.

  • quantification : run cluster_rnaseq until get the counts matrix.

  • diffexp : run cluster_rnaseq until differential expression included.

  • plots : run cluster_rnaseq until get the final plots.

  • index : create an index for the aligner chosen in config.yaml .

  • multiqc_all : search for any file in the results folder that can be added to a MultiQC report and create the report.

Additionally, the user might use the Snakemake rules names as targets, which are available in the config.yaml file.

Cluster Usage

This pipeline is part of the tools developed by the CNIO Bioinformatics Unit. Although it can be launched on any machine, it is designed to be executed in environments with high computational capacity such as the cluster that the center has. This uses slurm as a task manager, such a way that to launch the pipeline properly you must add the slurm profile created to launch the snakemake tools. So the command is:

snakemake --use-conda --profile $SMK_PROFILE_SLURM -j 32

In addition, to save the user time and space, there are shared resources in the CNIO cluster that can be referenced from cluster_rnaseq, such as the aligner indices that take time to create. In the file template_config.yaml are the paths to some of these resources which are updated periodically.

Configuration of computation resources

The user can configure cluster_rnaseq to optimise the available computational resources in order to reduce the computational time. The optimization is achieved thanks to Snakemake's ability to parallelize the jobs and to assign specific resources to each rule in a simple way. Resources configuration is done through the configuration file. This file has a field called resources , where the user can define the RAM memory usage, the number of threads (if the rule admits multithreading) available to a specific step and the maximum execution time. Additionally, if the user does not provide any value for some of these fields, the pipeline will use the default values.

Code Snippets

35
shell: 'salmon quant -i {input.salmon_index} -l {params.libtype} -r {input.se_reads} --validateMappings -o {params.quant_directory} --threads {threads}'
57
shell: 'salmon quant -i {input.salmon_index} -l {params.libtype} -1 {input.r1_reads} -2 {input.r2_reads} --validateMappings -o {params.quant_directory} --threads {threads}'
80
81
script:
    "../scripts/star.py"
104
105
script:
    "../scripts/star.py"
SnakeMake From line 104 of rules/align.smk
125
126
wrapper:
    "0.74.0/bio/hisat2/align"
SnakeMake From line 125 of rules/align.smk
143
144
shell:
    'samtools sort -@ {threads} -o {output.sortedCoord} {input.aligned}'
22
23
script:
    "../scripts/deseq2_init.R"
45
46
script:
    "../scripts/deseq2_diffexp.R"
10
11
12
13
14
shell: 
    """
    grep "^>" <(gunzip -c {input.genome}) | cut -d " "  -f 1 > {output.decoys}
    sed -i -e 's/>//g' {output.decoys}
    """
25
26
shell:
    'cat {input.transcriptome} {input.genome} > {output.gentrome}'
46
47
shell:
    'salmon index -t {input.gentrome} -d {input.decoys} -p {threads} -i {output} {params.gencode}'
66
67
wrapper:
    "0.74.0/bio/star/index"
85
86
wrapper:
    "0.74.0/bio/hisat2/index"
20
21
script: 
    "../scripts/pca.R"
40
41
script: 
    "../scripts/MAplot.R"
61
62
script: 
    "../scripts/distances.R"
83
84
script: 
    "../scripts/topbottomDEgenes.R"
44
shell: 'cat {input} > {output}'
56
shell: 'cat {input} > {output}'
77
78
wrapper:
    "v1.25.0/bio/bbtools/bbduk"
 99
100
wrapper:
    "v1.25.0/bio/bbtools/bbduk"
119
120
wrapper:
    "0.74.0/bio/seqtk/subsample/se"
141
142
wrapper:
    "0.74.0/bio/seqtk/subsample/pe"
90
91
script:
    "../scripts/md5.py"
SnakeMake From line 90 of rules/qc.smk
114
115
wrapper:
    "0.74.0/bio/fastqc"
139
140
141
shell:"""
    fastq_screen --threads {threads} --get_genomes --outdir {params.outdir}/ &> {log}
"""
SnakeMake From line 139 of rules/qc.smk
167
168
wrapper:
    "v1.23.4/bio/fastq_screen"
SnakeMake From line 167 of rules/qc.smk
188
189
script:
    "../scripts/multiqc.py"
SnakeMake From line 188 of rules/qc.smk
211
212
wrapper:
    "0.74.0/bio/fastqc"
236
237
wrapper:
    "v1.23.4/bio/fastq_screen"
SnakeMake From line 236 of rules/qc.smk
257
258
script:
    "../scripts/multiqc.py"
SnakeMake From line 257 of rules/qc.smk
280
281
script:
    "../scripts/multiqc.py"
SnakeMake From line 280 of rules/qc.smk
23
24
shell:
    'samtools index -@ {threads} {input.aligned}'
48
shell: 'htseq-count {params.extra} {params.mode} {params.strandedness} {input.bam_file} {params.annotation} > {output.quant} 2> {log}'
65
66
script:
    "../scripts/htseq_count_matrix.R"
90
shell: 'featureCounts {params.args} -T {threads} {params.extra} {params.strandedness} -a {params.annotation} -o {output.quant} {input.bam_file} &> {log}'
105
106
script:
    "../scripts/fcounts_count_matrix.py"
130
131
script:
    '../scripts/salmon_matrix_from_quant.R'
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
log <- file(snakemake@log[[1]], open = "wt")
sink(log)
sink(log, type = "message")

suppressMessages(library("DESeq2"))
suppressMessages(library("openxlsx"))
suppressMessages(library("org.Hs.eg.db"))
suppressMessages(library("AnnotationDbi"))

## PARALLELIZATION ##
parallel <- FALSE
if (snakemake@threads > 1) {
    suppressMessages(library("BiocParallel"))
    register(MulticoreParam(snakemake@threads))
    parallel <- TRUE
}

## SNAKEMAKE I/O ##
dds <- snakemake@input[["dds"]]

## SNAKEMAKE PARAMS ##
condition <- snakemake@params[["condition"]]
levels <- snakemake@params[["levels"]]

## CODE ##
# Get dds
dds <- readRDS(dds)

# Get results
res <- results(dds, contrast = c(condition, levels), alpha=0.05, parallel = parallel)

# Annotate the GeneSymbol and complete GeneName from the ENSEMBL Gene ID.
ensg_res <- rownames(res)
ensemblGene_DEA <- gsub("\\.[0-9]*$", "", ensg_res)

ensg_symbol <- as.data.frame(mapIds(org.Hs.eg.db, keys = ensemblGene_DEA,
                                    column = "SYMBOL", keytype = "ENSEMBL"))
colnames(ensg_symbol) <- "GeneSymbol"
ensg_genename <- as.data.frame(mapIds(org.Hs.eg.db, keys = ensemblGene_DEA,
                                      column = "GENENAME", keytype = "ENSEMBL"))
colnames(ensg_genename) <- "GeneName"
annot <- merge(ensg_symbol, ensg_genename, by = "row.names", all = TRUE)
rownames(res) <- ensemblGene_DEA
res <- merge(as.data.frame(res), annot, by.x = "row.names", by.y = 1, all = TRUE)
ENSG_res_list <- as.list(ensg_res)
names(ENSG_res_list) <- ensemblGene_DEA
rownames(res) <- ENSG_res_list[res$Row.names]
col_order <- c("GeneSymbol", "GeneName", "baseMean", "log2FoldChange",
               "lfcSE", "stat", "pvalue", "padj")
res <- res[, col_order]

# Sort by adjusted p-value
res <- res[order(res$padj, decreasing = FALSE), ]

# Save tsv
write.table(res, file = snakemake@output[["tsv"]], sep = "\t", quote = FALSE, 
            col.names = NA)

# Add Name column
res$EnsemblGeneID <- rownames(res)
res <- res[c("EnsemblGeneID", colnames(res)[1:(ncol(res)-1)])]

# Green and red styles for formatting excel
redStyle <- createStyle(fontColour = "#FF1F00", bgFill = "#F6F600")
redPlainStyle <- createStyle(fontColour = "#FF1F00")
greenStyle <- createStyle(fontColour = "#008000", bgFill = "#F6F600")
greenPlainStyle <- createStyle(fontColour = "#008000")
boldStyle <- createStyle(textDecoration = c("BOLD"))

# Excel file
wb <- createWorkbook()
sheet <- "Sheet1"
addWorksheet(wb, sheet)

# Legend 
legend <- t(data.frame(paste("Upregulated in", levels[1]),
                       paste("Downregulated in", levels[1]), "FDR=0.05"))
writeData(wb, sheet, legend[, 1])
addStyle(wb, sheet, cols = 1, rows = 1, 
         style = createStyle(fontColour = "#FF1F00", fgFill = "#F6F600"))
addStyle(wb, sheet, cols = 1, rows = 2, 
         style = createStyle(fontColour = "#008000", fgFill = "#F6F600"))
addStyle(wb, sheet, cols = 1, rows = 3, style = boldStyle)
invisible(sapply(1:3, function(i) mergeCells(wb, sheet, cols = 1:3, rows = i)))

# Reorder genes according to adjusted p-value
writeData(wb, sheet, res, startRow = 6)
addStyle(wb, sheet, cols = 1:ncol(res), rows = 6, style = boldStyle, 
         gridExpand = TRUE)
conditionalFormatting(wb, sheet, cols = 1:ncol(res), rows = 7:(nrow(res)+6),
                      rule = "AND($E7>0, $I7<0.05, NOT(ISBLANK($I7)))", style = redStyle)
conditionalFormatting(wb, sheet, cols = 1:ncol(res), rows = 7:(nrow(res)+6),
                      rule = "AND($E7>0, OR($I7>0.05, ISBLANK($I7)))", style = redPlainStyle)
conditionalFormatting(wb, sheet, cols = 1:ncol(res), rows = 7:(nrow(res)+6),
                      rule = "AND($E7<0, $I7<0.05, NOT(ISBLANK($I7)))", style = greenStyle)
conditionalFormatting(wb, sheet, cols = 1:ncol(res), rows = 7:(nrow(res)+6),
                      rule = "AND($E7<0, OR($I7>0.05, ISBLANK($I7)))", style = greenPlainStyle)
setColWidths(wb, sheet, 1:ncol(res), widths = 13)

# Save excel
saveWorkbook(wb, file = snakemake@output[["xlsx"]], overwrite = TRUE)



# GET SHRUNKEN LOG FOLD CHANGES.
coef <- paste0(c(condition, levels[1], "vs", levels[2]), collapse = "_")
res_shrink <- lfcShrink(dds, coef=coef, type="apeglm")

# Annotate the GeneSymbol and complete GeneName from the ENSEMBL Gene ID.
ensg_res <- rownames(res_shrink)
ensemblGene_DEA <- gsub("\\.[0-9]*$", "", ensg_res)

ensg_symbol <- as.data.frame(mapIds(org.Hs.eg.db, keys = ensemblGene_DEA,
                                    column = "SYMBOL", keytype = "ENSEMBL"))
colnames(ensg_symbol) <- "GeneSymbol"
ensg_genename <- as.data.frame(mapIds(org.Hs.eg.db, keys = ensemblGene_DEA,
                                      column = "GENENAME", keytype = "ENSEMBL"))
colnames(ensg_genename) <- "GeneName"
annot <- merge(ensg_symbol, ensg_genename, by = "row.names", all = TRUE)
rownames(res_shrink) <- ensemblGene_DEA
res_shrink <- merge(as.data.frame(res_shrink), annot, by.x = "row.names", by.y = 1, all = TRUE)
ENSG_res_list <- as.list(ensg_res)
names(ENSG_res_list) <- ensemblGene_DEA
rownames(res_shrink) <- ENSG_res_list[res_shrink$Row.names]
col_order <- c("GeneSymbol", "GeneName", "baseMean", "log2FoldChange",
               "lfcSE", "pvalue", "padj")
res_shrink <- res_shrink[, col_order]

# Sort by adjusted p-value
res_shrink <- res_shrink[order(res_shrink$padj, decreasing = FALSE), ]

# Save tsv
write.table(res_shrink, file = snakemake@output[["tsv_lfcShrink"]], sep = "\t", quote = FALSE, 
            col.names = NA)

# Add Name column
res_shrink$EnsemblGeneID <- rownames(res_shrink)
res_shrink <- res_shrink[c("EnsemblGeneID", colnames(res_shrink)[1:(ncol(res_shrink)-1)])]

# Green and red styles for formatting excel
redStyle <- createStyle(fontColour = "#FF1F00", bgFill = "#F6F600")
redPlainStyle <- createStyle(fontColour = "#FF1F00")
greenStyle <- createStyle(fontColour = "#008000", bgFill = "#F6F600")
greenPlainStyle <- createStyle(fontColour = "#008000")
boldStyle <- createStyle(textDecoration = c("BOLD"))

# Excel file
wb <- createWorkbook()
sheet <- "Sheet1"
addWorksheet(wb, sheet)

# Legend 
legend <- t(data.frame(paste("Upregulated in", levels[1]),
                       paste("Downregulated in", levels[1]), "FDR=0.05"))
writeData(wb, sheet, legend[, 1])
addStyle(wb, sheet, cols = 1, rows = 1, 
         style = createStyle(fontColour = "#FF1F00", fgFill = "#F6F600"))
addStyle(wb, sheet, cols = 1, rows = 2, 
         style = createStyle(fontColour = "#008000", fgFill = "#F6F600"))
addStyle(wb, sheet, cols = 1, rows = 3, style = boldStyle)
invisible(sapply(1:3, function(i) mergeCells(wb, sheet, cols = 1:3, rows = i)))

# Reorder genes according to adjusted p-value
writeData(wb, sheet, res_shrink, startRow = 6)
addStyle(wb, sheet, cols = 1:ncol(res_shrink), rows = 6, style = boldStyle, 
         gridExpand = TRUE)
conditionalFormatting(wb, sheet, cols = 1:ncol(res_shrink), rows = 7:(nrow(res_shrink)+6),
                      rule = "AND($E7>0, $H7<0.05, NOT(ISBLANK($H7)))", style = redStyle)
conditionalFormatting(wb, sheet, cols = 1:ncol(res_shrink), rows = 7:(nrow(res_shrink)+6),
                      rule = "AND($E7>0, OR($H7>0.05, ISBLANK($H7)))", style = redPlainStyle)
conditionalFormatting(wb, sheet, cols = 1:ncol(res_shrink), rows = 7:(nrow(res_shrink)+6),
                      rule = "AND($E7<0, $H7<0.05, NOT(ISBLANK($H7)))", style = greenStyle)
conditionalFormatting(wb, sheet, cols = 1:ncol(res_shrink), rows = 7:(nrow(res_shrink)+6),
                      rule = "AND($E7<0, OR($H7>0.05, ISBLANK($H7)))", style = greenPlainStyle)
setColWidths(wb, sheet, 1:ncol(res_shrink), widths = 13)

# Save excel
saveWorkbook(wb, file = snakemake@output[["xlsx_lfcShrink"]], overwrite = TRUE)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
log <- file(snakemake@log[[1]], open = "wt")
sink(log)
sink(log, type = "message")

suppressMessages(library("DESeq2"))
suppressMessages(library("tidyr"))
suppressMessages(library("dplyr"))
suppressMessages(library("limma"))

## PARALLELIZATION ##
parallel <- FALSE
if (snakemake@threads > 1) {
    suppressMessages(library("BiocParallel"))
    register(MulticoreParam(snakemake@threads))
    parallel <- TRUE
}

## SNAKEMAKE I/O ##
counts <- snakemake@input[["counts"]]

## SNAKEMAKE PARAMS ##
samples <- snakemake@params[["samples"]]
designmatrix <- snakemake@params[["designmatrix"]]
ref <- snakemake@params[["ref_levels"]]
design <- snakemake@params[["design"]]
batch_variables <- snakemake@params[["batch"]]

## FUNCTION ##
factor_relevel <- function(x, reference) {
  relev <- relevel(as.factor(gsub("^\\*", "", as.character(x))),
                   ref = reference[cur_column()])
  return(relev)
}

## CODE ##
# Get counts
counts <- read.table(counts, header = TRUE, row.names = 1)

# Get samples for DEA
samples <- read.table(samples, header = TRUE, row.names = 1)
DEAsamples <- rownames(samples)[samples$diffexp == "True"]

# Get design matrix
designmatrix <- read.table(designmatrix, header = TRUE, row.names = 1)

# Subset the count matrix keeping only samples to be tested
counts <- counts %>% select(all_of(DEAsamples))

# Subset the design matrix keeping only samples to be tested
designmatrix <- designmatrix[colnames(counts), , drop = FALSE]

# Set names to reference levels for each column in design matrix
ref <- setNames(ref, colnames(designmatrix))

# Remove '*' prefix from design matrix cells, convert all columns to factors
# and relevel using ref
designmatrix <- designmatrix %>% 
  mutate(across(everything(), factor_relevel, reference = ref))

# DESeq2 from htseqCount output
dds <- DESeqDataSetFromMatrix(counts, colData = designmatrix, 
                              design = as.formula(design))

# Pre-filtering
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep, ]

# DEA
dds <- DESeq(dds, parallel = parallel)

# Normalized counts
norm_counts <- counts(dds, normalized = TRUE)

# Data transformation
vsd <- vst(dds, blind = FALSE)

# Save objects
saveRDS(dds, file = snakemake@output[["dds"]])
write.table(norm_counts, file = snakemake@output[["normalized_counts"]], 
            sep = "\t", quote = FALSE, col.names = NA)
saveRDS(vsd, file = snakemake@output[["vst"]][1])

# If there is a batch, perform batch correction
if (!is.null(batch_variables)) {
  ### colData
  coldata <- as.data.frame(colData(vsd))

  # Batch correct the vst
  batch <- coldata %>% unite("combined", all_of(batch_variables), sep = ":") %>%
    select(combined) %>% pull
  assay(vsd) <- removeBatchEffect(assay(vsd), batch = batch)
  saveRDS(vsd, file = snakemake@output[["vst"]][2])
}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
log <- file(snakemake@log[[1]], open = "wt")
sink(log)
sink(log, type = "message")

suppressMessages(library("DESeq2"))
suppressMessages(library("dplyr"))
suppressMessages(library("scales"))
suppressMessages(library("RColorBrewer"))
suppressMessages(library("ComplexHeatmap"))
source("scripts/levelFunctions.R")

## SNAKEMAKE I/O ##
vsd <- snakemake@input[["vst"]]

## SNAKEMAKE PARAMS ##
designmatrix <- snakemake@params[["designmatrix"]]
levels <- snakemake@params[["levels"]]
ref <- snakemake@params[["ref_levels"]]

## CODE ##
# Get vst
vsd <- readRDS(vsd)

# Get design matrix
designmatrix <- read.table(designmatrix, header = TRUE, row.names = 1)

# colData
coldata <- as.data.frame(colData(vsd)) %>% select(colnames(designmatrix))

# Set names to reference levels for each column in coldata
ref <- setNames(ref, colnames(coldata))

# Convert all coldata columns to factors and relevel
coldata <- coldata %>% 
  mutate(across(everything(), factor_relevel, reference = ref))

# Assign a color to each level of each variable
levels_colors <- color_levels(coldata)

# Subset the sample names according to the specified levels
coldata <- coldata %>% filter(condition %in% levels)
samples <- rownames(coldata)

levels_colors <- setNames(lapply(names(levels_colors), function(y) {
  levels_colors[[y]][names(levels_colors[[y]]) %in% unique(coldata[[y]])]
}), names(levels_colors))

# Euclidean sample distances
sampleDist <- as.matrix(dist(t(assay(vsd[, samples])), method = "euclidean"))

# Colors
colors <- colorRampPalette(rev(brewer.pal(9, "Blues")))(255)

# Heatmaps
pdf(snakemake@output[["pdf"]], width = 9, height = 7)
pheatmap(sampleDist, name = "Euclidean Distance", color = colors, 
         annotation_col = coldata, annotation_colors = levels_colors)
dev.off()

png(snakemake@output[["png"]], width = 9, height = 7, units = "in", res = 600)
pheatmap(sampleDist, name = "Euclidean Distance", color = colors,
         annotation_col = coldata, annotation_colors = levels_colors)
dev.off()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import pandas as pd
import numpy as np

samples = pd.read_table(snakemake.params.samples).set_index("sample", drop=False)

counts = [pd.read_table(f, index_col=0, usecols=[0, 6], header=None, skiprows=2)
          for f in snakemake.input]

for t, sample in zip(counts, samples.index):
    t.columns = [sample]

count_matrix = pd.concat(counts, axis=1)
count_matrix.index.name = "geneID"

# collapse technical replicates
count_matrix = count_matrix.groupby(count_matrix.columns, axis=1).sum()
count_matrix = count_matrix.apply(np.floor)
count_matrix.to_csv(snakemake.output[0], sep="\t")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
log <- file(snakemake@log[[1]], open = "wt")
sink(log)
sink(log, type = "message")

## SNAKEMAKE I/O ##
htseqcount <- snakemake@input[["quant"]]

## CODE ##
# Get counts
counts <- lapply(htseqcount, read.table, header = FALSE, row.names = 1)

# Merge counts
counts <- do.call("cbind", counts)

# Remove last rows
counts <- counts[!startsWith(rownames(counts), "__"), ]

# Add colnames
samples <- sapply(htseqcount, function(x) {
  sub(pattern = "(.*?)\\..*$", replacement = "\\1", basename(x))
})
colnames(counts) <- samples

# Save object
write.table(counts, file = snakemake@output[["counts"]], sep = "\t", 
            quote = FALSE, col.names = NA, row.names = TRUE)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
log <- file(snakemake@log[[1]], open = "wt")
sink(log)
sink(log, type = "message")

suppressMessages(library("DESeq2"))
suppressMessages(library("ggpubr"))
suppressMessages(library("ggplot2"))
suppressMessages(library("patchwork"))

## PARALLELIZATION ##
parallel <- FALSE
if (snakemake@threads > 1) {
  suppressMessages(library("BiocParallel"))
  register(MulticoreParam(snakemake@threads))
  parallel <- TRUE
}

## SNAKEMAKE I/O ##
dds <- snakemake@input[["dds"]]

## SNAKEMAKE PARAMS ##
condition <- snakemake@params[["condition"]]
levels <- snakemake@params[["levels"]]

## CODE ##
# Get dds
dds <- readRDS(dds)

# Get results
res <- results(dds, contrast = c(condition, levels), alpha=0.05, 
               parallel = parallel)

# Log Fold Change shrinkage
coef <- paste0(c(condition, levels[1], "vs", levels[2]), collapse = "_")
res_apeglm <- lfcShrink(dds, coef = coef, type = "apeglm")
res_normal <- lfcShrink(dds, coef = coef, type = "normal")
res_ashr <- lfcShrink(dds, coef = coef, type = "ashr")

# MA plots
MA_apeglm <- ggmaplot(res_apeglm, fdr = 0.05, fc = 1.5, size = 0.7, top = 10,
                      main = "apeglm", legend = "top", font.main = "bold", 
                      font.legend = "bold", font.label = c("bold", 11), 
                      label.rectangle = TRUE, ggtheme = theme_minimal()) + 
  theme(plot.title = element_text(hjust = 0.5))
MA_normal <- ggmaplot(res_normal, fdr = 0.05, fc = 1.5, size = 0.7, top = 10,
                      main = "normal", legend = "top", font.main = "bold", 
                      font.legend = "bold", font.label = c("bold", 11), 
                      label.rectangle = TRUE, ggtheme = theme_minimal()) + 
  theme(plot.title = element_text(hjust = 0.5))
MA_ashr <- ggmaplot(res_ashr, fdr = 0.05, fc = 1.5, size = 0.7, top = 10,
                    main = "ashr", legend = "top", font.main = "bold", 
                    font.legend = "bold", font.label = c("bold", 11), 
                    label.rectangle = TRUE, ggtheme = theme_minimal()) + 
  theme(plot.title = element_text(hjust = 0.5))

# Make patchwork with the three plots
MA <- MA_apeglm + MA_normal + MA_ashr +
  plot_layout(widths = rep(5, 3))

# Save PCA plot
ggsave(filename = snakemake@output[["pdf"]], plot = MA, width = 20)
ggsave(filename = snakemake@output[["png"]], plot = MA, dpi = 600, width = 20)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import hashlib
import os

import pandas as pd
import numpy  as np

def get_md5(read):

    cur_dir   = os.getcwd()
    read_path = os.path.join(cur_dir, read)

    if os.path.isfile(read_path):
        with open(read_path, "rb") as file_name:
                content = file_name.read()
                aux_var = hashlib.md5()
                aux_var.update(content)
        return aux_var.hexdigest()
    else:
        return 'missing_file'


## SNAKEMAKE I/O ##
units = snakemake.input['units']

## SNAKEMAKE PARAMS ##
logdir = snakemake.params['logdir']


units  = pd.read_csv(units, sep='\t')

md5s    = units.set_index('fq1')['md5_fq1'].to_dict()
md5s.update(units.set_index('fq2')['md5_fq2'].to_dict())

md5_check = pd.DataFrame.from_dict(md5s, orient='index', columns=['md5'])
md5_check['fastq'] = md5_check.index
md5_check = md5_check.reindex(columns=['fastq', 'md5'])

md5_check = md5_check.dropna(subset=["fastq", "md5"])

md5_check['md5_fq_check'] = md5_check[md5_check['fastq'].notnull()]['fastq'].apply(get_md5)

if not md5_check.empty:
    md5_check['md5_fq_check'] = np.where(md5_check['md5_fq_check'].str.lower() == md5_check['md5'].str.lower(), 'Passed', 'Failed')
    md5_check[md5_check['md5'].notnull()].to_csv(f"{logdir}/md5.done.tsv", sep = '\t', index=False)

if (md5_check[md5_check['md5'].notnull()]['md5_fq_check'] == 'Failed').any():
    sys.exit(f"\nSome files did not pass the checksum veryfication.\nCheck the report at {logdir}/md5.done.tsv\nExiting...\n")
else:
    sys.exit(0)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path

from snakemake.shell import shell


input_dirs = set(path.dirname(fp) for fp in snakemake.input)
output_dir = path.dirname(snakemake.output[0])
output_name = path.basename(snakemake.output[0])
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
      "multiqc"
      " {snakemake.params}"
      " --force"
      " -o {output_dir}"
      " -n {output_name}"
      " {input_dirs}"
      " {log}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
log <- file(snakemake@log[[1]], open = "wt")
sink(log)
sink(log, type = "message")

suppressMessages(library("DESeq2"))
suppressMessages(library("stringr"))
suppressMessages(library("dplyr"))
suppressMessages(library("tidyr"))
suppressMessages(library("scales"))
suppressMessages(library("ggplot2"))
suppressMessages(library("ggrepel"))
suppressMessages(library("patchwork"))
source("scripts/plotPCA.3.R")
source("scripts/levelFunctions.R")

## SNAKEMAKE I/O ##
vsd <- snakemake@input[["vst"]]

## SNAKEMAKE PARAMS ##
designmatrix <- snakemake@params[["designmatrix"]]
levels <- snakemake@params[["levels"]]
design <- snakemake@params[["design"]]
ref <- snakemake@params[["ref_levels"]]

## CODE ##
# Get vst
vsd <- readRDS(vsd)

# Get design matrix
designmatrix <- read.table(designmatrix, header = TRUE, row.names = 1)

# Get variables to plot by from design
variables <- unlist(str_split(design, pattern = "\\*|:|\\+"))
variables <- rev(str_trim(str_remove(variables, "~")))

# colData
coldata <- as.data.frame(colData(vsd)) %>% select(colnames(designmatrix))

# Set names to reference levels for each column in coldata
ref <- setNames(ref, colnames(coldata))

# Convert all coldata columns to factors and relevel
coldata <- coldata %>% 
  mutate(across(everything(), factor_relevel, reference = ref))

# Get variables' all combined levels
all_levels <- expand.grid(rev(lapply(coldata, levels))) %>% 
  unite("combined", all_of(variables), sep = ":") %>% select(combined) %>% pull

# Assign a color to each level
colors <- setNames(hue_pal()(length(all_levels)), all_levels)

# Subset the sample names according to the specified levels
coldata <- coldata %>% filter(get(variables[1]) %in% levels)
samples <- rownames(coldata)

# Keep only specified levels
levels <- coldata %>% unite("combined", all_of(variables), sep = ":") %>% 
  select(combined) %>% pull %>% unique

# Subset the colors according to the specified levels
colors <- colors[levels]

# Subset samples
vsd <- vsd[, samples]

# PCA plots for the first 3 principal components
pca12 <- plotPCA.3(vsd, intgroup = variables) + 
  scale_color_manual(values = colors, breaks = all_levels)
pca13 <- plotPCA.3(vsd, intgroup = variables, pc = c(1, 3)) + 
  scale_color_manual(values = colors, breaks = all_levels)
pca23 <- plotPCA.3(vsd, intgroup = variables, pc = c(2, 3)) + 
  scale_color_manual(values = colors, breaks = all_levels)

# Make patchwork with the three plots
pca <- pca12 + pca13 + pca23 + 
  plot_layout(guides = "collect", widths = rep(1, 3)) & 
  theme(legend.position = "bottom")

# Save PCA plot
ggsave(filename = snakemake@output[["pdf"]], plot = pca, width = 20)
ggsave(filename = snakemake@output[["png"]], plot = pca, dpi = 600, width = 20)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
library('DESeq2')
library('readr')
library('tximeta')


## SNAKEMAKE I/O ##
metadata_cache      <- snakemake@output[['metadata_cache']]
tximeta_transcripts <- snakemake@output[['transcript_estimates']]
gene_level_matrix   <- snakemake@output[['gene_level_matrix']]

# NOTE: We don't really fetch the expanded list of quant files from the
# snakemake object since we don't really need it. It's there so that
# this rule does not get executed until all samples have been quantified.

## SNAKEMAKE PARAMS ##
salmon_quant_directory <- snakemake@params[['salmon_quant_directory']]
samples_files          <- snakemake@params[['samples']]

## SNAKEMAKE LOG ##
log <- file(snakemake@log[[1]], open = "wt")
sink(log)
sink(log, type = "message")


# SET metadata cache folder
setTximetaBFC(metadata_cache)

samples           <- read.csv(samples_files, sep="\t", header=TRUE)
rownames(samples) <- samples$sample
quant_files   <- file.path(salmon_quant_directory, samples$sample, "quant.sf")

# tximeta looks for two columns: names for samples and files for paths
samples$files <- quant_files
samples$names <- samples$sample

se  <- tximeta(coldata = samples, type = 'salmon')
gse <- summarizeToGene(se)

## This may seem dumb but it let us use the DESeqDataSet wrapper
## to transform gene-level estimates to integer counts taking into 
## account the average transcript lengths from tximeta
## An alternative is to just round them up, as with tximport.

dds        <- DESeqDataSet(se=gse, design=~1) #blind design
raw_counts <- counts(dds, normalized=FALSE)


saveRDS(se, tximeta_transcripts)
write.table(raw_counts,file = gene_level_matrix, sep = "\t", 
            quote = FALSE, col.names = NA, row.names = TRUE)
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import os
from snakemake.shell import shell

extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

fq1 = snakemake.input.get("fq1")
assert fq1 is not None, "input-> fq1 is a required input parameter"
fq1 = (
    [snakemake.input.fq1]
    if isinstance(snakemake.input.fq1, str)
    else snakemake.input.fq1
)
fq2 = snakemake.input.get("fq2")
if fq2:
    fq2 = (
        [snakemake.input.fq2]
        if isinstance(snakemake.input.fq2, str)
        else snakemake.input.fq2
    )
    assert len(fq1) == len(
        fq2
    ), "input-> equal number of files required for fq1 and fq2"
input_str_fq1 = ",".join(fq1)
input_str_fq2 = ",".join(fq2) if fq2 is not None else ""
input_str = " ".join([input_str_fq1, input_str_fq2])

if fq1[0].endswith(".gz"):
    readcmd = "--readFilesCommand gunzip -c"
else:
    readcmd = ""

outprefix = os.path.dirname(snakemake.output[0]) + "/"

shell(
    "STAR "
    "{extra} "
    "--runThreadN {snakemake.threads} "
    "--genomeDir {snakemake.input.index} "
    "--readFilesIn {input_str} "
    "{readcmd} "
    "--outFileNamePrefix {outprefix} "
    "--outStd Log "
    "{log}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
log <- file(snakemake@log[[1]], open = "wt")
sink(log)
sink(log, type = "message")

suppressMessages(library("plyr"))
suppressMessages(library("dplyr"))
suppressMessages(library("tibble"))
suppressMessages(library("scales"))
suppressMessages(library("RColorBrewer"))
suppressMessages(library("ComplexHeatmap"))
source("scripts/levelFunctions.R")

## SNAKEMAKE I/O ##
norm_counts <- snakemake@input[["normalized_counts"]]
diffexp <- snakemake@input[["diffexp"]]

## SNAKEMAKE PARAMS ##
levels <- snakemake@params[["levels"]]
designmatrix <- snakemake@params[["designmatrix"]]
ref <- snakemake@params[["ref_levels"]]

## CODE ##
# Get normalized counts
norm_counts <- read.table(norm_counts)

# Get differential expression results
diffexp <- read.table(diffexp, , header = TRUE, row.names = 1, sep = '\t', 
                      blank.lines.skip = FALSE, quote = "")
diffexp <- diffexp[c("baseMean", "log2FoldChange", "lfcSE", 
                   "stat", "pvalue", "padj")]
# Get design matrix
designmatrix <- read.table(designmatrix, header = TRUE, row.names = 1)

# Keep only significantly expressed genes
diffexp <- diffexp %>% rownames_to_column %>% filter(padj < 0.05)

# Subset the design matrix keeping only samples to be tested
designmatrix <- designmatrix[colnames(norm_counts), , drop = FALSE]

# Set names to reference levels for each column in design matrix
ref <- setNames(ref, colnames(designmatrix))

# Remove '*' prefix from design matrix cells, convert all columns to factors
# and relevel
designmatrix <- designmatrix %>% 
  mutate(across(everything(), factor_relevel, reference = ref))

# Assign a color to each level of each variable
levels_colors <- color_levels(designmatrix)

# Subset the sample names according to the specified levels
designmatrix <- designmatrix %>% filter(condition %in% levels) %>% 
  select(condition)
samples <- rownames(designmatrix)

levels_colors <- list(condition = levels_colors[["condition"]]
                      [names(levels_colors[["condition"]]) %in% levels])

# Get top 25 and bottom 25 DE genes
top <- diffexp %>% arrange(desc(log2FoldChange)) %>% 
  top_n(n = 25, wt = log2FoldChange) %>% select(rowname) %>% pull

bottom <-diffexp %>% arrange(log2FoldChange) %>% 
  top_n(n = -25, wt = log2FoldChange) %>% select(rowname) %>% pull

# Subset the normalized count matrix and scale by row
norm_counts <- t(apply(norm_counts[c(top, bottom), samples], 1, scale))
colnames(norm_counts) <- samples

# Specify the colors and breaks of the plot
colors <- colorRampPalette(c("blue", "white", "red"))(100)
break_limit <- round_any(max(max(norm_counts), abs(min(norm_counts))), 0.5,
                         ceiling)
breaks <- seq(-break_limit, break_limit, length.out = 101)

# Heatmaps
pdf(snakemake@output[["pdf"]], width = 9, height = 7)
pheatmap(norm_counts, name = "Gene Expression", color = colors, breaks = breaks,
         annotation_col = designmatrix, annotation_colors = levels_colors)
dev.off()

png(snakemake@output[["png"]], width = 9, height = 7, units = "in", res = 600)
pheatmap(norm_counts, name = "Gene Expression", color = colors, breaks = breaks,
         annotation_col = designmatrix, annotation_colors = levels_colors)
dev.off()
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path
import re
from tempfile import TemporaryDirectory

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=True, stderr=True)


def basename_without_ext(file_path):
    """Returns basename of file path, without the file extension."""

    base = path.basename(file_path)
    # Remove file extension(s) (similar to the internal fastqc approach)
    base = re.sub("\\.gz$", "", base)
    base = re.sub("\\.bz2$", "", base)
    base = re.sub("\\.txt$", "", base)
    base = re.sub("\\.fastq$", "", base)
    base = re.sub("\\.fq$", "", base)
    base = re.sub("\\.sam$", "", base)
    base = re.sub("\\.bam$", "", base)

    return base


# Run fastqc, since there can be race conditions if multiple jobs
# use the same fastqc dir, we create a temp dir.
with TemporaryDirectory() as tempdir:
    shell(
        "fastqc {snakemake.params} -t {snakemake.threads} "
        "--outdir {tempdir:q} {snakemake.input[0]:q}"
        " {log}"
    )

    # Move outputs into proper position.
    output_base = basename_without_ext(snakemake.input[0])
    html_path = path.join(tempdir, output_base + "_fastqc.html")
    zip_path = path.join(tempdir, output_base + "_fastqc.zip")

    if snakemake.output.html != html_path:
        shell("mv {html_path:q} {snakemake.output.html:q}")

    if snakemake.output.zip != zip_path:
        shell("mv {zip_path:q} {snakemake.output.zip:q}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
__author__ = "Wibowo Arindrarto"
__copyright__ = "Copyright 2016, Wibowo Arindrarto"
__email__ = "[email protected]"
__license__ = "BSD"


from snakemake.shell import shell

# Placeholder for optional parameters
extra = snakemake.params.get("extra", "")
# Run log
log = snakemake.log_fmt_shell()

# Input file wrangling
reads = snakemake.input.get("reads")
if isinstance(reads, str):
    input_flags = "-U {0}".format(reads)
elif len(reads) == 1:
    input_flags = "-U {0}".format(reads[0])
elif len(reads) == 2:
    input_flags = "-1 {0} -2 {1}".format(*reads)
else:
    raise RuntimeError(
        "Reads parameter must contain at least 1 and at most 2" " input files."
    )

# Executed shell command
shell(
    "(hisat2 {extra} "
    "--threads {snakemake.threads} "
    " -x {snakemake.params.idx} {input_flags} "
    " | samtools view -Sbh -o {snakemake.output[0]} -) "
    " {log}"
)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
__author__ = "Joël Simoneau"
__copyright__ = "Copyright 2019, Joël Simoneau"
__email__ = "[email protected]"
__license__ = "MIT"

import os
from snakemake.shell import shell

# Creating log
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

# Placeholder for optional parameters
extra = snakemake.params.get("extra", "")

# Allowing for multiple FASTA files
fasta = snakemake.input.get("fasta")
assert fasta is not None, "input-> a FASTA-file or a sequence is required"
input_seq = ""
if not "." in fasta:
    input_seq += "-c "
input_seq += ",".join(fasta) if isinstance(fasta, list) else fasta

hisat_dir = snakemake.params.get("prefix", "")
if hisat_dir:
    os.makedirs(hisat_dir)

shell(
    "hisat2-build {extra} "
    "-p {snakemake.threads} "
    "{input_seq} "
    "{snakemake.params.prefix} "
    "{log}"
)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
__author__ = "Fabian Kilpert"
__copyright__ = "Copyright 2020, Fabian Kilpert"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


log = snakemake.log_fmt_shell()


shell(
    "( "
    "seqtk sample "
    "-s {snakemake.params.seed} "
    "{snakemake.input.f1} "
    "{snakemake.params.n} "
    "| pigz -9 -p {snakemake.threads} "
    "> {snakemake.output.f1} "
    "&& "
    "seqtk sample "
    "-s {snakemake.params.seed} "
    "{snakemake.input.f2} "
    "{snakemake.params.n} "
    "| pigz -9 -p {snakemake.threads} "
    "> {snakemake.output.f2} "
    ") {log} "
)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
__author__ = "Fabian Kilpert"
__copyright__ = "Copyright 2020, Fabian Kilpert"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


log = snakemake.log_fmt_shell()


shell(
    "( "
    "seqtk sample "
    "-s {snakemake.params.seed} "
    "{snakemake.input} "
    "{snakemake.params.n} "
    "| pigz -9 -p {snakemake.threads} "
    "> {snakemake.output} "
    ") {log} "
)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
__author__ = "Thibault Dayris"
__copyright__ = "Copyright 2019, Dayris Thibault"
__email__ = "[email protected]"
__license__ = "MIT"

from snakemake.shell import shell
from snakemake.utils import makedirs

log = snakemake.log_fmt_shell(stdout=True, stderr=True)

extra = snakemake.params.get("extra", "")
sjdb_overhang = snakemake.params.get("sjdbOverhang", "100")

gtf = snakemake.input.get("gtf")
if gtf is not None:
    gtf = "--sjdbGTFfile " + gtf
    sjdb_overhang = "--sjdbOverhang " + sjdb_overhang
else:
    gtf = sjdb_overhang = ""

makedirs(snakemake.output)

shell(
    "STAR "  # Tool
    "--runMode genomeGenerate "  # Indexation mode
    "{extra} "  # Optional parameters
    "--runThreadN {snakemake.threads} "  # Number of threads
    "--genomeDir {snakemake.output} "  # Path to output
    "--genomeFastaFiles {snakemake.input.fasta} "  # Path to fasta files
    "{sjdb_overhang} "  # Read-len - 1
    "{gtf} "  # Highly recommended GTF
    "{log}"  # Logging
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import os
import re
from snakemake.shell import shell
import tempfile

__author__ = "Ryan Dale"
__copyright__ = "Copyright 2016, Ryan Dale"
__email__ = "[email protected]"
__license__ = "MIT"

_config = snakemake.params["fastq_screen_config"]

subset = snakemake.params.get("subset", 100000)
aligner = snakemake.params.get("aligner", "bowtie2")
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell()

# snakemake.params.fastq_screen_config can be either a dict or a string. If
# string, interpret as a filename pointing to the fastq_screen config file.
# Otherwise, create a new tempfile out of the contents of the dict:
if isinstance(_config, dict):
    tmp = tempfile.NamedTemporaryFile(delete=False).name
    with open(tmp, "w") as fout:
        for label, indexes in _config["database"].items():
            for aligner, index in indexes.items():
                fout.write(
                    "\t".join(["DATABASE", label, index, aligner.upper()]) + "\n"
                )
        for aligner, path in _config["aligner_paths"].items():
            fout.write("\t".join([aligner.upper(), path]) + "\n")
    config_file = tmp
else:
    config_file = _config

# fastq_screen hard-codes filenames according to this prefix. We will send
# hard-coded output to a temp dir, and then move them later.
prefix = re.split(".fastq|.fq|.txt|.seq", os.path.basename(snakemake.input[0]))[0]

tempdir = tempfile.mkdtemp()

shell(
    "fastq_screen --outdir {tempdir} "
    "--force "
    "--aligner {aligner} "
    "--conf {config_file} "
    "--subset {subset} "
    "--threads {snakemake.threads} "
    "{extra} "
    "{snakemake.input[0]} "
    "{log}"
)

# Move output to the filenames specified by the rule
shell("mv {tempdir}/{prefix}_screen.txt {snakemake.output.txt}")
shell("mv {tempdir}/{prefix}_screen.png {snakemake.output.png}")

# Clean up temp
shell("rm -r {tempdir}")
if isinstance(_config, dict):
    shell("rm {tmp}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
__author__ = "Filipe G. Vieira"
__copyright__ = "Copyright 2021, Filipe G. Vieira"
__license__ = "MIT"

from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts

java_opts = get_java_opts(snakemake)
extra = snakemake.params.get("extra", "")
adapters = snakemake.params.get("adapters", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)


n = len(snakemake.input.sample)
assert (
    n == 1 or n == 2
), "input->sample must have 1 (single-end) or 2 (paired-end) elements."

if n == 1:
    reads = "in={}".format(snakemake.input.sample)
    trimmed = "out={}".format(snakemake.output.trimmed)
else:
    reads = "in={} in2={}".format(*snakemake.input.sample)
    trimmed = "out={} out2={}".format(*snakemake.output.trimmed)


singleton = snakemake.output.get("singleton", "")
if singleton:
    singleton = f"outs={singleton}"


discarded = snakemake.output.get("discarded", "")
if discarded:
    discarded = f"outm={discarded}"


stats = snakemake.output.get("stats", "")
if stats:
    stats = f"stats={stats}"


shell(
    "bbduk.sh {java_opts} t={snakemake.threads} "
    "{reads} "
    "{adapters} "
    "{extra} "
    "{trimmed} {singleton} {discarded} "
    "{stats} "
    "{log}"
)
ShowHide 39 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...