RNA-seq Alignment and Counting Pipeline

public public 1yr ago 0 bookmarks

This pipeline aims to provide a simple alignment tool for RNAseq data (without UMI). The fastq files are aligned on a reference genome with STAR and counted with HTSeq-Count.

Prerequisites

  • Git

  • Conda

Input

The fastq.gz files need to be gathered in a directory. The pathway to this directory will be specified in the config.json. One fastq file per sample (or two if data are sequenced in paired end).

Installation

git clone https://github.com/DimitriMeistermann/rnaseq_align.git
cd rnaseq_align
conda env create -f virtualEnvs/rnaseq_align.yml

Configuration

  • IS_PAIRED_END : true or false , is the data in paired end (forward and reverse reads ) ?

  • PRELOAD_GENOME : true or false (experimental). STAR can preload the genome in the RAM, so the next job can use it. Currently it does not work on bird2cluster.

  • STAR_ALIGN_MULTIPLE_FILE : true or false , align fastqs by batches, then resulting BAM is split. This is an alternative way to minimize the number of loading of reference genome if PRELOAD_GENOME is not possible.

  • MAX_SIZE_PER_MULTIPLE_ALIGN : Size of alignment batches (if STAR_ALIGN_MULTIPLE_FILE=true ). Example for 50 (means 50 gigabytes): if we have 10 samples of 10GB, the alignment will be performed in two batches, from two different STAR_ALIGN jobs. Alignment is also split every 1000 samples, even if MAX_SIZE_PER_MULTIPLE_ALIGN has not been reached.

  • READ_LENGTH : String. Number of nucleotid in each reads, automatically picked if empty string by opening the first read of the first fastq. Warning: if cutadapt was already used on fastqs, you have to enter manually the read length before its use.

  • STAR_GENOME_THREADS : Number of cores used by genome indexing, preload and align if STAR_ALIGN_MULTIPLE_FILE.

  • THREAD_PER_SAMPLE : Number of cores used by most of the workflow jobs.

  • FASTA_REFERENCE : Path of fasta file of the reference genome

  • GTF_REFERENCE : Path of the GTF file (feature of the reference genome)

  • FASTQ_PATH : Directory that contains the input fastqs.

  • OUTPUT_PATH : Path where the results will be written.

  • PAIR_END_FILE_PATTERN : characters between sample name and 1 or 2 for paired end data. Example: if forward reads are in sample_1.fastq.gz and reverse in sample_2.fastq.gz , then PAIR_END_FILE_PATTERN = "_"

  • FORWARD_ADAPTATOR : Forward sequencing adaptator for cutadapt (adaptator trimming). If this argument is set to an empty string, cutadapt will not be used.

  • REVERSE_ADAPTATOR : Reverse sequencing adaptator for cutadapt. This argument is not used for single end data.

  • GENOME_INDEX_RAM : RAM allowed to be used by genome indexing.

  • CUTADAPT_MIN_READ_LEN : Minimum read length after trimming. If the read is to short it will be removed from the fastq.

  • DEDUP_UMI : true or false , deduplicate reads based on their UMI. UMI must be stored in read names. The UMI must be at the end of the read name, separated by an underscore, e.g. @[readname]_[UMI] . Please see the documentation of umi_tools extract to see more details on how to add UMI to read names if the sequencing method uses UMI.

  • FEATURE_TYPE : Feature of the GTF (second column) were aligned reads are counted (example: Exon, Gene,...).

  • FEATURE_ID : ID that will be the rows of the count table (gene_name for gene symbols).

Execution of the pipeline

Normal launch (with 8 cores), with default config.json

conda activate rnaseq_align
snakemake -rp -j 8

With specific config file

conda activate rnaseq_align
snakemake -rp -j 8 --configfile configFileSave/[yourConfig.json]

On SGE grid engine (example: bird2cluster ).

You can find this code in clusterFiles/exeSnakemakeBird.sh . You can modify the parameter of jobs by editing jobscriptBird.sh .

conda activate rnaseq_align
snakemake -rp -j 20 --cluster 'qsub' --jobscript clusterFiles/jobscriptBird.sh --latency-wait 10 --max-jobs-per-second 1 --configfile configFileSave/configTest.json

Test files can be found at https://gitlab.univ-nantes.fr/E114424Z/rnaseq_align/-/tree/master/testFolder/fastq

Output data

  • log : log files from executed jobs, in the form of RULE_NAME[_sampleName].(err|out) . Additional logs are maybe saved depending the tool that was used.

  • FASTQ_CUTADAPT : optional, if cutadapt is used, trimmed fastqs are stored in this folder.

  • fastQC : Zip and HTML results from the execution of FASTQC.

  • BAM : this directory contains the results of alignment in the form of BAM files.

  • DEDUP_BAM : if applicable, this directory contains the BAM deduplicated (based on their UMI)

  • counts : this directory contains the resulting counts files obtained from htseqcount.

  • results : this directory contains the raw counts table, a table of alignment stats, and the multiqc reports that present different quality scores from the data.

Code Snippets

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
args <- commandArgs(trailingOnly = TRUE)
options(stringsAsFactors = FALSE)

require(data.table)

#############
# FUNCTIONS #
#############
fastRead <- function(fileName, sep = '\t',row.names = 1,as.matrix=FALSE,stringsAsFactors=FALSE,...){
	require(data.table)
	dat <- as.data.frame(data.table::fread(fileName,stringsAsFactors=stringsAsFactors, sep = sep,...))
	if(!is.null(row.names)){
	  rownames(dat) <- dat[,row.names]
	  dat <- dat[,-row.names,drop=FALSE]
	}
	if(as.matrix) dat<-as.matrix(dat)
	return(dat)
}

fastWrite <- function(x, fileName = "default.tsv", headRow="Name",row.names=TRUE,col.names=TRUE, dec=".",sep="\t",...) {
	require(data.table)
	if(row.names){
		x=cbind(rownames(x),x)
		colnames(x)[1]<-headRow
	}
	fwrite(x=x,file=fileName,sep="\t", row.names = FALSE, col.names = col.names, quote = FALSE, dec=dec,...)
}


########
# MAIN #
########
# Construit la matrice de comptage des samples (en colonne) pour chaque gène (en ligne)

if (length(args)==0) stop("The fastq path argument is missing .n", call.=FALSE)

fastq_path <- args[1]
files<-list.files(paste0(fastq_path,"/counts"))
samplesName<- substr(files,1,nchar(files)-7)
countsList<-list()
statList<-list()
for(i in 1:length(samplesName)){
	counts<-fastRead(paste0(fastq_path,"/counts/",files[i]),header=FALSE)
	countsList[[samplesName[i]]]<-counts[1:(nrow(counts)-5),,drop=FALSE]
	aligned<-colSums(countsList[[samplesName[i]]])
	stat<-rbind(aligned,counts[(nrow(counts)-4):nrow(counts),,drop=FALSE])
	rownames(stat)<-c("aligned","no_feature","ambiguous","too_low_aQual","not_aligned",
		"alignment_not_unique")
	statList[[samplesName[i]]]<-stat
}
countsTable<-do.call(cbind,countsList)
colnames(countsTable)<-samplesName

statTable<-data.frame(t(do.call(cbind,statList)))
rownames(statTable)<-samplesName
statTable$total_read<-rowSums(statTable)

fastWrite(countsTable,paste0(fastq_path,"/results/rawCountsTable.tsv"))
fastWrite(statTable,paste0(fastq_path,"/results/alignStatTable.tsv"))
114
115
116
117
118
119
shell: """
mkdir {output}
STAR --runThreadN {params.cpu} --runMode genomeGenerate --genomeDir {output} --genomeFastaFiles {input.fasta} \\
--outFileNamePrefix {OUTPUT_PATH}/log/STAR_INDEX. \\
--sjdbGTFfile {input.gtf} --sjdbOverhang {OVERHANG}  --limitGenomeGenerateRAM {params.ram} 1> {log.out} 2> {log.err}
"""
129
130
131
shell: """
cutadapt {CUTADAPT_PARAMETER} -j {params.cpu} --minimum-length {params.minLen}  {input} 2> {log.err} | gzip -c > {output}
"""	
142
143
144
shell: """
fastqc -o {params.outpath} {input} 1> {log.out} 2> {log.err}
"""
160
161
162
163
164
165
166
167
168
169
170
shell: """
STAR --genomeLoad Remove  --genomeDir {input.starIndexDir} --runThreadN {params.cpu} \\
--outFileNamePrefix {OUTPUT_PATH}/log/STAR_LOAD_GENOME. 1> {log.out} 2> {log.err}

STAR --genomeLoad LoadAndExit  --runThreadN {params.cpu}  \\
--genomeDir {input.starIndexDir} \\
--outFileNamePrefix {OUTPUT_PATH}/log/STAR_LOAD_GENOME.  1> {log.out} 2> {log.err}

touch {output}
rm {OUTPUT_PATH}/log/STAR_LOAD_GENOME.Aligned.out.sam
"""
186
187
188
189
190
191
192
193
run:
	with open(str(output),"w") as writer:
		for sample in ALIGN_BATCHES[wildcards.alignBatch]:
			base_name = STAR_FASTQ_FOLDER+"/"+sample
			if IS_PAIRED_END:
				writer.write(base_name+PAIR_SUFFIX[0]+".fastq.gz\t"+base_name+PAIR_SUFFIX[1]+".fastq.gz\t"+sample+"\n")
			else:
				writer.write(base_name+".fastq.gz\t-\t"+sample+"\n")
SnakeMake From line 186 of main/Snakefile
207
208
209
210
211
212
shell: """
STAR --runThreadN {params.cpu} --genomeDir {input.starIndexDir} --readFilesCommand zcat \\
--readFilesManifest {input.manifest} --outSAMattributes NH HI AS nM RG \\
--outFileNamePrefix {OUTPUT_PATH}/log/STAR_ALIGN_{wildcards.alignBatch}. --outSAMtype BAM Unsorted \\
{STAR_LOAD_BEHAVIOR} 1> {log.out} 2> {log.err}
"""
218
219
220
shell: """			
samtools split -f '{OUTPUT_PATH}/log/%!.bam' -@{params.cpu} {input}
"""
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
run:
	missingFile=[]
	for alignBatch in ALIGN_BATCHES:
		missingFileInBAM = False

		for sample in ALIGN_BATCHES[alignBatch]:
			bamFile = OUTPUT_PATH+"/log/"+sample+".bam"
			if not os.path.exists(bamFile):
				missingFileInBAM=True
				missingFile.append(bamFile)

		if missingFileInBAM: os.remove(OUTPUT_PATH+"/log/"+alignBatch+"_splitIsDone")

	if len(missingFile)>0:
		exit("BAM splitting step is a failure, the following BAM are missing: "+ " ".join(missingFile))

	#no error
	for sample in SAMPLES:
		os.rename(OUTPUT_PATH+"/log/"+sample+".bam", OUTPUT_PATH+"/BAM/"+sample+".bam")
SnakeMake From line 225 of main/Snakefile
259
260
261
262
263
264
265
shell: """

STAR --runThreadN {params.cpu} --genomeDir {input.starIndexDir} --readFilesCommand zcat --readFilesIn {input.fastq} \\
--outFileNamePrefix {OUTPUT_PATH}/log/STAR_ALIGN_{wildcards.sample}. --outSAMtype BAM Unsorted \\
{STAR_LOAD_BEHAVIOR} 1> {log.out} 2> {log.err}

"""
271
272
273
shell: """
	mv {input} {output}
"""
SnakeMake From line 271 of main/Snakefile
286
287
288
289
shell: """
STAR --genomeLoad Remove  --runThreadN {params.cpu}  --genomeDir {input.starIndexDir} \\
--outFileNamePrefix {OUTPUT_PATH}/log/STAR_UNLOAD_GENOME.  1> {log.out} 2> {log.err}
"""
296
297
298
shell: """
samtools sort -@{params.cpu} -o {output} {input} 
"""
304
305
306
shell: """
samtools index -@{params.cpu} {input}
"""
320
321
322
shell: """
	umi_tools dedup --no-sort-output --stdin={input.bam} k:{params.paired} --log={log.out}  --output-stats={OUTPUT_PATH}/log/DEDUP_UMI_{wildcards.sample} | samtools sort -n -@{params.cpu} -o {output} 2> {log.err}
"""
335
336
337
shell: """
htseq-count -f bam -t {params.featureType} -i {params.featureID} -s no {input} {params.gtf} 1> {output} 2> {log.err}
"""
348
shell: "Rscript {WORKING_DIR}/countsTable.R {OUTPUT_PATH}  1> {log.out} 2> {log.err}"
SnakeMake From line 348 of main/Snakefile
359
360
361
shell: """
multiqc -f -e general_stats -e tophat -e bowtie2 {OUTPUT_PATH} -o {params.outpath}
"""
ShowHide 10 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/DimitriMeistermann/rnaseq_align
Name: rnaseq_align
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...