Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
This pipeline aims to provide a simple alignment tool for RNAseq data (without UMI). The fastq files are aligned on a reference genome with STAR and counted with HTSeq-Count.
Prerequisites
-
Git
-
Conda
Input
The fastq.gz files need to be gathered in a directory. The pathway to this directory will be specified in the config.json. One fastq file per sample (or two if data are sequenced in paired end).
Installation
git clone https://github.com/DimitriMeistermann/rnaseq_align.git
cd rnaseq_align
conda env create -f virtualEnvs/rnaseq_align.yml
Configuration
-
IS_PAIRED_END : true or false , is the data in paired end (forward and reverse reads ) ?
-
PRELOAD_GENOME : true or false (experimental). STAR can preload the genome in the RAM, so the next job can use it. Currently it does not work on bird2cluster.
-
STAR_ALIGN_MULTIPLE_FILE : true or false , align fastqs by batches, then resulting BAM is split. This is an alternative way to minimize the number of loading of reference genome if PRELOAD_GENOME is not possible.
-
MAX_SIZE_PER_MULTIPLE_ALIGN : Size of alignment batches (if STAR_ALIGN_MULTIPLE_FILE=true ). Example for 50 (means 50 gigabytes): if we have 10 samples of 10GB, the alignment will be performed in two batches, from two different STAR_ALIGN jobs. Alignment is also split every 1000 samples, even if MAX_SIZE_PER_MULTIPLE_ALIGN has not been reached.
-
READ_LENGTH : String. Number of nucleotid in each reads, automatically picked if empty string by opening the first read of the first fastq. Warning: if cutadapt was already used on fastqs, you have to enter manually the read length before its use.
-
STAR_GENOME_THREADS : Number of cores used by genome indexing, preload and align if STAR_ALIGN_MULTIPLE_FILE.
-
THREAD_PER_SAMPLE : Number of cores used by most of the workflow jobs.
-
FASTA_REFERENCE : Path of fasta file of the reference genome
-
GTF_REFERENCE : Path of the GTF file (feature of the reference genome)
-
FASTQ_PATH : Directory that contains the input fastqs.
-
OUTPUT_PATH : Path where the results will be written.
-
PAIR_END_FILE_PATTERN : characters between sample name and 1 or 2 for paired end data. Example: if forward reads are in sample_1.fastq.gz and reverse in sample_2.fastq.gz , then PAIR_END_FILE_PATTERN = "_"
-
FORWARD_ADAPTATOR : Forward sequencing adaptator for cutadapt (adaptator trimming). If this argument is set to an empty string, cutadapt will not be used.
-
REVERSE_ADAPTATOR : Reverse sequencing adaptator for cutadapt. This argument is not used for single end data.
-
GENOME_INDEX_RAM : RAM allowed to be used by genome indexing.
-
CUTADAPT_MIN_READ_LEN : Minimum read length after trimming. If the read is to short it will be removed from the fastq.
-
DEDUP_UMI : true or false , deduplicate reads based on their UMI. UMI must be stored in read names. The UMI must be at the end of the read name, separated by an underscore, e.g.
@[readname]_[UMI]
. Please see the documentation of umi_tools extract to see more details on how to add UMI to read names if the sequencing method uses UMI. -
FEATURE_TYPE : Feature of the GTF (second column) were aligned reads are counted (example: Exon, Gene,...).
-
FEATURE_ID : ID that will be the rows of the count table (gene_name for gene symbols).
Execution of the pipeline
Normal launch (with 8 cores), with default config.json
conda activate rnaseq_align
snakemake -rp -j 8
With specific config file
conda activate rnaseq_align
snakemake -rp -j 8 --configfile configFileSave/[yourConfig.json]
On SGE grid engine (example: bird2cluster ).
You can find this code in clusterFiles/exeSnakemakeBird.sh . You can modify the parameter of jobs by editing jobscriptBird.sh .
conda activate rnaseq_align
snakemake -rp -j 20 --cluster 'qsub' --jobscript clusterFiles/jobscriptBird.sh --latency-wait 10 --max-jobs-per-second 1 --configfile configFileSave/configTest.json
Test files can be found at https://gitlab.univ-nantes.fr/E114424Z/rnaseq_align/-/tree/master/testFolder/fastq
Output data
-
log : log files from executed jobs, in the form of RULE_NAME[_sampleName].(err|out) . Additional logs are maybe saved depending the tool that was used.
-
FASTQ_CUTADAPT : optional, if cutadapt is used, trimmed fastqs are stored in this folder.
-
fastQC : Zip and HTML results from the execution of FASTQC.
-
BAM : this directory contains the results of alignment in the form of BAM files.
-
DEDUP_BAM : if applicable, this directory contains the BAM deduplicated (based on their UMI)
-
counts : this directory contains the resulting counts files obtained from htseqcount.
-
results : this directory contains the raw counts table, a table of alignment stats, and the multiqc reports that present different quality scores from the data.
Code Snippets
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | args <- commandArgs(trailingOnly = TRUE) options(stringsAsFactors = FALSE) require(data.table) ############# # FUNCTIONS # ############# fastRead <- function(fileName, sep = '\t',row.names = 1,as.matrix=FALSE,stringsAsFactors=FALSE,...){ require(data.table) dat <- as.data.frame(data.table::fread(fileName,stringsAsFactors=stringsAsFactors, sep = sep,...)) if(!is.null(row.names)){ rownames(dat) <- dat[,row.names] dat <- dat[,-row.names,drop=FALSE] } if(as.matrix) dat<-as.matrix(dat) return(dat) } fastWrite <- function(x, fileName = "default.tsv", headRow="Name",row.names=TRUE,col.names=TRUE, dec=".",sep="\t",...) { require(data.table) if(row.names){ x=cbind(rownames(x),x) colnames(x)[1]<-headRow } fwrite(x=x,file=fileName,sep="\t", row.names = FALSE, col.names = col.names, quote = FALSE, dec=dec,...) } ######## # MAIN # ######## # Construit la matrice de comptage des samples (en colonne) pour chaque gène (en ligne) if (length(args)==0) stop("The fastq path argument is missing .n", call.=FALSE) fastq_path <- args[1] files<-list.files(paste0(fastq_path,"/counts")) samplesName<- substr(files,1,nchar(files)-7) countsList<-list() statList<-list() for(i in 1:length(samplesName)){ counts<-fastRead(paste0(fastq_path,"/counts/",files[i]),header=FALSE) countsList[[samplesName[i]]]<-counts[1:(nrow(counts)-5),,drop=FALSE] aligned<-colSums(countsList[[samplesName[i]]]) stat<-rbind(aligned,counts[(nrow(counts)-4):nrow(counts),,drop=FALSE]) rownames(stat)<-c("aligned","no_feature","ambiguous","too_low_aQual","not_aligned", "alignment_not_unique") statList[[samplesName[i]]]<-stat } countsTable<-do.call(cbind,countsList) colnames(countsTable)<-samplesName statTable<-data.frame(t(do.call(cbind,statList))) rownames(statTable)<-samplesName statTable$total_read<-rowSums(statTable) fastWrite(countsTable,paste0(fastq_path,"/results/rawCountsTable.tsv")) fastWrite(statTable,paste0(fastq_path,"/results/alignStatTable.tsv")) |
114 115 116 117 118 119 | shell: """ mkdir {output} STAR --runThreadN {params.cpu} --runMode genomeGenerate --genomeDir {output} --genomeFastaFiles {input.fasta} \\ --outFileNamePrefix {OUTPUT_PATH}/log/STAR_INDEX. \\ --sjdbGTFfile {input.gtf} --sjdbOverhang {OVERHANG} --limitGenomeGenerateRAM {params.ram} 1> {log.out} 2> {log.err} """ |
129 130 131 | shell: """ cutadapt {CUTADAPT_PARAMETER} -j {params.cpu} --minimum-length {params.minLen} {input} 2> {log.err} | gzip -c > {output} """ |
142 143 144 | shell: """ fastqc -o {params.outpath} {input} 1> {log.out} 2> {log.err} """ |
160 161 162 163 164 165 166 167 168 169 170 | shell: """ STAR --genomeLoad Remove --genomeDir {input.starIndexDir} --runThreadN {params.cpu} \\ --outFileNamePrefix {OUTPUT_PATH}/log/STAR_LOAD_GENOME. 1> {log.out} 2> {log.err} STAR --genomeLoad LoadAndExit --runThreadN {params.cpu} \\ --genomeDir {input.starIndexDir} \\ --outFileNamePrefix {OUTPUT_PATH}/log/STAR_LOAD_GENOME. 1> {log.out} 2> {log.err} touch {output} rm {OUTPUT_PATH}/log/STAR_LOAD_GENOME.Aligned.out.sam """ |
186 187 188 189 190 191 192 193 | run: with open(str(output),"w") as writer: for sample in ALIGN_BATCHES[wildcards.alignBatch]: base_name = STAR_FASTQ_FOLDER+"/"+sample if IS_PAIRED_END: writer.write(base_name+PAIR_SUFFIX[0]+".fastq.gz\t"+base_name+PAIR_SUFFIX[1]+".fastq.gz\t"+sample+"\n") else: writer.write(base_name+".fastq.gz\t-\t"+sample+"\n") |
207 208 209 210 211 212 | shell: """ STAR --runThreadN {params.cpu} --genomeDir {input.starIndexDir} --readFilesCommand zcat \\ --readFilesManifest {input.manifest} --outSAMattributes NH HI AS nM RG \\ --outFileNamePrefix {OUTPUT_PATH}/log/STAR_ALIGN_{wildcards.alignBatch}. --outSAMtype BAM Unsorted \\ {STAR_LOAD_BEHAVIOR} 1> {log.out} 2> {log.err} """ |
218 219 220 | shell: """ samtools split -f '{OUTPUT_PATH}/log/%!.bam' -@{params.cpu} {input} """ |
225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 | run: missingFile=[] for alignBatch in ALIGN_BATCHES: missingFileInBAM = False for sample in ALIGN_BATCHES[alignBatch]: bamFile = OUTPUT_PATH+"/log/"+sample+".bam" if not os.path.exists(bamFile): missingFileInBAM=True missingFile.append(bamFile) if missingFileInBAM: os.remove(OUTPUT_PATH+"/log/"+alignBatch+"_splitIsDone") if len(missingFile)>0: exit("BAM splitting step is a failure, the following BAM are missing: "+ " ".join(missingFile)) #no error for sample in SAMPLES: os.rename(OUTPUT_PATH+"/log/"+sample+".bam", OUTPUT_PATH+"/BAM/"+sample+".bam") |
259 260 261 262 263 264 265 | shell: """ STAR --runThreadN {params.cpu} --genomeDir {input.starIndexDir} --readFilesCommand zcat --readFilesIn {input.fastq} \\ --outFileNamePrefix {OUTPUT_PATH}/log/STAR_ALIGN_{wildcards.sample}. --outSAMtype BAM Unsorted \\ {STAR_LOAD_BEHAVIOR} 1> {log.out} 2> {log.err} """ |
271 272 273 | shell: """ mv {input} {output} """ |
286 287 288 289 | shell: """ STAR --genomeLoad Remove --runThreadN {params.cpu} --genomeDir {input.starIndexDir} \\ --outFileNamePrefix {OUTPUT_PATH}/log/STAR_UNLOAD_GENOME. 1> {log.out} 2> {log.err} """ |
296 297 298 | shell: """ samtools sort -@{params.cpu} -o {output} {input} """ |
304 305 306 | shell: """ samtools index -@{params.cpu} {input} """ |
320 321 322 | shell: """ umi_tools dedup --no-sort-output --stdin={input.bam} k:{params.paired} --log={log.out} --output-stats={OUTPUT_PATH}/log/DEDUP_UMI_{wildcards.sample} | samtools sort -n -@{params.cpu} -o {output} 2> {log.err} """ |
335 336 337 | shell: """ htseq-count -f bam -t {params.featureType} -i {params.featureID} -s no {input} {params.gtf} 1> {output} 2> {log.err} """ |
348 | shell: "Rscript {WORKING_DIR}/countsTable.R {OUTPUT_PATH} 1> {log.out} 2> {log.err}" |
359 360 361 | shell: """ multiqc -f -e general_stats -e tophat -e bowtie2 {OUTPUT_PATH} -o {params.outpath} """ |
Support
- Future updates