RNA-seq Alignment and Counting Pipeline

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

This pipeline aims to provide a simple alignment tool for RNAseq data (without UMI). The fastq files are aligned on a reference genome with STAR and counted with HTSeq-Count.

Prerequisites

Git
Conda

Input

The fastq.gz files need to be gathered in a directory. The pathway to this directory will be specified in the config.json. One fastq file per sample (or two if data are sequenced in paired end).

Installation

git clone https://github.com/DimitriMeistermann/rnaseq_align.git
cd rnaseq_align
conda env create -f virtualEnvs/rnaseq_align.yml

Configuration

IS_PAIRED_END : true or false , is the data in paired end (forward and reverse reads ) ?
PRELOAD_GENOME : true or false (experimental). STAR can preload the genome in the RAM, so the next job can use it. Currently it does not work on bird2cluster.
STAR_ALIGN_MULTIPLE_FILE : true or false , align fastqs by batches, then resulting BAM is split. This is an alternative way to minimize the number of loading of reference genome if PRELOAD_GENOME is not possible.
MAX_SIZE_PER_MULTIPLE_ALIGN : Size of alignment batches (if STAR_ALIGN_MULTIPLE_FILE=true ). Example for 50 (means 50 gigabytes): if we have 10 samples of 10GB, the alignment will be performed in two batches, from two different STAR_ALIGN jobs. Alignment is also split every 1000 samples, even if MAX_SIZE_PER_MULTIPLE_ALIGN has not been reached.
READ_LENGTH : String. Number of nucleotid in each reads, automatically picked if empty string by opening the first read of the first fastq. Warning: if cutadapt was already used on fastqs, you have to enter manually the read length before its use.
STAR_GENOME_THREADS : Number of cores used by genome indexing, preload and align if STAR_ALIGN_MULTIPLE_FILE.
THREAD_PER_SAMPLE : Number of cores used by most of the workflow jobs.
FASTA_REFERENCE : Path of fasta file of the reference genome
GTF_REFERENCE : Path of the GTF file (feature of the reference genome)
FASTQ_PATH : Directory that contains the input fastqs.
OUTPUT_PATH : Path where the results will be written.
PAIR_END_FILE_PATTERN : characters between sample name and 1 or 2 for paired end data. Example: if forward reads are in sample_1.fastq.gz and reverse in sample_2.fastq.gz , then PAIR_END_FILE_PATTERN = "_"
FORWARD_ADAPTATOR : Forward sequencing adaptator for cutadapt (adaptator trimming). If this argument is set to an empty string, cutadapt will not be used.
REVERSE_ADAPTATOR : Reverse sequencing adaptator for cutadapt. This argument is not used for single end data.
GENOME_INDEX_RAM : RAM allowed to be used by genome indexing.
CUTADAPT_MIN_READ_LEN : Minimum read length after trimming. If the read is to short it will be removed from the fastq.
DEDUP_UMI : true or false , deduplicate reads based on their UMI. UMI must be stored in read names. The UMI must be at the end of the read name, separated by an underscore, e.g. @[readname]_[UMI] . Please see the documentation of umi_tools extract to see more details on how to add UMI to read names if the sequencing method uses UMI.
FEATURE_TYPE : Feature of the GTF (second column) were aligned reads are counted (example: Exon, Gene,...).
FEATURE_ID : ID that will be the rows of the count table (gene_name for gene symbols).

Execution of the pipeline

Normal launch (with 8 cores), with default config.json

conda activate rnaseq_align
snakemake -rp -j 8

With specific config file

conda activate rnaseq_align
snakemake -rp -j 8 --configfile configFileSave/[yourConfig.json]

On SGE grid engine (example: bird2cluster ).

You can find this code in clusterFiles/exeSnakemakeBird.sh . You can modify the parameter of jobs by editing jobscriptBird.sh .

conda activate rnaseq_align
snakemake -rp -j 20 --cluster 'qsub' --jobscript clusterFiles/jobscriptBird.sh --latency-wait 10 --max-jobs-per-second 1 --configfile configFileSave/configTest.json

Test files can be found at https://gitlab.univ-nantes.fr/E114424Z/rnaseq_align/-/tree/master/testFolder/fastq

Output data

log : log files from executed jobs, in the form of RULE_NAME[_sampleName].(err|out) . Additional logs are maybe saved depending the tool that was used.
FASTQ_CUTADAPT : optional, if cutadapt is used, trimmed fastqs are stored in this folder.
fastQC : Zip and HTML results from the execution of FASTQC.
BAM : this directory contains the results of alignment in the form of BAM files.
DEDUP_BAM : if applicable, this directory contains the BAM deduplicated (based on their UMI)
counts : this directory contains the resulting counts files obtained from htseqcount.
results : this directory contains the raw counts table, a table of alignment stats, and the multiqc reports that present different quality scores from the data.

Code Snippets

args <- commandArgs(trailingOnly = TRUE)
options(stringsAsFactors = FALSE)

require(data.table)

#############
# FUNCTIONS #
#############
fastRead <- function(fileName, sep = '\t',row.names = 1,as.matrix=FALSE,stringsAsFactors=FALSE,...){
	require(data.table)
	dat <- as.data.frame(data.table::fread(fileName,stringsAsFactors=stringsAsFactors, sep = sep,...))
	if(!is.null(row.names)){
	  rownames(dat) <- dat[,row.names]
	  dat <- dat[,-row.names,drop=FALSE]
	}
	if(as.matrix) dat<-as.matrix(dat)
	return(dat)
}

fastWrite <- function(x, fileName = "default.tsv", headRow="Name",row.names=TRUE,col.names=TRUE, dec=".",sep="\t",...) {
	require(data.table)
	if(row.names){
		x=cbind(rownames(x),x)
		colnames(x)[1]<-headRow
	}
	fwrite(x=x,file=fileName,sep="\t", row.names = FALSE, col.names = col.names, quote = FALSE, dec=dec,...)
}


########
# MAIN #
########
# Construit la matrice de comptage des samples (en colonne) pour chaque gène (en ligne)

if (length(args)==0) stop("The fastq path argument is missing .n", call.=FALSE)

fastq_path <- args[1]
files<-list.files(paste0(fastq_path,"/counts"))
samplesName<- substr(files,1,nchar(files)-7)
countsList<-list()
statList<-list()
for(i in 1:length(samplesName)){
	counts<-fastRead(paste0(fastq_path,"/counts/",files[i]),header=FALSE)
	countsList[[samplesName[i]]]<-counts[1:(nrow(counts)-5),,drop=FALSE]
	aligned<-colSums(countsList[[samplesName[i]]])
	stat<-rbind(aligned,counts[(nrow(counts)-4):nrow(counts),,drop=FALSE])
	rownames(stat)<-c("aligned","no_feature","ambiguous","too_low_aQual","not_aligned",
		"alignment_not_unique")
	statList[[samplesName[i]]]<-stat
}
countsTable<-do.call(cbind,countsList)
colnames(countsTable)<-samplesName

statTable<-data.frame(t(do.call(cbind,statList)))
rownames(statTable)<-samplesName
statTable$total_read<-rowSums(statTable)

fastWrite(countsTable,paste0(fastq_path,"/results/rawCountsTable.tsv"))
fastWrite(statTable,paste0(fastq_path,"/results/alignStatTable.tsv"))

R data.table From line 1 of main/countsTable.R

shell: """
mkdir {output}
STAR --runThreadN {params.cpu} --runMode genomeGenerate --genomeDir {output} --genomeFastaFiles {input.fasta} \\
--outFileNamePrefix {OUTPUT_PATH}/log/STAR_INDEX. \\
--sjdbGTFfile {input.gtf} --sjdbOverhang {OVERHANG}  --limitGenomeGenerateRAM {params.ram} 1> {log.out} 2> {log.err}
"""

SnakeMake STAR From line 114 of main/Snakefile

shell: """
cutadapt {CUTADAPT_PARAMETER} -j {params.cpu} --minimum-length {params.minLen}  {input} 2> {log.err} | gzip -c > {output}
"""	

SnakeMake Cutadapt From line 129 of main/Snakefile

shell: """
fastqc -o {params.outpath} {input} 1> {log.out} 2> {log.err}
"""

SnakeMake FastQC From line 142 of main/Snakefile

shell: """
STAR --genomeLoad Remove  --genomeDir {input.starIndexDir} --runThreadN {params.cpu} \\
--outFileNamePrefix {OUTPUT_PATH}/log/STAR_LOAD_GENOME. 1> {log.out} 2> {log.err}

STAR --genomeLoad LoadAndExit  --runThreadN {params.cpu}  \\
--genomeDir {input.starIndexDir} \\
--outFileNamePrefix {OUTPUT_PATH}/log/STAR_LOAD_GENOME.  1> {log.out} 2> {log.err}

touch {output}
rm {OUTPUT_PATH}/log/STAR_LOAD_GENOME.Aligned.out.sam
"""

SnakeMake STAR From line 160 of main/Snakefile

run:
	with open(str(output),"w") as writer:
		for sample in ALIGN_BATCHES[wildcards.alignBatch]:
			base_name = STAR_FASTQ_FOLDER+"/"+sample
			if IS_PAIRED_END:
				writer.write(base_name+PAIR_SUFFIX[0]+".fastq.gz\t"+base_name+PAIR_SUFFIX[1]+".fastq.gz\t"+sample+"\n")
			else:
				writer.write(base_name+".fastq.gz\t-\t"+sample+"\n")

SnakeMake From line 186 of main/Snakefile

shell: """
STAR --runThreadN {params.cpu} --genomeDir {input.starIndexDir} --readFilesCommand zcat \\
--readFilesManifest {input.manifest} --outSAMattributes NH HI AS nM RG \\
--outFileNamePrefix {OUTPUT_PATH}/log/STAR_ALIGN_{wildcards.alignBatch}. --outSAMtype BAM Unsorted \\
{STAR_LOAD_BEHAVIOR} 1> {log.out} 2> {log.err}
"""

SnakeMake STAR From line 207 of main/Snakefile

shell: """			
samtools split -f '{OUTPUT_PATH}/log/%!.bam' -@{params.cpu} {input}
"""

SnakeMake SAMtools From line 218 of main/Snakefile

run:
	missingFile=[]
	for alignBatch in ALIGN_BATCHES:
		missingFileInBAM = False

		for sample in ALIGN_BATCHES[alignBatch]:
			bamFile = OUTPUT_PATH+"/log/"+sample+".bam"
			if not os.path.exists(bamFile):
				missingFileInBAM=True
				missingFile.append(bamFile)

		if missingFileInBAM: os.remove(OUTPUT_PATH+"/log/"+alignBatch+"_splitIsDone")

	if len(missingFile)>0:
		exit("BAM splitting step is a failure, the following BAM are missing: "+ " ".join(missingFile))

	#no error
	for sample in SAMPLES:
		os.rename(OUTPUT_PATH+"/log/"+sample+".bam", OUTPUT_PATH+"/BAM/"+sample+".bam")

SnakeMake From line 225 of main/Snakefile

shell: """

STAR --runThreadN {params.cpu} --genomeDir {input.starIndexDir} --readFilesCommand zcat --readFilesIn {input.fastq} \\
--outFileNamePrefix {OUTPUT_PATH}/log/STAR_ALIGN_{wildcards.sample}. --outSAMtype BAM Unsorted \\
{STAR_LOAD_BEHAVIOR} 1> {log.out} 2> {log.err}

"""

SnakeMake STAR From line 259 of main/Snakefile

shell: """
	mv {input} {output}
"""

SnakeMake From line 271 of main/Snakefile

shell: """
STAR --genomeLoad Remove  --runThreadN {params.cpu}  --genomeDir {input.starIndexDir} \\
--outFileNamePrefix {OUTPUT_PATH}/log/STAR_UNLOAD_GENOME.  1> {log.out} 2> {log.err}
"""

SnakeMake STAR From line 286 of main/Snakefile

shell: """
samtools sort -@{params.cpu} -o {output} {input} 
"""

SnakeMake SAMtools From line 296 of main/Snakefile

shell: """
samtools index -@{params.cpu} {input}
"""

SnakeMake SAMtools From line 304 of main/Snakefile

shell: """
	umi_tools dedup --no-sort-output --stdin={input.bam} k:{params.paired} --log={log.out}  --output-stats={OUTPUT_PATH}/log/DEDUP_UMI_{wildcards.sample} | samtools sort -n -@{params.cpu} -o {output} 2> {log.err}
"""

SnakeMake SAMtools umi_tools From line 320 of main/Snakefile

shell: """
htseq-count -f bam -t {params.featureType} -i {params.featureID} -s no {input} {params.gtf} 1> {output} 2> {log.err}
"""

SnakeMake htseqcount From line 335 of main/Snakefile

shell: "Rscript {WORKING_DIR}/countsTable.R {OUTPUT_PATH}  1> {log.out} 2> {log.err}"

SnakeMake From line 348 of main/Snakefile

shell: """
multiqc -f -e general_stats -e tophat -e bowtie2 {OUTPUT_PATH} -o {params.outpath}
"""

SnakeMake MultiQC Bowtie 2 From line 359 of main/Snakefile

ShowHide 10 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/DimitriMeistermann/rnaseq_align

Name: rnaseq_align

Version: 1

Badge:

Insert copied code into your website to add a link to this workflow.

License: None

Keywords:

Sequence alignment umi_tools Bowtie 2 Cutadapt FastQC htseqcount MultiQC SAMtools Snakemake STAR data.table RNA-Seq

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free