RNAseq pipeline based on snakemake

public public 1yr ago Version: v1.0.0 0 bookmarks

citation

If you use this pipeline, please cite:

work flow of the pipeline

  • If inputs are fastq files:

The output folder will be suffixed by _fq

  • If inputs are bam files:

The output folder will be suffixed by _bam

you can sepcify the inputs are fastqs or bams in the config.yaml file.

htseq-count or subread

you can choose to run htseq-count and (or) featureCount in the config.yaml file. featureCount is much faster (using 10 thread a mouse RNAseq data with 10gb size finish in 10 mins), while htseq-count may run 5 hours.

The count results are very similar. Htseq does some read quality filtering though. for featureCount, you have to specify in the arguments. now it is set to 0.

Importantly, featureCount requires a Name sorted bam file http://seqanswers.com/forums/showthread.php?t=49687

That's why in the STAR command, I specified --outSAMtype BAM Unsorted .

if the bam file is coordinate sorted, feature-count will resort it and it may be time-consuming.

For the newest version of feature-count (v1.5.2), if the bam files are not sorted by Name, featureCounts will sort it on the fly.

what I found is that even I specified --outSAMtype BAM Unsorted , featureCounts sometimes still complains that the bam files are not sorted by Name. I guess it has to do with multi-mapping.

https://groups.google.com/forum/#!topic/subread/pBjWETghQr0

When you count fragments ("-p") and turn on the "--donotsort" option, featureCounts will take two reads at a time from the input and treat them as the reads from the same pair. If there are for example three lines having the same read id (ie. from the same fragment/pair), then the last line will be used together with the read after it for counting. This will result in incorrect fragment counting because reads were incorrectly paired.

I may change the STAR --outSAMtype BAM SortedByCoordinate to save me from sorting later by coordinates.

naming conventions of the files

the prefix before .bam or .fastq.gz will be used to name the files.

1-Kdm2a-RasG12D-1079_S43_L007_R1_001.fastq.gz

1-Kdm2a-RasG12D-1079_S43_L007_R2_001.fastq.gz

  • If input are fastqs, R1 and R2 will be used to determine the forward reads and reverse reads.

How to distribute workflows

read doc

ssh shark.mdanderson.org
# start a screen session
screen
# make a folder, name it yourself, I named it workdir
mkdir /rsch2/genomic_med/krai/workdir/
cd /rsch2/genomic_med/krai/workdir/
git clone https://gitlab.com/tangming2005/STAR_htseq_RNAseq_pipeline
cd STAR_htseq_RNAseq_pipeline
## go to shark branch
git checkout shark
## edit the config.yaml file as needed, e.g. set mouse or human for ref genome, p value cut off for peak calling, the number of reads you want to downsample to
nano config.yaml
## skip this if on Shark, samir has py351 set up for you. see below STEPS
conda create -n snakemake python=3 snakemake
source activate snakemake

STEPS

on nautilus

there is a python3 environment set up. just do

source activate py351

create the sample.json file by feeding a fastq folder or bam folder. this folder should be a folder containing all the samples.

please use the full path for the folder that contains your fastq folders.

python3 fastq2json.py --fastq_dir /path/to/the/fastq/

python3 bam2json.py --bam_dir /path/to/the/bam/

e.g.

raw_rnaseq/
├── Sample_1-Kdm2a-RasG12D-1079
│ ├── 1-Kdm2a-RasG12D-1079_S43_L007_R1_001.fastq.gz
│ └── 1-Kdm2a-RasG12D-1079_S43_L007_R2_001.fastq.gz
├── Sample_2-Kdm2a-RasG12D-1031
│ ├── 2-Kdm2a-RasG12D-1031_S44_L007_R1_001.fastq.gz
│ └── 2-Kdm2a-RasG12D-1031_S44_L007_R2_001.fastq.gz
├── Sample_3-Mll4-RasG12D-1527
│ ├── 3-Mll4-RasG12D-1527_S45_L007_R1_001.fastq.gz
│ └── 3-Mll4-RasG12D-1527_S45_L007_R2_001.fastq.gz
├── Sample_4-Mll4-RasG12D-1576
│ ├── 4-Mll4-RasG12D-1576_S46_L007_R1_001.fastq.gz
│ └── 4-Mll4-RasG12D-1576_S46_L007_R2_001.fastq.gz
├── Sample_5-RasG12D-1475
│ ├── 5-RasG12D-1475_S47_L007_R1_001.fastq.gz
│ └── 5-RasG12D-1475_S47_L007_R2_001.fastq.gz
└── Sample_6--RasG12D-1508
 ├── 6--RasG12D-1508_S48_L007_R1_001.fastq.gz
 └── 6--RasG12D-1508_S48_L007_R2_001.fastq.gz

python3 fastq2json.py --fastq_dir /rsch2/genomic_med/krai/raw_rnaseq/

check the file information in the json file:

less -S samples.json

dry run to test

## dry run
snakemake -np

if no errors, preceed below.

Using DRMAA

job control through drmaa

DRMAA is only supported on Shark .

module load drmma
./pyflow-drmaa-RNAseq.sh

Using drmaa can control + c to stop the current run.

Dependent jobs are submitted one by one, if some jobs failed, the pipeline will stop. Good for initital testing.

submit all jobs to the cluster

./pyflow-RNAseq.sh

All jobs will be submitted to the cluster on queue. This is useful if you know your jobs will succeed for most of them and the jobs are on queue to gain priority.

job control

To kill all of your pending jobs you can use the command:

bkill ` bjobs -u krai |grep PEND |cut -f1 -d" "`
bjobs -pl
Display detailed information of all pending jobs of the invoker.
bjobs -ps
Display only pending and suspended jobs.
bjobs -u all -a
Display all jobs of all users.
bjobs -d -q short -m apple -u mtang1
Display all the recently finished jobs submitted by john to the
queue short, and executed on the host apple.

rerun some of the jobs

# specify the name of the rule, all files that associated with that rule will be rerun. e.g. rerun htseq count rule,
./pyflow-RNAseq.sh -R htseq_fq
# process unitil htseq_fq
./pyflow-RNAseq.sh -R --until htseq_fq
# process unitl htseq_fq, force rerun all files related to this step
./pyflow-RNAseq.sh -FR --until htseq_fq
## rerun one sample, just specify the name of the target file
./pyflow-ChIPseq.sh -R mysampleAligned.out.bam

checking results after run finish

snakemake --summary | sort -k1,1 | less -S
# or detailed summary will give you the commands used to generated the output and what input is used
snakemake --detailed-summary | sort -k1,1 > snakemake_run_summary.txt

clean the folders

I use echo to see what will be removed first, then you can remove all later.

find . -maxdepth 1 -type d -name "[0-9]*" | xargs echo rm -rf

Snakemake does not trigger re-runs if I add additional input files. What can I do?

Snakemake has a kind of “lazy” policy about added input files if their modification date is older than that of the output files. One reason is that information what to do cannot be inferred just from the input and output files. You need additional information about the last run to be stored. Since behaviour would be inconsistent between cases where that information is available and where it is not, this functionality has been encoded as an extra switch. To trigger updates for jobs with changed input files, you can use the command line argument –list-input-changes in the following way:

snakemake -n -R `snakemake --list-input-changes`

How do I trigger re-runs for rules with updated code or parameters?

snakemake -n -R `snakemake --list-params-changes`

and

$ snakemake -n -R `snakemake --list-code-changes`

Acknowledgements

Thanks Samir Amin for providing the STAR aligning parameters.

Code Snippets

 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
shell:
	"""
	STAR --runMode alignReads \
	--runThreadN 5 \
	--genomeDir {STARINDEX} \
	--genomeLoad NoSharedMemory \
	--readFilesIn {input.r1} {input.r2} \
	--readFilesCommand zcat \
	--twopassMode Basic \
	--runRNGseed 777 \
	--outFilterType Normal \
	--outFilterMultimapNmax 20 \
	--outFilterMismatchNmax 10 \
	--outFilterMultimapScoreRange 1 \
	--outFilterMatchNminOverLread 0.33 \
	--outFilterScoreMinOverLread 0.33 \
	--outReadsUnmapped None \
	--alignIntronMin 20 \
	--alignIntronMax 500000 \
	--alignMatesGapMax 1000000 \
	--alignSJoverhangMin 8 \
	--alignSJstitchMismatchNmax 5 -1 5 5 \
	--sjdbScore 2 \
	--alignSJDBoverhangMin 1 \
	--sjdbOverhang 100 \
	--chimSegmentMin 20 \
	--chimJunctionOverhangMin 20 \
	--chimSegmentReadGapMax 3 \
	--quantMode GeneCounts \
	--outMultimapperOrder Random \
	--outSAMstrandField intronMotif \
	--outSAMattributes All \
	--outSAMunmapped Within KeepPairs \
	--outSAMtype BAM Unsorted \
	--limitBAMsortRAM 30000000000 \
	--outSAMmode Full \
	--outSAMheaderHD @HD VN:1.4 \
	--outFileNamePrefix {params.outprefix} 2> {log}
	"""
119
120
121
122
123
124
shell:
	"""
	source activate root
	htseq-count -m intersection-nonempty --stranded=no --idattr gene_id -r name -f bam {input} {MYGTF} > {output} 2> {log}

	"""
134
135
136
137
138
139
shell:
	"""
	# -p for paried-end, counting fragments rather reads
	featureCounts -T 5 -p -t exon -g gene_id -a {MYGTF} -o {output} {input} 2> {log}

	"""
148
149
150
151
152
shell:
	"""
	samtools sort -m 2G -@ 5 -T {output}.tmp -o {output} {input} 2> {log}

	"""
162
163
164
165
shell:
	"""
	samtools index {input}
	"""
176
177
178
179
180
181
shell:
	"""
	source activate root
	bamCoverage -b {input[0]} --skipNonCoveredRegions --normalizeUsing RPKM --binSize 20 --smoothLength 100 -p 5  -o {output} 2> {log}

	"""
191
192
193
194
195
196
shell:
	"""
	source activate root
	htseq-count -m intersection-nonempty --stranded=no --idattr gene_id -r name -f bam {input} {MYGTF} > {output} 2> {log}

	"""
205
206
207
208
209
210
shell:
	"""
	# -p for paried-end, counting fragments rather reads
	featureCounts -T 5 -p -t exon -g gene_id -a {MYGTF} -o {output} {input} 2> {log}

	"""
220
221
222
223
224
225
shell:
	"""
	source activate root
	bamCoverage -b {input} --skipNonCoveredRegions --normalizeUsing RPKM --binSize 20 --smoothLength 100 -p 5  -o {output} 2> {log}

	"""
ShowHide 4 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/crazyhottommy/pyflow-RNAseq
Name: pyflow-rnaseq
Version: v1.0.0
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...