Snakemake pipeline to perform a high-performance batch RNA-seq quantification

public public 1yr ago 0 bookmarks

Introduction

The Snakemake pipeline presented here allow to align a batch of RNA-seq paired-end samples accounting for genomic features.

It has been thought to be runned in a High-performance computing environment, but it could be adapted depending on the computer capability.

It allows both GENCODE and RefSeq genome reference annotations.

Sequentially it executes:

  • Cutadapt : To find and removes adapters. It uses the cutadapt_pe mode.

  • STAR : Spliced and referenced transcripts aligner.

  • Samtools : To interact with the sequencing data.

  • featureCounts : To count reads to genomic features such as genes or exons.

QSplice could also be runned after the pipeline in order to quantify the splice junctions coverage per transcript.

Installation

Run the silent installation of Miniconda in case you don't have this software in your Linux Environment

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -p $HOME/miniconda3

Once you have installed Miniconda/Anaconda, create a Python 3.7 environment. Then, install snakemake in your conda environment:

conda install -c conda-forge mamba
mamba create -c conda-forge -c bioconda -n snakemake snakemake

If Slurm has been installed in the computing environment, install with:

sbatch -o log.txt -e err.txt -J smk-install -c 2 --mem=2G -t 200 --wrap "mamba create -c conda-forge -c bioconda -y -n snakemake snakemake"

Usage

To execute the pipeline, first the user must prepare their samples.

git clone git@gitlab.com:fpozoc/appris_rnaseq.git
cd appris_rnaseq

Then, edit config and workflow as needed:

vim config/config.yaml

sample config.yaml , in which user must decide:

  • To use GENCODE or RefSeq annotation GTF files.

  • Which is the appropriate reference genome for this analysis. Please, read the Heng Li reference and this Biostar answer before taking a final decision.

  • Which cutadapt and STAR parameters desire to select

  • Which RNA-seq samples user wants to align. In this case we are using E-MTAB-2836 .

annotations:
 GRCh38:
 g34:
 url: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_34/gencode.v34.primary_assembly.annotation.gtf.gz # GENCODE 34 GRCh38 gtf annotation file
 enabled: True # Enabled to be runned
 rs109:
 url: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GRCh38_major_release_seqs_for_alignment_pipelines/GCA_000001405.15_GRCh38_full_analysis_set.refseq_annotation.gtf.gz 
 enabled: True
 GRCh37:
 g19:
 url: XXXX # GENCODE 19 GRCh37 gtf annotation file
 enabled: False
genomes:
 GRCh38:
 url: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.28_GRCh38.p13/GRCh38_major_release_seqs_for_alignment_pipelines/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz 
 GRCh37: 
 url: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz
params:
 cutadapt_pe: "option1"
 STAR: "option2"
samples:
 E-MTAB-XXXX: # proyect
 ERRXXXXX: # sample
 "1": # replicate
 r1: ERR315325_1.fastq.gz # FastQ read 1
 r2: ERR315325_2.fastq.gz # FastQ read 2

Finally, execute workflow, deploy software dependencies via conda:

snakemake -n --use-conda

If Slurm has been installed in the computing environment:

snakemake -n --use-conda --profile slurm

Directory structure

├── config
│ └── config.yaml
├── envs
│ └── environment.yaml
├── log
├── out
├── README.md
├── seq
└── Snakefile

Author information and license

Fernando Pozo ( @fpozoca – fpozoc@cnio.es) and Tomás Di Domenico.

Project initially forked from here .

Release History

  • 1.0.0.

Contributing

  1. Fork it ( https://gitlab.com/fpozoc/appris_rnaseq.git )

  2. Create your feature branch ( git checkout -b feature/fooBar )

  3. Commit your changes ( git commit -am 'Add some fooBar' )

  4. Push to the branch ( git push origin feature/fooBar )

  5. Create a new Pull Request

Code Snippets

36
37
38
shell:"""
    curl {params.url} | gunzip > {output}
"""
49
50
51
52
shell:"""
    curl {params.url} | gunzip > {output.main}
    grep -P '\tCDS\t' {output.main} > {output.cds}
"""
63
64
65
66
shell:"""
    samtools faidx {input.fa} 
    cut -f1,2 {input.fa}.fai > {output}
"""
79
80
shell:
    "cat {input.r} > {output.merged} 2> {log.stderr}"
104
105
shell:
   "cutadapt -j {threads} {config[params][cutadapt_pe]} -o {output.r1} -p {output.r2} {input.r1} {input.r2} > {log.stdout} 2> {log.stderr}"
150
151
152
153
shell:         
    """
    STAR --sjdbGTFfile {input.annfile} --genomeDir {params.genome_idx} --outFileNamePrefix {params.prefix}/ --readFilesIn {input.r1} {input.r2} --runThreadN {threads} {config[params][STAR]} > {log.stdout} 2> {log.stderr}
"""
168
169
wrapper:
    "0.30.0/bio/samtools/sort"
SnakeMake From line 168 of master/Snakefile
178
179
wrapper:
    "0.30.0/bio/samtools/index"
SnakeMake From line 178 of master/Snakefile
195
196
197
shell:"""
    featureCounts -O -M --fraction -T {threads} -t CDS -g gene_id -a {input.ann} -o {output} {input.bam} > {log.stdout} 2> {log.stderr}
"""
215
216
217
shell:"""
    rsem-prepare-reference --gtf {input.gtf} --num-threads {threads} {input.fasta} {params.genome_dir} > {log.stdout} 2> {log.stderr}
"""
236
237
238
shell:"""
    rsem-calculate-expression --num-threads {threads} --bam {input.bam} --no-bam-output --paired-end {params.genome_idx} {params.prefix} > {log.stdout} 2> {log.stderr}
""" 
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"


from snakemake.shell import shell


shell("samtools index {snakemake.params} {snakemake.input[0]} {snakemake.output[0]}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"


import os
from snakemake.shell import shell


prefix = os.path.splitext(snakemake.output[0])[0]

shell(
    "samtools sort {snakemake.params} -@ {snakemake.threads} -o {snakemake.output[0]} "
    "-T {prefix} {snakemake.input[0]}")
ShowHide 7 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/fpozoc/appris_rnaseq
Name: appris_rnaseq
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...