Snakemake pipeline to perform a high-performance batch RNA-seq quantification

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

APPRIS RNA-Seq pipeline

Introduction

The Snakemake pipeline presented here allow to align a batch of RNA-seq paired-end samples accounting for genomic features.

It has been thought to be runned in a High-performance computing environment, but it could be adapted depending on the computer capability.

It allows both GENCODE and RefSeq genome reference annotations.

Sequentially it executes:

Cutadapt : To find and removes adapters. It uses the cutadapt_pe mode.
STAR : Spliced and referenced transcripts aligner.
Samtools : To interact with the sequencing data.
featureCounts : To count reads to genomic features such as genes or exons.

QSplice could also be runned after the pipeline in order to quantify the splice junctions coverage per transcript.

Installation

Run the silent installation of Miniconda in case you don't have this software in your Linux Environment

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -p $HOME/miniconda3

Once you have installed Miniconda/Anaconda, create a Python 3.7 environment. Then, install snakemake in your conda environment:

conda install -c conda-forge mamba
mamba create -c conda-forge -c bioconda -n snakemake snakemake

If Slurm has been installed in the computing environment, install with:

sbatch -o log.txt -e err.txt -J smk-install -c 2 --mem=2G -t 200 --wrap "mamba create -c conda-forge -c bioconda -y -n snakemake snakemake"

Usage

To execute the pipeline, first the user must prepare their samples.

git clone git@gitlab.com:fpozoc/appris_rnaseq.git
cd appris_rnaseq

Then, edit config and workflow as needed:

vim config/config.yaml

sample config.yaml , in which user must decide:

To use GENCODE or RefSeq annotation GTF files.
Which is the appropriate reference genome for this analysis. Please, read the Heng Li reference and this Biostar answer before taking a final decision.
Which cutadapt and STAR parameters desire to select
Which RNA-seq samples user wants to align. In this case we are using E-MTAB-2836 .

annotations:
 GRCh38:
 g34:
 url: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_34/gencode.v34.primary_assembly.annotation.gtf.gz # GENCODE 34 GRCh38 gtf annotation file
 enabled: True # Enabled to be runned
 rs109:
 url: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GRCh38_major_release_seqs_for_alignment_pipelines/GCA_000001405.15_GRCh38_full_analysis_set.refseq_annotation.gtf.gz 
 enabled: True
 GRCh37:
 g19:
 url: XXXX # GENCODE 19 GRCh37 gtf annotation file
 enabled: False
genomes:
 GRCh38:
 url: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.28_GRCh38.p13/GRCh38_major_release_seqs_for_alignment_pipelines/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz 
 GRCh37: 
 url: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz
params:
 cutadapt_pe: "option1"
 STAR: "option2"
samples:
 E-MTAB-XXXX: # proyect
 ERRXXXXX: # sample
 "1": # replicate
 r1: ERR315325_1.fastq.gz # FastQ read 1
 r2: ERR315325_2.fastq.gz # FastQ read 2

Finally, execute workflow, deploy software dependencies via conda:

snakemake -n --use-conda

If Slurm has been installed in the computing environment:

snakemake -n --use-conda --profile slurm

Directory structure

├── config
│ └── config.yaml
├── envs
│ └── environment.yaml
├── log
├── out
├── README.md
├── seq
└── Snakefile

Author information and license

Fernando Pozo ( @fpozoca – fpozoc@cnio.es) and Tomás Di Domenico.

Project initially forked from here .

Release History

1.0.0.

Contributing

Fork it ( https://gitlab.com/fpozoc/appris_rnaseq.git )
Create your feature branch ( git checkout -b feature/fooBar )
Commit your changes ( git commit -am 'Add some fooBar' )
Push to the branch ( git push origin feature/fooBar )
Create a new Pull Request

Code Snippets

shell:"""
    curl {params.url} | gunzip > {output}
"""

SnakeMake From line 36 of master/Snakefile

shell:"""
    curl {params.url} | gunzip > {output.main}
    grep -P '\tCDS\t' {output.main} > {output.cds}
"""

SnakeMake From line 49 of master/Snakefile

shell:"""
    samtools faidx {input.fa} 
    cut -f1,2 {input.fa}.fai > {output}
"""

SnakeMake SAMtools From line 63 of master/Snakefile

shell:
    "cat {input.r} > {output.merged} 2> {log.stderr}"

SnakeMake From line 79 of master/Snakefile

shell:
   "cutadapt -j {threads} {config[params][cutadapt_pe]} -o {output.r1} -p {output.r2} {input.r1} {input.r2} > {log.stdout} 2> {log.stderr}"

SnakeMake Cutadapt From line 104 of master/Snakefile

shell:         
    """
    STAR --sjdbGTFfile {input.annfile} --genomeDir {params.genome_idx} --outFileNamePrefix {params.prefix}/ --readFilesIn {input.r1} {input.r2} --runThreadN {threads} {config[params][STAR]} > {log.stdout} 2> {log.stderr}
"""

SnakeMake STAR From line 150 of master/Snakefile

wrapper:
    "0.30.0/bio/samtools/sort"

SnakeMake From line 168 of master/Snakefile

wrapper:
    "0.30.0/bio/samtools/index"

SnakeMake From line 178 of master/Snakefile

shell:"""
    featureCounts -O -M --fraction -T {threads} -t CDS -g gene_id -a {input.ann} -o {output} {input.bam} > {log.stdout} 2> {log.stderr}
"""

SnakeMake FeatureCounts From line 195 of master/Snakefile

shell:"""
    rsem-prepare-reference --gtf {input.gtf} --num-threads {threads} {input.fasta} {params.genome_dir} > {log.stdout} 2> {log.stderr}
"""

SnakeMake RSEM From line 215 of master/Snakefile

shell:"""
    rsem-calculate-expression --num-threads {threads} --bam {input.bam} --no-bam-output --paired-end {params.genome_idx} {params.prefix} > {log.stdout} 2> {log.stderr}
""" 

SnakeMake RSEM From line 236 of master/Snakefile

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"


from snakemake.shell import shell


shell("samtools index {snakemake.params} {snakemake.input[0]} {snakemake.output[0]}")

Python Snakemake SAMtools From line 1 of index/wrapper.py

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester@jimmy.harvard.edu"
__license__ = "MIT"


import os
from snakemake.shell import shell


prefix = os.path.splitext(snakemake.output[0])[0]

shell(
    "samtools sort {snakemake.params} -@ {snakemake.threads} -o {snakemake.output[0]} "
    "-T {prefix} {snakemake.input[0]}")