Pipeline for analyzing short read bulk RNAseq data

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic
Lack of a description for a new keyword DESeq2 .

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Pipeline for analyzing short-read bulk RNAseq data from Illumina sequencing runs.

Overview

Bulk RNAseq reads are analyzed in the following steps:

Reads are trimmed of adapter sequences and low-quality reads using Trimmomatic .
Trimmed reads are mapped to a reference genome using STAR .
Reads per feature are quantified using HTSeq .
Differential expression is performed using DESeq2 .

How to run

Setup

Pipeline runs using snakemake. First, install snakemake. I use a standalone snakemake environment installed using mamba in the following way:

module load mamba #on MSI
mamba create -c conda-forge -c bioconda -n snakemake snakemake

To run the pipeline on MSI, first run the command source activate snakemake . This will activate the snakemake software. All other software dependencies are loaded within the snakemake pipeline as standalone environments. See the yaml files under workflow/envs .

Next, update the files under config/ to match your experiment. The config.yaml file should be updated to specify file paths to the raw reads. There are other parameters that can be adjusted for the trimming step and for STAR, to choose the reference species of choice. Finally, specify the statistical model to use with DESeq2. This should be a combination of one or more variables in your metadata. Read the DESeq2 vignette for more information.

Also update the .csv file under config/ to list your samples. Each column represents metadata related to your experiment and can be modified as needed. The one column name that must be in the .csv file is "sample_id". The name of this column should not be changed. The sample_id will contain the names you gave to the sequencing core, not the full fastq.gz file names that are returned after sequencing is complete. All other variables can be named whatever match your experimental design, but include variables that you want to use in your differential expression analysis.

A note on reference genomes

STAR requires an indexed reference genome to run correctly. If your analysis requires a standard reference genome from Ensembl, this pipeline can create it for you by automatically downloading the fasta and gtf files for the species of interest and creating a STAR index from those files. To do this, un-hash the lines in the config.yaml file under the "parameters for 'normal' genome indices" section and edit the file to contain the information on species name and genome build of choice.

If your analysis requires a more complex genome (such as a host genome + virus hybrid genome, applicable in experiments where a viral infection is being used), this will need to be manually created. First, follow the script within custom_ref.sh to create the index. Then go to the config.yaml file and un-hash the lines under the "Parameters for custom genome indices". Finally, update the index file path in the config.yaml file to point to the index you created.

Running the pipeline

To run the full pipeline from your command line, run snakemake --cores 32 --use-conda . Or, submit the pipeline.sh bash script to the MSI slurm scheduler by entering the workflow directory and running sbatch pipeline.sh from the terminal. The #SBATCH commands in the pipeline.sh script are suggested options and should be edited to fit the needs of your analysis (large datasets could require more memory or time, etc.).

If you have multiple reference species to map to, right now the best way to approach is to run the pipeline twice. The first run, only specify the samples in the sample.csv file that should be mapped to the given species. Make sure your config.yaml file contains the correct species under the references section. Then, re-write your samples.csv file to specify only the remaining files that should be mapped to the 2nd reference species. Update the config.yaml file to contain the correct reference specices and run the pipeline again.

Output

This pipeline runs through the steps to convert the HTSeq count files into a DESeq dataset (i.e. running through this step ). The model used by DESeq2 is specified in the config.yaml file by the user. The DESeq dataset is saved as an .rds object which needs to be read into R for further analysis.

Code Snippets

script:
    "../scripts/deseq2.R"

SnakeMake DESeq2 From line 12 of rules/deseq2.smk

shell:
    '''
    trimmomatic PE -threads {params.threads} \
    {input.r1} {input.r2} \
    {output.r1_paired} {output.r1_unpaired} {output.r2_paired} {output.r2_unpaired} \
    {params.other} &> {log}
    '''

SnakeMake Trimmomatic From line 17 of rules/trim.smk

library('DESeq2')
library('dplyr')

#Read in directory where the htseq output files are located
directory <- snakemake@params[["indir"]]

#Read in model for deseq2, as designated in the config.yaml file
model <- snakemake@params[["model"]]

#Construct dataframe that connects the htseq count file name with the metadata in the config file
#Read in metadata from the config file
meta <- read.csv(snakemake@params[["metadata"]], header=TRUE) %>%
    mutate(filename=paste(sample_id,"_counts.tsv",sep="")) %>% #Add row that indicates what the corresponding htseq file is called
    select(filename, everything()) %>% #Next two lines reorder the dataframe so that the sample_id column is first, followed by the newly created filename variable, followed by all the other columns
    select(sample_id, everything())

dds <- DESeqDataSetFromHTSeqCount(sampleTable = meta,
                                       directory = directory,
                                       design=as.formula(model))

#Save DESEQ data set to an R file
saveRDS(dds, file = snakemake@output[[1]])