Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
- Lack of a description for a new keyword DESeq2 .
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Pipeline for analyzing short-read bulk RNAseq data from Illumina sequencing runs.
Overview
Bulk RNAseq reads are analyzed in the following steps:
-
Reads are trimmed of adapter sequences and low-quality reads using Trimmomatic .
-
Trimmed reads are mapped to a reference genome using STAR .
-
Reads per feature are quantified using HTSeq .
-
Differential expression is performed using DESeq2 .
How to run
Setup
Pipeline runs using snakemake. First, install snakemake. I use a standalone snakemake environment installed using mamba in the following way:
module load mamba #on MSI
mamba create -c conda-forge -c bioconda -n snakemake snakemake
To run the pipeline on MSI, first run the command
source activate snakemake
. This will activate the snakemake software. All other software dependencies are loaded within the snakemake pipeline as standalone environments. See the yaml files under
workflow/envs
.
Next, update the files under
config/
to match your experiment. The config.yaml file should be updated to specify file paths to the raw reads. There are other parameters that can be adjusted for the trimming step and for STAR, to choose the reference species of choice. Finally, specify the statistical model to use with DESeq2. This should be a combination of one or more variables in your metadata. Read the
DESeq2 vignette
for more information.
Also update the .csv file under
config/
to list your samples. Each column represents metadata related to your experiment and can be modified as needed. The one column name that must be in the .csv file is "sample_id". The name of this column should not be changed. The sample_id will contain the names you gave to the sequencing core, not the full fastq.gz file names that are returned after sequencing is complete. All other variables can be named whatever match your experimental design, but include variables that you want to use in your differential expression analysis.
A note on reference genomes
STAR requires an indexed reference genome to run correctly. If your analysis requires a standard reference genome from Ensembl, this pipeline can create it for you by automatically downloading the fasta and gtf files for the species of interest and creating a STAR index from those files. To do this, un-hash the lines in the config.yaml file under the "parameters for 'normal' genome indices" section and edit the file to contain the information on species name and genome build of choice.
If your analysis requires a more complex genome (such as a host genome + virus hybrid genome, applicable in experiments where a viral infection is being used), this will need to be manually created. First, follow the script within custom_ref.sh to create the index. Then go to the config.yaml file and un-hash the lines under the "Parameters for custom genome indices". Finally, update the
index
file path in the config.yaml file to point to the index you created.
Running the pipeline
To run the full pipeline from your command line, run
snakemake --cores 32 --use-conda
. Or, submit the
pipeline.sh
bash script to the MSI slurm scheduler by entering the workflow directory and running
sbatch pipeline.sh
from the terminal. The #SBATCH commands in the pipeline.sh script are suggested options and should be edited to fit the needs of your analysis (large datasets could require more memory or time, etc.).
If you have multiple reference species to map to, right now the best way to approach is to run the pipeline twice. The first run, only specify the samples in the sample.csv file that should be mapped to the given species. Make sure your
config.yaml
file contains the correct species under the references section. Then, re-write your samples.csv file to specify only the remaining files that should be mapped to the 2nd reference species. Update the
config.yaml
file to contain the correct reference specices and run the pipeline again.
Output
This pipeline runs through the steps to convert the HTSeq count files into a DESeq dataset (i.e. running through
this step
). The model used by DESeq2 is specified in the
config.yaml
file by the user. The DESeq dataset is saved as an .rds object which needs to be read into R for further analysis.
Code Snippets
12 13 | script: "../scripts/deseq2.R" |
17 18 19 20 21 22 23 | shell: ''' trimmomatic PE -threads {params.threads} \ {input.r1} {input.r2} \ {output.r1_paired} {output.r1_unpaired} {output.r2_paired} {output.r2_unpaired} \ {params.other} &> {log} ''' |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | library('DESeq2') library('dplyr') #Read in directory where the htseq output files are located directory <- snakemake@params[["indir"]] #Read in model for deseq2, as designated in the config.yaml file model <- snakemake@params[["model"]] #Construct dataframe that connects the htseq count file name with the metadata in the config file #Read in metadata from the config file meta <- read.csv(snakemake@params[["metadata"]], header=TRUE) %>% mutate(filename=paste(sample_id,"_counts.tsv",sep="")) %>% #Add row that indicates what the corresponding htseq file is called select(filename, everything()) %>% #Next two lines reorder the dataframe so that the sample_id column is first, followed by the newly created filename variable, followed by all the other columns select(sample_id, everything()) dds <- DESeqDataSetFromHTSeqCount(sampleTable = meta, directory = directory, design=as.formula(model)) #Save DESEQ data set to an R file saveRDS(dds, file = snakemake@output[[1]]) |
Support
- Future updates