Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
The repository contains a Snakemake workflow to analyze data from SARS-CoV-2 infected samples from humans. The workflow analyzes bulk RNA-seq data from a study by Katsura et al. (https://www.cell.com/cell-stem-cell/pdf/S1934-5909(20)30499-9.pdf). Katsura et al. collected bulk RNA-seq from a sample of SARS-CoV-2 infected human lung cells (alveolar type 2 or AT2 cells) 48 hours post inoculation alongwith RNA-seq from a control (uninfected) sample. Each sample had 3 replicates. The data was deposited by the authors at the gene expression omnibus (GEO) database (GSE152586) and can be accessed at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE152586.
Note: The current version of the pipeline has only the Data Download, Raw Data Quality Check, Adapter Trimming, Trimmed Data Quality Check, Index reference genomes and Map trimmed reads sections complete. Work is still in progress for other sections (quantification and differential gene expression) of the pipeline
Table of Contents
About The Project
In this project, I have implemented a Snakemake workflow to analyze bulk RNA-seq data from SARS CoV-2 infected samples. The intent is have the pipeline automated end-to-end to allow the user to run the analysis without any interruptions. This workflow outlines a very basic approach to analyze RNA-seq data and can be modified by any user to add more rules and convert into a more sophisticated workflow. On an added note, I have not included steps for filtering the data for ribosomal RNAs. I assume that mapping the reads to human reference genome and then collecting the data only for mapped reads and assembling the reads into transcripts should result in high confidence mRNA expression data.
For this work, I have used the human and SARS Cov-2 reference genomes available at NCBI.
-
Human : https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.fna.gz
-
SAR-CoV-2: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/858/895/GCF_009858895.2_ASM985889v3/GCF_009858895.2_ASM985889v3_genomic.fna.gz
Built With
- Python and Snakemake
Getting Started
You will need a Linux machine (Ubuntu 18.04 or similar) to run this workflow. I have run my analysis on a Ubuntu 18.04 instance using the Amazon Web Services (AWS) Elastic Cloud Compute (EC2) services (https://aws.amazon.com/ec2/?ec2-whats-new.sort-by=item.additionalFields.postDateTime&ec2-whats-new.sort-order=desc). If you already have a good configuration Linux machine (e.g., 12 cores, 32 GB memory, ~1 TB diskspace), you should be good to go. Otherwise, if you have a similar configuration Windows machine and have Windows 10 installed, then you can install windows subsystem linux (WSL) to run the analysis (https://docs.microsoft.com/en-us/learn/modules/get-started-with-windows-subsystem-for-linux/).
Prerequisites
Ubuntu 18.04 or similar machine with at least 30 GB memory and 1 TB diskspace. Must have BASH installed.
Installation
Install from scratch
-
Clone the repo
git clone https://github.com/sambitmishra0628/RNA-Seq-1
-
Install miniconda
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh chmod +x Miniconda3-latest-Linux-x86_64.sh ./Miniconda3-latest-Linux-x86_64.sh
Follow the instructions to complete the installation of miniconda.
-
Create a conda environment
conda create -n rna_seq_env python=3.9 conda activate rna_seq_env
-
Install mamba
conda install -c conda-forge mamba
-
Install snakemake
mamba install -c conda-forge -c bioconda snakemake
Install using environment file
You can use the environment file provided in this repository to create a similar environment as in the above steps. Follow steps 1 and 2 as given above and then use the following command.
conda create -n rna_seq_env -f ngs_env.yml
Usage
-
Activate the conda environment (if not already activated)
conda activate rna_seq_env
-
Clone the repository
git clone https://github.com/sambitmishra0628/RNA-Seq-1
-
Change the DATADIR in the
RNA-Seq-1/config.yml
to point to a location in your machine -
Do a dry run of Snakemake
snakemake -n -p --use-conda --configfile config.yml --cores 4
Note that I have set the number of cores to 4. You can increase that number depending on the number of cores available in your machine minus 1
-
Run the workflow
snakemake -p --use-conda --configfile config.yml --cores 4
-
Run the following command to view the dag
snakemake --dag | dot -Tpng > dag.png
-
Tip. If you notice that the workflow is running all over again after a certain rule/rules failed despite the files from previously executed rules being present, then it is likely due to a time stamp issue. Follow the following steps:
-
Under
rule all
, comment out the rules that fail in theSnakefile
-
Run the command
snakemake --touch -p --use-conda --configfile config.yml --cores 4
This will reset all the time stamps incase a changed time stamp is the issue. Then do a dry run and you should see only the rules for which the files are missing shown to be executed.
License
Distributed under the MIT License. See
LICENSE
for more information.
Contact
Sambit Mishra - [email protected]
Project Link: https://github.com/sambitmishra0628/RNA-Seq-1
Acknowledgements
Code Snippets
22 23 24 25 26 27 28 29 30 31 32 33 34 35 | shell: """ cd {output.outdir} wget {params.hg_url} wget {params.h_gff_url} wget {params.cv_url} wget {params.cv_gff_url} gunzip {params.hg_file}.gz gunzip {params.h_gff_file}.gz gunzip {params.cv_g_file}.gz gunzip {params.cv_gff_file}.gz """ |
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | run: # Create sample directory sample_dir = config["DATADIR"] + "/samples/" if not os.path.isdir(sample_dir): os.mkdir(sample_dir) # Download control files print ("***Downloading control files ... ***") for srr,file_id in zip (config["SRR_CONTROL"],config["CONTROL"]): fw_file = config["src_url"] + srr + "/" + file_id + "_R1.fq.gz.1" rev_file = config["src_url"] + srr + "/" + file_id + "_R2.fq.gz.1" cmd1 = "wget " + fw_file + " -P " + sample_dir cmd2 = "wget " + rev_file + " -P " + sample_dir print (cmd1) print (cmd2) os.system(cmd1) os.system(cmd2) # Download treatment files print ("***Downloading treatment files ... ***") for srr,file_id in zip (config["SRR_TREATMENT"],config["TREATMENT"]): fw_file = config["src_url"] + srr + "/" + file_id + "_R1.fq.gz.1" rev_file = config["src_url"] + srr + "/" + file_id + "_R2.fq.gz.1" cmd1 = "wget " + fw_file + " -P " + sample_dir cmd2 = "wget " + rev_file + " -P " + sample_dir print (cmd1) print (cmd2) os.system(cmd1) os.system(cmd2) # Rename the files from *.fq.gz.1 to *.fq.gz curr_dir = os.getcwd() os.chdir(sample_dir) all_files = glob.glob("*.gz.1") for file_i in all_files: new_file = file_i.replace('.gz.1', '.gz') cmd = "mv " + file_i + " " + new_file os.system(cmd) |
21 22 23 24 25 26 27 28 29 | shell: """ gffread -T {input.human_gff} -o {params.human_gtf} gffread -T {input.sars2_gff} -o {params.sars2_gtf} STAR --runThreadN {threads} --runMode genomeGenerate --genomeDir {output.out1} --genomeFastaFiles {input.human_genome} \ --sjdbGTFfile {params.human_gtf} --outFileNamePrefix {output.out1}/{params.prefix1} --sjdbOverhang 100 \ --limitGenomeGenerateRAM 25000000000 --genomeChrBinNbits=16 --genomeSAsparseD 3 STAR --runThreadN {threads} --runMode genomeGenerate --genomeDir {output.out2} --genomeFastaFiles {input.sars2_genome} --sjdbGTFfile {params.sars2_gtf} --outFileNamePrefix {output.out2}/{params.prefix2} --sjdbOverhang 100 """ |
18 19 20 21 | shell: """ STAR --runThreadN {threads} --genomeDir {params.gd} --readFilesIn {input.fw} {input.rev} --outFileNamePrefix {params.outdir}{wildcards.sample_id} --sjdbOverhang 100 --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate --outReadsUnmapped Fastx --quantMode GeneCounts """ |
14 15 16 17 18 19 | shell: """ mkdir -p {output.out} fastqc -o {output.out} -t {threads} {input.ctr} &> {output.out}/fastqc_raw_ctr.log fastqc -o {output.out} -t {threads} {input.trt} &> {output.out}/fastqc_raw_trt.log """ |
28 29 | shell: "multiqc {input.inputdir} -o {output.outdir}" |
43 44 45 46 | shell: """ cutadapt -q 20 -m 100 -j {threads} -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT -o {output.fw} -p {output.rev} {input.fw} {input.rev} &> {log} """ |
60 61 62 63 64 65 | shell: """ mkdir -p {output.out} fastqc -o {output.out} -t {threads} {input.ctr} &> {output.out}/fastqc_trimmed_ctr.log fastqc -o {output.out} -t {threads} {input.trt} &> {output.out}/fastqc_trimmed_trt.log """ |
75 76 | shell: "multiqc {input.inputdir} -o {output.outdir}" |
15 16 17 18 | shell: """ rsem-prepare-reference --gtf {ref_dir}/{gtf_file} -trusted-sources BestRefSeq,Curated\ Genomic {params.ref_dir}/{params.genome_file} {output.outdir}/human_refseq """ |
33 34 35 36 | shell: """ STAR --runThreadN {threads} --genomeDir {params.gd} --readFilesIn {input.fw} {input.rev} --outFileNamePrefix {params.outdir}{wildcards.sample_id} --sjdbOverhang 100 --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate --outReadsUnmapped Fastx --quantMode GeneCounts """ |
Support
- Future updates