Analysis of RNA-seq data from Covid19 samples

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

The repository contains a Snakemake workflow to analyze data from SARS-CoV-2 infected samples from humans. The workflow analyzes bulk RNA-seq data from a study by Katsura et al. (https://www.cell.com/cell-stem-cell/pdf/S1934-5909(20)30499-9.pdf). Katsura et al. collected bulk RNA-seq from a sample of SARS-CoV-2 infected human lung cells (alveolar type 2 or AT2 cells) 48 hours post inoculation alongwith RNA-seq from a control (uninfected) sample. Each sample had 3 replicates. The data was deposited by the authors at the gene expression omnibus (GEO) database (GSE152586) and can be accessed at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE152586.

Note: The current version of the pipeline has only the Data Download, Raw Data Quality Check, Adapter Trimming, Trimmed Data Quality Check, Index reference genomes and Map trimmed reads sections complete. Work is still in progress for other sections (quantification and differential gene expression) of the pipeline

About The Project
- Built With
Getting Started
- Prerequisites
- Installation
Usage
License
Contact
Acknowledgements

About The Project

In this project, I have implemented a Snakemake workflow to analyze bulk RNA-seq data from SARS CoV-2 infected samples. The intent is have the pipeline automated end-to-end to allow the user to run the analysis without any interruptions. This workflow outlines a very basic approach to analyze RNA-seq data and can be modified by any user to add more rules and convert into a more sophisticated workflow. On an added note, I have not included steps for filtering the data for ribosomal RNAs. I assume that mapping the reads to human reference genome and then collecting the data only for mapped reads and assembling the reads into transcripts should result in high confidence mRNA expression data.

For this work, I have used the human and SARS Cov-2 reference genomes available at NCBI.

Human : https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.fna.gz
SAR-CoV-2: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/858/895/GCF_009858895.2_ASM985889v3/GCF_009858895.2_ASM985889v3_genomic.fna.gz

Built With

Python and Snakemake

Getting Started

You will need a Linux machine (Ubuntu 18.04 or similar) to run this workflow. I have run my analysis on a Ubuntu 18.04 instance using the Amazon Web Services (AWS) Elastic Cloud Compute (EC2) services (https://aws.amazon.com/ec2/?ec2-whats-new.sort-by=item.additionalFields.postDateTime&ec2-whats-new.sort-order=desc). If you already have a good configuration Linux machine (e.g., 12 cores, 32 GB memory, ~1 TB diskspace), you should be good to go. Otherwise, if you have a similar configuration Windows machine and have Windows 10 installed, then you can install windows subsystem linux (WSL) to run the analysis (https://docs.microsoft.com/en-us/learn/modules/get-started-with-windows-subsystem-for-linux/).

Prerequisites

Ubuntu 18.04 or similar machine with at least 30 GB memory and 1 TB diskspace. Must have BASH installed.

Installation

Install from scratch

Clone the repo

git clone https://github.com/sambitmishra0628/RNA-Seq-1

Install miniconda

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod +x Miniconda3-latest-Linux-x86_64.sh
./Miniconda3-latest-Linux-x86_64.sh

Follow the instructions to complete the installation of miniconda.

Create a conda environment

conda create -n rna_seq_env python=3.9
conda activate rna_seq_env

Install mamba
```
conda install -c conda-forge mamba
```

Install snakemake

mamba install -c conda-forge -c bioconda snakemake

Install using environment file

You can use the environment file provided in this repository to create a similar environment as in the above steps. Follow steps 1 and 2 as given above and then use the following command.

conda create -n rna_seq_env -f ngs_env.yml

Usage

Activate the conda environment (if not already activated)
```
conda activate rna_seq_env
```

Clone the repository

git clone https://github.com/sambitmishra0628/RNA-Seq-1

Change the DATADIR in the RNA-Seq-1/config.yml to point to a location in your machine
Do a dry run of Snakemake
```
snakemake -n -p --use-conda --configfile config.yml --cores 4
```
Note that I have set the number of cores to 4. You can increase that number depending on the number of cores available in your machine minus 1

Run the workflow

snakemake -p --use-conda --configfile config.yml --cores 4

Run the following command to view the dag
```
snakemake --dag | dot -Tpng > dag.png
```
Tip. If you notice that the workflow is running all over again after a certain rule/rules failed despite the files from previously executed rules being present, then it is likely due to a time stamp issue. Follow the following steps:

Under rule all , comment out the rules that fail in the Snakefile
Run the command snakemake --touch -p --use-conda --configfile config.yml --cores 4

This will reset all the time stamps incase a changed time stamp is the issue. Then do a dry run and you should see only the rules for which the files are missing shown to be executed.

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Sambit Mishra - [email protected]

Project Link: https://github.com/sambitmishra0628/RNA-Seq-1

Acknowledgements

Code Snippets

shell:
    """
    cd {output.outdir}
    wget {params.hg_url}
    wget {params.h_gff_url}
    wget {params.cv_url}
    wget {params.cv_gff_url}

    gunzip {params.hg_file}.gz
    gunzip {params.h_gff_file}.gz
    gunzip {params.cv_g_file}.gz
    gunzip {params.cv_gff_file}.gz

    """

SnakeMake From line 22 of rules/fetch_references.smk

run:
    # Create sample directory
    sample_dir = config["DATADIR"] + "/samples/"
    if not os.path.isdir(sample_dir):
        os.mkdir(sample_dir)        
    # Download control files
    print ("***Downloading control files ... ***")
    for srr,file_id in zip (config["SRR_CONTROL"],config["CONTROL"]):
        fw_file = config["src_url"] + srr + "/" + file_id + "_R1.fq.gz.1"
        rev_file = config["src_url"] + srr + "/" + file_id + "_R2.fq.gz.1"
        cmd1 = "wget " + fw_file + " -P " + sample_dir
        cmd2 = "wget " + rev_file + " -P " + sample_dir
        print (cmd1)
        print (cmd2)
        os.system(cmd1)
        os.system(cmd2)

    # Download treatment files
    print ("***Downloading treatment files ... ***")    
    for srr,file_id in zip (config["SRR_TREATMENT"],config["TREATMENT"]):
        fw_file = config["src_url"] + srr + "/" + file_id + "_R1.fq.gz.1"
        rev_file = config["src_url"] + srr + "/" + file_id + "_R2.fq.gz.1"
        cmd1 = "wget " + fw_file + " -P " + sample_dir
        cmd2 = "wget " + rev_file + " -P " + sample_dir
        print (cmd1)
        print (cmd2)
        os.system(cmd1)
        os.system(cmd2)            

    # Rename the files from *.fq.gz.1 to *.fq.gz
    curr_dir = os.getcwd()
    os.chdir(sample_dir)
    all_files = glob.glob("*.gz.1")
    for file_i in all_files:
        new_file = file_i.replace('.gz.1', '.gz')
        cmd = "mv " + file_i + " " + new_file
        os.system(cmd)

SnakeMake From line 14 of rules/fetch_samples.smk

shell:
    """
    gffread -T {input.human_gff} -o {params.human_gtf}
    gffread -T {input.sars2_gff} -o {params.sars2_gtf}
    STAR --runThreadN {threads} --runMode genomeGenerate --genomeDir {output.out1} --genomeFastaFiles {input.human_genome} \
    --sjdbGTFfile {params.human_gtf} --outFileNamePrefix {output.out1}/{params.prefix1} --sjdbOverhang 100 \
    --limitGenomeGenerateRAM 25000000000 --genomeChrBinNbits=16 --genomeSAsparseD 3
    STAR --runThreadN {threads} --runMode genomeGenerate --genomeDir {output.out2} --genomeFastaFiles {input.sars2_genome} --sjdbGTFfile {params.sars2_gtf} --outFileNamePrefix {output.out2}/{params.prefix2} --sjdbOverhang 100
    """    

SnakeMake STAR gffread From line 21 of rules/index_genomes.smk

shell:
    """
    STAR --runThreadN {threads} --genomeDir {params.gd} --readFilesIn {input.fw} {input.rev} --outFileNamePrefix {params.outdir}{wildcards.sample_id} --sjdbOverhang 100 --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate --outReadsUnmapped Fastx --quantMode GeneCounts
    """

SnakeMake STAR From line 18 of rules/map_reads.smk

shell:
    """
        mkdir -p {output.out}
        fastqc -o {output.out} -t {threads} {input.ctr} &> {output.out}/fastqc_raw_ctr.log
        fastqc -o {output.out} -t {threads} {input.trt} &> {output.out}/fastqc_raw_trt.log   
    """    

SnakeMake FastQC From line 14 of rules/quality_control.smk

shell:
    "multiqc {input.inputdir} -o {output.outdir}"

SnakeMake MultiQC From line 28 of rules/quality_control.smk

shell:
    """
    cutadapt -q 20  -m 100 -j {threads} -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT -o {output.fw} -p {output.rev} {input.fw} {input.rev} &> {log}
    """

SnakeMake Cutadapt From line 43 of rules/quality_control.smk

shell:
    """
    mkdir -p {output.out}
    fastqc -o {output.out} -t {threads} {input.ctr} &> {output.out}/fastqc_trimmed_ctr.log
    fastqc -o {output.out} -t {threads} {input.trt} &> {output.out}/fastqc_trimmed_trt.log
    """ 

SnakeMake FastQC From line 60 of rules/quality_control.smk

shell:
    "multiqc {input.inputdir} -o {output.outdir}"

SnakeMake MultiQC From line 75 of rules/quality_control.smk

shell:
    """
    rsem-prepare-reference --gtf {ref_dir}/{gtf_file} -trusted-sources BestRefSeq,Curated\ Genomic {params.ref_dir}/{params.genome_file} {output.outdir}/human_refseq
    """

SnakeMake RSEM From line 15 of rules/quantification.smk

shell:
    """
    STAR --runThreadN {threads} --genomeDir {params.gd} --readFilesIn {input.fw} {input.rev} --outFileNamePrefix {params.outdir}{wildcards.sample_id} --sjdbOverhang 100 --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate --outReadsUnmapped Fastx --quantMode GeneCounts
    """