Analysis of RNA-seq data from Covid19 samples

public public 1yr ago 0 bookmarks

The repository contains a Snakemake workflow to analyze data from SARS-CoV-2 infected samples from humans. The workflow analyzes bulk RNA-seq data from a study by Katsura et al. (https://www.cell.com/cell-stem-cell/pdf/S1934-5909(20)30499-9.pdf). Katsura et al. collected bulk RNA-seq from a sample of SARS-CoV-2 infected human lung cells (alveolar type 2 or AT2 cells) 48 hours post inoculation alongwith RNA-seq from a control (uninfected) sample. Each sample had 3 replicates. The data was deposited by the authors at the gene expression omnibus (GEO) database (GSE152586) and can be accessed at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE152586.

Note: The current version of the pipeline has only the Data Download, Raw Data Quality Check, Adapter Trimming, Trimmed Data Quality Check, Index reference genomes and Map trimmed reads sections complete. Work is still in progress for other sections (quantification and differential gene expression) of the pipeline

Table of Contents

  1. About The Project
  2. Getting Started
  3. Usage
  4. License
  5. Contact
  6. Acknowledgements

About The Project

In this project, I have implemented a Snakemake workflow to analyze bulk RNA-seq data from SARS CoV-2 infected samples. The intent is have the pipeline automated end-to-end to allow the user to run the analysis without any interruptions. This workflow outlines a very basic approach to analyze RNA-seq data and can be modified by any user to add more rules and convert into a more sophisticated workflow. On an added note, I have not included steps for filtering the data for ribosomal RNAs. I assume that mapping the reads to human reference genome and then collecting the data only for mapped reads and assembling the reads into transcripts should result in high confidence mRNA expression data.

For this work, I have used the human and SARS Cov-2 reference genomes available at NCBI.

  • Human : https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.fna.gz

  • SAR-CoV-2: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/858/895/GCF_009858895.2_ASM985889v3/GCF_009858895.2_ASM985889v3_genomic.fna.gz

Built With

  • Python and Snakemake

Getting Started

You will need a Linux machine (Ubuntu 18.04 or similar) to run this workflow. I have run my analysis on a Ubuntu 18.04 instance using the Amazon Web Services (AWS) Elastic Cloud Compute (EC2) services (https://aws.amazon.com/ec2/?ec2-whats-new.sort-by=item.additionalFields.postDateTime&ec2-whats-new.sort-order=desc). If you already have a good configuration Linux machine (e.g., 12 cores, 32 GB memory, ~1 TB diskspace), you should be good to go. Otherwise, if you have a similar configuration Windows machine and have Windows 10 installed, then you can install windows subsystem linux (WSL) to run the analysis (https://docs.microsoft.com/en-us/learn/modules/get-started-with-windows-subsystem-for-linux/).

Prerequisites

Ubuntu 18.04 or similar machine with at least 30 GB memory and 1 TB diskspace. Must have BASH installed.

Installation

Install from scratch

  1. Clone the repo

    git clone https://github.com/sambitmishra0628/RNA-Seq-1
    
  2. Install miniconda

    wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
    chmod +x Miniconda3-latest-Linux-x86_64.sh
    ./Miniconda3-latest-Linux-x86_64.sh
    

    Follow the instructions to complete the installation of miniconda.

  3. Create a conda environment

    conda create -n rna_seq_env python=3.9
    conda activate rna_seq_env
    
  4. Install mamba

    conda install -c conda-forge mamba
    
  5. Install snakemake

    mamba install -c conda-forge -c bioconda snakemake
    

    Install using environment file

You can use the environment file provided in this repository to create a similar environment as in the above steps. Follow steps 1 and 2 as given above and then use the following command.

conda create -n rna_seq_env -f ngs_env.yml

Usage

  1. Activate the conda environment (if not already activated)

    conda activate rna_seq_env
    
  2. Clone the repository

    git clone https://github.com/sambitmishra0628/RNA-Seq-1
    
  3. Change the DATADIR in the RNA-Seq-1/config.yml to point to a location in your machine

  4. Do a dry run of Snakemake

    snakemake -n -p --use-conda --configfile config.yml --cores 4
    

    Note that I have set the number of cores to 4. You can increase that number depending on the number of cores available in your machine minus 1

  5. Run the workflow

    snakemake -p --use-conda --configfile config.yml --cores 4
    
  6. Run the following command to view the dag

    snakemake --dag | dot -Tpng > dag.png
    

    dag_image

  7. Tip. If you notice that the workflow is running all over again after a certain rule/rules failed despite the files from previously executed rules being present, then it is likely due to a time stamp issue. Follow the following steps:

  • Under rule all , comment out the rules that fail in the Snakefile

  • Run the command snakemake --touch -p --use-conda --configfile config.yml --cores 4

This will reset all the time stamps incase a changed time stamp is the issue. Then do a dry run and you should see only the rules for which the files are missing shown to be executed.

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Sambit Mishra - [email protected]

Project Link: https://github.com/sambitmishra0628/RNA-Seq-1

Acknowledgements

Code Snippets

22
23
24
25
26
27
28
29
30
31
32
33
34
35
shell:
    """
    cd {output.outdir}
    wget {params.hg_url}
    wget {params.h_gff_url}
    wget {params.cv_url}
    wget {params.cv_gff_url}

    gunzip {params.hg_file}.gz
    gunzip {params.h_gff_file}.gz
    gunzip {params.cv_g_file}.gz
    gunzip {params.cv_gff_file}.gz

    """
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
run:
    # Create sample directory
    sample_dir = config["DATADIR"] + "/samples/"
    if not os.path.isdir(sample_dir):
        os.mkdir(sample_dir)        
    # Download control files
    print ("***Downloading control files ... ***")
    for srr,file_id in zip (config["SRR_CONTROL"],config["CONTROL"]):
        fw_file = config["src_url"] + srr + "/" + file_id + "_R1.fq.gz.1"
        rev_file = config["src_url"] + srr + "/" + file_id + "_R2.fq.gz.1"
        cmd1 = "wget " + fw_file + " -P " + sample_dir
        cmd2 = "wget " + rev_file + " -P " + sample_dir
        print (cmd1)
        print (cmd2)
        os.system(cmd1)
        os.system(cmd2)

    # Download treatment files
    print ("***Downloading treatment files ... ***")    
    for srr,file_id in zip (config["SRR_TREATMENT"],config["TREATMENT"]):
        fw_file = config["src_url"] + srr + "/" + file_id + "_R1.fq.gz.1"
        rev_file = config["src_url"] + srr + "/" + file_id + "_R2.fq.gz.1"
        cmd1 = "wget " + fw_file + " -P " + sample_dir
        cmd2 = "wget " + rev_file + " -P " + sample_dir
        print (cmd1)
        print (cmd2)
        os.system(cmd1)
        os.system(cmd2)            

    # Rename the files from *.fq.gz.1 to *.fq.gz
    curr_dir = os.getcwd()
    os.chdir(sample_dir)
    all_files = glob.glob("*.gz.1")
    for file_i in all_files:
        new_file = file_i.replace('.gz.1', '.gz')
        cmd = "mv " + file_i + " " + new_file
        os.system(cmd)
21
22
23
24
25
26
27
28
29
shell:
    """
    gffread -T {input.human_gff} -o {params.human_gtf}
    gffread -T {input.sars2_gff} -o {params.sars2_gtf}
    STAR --runThreadN {threads} --runMode genomeGenerate --genomeDir {output.out1} --genomeFastaFiles {input.human_genome} \
    --sjdbGTFfile {params.human_gtf} --outFileNamePrefix {output.out1}/{params.prefix1} --sjdbOverhang 100 \
    --limitGenomeGenerateRAM 25000000000 --genomeChrBinNbits=16 --genomeSAsparseD 3
    STAR --runThreadN {threads} --runMode genomeGenerate --genomeDir {output.out2} --genomeFastaFiles {input.sars2_genome} --sjdbGTFfile {params.sars2_gtf} --outFileNamePrefix {output.out2}/{params.prefix2} --sjdbOverhang 100
    """    
18
19
20
21
shell:
    """
    STAR --runThreadN {threads} --genomeDir {params.gd} --readFilesIn {input.fw} {input.rev} --outFileNamePrefix {params.outdir}{wildcards.sample_id} --sjdbOverhang 100 --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate --outReadsUnmapped Fastx --quantMode GeneCounts
    """
14
15
16
17
18
19
shell:
    """
        mkdir -p {output.out}
        fastqc -o {output.out} -t {threads} {input.ctr} &> {output.out}/fastqc_raw_ctr.log
        fastqc -o {output.out} -t {threads} {input.trt} &> {output.out}/fastqc_raw_trt.log   
    """    
28
29
shell:
    "multiqc {input.inputdir} -o {output.outdir}"
43
44
45
46
shell:
    """
    cutadapt -q 20  -m 100 -j {threads} -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT -o {output.fw} -p {output.rev} {input.fw} {input.rev} &> {log}
    """
60
61
62
63
64
65
shell:
    """
    mkdir -p {output.out}
    fastqc -o {output.out} -t {threads} {input.ctr} &> {output.out}/fastqc_trimmed_ctr.log
    fastqc -o {output.out} -t {threads} {input.trt} &> {output.out}/fastqc_trimmed_trt.log
    """ 
75
76
shell:
    "multiqc {input.inputdir} -o {output.outdir}"
15
16
17
18
shell:
    """
    rsem-prepare-reference --gtf {ref_dir}/{gtf_file} -trusted-sources BestRefSeq,Curated\ Genomic {params.ref_dir}/{params.genome_file} {output.outdir}/human_refseq
    """
33
34
35
36
shell:
    """
    STAR --runThreadN {threads} --genomeDir {params.gd} --readFilesIn {input.fw} {input.rev} --outFileNamePrefix {params.outdir}{wildcards.sample_id} --sjdbOverhang 100 --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate --outReadsUnmapped Fastx --quantMode GeneCounts
    """
ShowHide 6 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/sambitmishra0628/RNA-Seq-1
Name: rna-seq-1
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...