Differential Gene Expression Analysis Pipeline

public 1yr ago Version: 2 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Here are the instructions translated into English:

Use STAR or HISAT2 for DNA alignment and count using featureCounts. Use Salmon for cDNA alignment and counting. Generate a report using MultiQC. Perform differential analysis using DESeq2.

Usage:

Create an output directory

mkdir /path/to/output

Generate the initial configuration file

python ./setup.py --init

Create environments

python ./setup.py -s --conda-create-envs-only

Start the process

python ./setup.py -s

This script is based on https://github.com/BIMSBbioinfo/pigx_rnaseq .

Code Snippets

shell:
    "{STAR_EXEC_INDEX} --runMode genomeGenerate --runThreadN {threads} --genomeDir {params.star_index_dir} --genomeFastaFiles {input} --sjdbGTFfile {GTF_FILE} >> {log} 2>&1"

SnakeMake From line 16 of rules/align.smk

shell:
    "{HISAT2_BUILD_EXEC} -p {threads} --large-index {input} {params.index_directory}/{GENOME_BUILD}_index >> {log} 2>&1"

SnakeMake From line 35 of rules/align.smk

shell:
    "{SALMON_INDEX_EXEC} -t {input} -i {params.salmon_index_dir} -p {threads} >> {log} 2>&1"

SnakeMake From line 54 of rules/align.smk

shell:
    "{STAR_EXEC_MAP} --runThreadN {threads} --genomeDir {params.index_dir} --readFilesIn {input.reads} --readFilesCommand '{GUNZIP_EXEC} -c' --outSAMtype BAM SortedByCoordinate --outFileNamePrefix {params.output_prefix} >> {log} 2>&1"

SnakeMake From line 79 of rules/align.smk

shell:
    """
    {HISAT2_EXEC} -x {params.index_dir}/{GENOME_BUILD}_index -p {threads} -q -S {params.samfile} {params.args} >> {log[0]} 2>&1
    {SAMTOOLS_EXEC} view -bh {params.samfile} | {SAMTOOLS_EXEC} sort -o {output} >> {log[1]} 2>&1
    rm {params.samfile}
    """

SnakeMake From line 102 of rules/align.smk

shell:
    "{SAMTOOLS_EXEC} index {input} {output}"

SnakeMake From line 119 of rules/align.smk

script:
    "../scripts/salmon_quant.py"

SnakeMake From line 146 of rules/align.smk

shell:
    """
    {BAMCOVERAGE_EXEC} -b {input.bam} -o {output[0]} --filterRNAstrand forward >> {log[0]} 2>&1
    {BAMCOVERAGE_EXEC} -b {input.bam} -o {output[1]} --filterRNAstrand reverse >> {log[1]} 2>&1
    {BAMCOVERAGE_EXEC} -b {input.bam} -o {output[2]} >> {log[2]} 2>&1
    """

SnakeMake From line 17 of rules/bigwig.smk

shell:
    "{RSCRIPT_EXEC} {SCRIPTS_DIR}/counts_matrix_from_SALMON.R {SALMON_DIR} {COUNTS_DIR} {input.colDataFile} >> {log} 2>&1"

SnakeMake From line 17 of rules/diffexp.smk

script:
    "../scripts/featureCounts.py"

SnakeMake From line 65 of rules/diffexp.smk

shell:
    "{RSCRIPT_EXEC} {params.script} {params.mapped_dir} {output} >> {log} 2>&1"

SnakeMake From line 83 of rules/diffexp.smk

shell:
    "{RSCRIPT_EXEC} {params.script} {input.counts_file} {input.colDataFile} {params.outdir} >> {log} 2>&1"

SnakeMake From line 106 of rules/diffexp.smk

shell:
    "{RSCRIPT_EXEC} {params.reportR} --prefix='{wildcards.analysis}' --reportFile={params.reportRmd} --countDataFile={input.counts} --colDataFile={input.coldata} --gtfFile={GTF_FILE} --caseSampleGroups='{params.case}' --controlSampleGroups='{params.control}' --covariates='{params.covariates}'  --workdir={params.outdir} --organism='{ORGANISM}'  >> {log} 2>&1"

SnakeMake From line 129 of rules/diffexp.smk

shell:
    "{RSCRIPT_EXEC} {params.reportR} --prefix='{wildcards.analysis}.salmon.transcripts' --reportFile={params.reportRmd} --countDataFile={input.counts} --colDataFile={input.coldata} --gtfFile={GTF_FILE} --caseSampleGroups='{params.case}' --controlSampleGroups='{params.control}' --covariates='{params.covariates}' --workdir={params.outdir} --organism='{ORGANISM}' >> {log} 2>&1"

SnakeMake From line 151 of rules/diffexp.smk

shell:
    "{RSCRIPT_EXEC} {params.reportR} --prefix='{wildcards.analysis}.salmon.genes' --reportFile={params.reportRmd} --countDataFile={input.counts} --colDataFile={input.coldata} --gtfFile={GTF_FILE} --caseSampleGroups='{params.case}' --controlSampleGroups='{params.control}' --covariates='{params.covariates}' --workdir={params.outdir} --organism='{ORGANISM}' >> {log} 2>&1"

SnakeMake From line 173 of rules/diffexp.smk

shell:
    "{MULTIQC_EXEC} -f -o {MULTIQC_DIR} {OUTPUT_DIR} >> {log} 2>&1"

SnakeMake From line 13 of rules/qc.smk

shell:
    "{FASTP_EXEC} --in1 {input[0]} --in2 {input[1]} --out1 {output.r1} --out2 {output.r2} -h {output.html} -j {output.json} >> {log} 2>&1"

SnakeMake From line 14 of rules/trim.smk

shell:
    "{FASTP_EXEC} --in1 {input[0]} --out1 {output.r} -h {output.html} -j {output.json} >> {log} 2>&1 "

SnakeMake From line 30 of rules/trim.smk

import tempfile
from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=True, stderr=True)
# extra = snakemake.params.get("extra", "")

# optional input files and directories
strand = snakemake.params.get("strandedness", 0)
if int(strand) not in [0, 1, 2]:
    print("Acceptable strandedness options are '0(unspecific), 1(forward), or 2(reverse).")
    exit(1)

annotation_file_type = snakemake.params.get("annotation_file_type", "")
if annotation_file_type:
    annotation_file_type = f"-F {annotation_file_type}"

group_feature_by = snakemake.params.get("group_by", "")
if group_feature_by:
    group_feature_by = f"-g {group_feature_by}"

feature = snakemake.params.get("feature", "")
if feature:
    feature = f"-t {feature}"


singleEnd = snakemake.params.get("single_end", False)
if singleEnd:
    isPair = ""
else:
    isPair = "-p"


with tempfile.TemporaryDirectory() as tmpdir:
    shell(
        "featureCounts"
        " -T {snakemake.threads}"
        " -s {strand}"
        " -a {snakemake.input.annotation}"
        " {isPair}"
        " {annotation_file_type}"
        " {group_feature_by}"
        " {feature}"
        " --tmpDir {tmpdir}"
        " -o {snakemake.output[0]}"
        " {snakemake.input.samples}"
        " {log}"
    )

Python Snakemake FeatureCounts From line 8 of scripts/featureCounts.py

from os import path

from snakemake.shell import shell

reads = snakemake.input.reads
GTF_FILE = snakemake.input.gtf_file

outfolder = snakemake.params.get("outfolder", "")
index_dir = snakemake.params.get("index_dir", "")
SALMON_QUANT_EXEC = snakemake.params.get("SALMON_QUANT_EXEC", "")

log = snakemake.log_fmt_shell(stdout=True, stderr=True)

if(len(snakemake.input.reads) == 1):
    COMMAND = "{SALMON_QUANT_EXEC} -i {index_dir} -l A -p {snakemake.threads} -r {reads} -o {outfolder} --seqBias --gcBias -g {GTF_FILE} {log}"
elif(len(snakemake.input.reads) == 2):
    COMMAND = "{SALMON_QUANT_EXEC} -i {index_dir} -l A -p {snakemake.threads} -1 {reads[0]} -2 {reads[1]} -o {outfolder} --seqBias --gcBias -g {GTF_FILE} {log}"

shell(COMMAND)

Python Snakemake From line 8 of scripts/salmon_quant.py

import os
import csv
import yaml
import argparse
from glob import glob

def read_sample_sheet(path):
    with open(path, 'r') as fp:
        rows =  [row for row in csv.reader(fp, delimiter=',')]
        header = rows[0]; rows = rows[1:]
        sample_sheet = [dict(zip(header, row)) for row in rows]
    return sample_sheet

def read_config_file(path):
    with open(path, 'rt') as infile:
        config = yaml.load(infile)
    return config

def validate_config(config):
    # Check that all locations exist
    for loc in config['locations']:
        if (not loc == 'output-dir') and (not (os.path.isdir(config['locations'][loc]) or os.path.isfile(config['locations'][loc]))):
            raise Exception("ERROR: The following necessary directory/file does not exist: {} ({})".format(config['locations'][loc], loc))

    sample_sheet = read_sample_sheet(config['locations']['sample-sheet'])

    # Check if the required fields are found in the sample sheet
    required_fields = set(['name', 'reads', 'reads2', 'sample_type'])
    not_found = required_fields.difference(set(sample_sheet[0].keys()))
    if len(not_found) > 0:
        raise Exception("ERROR: Required field(s) {} could not be found in the sample sheet file '{}'".format(not_found, config['locations']['sample-sheet']))

    # Check that requested analyses make sense
    if 'DEanalyses' in config:
        for analysis in config['DEanalyses']:
            for group in config['DEanalyses'][analysis]['case_sample_groups'] .split(',') + config['DEanalyses'][analysis]['control_sample_groups'].split(','):
                group = group.strip() #remove any leading/trailing whitespaces in the sample group names
                if not any(row['sample_type'] == group for row in sample_sheet):
                    raise Exception('ERROR: no samples in sample sheet have sample type {}, specified in analysis {}.'.format(group, analysis))

    # Check that reads files exist; sample names are unique to each row; 
    samples = {}        

    for row in sample_sheet:
        sample = row['name']
        if sample in samples:
            raise Exception('ERROR: name "{}" is not unique. Replace it with a unique name in the sample_sheet.'.format(sample))
        else:
            samples[sample] = 1

        filenames = [row['reads'], row['reads2']] if row['reads2'] else [row['reads']]
        for filename in filenames:
            fullpath = os.path.join(config['locations']['reads-dir'], filename)
            if not os.path.isfile(fullpath):
                raise Exception('ERROR: missing reads file: {}'.format(fullpath))



if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('-c', '--config-file', required=True, help='Path of configuration file [settings.yaml]')
    parser.add_argument('-s', '--sample-sheet-file', required=True, help='Path of sample sheet [sample_sheet.csv]')
    args = parser.parse_args()

    config = read_config_file(args.config_file)
    config['locations']['sample-sheet'] = args.sample_sheet_file
    validate_config(config)

Python PyYAML From line 1 of scripts/validate_input.py

run:
  for key in sorted(targets.keys()):
    print('{}:\n  {}'.format(key, targets[key]['description']))

SnakeMake From line 41 of workflow/Snakefile

shell: "{RSCRIPT_EXEC} {SCRIPTS_DIR}/translate_sample_sheet_for_report.R {input}"

SnakeMake From line 75 of workflow/Snakefile

ShowHide 21 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/nnlrl/rna-seq-snakemake

Name: rna-seq-snakemake

Version: 2

Badge:

Insert copied code into your website to add a link to this workflow.

Other Versions:

License: None

Keywords:

FeatureCounts Snakemake PyYAML Gene expression

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free

Differential Gene Expression Analysis Pipeline

Help improve this workflow!

Create an output directory

Generate the initial configuration file

Create environments

Start the process

Code Snippets

Comments

Support

Free

Related Workflows

public

public

public

public

public

public