Differential Gene Expression Analysis Pipeline

public public 1yr ago Version: 2 0 bookmarks
  • Here are the instructions translated into English:

    Use STAR or HISAT2 for DNA alignment and count using featureCounts. Use Salmon for cDNA alignment and counting. Generate a report using MultiQC. Perform differential analysis using DESeq2.

    Usage:

    Create an output directory

    mkdir /path/to/output

    Generate the initial configuration file

    python ./setup.py --init

    Create environments

    python ./setup.py -s --conda-create-envs-only

    Start the process

    python ./setup.py -s

    This script is based on https://github.com/BIMSBbioinfo/pigx_rnaseq .

Code Snippets

16
17
shell:
    "{STAR_EXEC_INDEX} --runMode genomeGenerate --runThreadN {threads} --genomeDir {params.star_index_dir} --genomeFastaFiles {input} --sjdbGTFfile {GTF_FILE} >> {log} 2>&1"
35
36
shell:
    "{HISAT2_BUILD_EXEC} -p {threads} --large-index {input} {params.index_directory}/{GENOME_BUILD}_index >> {log} 2>&1"
54
55
shell:
    "{SALMON_INDEX_EXEC} -t {input} -i {params.salmon_index_dir} -p {threads} >> {log} 2>&1"
79
80
shell:
    "{STAR_EXEC_MAP} --runThreadN {threads} --genomeDir {params.index_dir} --readFilesIn {input.reads} --readFilesCommand '{GUNZIP_EXEC} -c' --outSAMtype BAM SortedByCoordinate --outFileNamePrefix {params.output_prefix} >> {log} 2>&1"
102
103
104
105
106
107
shell:
    """
    {HISAT2_EXEC} -x {params.index_dir}/{GENOME_BUILD}_index -p {threads} -q -S {params.samfile} {params.args} >> {log[0]} 2>&1
    {SAMTOOLS_EXEC} view -bh {params.samfile} | {SAMTOOLS_EXEC} sort -o {output} >> {log[1]} 2>&1
    rm {params.samfile}
    """
SnakeMake From line 102 of rules/align.smk
119
120
shell:
    "{SAMTOOLS_EXEC} index {input} {output}"
SnakeMake From line 119 of rules/align.smk
146
147
script:
    "../scripts/salmon_quant.py"
SnakeMake From line 146 of rules/align.smk
17
18
19
20
21
22
shell:
    """
    {BAMCOVERAGE_EXEC} -b {input.bam} -o {output[0]} --filterRNAstrand forward >> {log[0]} 2>&1
    {BAMCOVERAGE_EXEC} -b {input.bam} -o {output[1]} --filterRNAstrand reverse >> {log[1]} 2>&1
    {BAMCOVERAGE_EXEC} -b {input.bam} -o {output[2]} >> {log[2]} 2>&1
    """
17
18
shell:
    "{RSCRIPT_EXEC} {SCRIPTS_DIR}/counts_matrix_from_SALMON.R {SALMON_DIR} {COUNTS_DIR} {input.colDataFile} >> {log} 2>&1"
65
66
script:
    "../scripts/featureCounts.py"
83
84
shell:
    "{RSCRIPT_EXEC} {params.script} {params.mapped_dir} {output} >> {log} 2>&1"
106
107
shell:
    "{RSCRIPT_EXEC} {params.script} {input.counts_file} {input.colDataFile} {params.outdir} >> {log} 2>&1"
129
130
shell:
    "{RSCRIPT_EXEC} {params.reportR} --prefix='{wildcards.analysis}' --reportFile={params.reportRmd} --countDataFile={input.counts} --colDataFile={input.coldata} --gtfFile={GTF_FILE} --caseSampleGroups='{params.case}' --controlSampleGroups='{params.control}' --covariates='{params.covariates}'  --workdir={params.outdir} --organism='{ORGANISM}'  >> {log} 2>&1"
151
152
shell:
    "{RSCRIPT_EXEC} {params.reportR} --prefix='{wildcards.analysis}.salmon.transcripts' --reportFile={params.reportRmd} --countDataFile={input.counts} --colDataFile={input.coldata} --gtfFile={GTF_FILE} --caseSampleGroups='{params.case}' --controlSampleGroups='{params.control}' --covariates='{params.covariates}' --workdir={params.outdir} --organism='{ORGANISM}' >> {log} 2>&1"
173
174
shell:
    "{RSCRIPT_EXEC} {params.reportR} --prefix='{wildcards.analysis}.salmon.genes' --reportFile={params.reportRmd} --countDataFile={input.counts} --colDataFile={input.coldata} --gtfFile={GTF_FILE} --caseSampleGroups='{params.case}' --controlSampleGroups='{params.control}' --covariates='{params.covariates}' --workdir={params.outdir} --organism='{ORGANISM}' >> {log} 2>&1"
13
14
shell:
    "{MULTIQC_EXEC} -f -o {MULTIQC_DIR} {OUTPUT_DIR} >> {log} 2>&1"
SnakeMake From line 13 of rules/qc.smk
14
15
shell:
    "{FASTP_EXEC} --in1 {input[0]} --in2 {input[1]} --out1 {output.r1} --out2 {output.r2} -h {output.html} -j {output.json} >> {log} 2>&1"
SnakeMake From line 14 of rules/trim.smk
30
31
shell:
    "{FASTP_EXEC} --in1 {input[0]} --out1 {output.r} -h {output.html} -j {output.json} >> {log} 2>&1 "
SnakeMake From line 30 of rules/trim.smk
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import tempfile
from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=True, stderr=True)
# extra = snakemake.params.get("extra", "")

# optional input files and directories
strand = snakemake.params.get("strandedness", 0)
if int(strand) not in [0, 1, 2]:
    print("Acceptable strandedness options are '0(unspecific), 1(forward), or 2(reverse).")
    exit(1)

annotation_file_type = snakemake.params.get("annotation_file_type", "")
if annotation_file_type:
    annotation_file_type = f"-F {annotation_file_type}"

group_feature_by = snakemake.params.get("group_by", "")
if group_feature_by:
    group_feature_by = f"-g {group_feature_by}"

feature = snakemake.params.get("feature", "")
if feature:
    feature = f"-t {feature}"


singleEnd = snakemake.params.get("single_end", False)
if singleEnd:
    isPair = ""
else:
    isPair = "-p"


with tempfile.TemporaryDirectory() as tmpdir:
    shell(
        "featureCounts"
        " -T {snakemake.threads}"
        " -s {strand}"
        " -a {snakemake.input.annotation}"
        " {isPair}"
        " {annotation_file_type}"
        " {group_feature_by}"
        " {feature}"
        " --tmpDir {tmpdir}"
        " -o {snakemake.output[0]}"
        " {snakemake.input.samples}"
        " {log}"
    )
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from os import path

from snakemake.shell import shell

reads = snakemake.input.reads
GTF_FILE = snakemake.input.gtf_file

outfolder = snakemake.params.get("outfolder", "")
index_dir = snakemake.params.get("index_dir", "")
SALMON_QUANT_EXEC = snakemake.params.get("SALMON_QUANT_EXEC", "")

log = snakemake.log_fmt_shell(stdout=True, stderr=True)

if(len(snakemake.input.reads) == 1):
    COMMAND = "{SALMON_QUANT_EXEC} -i {index_dir} -l A -p {snakemake.threads} -r {reads} -o {outfolder} --seqBias --gcBias -g {GTF_FILE} {log}"
elif(len(snakemake.input.reads) == 2):
    COMMAND = "{SALMON_QUANT_EXEC} -i {index_dir} -l A -p {snakemake.threads} -1 {reads[0]} -2 {reads[1]} -o {outfolder} --seqBias --gcBias -g {GTF_FILE} {log}"

shell(COMMAND)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import os
import csv
import yaml
import argparse
from glob import glob

def read_sample_sheet(path):
    with open(path, 'r') as fp:
        rows =  [row for row in csv.reader(fp, delimiter=',')]
        header = rows[0]; rows = rows[1:]
        sample_sheet = [dict(zip(header, row)) for row in rows]
    return sample_sheet

def read_config_file(path):
    with open(path, 'rt') as infile:
        config = yaml.load(infile)
    return config

def validate_config(config):
    # Check that all locations exist
    for loc in config['locations']:
        if (not loc == 'output-dir') and (not (os.path.isdir(config['locations'][loc]) or os.path.isfile(config['locations'][loc]))):
            raise Exception("ERROR: The following necessary directory/file does not exist: {} ({})".format(config['locations'][loc], loc))

    sample_sheet = read_sample_sheet(config['locations']['sample-sheet'])

    # Check if the required fields are found in the sample sheet
    required_fields = set(['name', 'reads', 'reads2', 'sample_type'])
    not_found = required_fields.difference(set(sample_sheet[0].keys()))
    if len(not_found) > 0:
        raise Exception("ERROR: Required field(s) {} could not be found in the sample sheet file '{}'".format(not_found, config['locations']['sample-sheet']))

    # Check that requested analyses make sense
    if 'DEanalyses' in config:
        for analysis in config['DEanalyses']:
            for group in config['DEanalyses'][analysis]['case_sample_groups'] .split(',') + config['DEanalyses'][analysis]['control_sample_groups'].split(','):
                group = group.strip() #remove any leading/trailing whitespaces in the sample group names
                if not any(row['sample_type'] == group for row in sample_sheet):
                    raise Exception('ERROR: no samples in sample sheet have sample type {}, specified in analysis {}.'.format(group, analysis))

    # Check that reads files exist; sample names are unique to each row; 
    samples = {}        

    for row in sample_sheet:
        sample = row['name']
        if sample in samples:
            raise Exception('ERROR: name "{}" is not unique. Replace it with a unique name in the sample_sheet.'.format(sample))
        else:
            samples[sample] = 1

        filenames = [row['reads'], row['reads2']] if row['reads2'] else [row['reads']]
        for filename in filenames:
            fullpath = os.path.join(config['locations']['reads-dir'], filename)
            if not os.path.isfile(fullpath):
                raise Exception('ERROR: missing reads file: {}'.format(fullpath))



if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('-c', '--config-file', required=True, help='Path of configuration file [settings.yaml]')
    parser.add_argument('-s', '--sample-sheet-file', required=True, help='Path of sample sheet [sample_sheet.csv]')
    args = parser.parse_args()

    config = read_config_file(args.config_file)
    config['locations']['sample-sheet'] = args.sample_sheet_file
    validate_config(config)
41
42
43
run:
  for key in sorted(targets.keys()):
    print('{}:\n  {}'.format(key, targets[key]['description']))
75
shell: "{RSCRIPT_EXEC} {SCRIPTS_DIR}/translate_sample_sheet_for_report.R {input}"
ShowHide 21 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/nnlrl/rna-seq-snakemake
Name: rna-seq-snakemake
Version: 2
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Other Versions:
Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...