RNA-seq workflow for Snakemake based on STAR and featureCounts.

public public 1yr ago Version: 0.1 0 bookmarks

Snakemake-exome

|Snakemake| |Wercker|

This is a Snakemake workflow for generating gene expression counts from RNA-sequencing data using STAR and featureCounts (from the subread package). The workflow is designed to handle both single-end and paired-end sequencing data, as well as sequencing data from multiple lanes. Processing of patient-derived xenograft (PDX) samples is also supported, by using disambiguate to separate graft/host sequence reads.

If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this repository and, if available, its DOI (see above).

.. |Snakemake| image:: https://img.shields.io/badge/snakemake-≥3.13.3-brightgreen.svg :target: https://snakemake.bitbucket.io

.. |Wercker| image:: https://app.wercker.com/status/9c3bfb4aa4dbffc027b7a0fcfc00cc57/s/develop :target: https://app.wercker.com/project/byKey/9c3bfb4aa4dbffc027b7a0fcfc00cc57

Overview

The standard (non-PDX) workflow essentially performs the following steps:

  • Cutadapt is used to trim the input reads for adapters and/or poor-quality base calls.

  • The trimmed reads are aligned to the reference genome using STAR.

  • The resulting alignments are sorted and indexed using sambamba.

  • featureCounts is used to generate gene expression counts.

  • The (per sample) counts are merged into a single count file.

  • The merged counts are normalized for differences in sequencing depth (using DESeq's median-of-ratios approach) and log-transformed.

This results in the following dependency graph:

.. image:: https://jrderuiter.github.io/snakemake-rnaseq/_images/dag.svg

The PDX workflow is a slightly modified version of the standard workflow, which aligns the reads to two reference genome (the host and graft reference genomes) and uses disambiguate_ to remove sequences originating from the host organism. See the documentation for more details.

Documentation

Documentation is available at: http://jrderuiter.github.io/snakemake-rnaseq.

License

This software is released under the MIT license.

.. _disambiguate: https://github.com/AstraZeneca-NGS/disambiguate

Code Snippets

47
48
wrapper:
    "0.17.0/bio/star/align"
65
66
wrapper:
    "0.17.0/bio/star/align"
78
79
wrapper:
    "0.17.0/bio/sambamba/sort"
101
102
wrapper:
    "0.17.0/bio/samtools/merge"
119
120
wrapper:
    "0.17.0/bio/ngs-disambiguate"
132
133
wrapper:
    "0.17.0/bio/sambamba/sort"
141
142
wrapper:
    "0.17.0/bio/samtools/index"
159
160
wrapper:
    "0.17.0/bio/star/align"
172
173
wrapper:
    "0.17.0/bio/sambamba/sort"
195
196
wrapper:
    "0.17.0/bio/samtools/merge"
204
205
wrapper:
    "0.17.0/bio/samtools/index"
59
60
61
62
63
64
65
66
67
    wrapper:
        "file://" + path.join(workflow.basedir, "wrappers/subread/feature_counts")


rule merge_counts:
    input:
        expand("counts/per_sample/{sample}.txt", sample=get_samples())
    output:
        "counts/merged.txt"
68
69
70
71
72
73
74
75
76
77
78
79
run:
    # Merge count files.
    frames = (pd.read_csv(fp, sep="\t", skiprows=1,
                    index_col=list(range(6)))
        for fp in input)
    merged = pd.concat(frames, axis=1)

    # Extract sample names.
    merged = merged.rename(
        columns=lambda c: path.splitext(path.basename(c))[0])

    merged.to_csv(output[0], sep="\t", index=True)
87
88
89
90
run:
    counts = pd.read_csv(input[0], sep="\t", index_col=list(range(6)))
    norm_counts = np.log2(normalize_counts(counts) + 1)
    norm_counts.to_csv(output[0], sep="\t", index=True)
14
15
wrapper:
    "0.17.0/bio/cutadapt/pe"
27
28
wrapper:
    "0.17.0/bio/cutadapt/se"
63
64
shell:
    "cp {input} {output}"
34
35
wrapper:
    "0.17.0/bio/multiqc"
46
47
wrapper:
    "0.17.0/bio/fastqc"
55
56
wrapper:
    "0.17.0/bio/samtools/stats"
SnakeMake From line 55 of rules/qc.smk
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


n = len(snakemake.input)
assert n == 2, "Input must contain 2 (paired-end) elements."

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

shell(
    "cutadapt"
    " {snakemake.params}"
    " -o {snakemake.output.fastq1}"
    " -p {snakemake.output.fastq2}"
    " {snakemake.input}"
    " > {snakemake.output.qc} {log}")
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


log = snakemake.log_fmt_shell(stdout=False, stderr=True)

shell(
    "cutadapt"
    " {snakemake.params}"
    " -o {snakemake.output.fastq}"
    " {snakemake.input[0]}"
    " > {snakemake.output.qc} {log}")
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path

from snakemake.shell import shell


def basename_without_ext(file_path):
    """Returns basename of file path, without the file extension."""

    base = path.basename(file_path)

    split_ind = 2 if base.endswith(".gz") else 1
    base = ".".join(base.split(".")[:-split_ind])

    return base


# Run fastqc.
output_dir = path.dirname(snakemake.output.html)

shell("fastqc {snakemake.params} --quiet "
      "--outdir {output_dir} {snakemake.input[0]}")

# Move outputs into proper position.
output_base = basename_without_ext(snakemake.input[0])
html_path = path.join(output_dir, output_base + "_fastqc.html")
zip_path = path.join(output_dir, output_base + "_fastqc.zip")

if snakemake.output.html != html_path:
    shell("mv {html_path} {snakemake.output.html}")

if snakemake.output.zip != zip_path:
    shell("mv {zip_path} {snakemake.output.zip}")
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path

from snakemake.shell import shell


input_dirs = set(path.dirname(fp) for fp in snakemake.input)
output_dir = path.dirname(snakemake.output[0])
output_name = path.basename(snakemake.output[0])
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "multiqc"
    " {snakemake.params}"
    " --force"
    " -o {output_dir}"
    " -n {output_name}"
    " {input_dirs}"
    " {log}")
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path

from snakemake.shell import shell


# Extract arguments.
prefix = snakemake.params.get("prefix", None)
extra = snakemake.params.get("extra", "")

output_dir = path.dirname(snakemake.output.a_ambiguous)
log = snakemake.log_fmt_shell(stdout=False, stderr=True)

# If prefix is not given, we use the summary path to derive the most
# probable sample name (as the summary path is least likely to contain)
# additional suffixes. This is better than using a random id as prefix,
# the prefix is also used as the sample name in the summary file.
if prefix is None:
    prefix = path.splitext(path.basename(snakemake.output.summary))[0]

# Run command.
shell(
    "ngs_disambiguate"
    " {extra}"
    " -o {output_dir}"
    " -s {prefix}"
    " -a {snakemake.params.algorithm}"
    " {snakemake.input.a}"
    " {snakemake.input.b}")

# Move outputs into expected positions.
output_base = path.join(output_dir, prefix)

output_map = {
    output_base + ".ambiguousSpeciesA.bam":
        snakemake.output.a_ambiguous,
    output_base + ".ambiguousSpeciesB.bam":
        snakemake.output.b_ambiguous,
    output_base + ".disambiguatedSpeciesA.bam":
        snakemake.output.a_disambiguated,
    output_base + ".disambiguatedSpeciesB.bam":
        snakemake.output.b_disambiguated,
    output_base + "_summary.txt":
        snakemake.output.summary
}

for src, dest in output_map.items():
    if src != dest:
        shell('mv {src} {dest}')
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


import os
from snakemake.shell import shell

shell(
    "sambamba sort {snakemake.params} -t {snakemake.threads} "
    "-o {snakemake.output[0]} {snakemake.input[0]}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


shell("samtools index {snakemake.params} {snakemake.input[0]} {snakemake.output[0]}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


shell("samtools merge --threads {snakemake.threads} {snakemake.params} "
      "{snakemake.output[0]} {snakemake.input}")
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


extra = snakemake.params.get("extra", "")
region = snakemake.params.get("region", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)


shell("samtools stats {extra} {snakemake.input}"
      " {region} > {snakemake.output} {log}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


import os
from snakemake.shell import shell

extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

n = len(snakemake.input.sample)
assert n == 1 or n == 2, "input->sample must have 1 (single-end) or 2 (paired-end) elements."

if snakemake.input.sample[0].endswith(".gz"):
    readcmd = "--readFilesCommand zcat"
else:
    readcmd = ""


outprefix = os.path.dirname(snakemake.output[0]) + "/"


shell(
    "STAR "
    "{snakemake.params.extra} "
    "--runThreadN {snakemake.threads} "
    "--genomeDir {snakemake.params.index} "
    "--readFilesIn {snakemake.input.sample} "
    "{readcmd} "
    "--outSAMtype BAM Unsorted "
    "--outFileNamePrefix {outprefix} "
    "--outStd Log "
    "{log}")
ShowHide 24 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/jrderuiter/snakemake-rnaseq
Name: snakemake-rnaseq
Version: 0.1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...