A Snakemake workflow to process paired-end RNA-Seq data.

public public 1yr ago 0 bookmarks

A Snakemake workflow to process paired-end RNA-Seq data.

Steps:

The workflow consists following steps:

  • Quality control of the raw and/or trimmed data (FastQC, MultiQC)

  • Adapter trimming w/ trim_galore (Optional)

  • Contamination check and decontamination (Optional)

  • Alignment to the reference genome (hisat2, STAR)

  • Alignment quality control with RSeQC, QualiMap

  • Transcript/gene quantification (StringTie, featureCounts, RSEM)

  • Alignment-free transcript quantification (kallisto/salmon)

Future additions:

  • Differential gene expression analysis (deseq2)

  • Machine learning-based mapping uncertainty analysis (GeneQC)

dag

Code Snippets

17
18
19
20
shell:
    """
    fastqc --outdir {params.prefix} -t {threads} -f fastq {input}
    """
 9
10
wrapper:
    "0.35.0/bio/multiqc"
13
14
wrapper:
    "0.35.2/bio/trim_galore/pe"
30
31
32
33
shell:
    """
    fastqc -t {threads} {input.r1} {input.r2} -q -f fastq -o {params.prefix}
    """
43
44
wrapper:
    "0.49.0/bio/multiqc"
54
55
wrapper:
    "0.49.0/bio/multiqc"
11
12
13
14
shell:
    '''
    bwa mem -t {threads} {input.rrna} {input.r1} {input.r2} > {output}
    '''
24
25
26
27
shell:
    '''
    samtools view -@ {threads} -bS -o {output} {input.sam}
    '''
37
38
39
40
shell:
    '''
    samtools flagstat {input.bam} --threads {threads} > {output}
    '''
50
51
52
53
shell:
    '''
    samtools stats {input.bam} > {output}
    '''
64
65
wrapper:
    "0.48.0/bio/multiqc"
18
19
20
21
shell:
    """
    bbsplit.sh -Xmx120g threads={threads} in1={input.r1} in2={input.r2} ref_rrna={input.rrna} path={params.index} basename={params.out_rrna}/out_%.fq outu1={output.out1} outu2={output.out2} refstats={output.stats} 2> {log}
    """
37
38
39
40
shell:
    """
    fastqc -t {threads} {input.r1} {input.r2} -q -f fastq -o {params.prefix}
    """
50
51
wrapper:
    "0.47.0/bio/multiqc"
13
14
15
16
shell:
    '''
    bwa mem -t {threads} {input.rrna} {input.r1} {input.r2} > {output}
    '''
26
27
28
29
shell:
    '''
    samtools view -@ {threads} -bS -o {output} {input.sam}
    '''
39
40
41
42
shell:
    '''
    samtools stats {input.bam} -@ {threads} > {output}
    '''
52
53
wrapper:
    "0.48.0/bio/multiqc"
8
9
shell:
    "gffread {input.anno} -T -o {output.gtf}"
22
23
wrapper:
    "0.49.0/bio/star/index"
41
42
wrapper:
    "0.49.0/bio/star/align"
49
50
51
52
shell:
    """
    cat {input} | awk '($5 > 0 && $7 > 2 && $6==0)' | cut -f1-6 | sort | uniq > {output}
    """
73
74
wrapper:
    "0.49.0/bio/star/align"
11
12
script:
    "../scripts/gtf2bed.py"
38
39
40
shell:
    "junction_annotation.py {params.extra} -i {input.bam} -r {input.bed} -o {params.prefix} "
    "> {log[0]} 2>&1"
51
52
shell:
    "bam_stat.py -i {input} > {output} 2> {log}"
65
66
shell:
    "infer_experiment.py -r {input.bed} -i {input.bam} > {output} 2> {log}"
78
79
shell:
    "read_distribution.py -r {input.bed} -i {input.bam} > {output} 2> {log}"
92
93
shell:
    "read_GC.py -i {input} -o {params.prefix} > {log} 2>&1"
126
127
wrapper:
    "0.49.0/bio/multiqc"
11
12
13
14
shell:
    '''
    samtools sort -@ {threads} -n -o {output} -T {params.prfx} {input.bam}
    '''
26
27
28
29
shell:
    '''
    qualimap rnaseq -bam {input.sorted_bam} -gtf {params.gtf} --outdir {params.outdir} --sorted --paired
    '''
38
39
wrapper:
    "0.49.0/bio/multiqc"
12
13
14
15
16
17
shell:
    """
    rsem-prepare-reference \
    -p {threads} \
    --gff3 {input.gff} {input.fasta} {params.ref_name}
    """
31
32
33
34
35
36
37
38
shell:
    """
    rsem-calculate-expression \
    --no-qualities \
    -p {threads} \
    --strandedness reverse \
    --alignments --paired-end {input.bam} {params.genomedir} {params.prefix}
    """
47
48
wrapper:
    "0.49.0/bio/multiqc"
13
14
15
16
shell:
    """
    htseq-count -m intersection-nonempty --stranded=reverse --idattr gene_id -r pos -f bam {input} {params.gtf} > {output} 2> {log}
    """
29
30
31
32
shell:
    """
    multiqc -m htseq {params.prefix} --filename {output} 2> {log}
    """
12
13
14
15
16
17
shell:
    """
    grep "^>" {input.ref} | cut -d " " -f 1 > {output.decoy}
    sed -i.bak -e 's/>//g' {output.decoy}
    cat {input.tcp} {input.ref} > {output.gent}
    """
30
31
32
33
shell:
    """
    salmon index -p {threads} -t {input.gent} -d {input.decoy} -i {output} &> {log}
    """
49
50
51
52
shell:
    """
    salmon quant -i {input.index} -l A -1 {input.r1} -2 {input.r2} -o {output} --validateMappings --gcBias --seqBias --writeUnmappedNames --writeMappings={output.mappings} -p {threads} --numBootstraps 100
    """
61
62
wrapper:
    "0.49.0/bio/multiqc"
73
74
75
76
shell:
    '''
    gffread -w {output} -g {input.ref} {input.gtf}
    '''
91
92
93
94
shell:
    '''
    salmon quant -t {input.tcp} -l A -a {input.bam} -o {output} --gcBias --seqBias --writeUnmappedNames -p {threads} -g {input.gtf} --numBootstraps 100
    '''
106
107
wrapper:
    "0.47.0/bio/multiqc"
11
12
shell:
    "kallisto index -i {output} {input} 2> {log}"
29
30
31
shell:
    "kallisto quant --threads {threads} -i {input.idx} -o {output} --gtf {input.gtf} "
    "{params.extra} {input.r1} {input.r2} 2> {log}"
40
41
wrapper:
    "0.49.0/bio/multiqc"
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import gffutils

db = gffutils.create_db(snakemake.input[0],
                        dbfn=snakemake.output.db,
                        force=True,
                        keep_order=True,
                        merge_strategy='merge',
                        sort_attribute_values=True,
                        disable_infer_genes=True,
                        disable_infer_transcripts=True)

with open(snakemake.output.bed, 'w') as outfileobj:
    for tx in db.features_of_type('transcript', order_by='start'):
        bed = [s.strip() for s in db.bed12(tx).split('\t')]
        bed[3] = tx.id
        outfileobj.write('{}\n'.format('\t'.join(bed)))
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path

from snakemake.shell import shell


input_dirs = set(path.dirname(fp) for fp in snakemake.input)
output_dir = path.dirname(snakemake.output[0])
output_name = path.basename(snakemake.output[0])
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "multiqc"
    " {snakemake.params}"
    " --force"
    " -o {output_dir}"
    " -n {output_name}"
    " {input_dirs}"
    " {log}")
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
__author__ = "Kerrin Mendler"
__copyright__ = "Copyright 2018, Kerrin Mendler"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell
import os.path


log = snakemake.log_fmt_shell()

# Check that two input files were supplied
n = len(snakemake.input)
assert n == 2, "Input must contain 2 files. Given: %r." % n

# Don't run with `--fastqc` flag
if "--fastqc" in snakemake.params.get("extra", ""):
    raise ValueError("The trim_galore Snakemake wrapper cannot "
                       "be run with the `--fastqc` flag. Please "
                       "remove the flag from extra params. "
                       "You can use the fastqc Snakemake wrapper on "
                       "the input and output files instead.")

# Check that four output files were supplied
m = len(snakemake.output)
assert m == 4, "Output must contain 4 files. Given: %r." % m

# Check that all output files are in the same directory
out_dir = os.path.dirname(snakemake.output[0])
for file_path in snakemake.output[1:]:
    assert out_dir == os.path.dirname(file_path), \
        "trim_galore can only output files to a single directory." \
        " Please indicate only one directory for the output files."

shell(
    "(trim_galore"
    " {snakemake.params.extra}"
    " --paired"
    " -o {out_dir}"
    " {snakemake.input})"
    " {log}")
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path

from snakemake.shell import shell


input_dirs = set(path.dirname(fp) for fp in snakemake.input)
output_dir = path.dirname(snakemake.output[0])
output_name = path.basename(snakemake.output[0])
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "multiqc"
    " {snakemake.params}"
    " --force"
    " -o {output_dir}"
    " -n {output_name}"
    " {input_dirs}"
    " {log}"
)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path

from snakemake.shell import shell


input_dirs = set(path.dirname(fp) for fp in snakemake.input)
output_dir = path.dirname(snakemake.output[0])
output_name = path.basename(snakemake.output[0])
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "multiqc"
    " {snakemake.params}"
    " --force"
    " -o {output_dir}"
    " -n {output_name}"
    " {input_dirs}"
    " {log}"
)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path

from snakemake.shell import shell


input_dirs = set(path.dirname(fp) for fp in snakemake.input)
output_dir = path.dirname(snakemake.output[0])
output_name = path.basename(snakemake.output[0])
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "multiqc"
    " {snakemake.params}"
    " --force"
    " -o {output_dir}"
    " -n {output_name}"
    " {input_dirs}"
    " {log}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


import os
from snakemake.shell import shell

extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

fq1 = snakemake.input.get("fq1")
assert fq1 is not None, "input-> fq1 is a required input parameter"
fq1 = (
    [snakemake.input.fq1]
    if isinstance(snakemake.input.fq1, str)
    else snakemake.input.fq1
)
fq2 = snakemake.input.get("fq2")
if fq2:
    fq2 = (
        [snakemake.input.fq2]
        if isinstance(snakemake.input.fq2, str)
        else snakemake.input.fq2
    )
    assert len(fq1) == len(
        fq2
    ), "input-> equal number of files required for fq1 and fq2"
input_str_fq1 = ",".join(fq1)
input_str_fq2 = ",".join(fq2) if fq2 is not None else ""
input_str = " ".join([input_str_fq1, input_str_fq2])

if fq1[0].endswith(".gz"):
    readcmd = "--readFilesCommand zcat"
else:
    readcmd = ""

outprefix = os.path.dirname(snakemake.output[0]) + "/"

shell(
    "STAR "
    "{extra} "
    "--runThreadN {snakemake.threads} "
    "--genomeDir {snakemake.params.index} "
    "--readFilesIn {input_str} "
    "{readcmd} "
    "--outFileNamePrefix {outprefix} "
    "--outStd Log "
    "{log}"
)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
__author__ = "Thibault Dayris"
__copyright__ = "Copyright 2019, Dayris Thibault"
__email__ = "[email protected]"
__license__ = "MIT"

from snakemake.shell import shell
from snakemake.utils import makedirs

log = snakemake.log_fmt_shell(stdout=True, stderr=True)

extra = snakemake.params.get("extra", "")
sjdb_overhang = snakemake.params.get("sjdbOverhang", "100")

gtf = snakemake.input.get("gtf")
if gtf is not None:
    gtf = "--sjdbGTFfile " + gtf
    sjdb_overhang = "--sjdbOverhang " + sjdb_overhang
else:
    gtf = sjdb_overhang = ""

makedirs(snakemake.output)

shell(
    "STAR "  # Tool
    "--runMode genomeGenerate "  # Indexation mode
    "{extra} "  # Optional parameters
    "--runThreadN {snakemake.threads} "  # Number of threads
    "--genomeDir {snakemake.output} "  # Path to output
    "--genomeFastaFiles {snakemake.input.fasta} "  # Path to fasta files
    "{sjdb_overhang} "  # Read-len - 1
    "{gtf} "  # Highly recommended GTF
    "{log}"  # Logging
)
ShowHide 41 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/akcorut/SnakeRNASeq
Name: snakernaseq
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...