muChip Analysis Workflow: Trackhub Creation and Sample Visualization

public public 1yr ago 0 bookmarks

Snakemake workflow to analyze muChip experiments. Creates a trackhub with called peaks and domains as well as fragment pileup and control track for each sample. Also creates average-, tss- and heatplots for each sample.

Usage

If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and, if available, its DOI (see above).

Step 1: Obtain a copy of this workflow

  1. Create a new github repository using this workflow as a template .

  2. Clone the newly created repository to your local system, into the place where you want to perform the data analysis.

Step 2: Configure workflow

Configure the workflow according to your needs via editing the files in the config/ folder. Adjust config.yaml to configure the workflow execution, and samples.tsv to specify your sample setup. See config/samples.tsv for local file example and config/public_samples.tsv for example using SRA.

Step 3: Get the read files

Move any local read files to the results/reads folder. The file names should end with _R1_001.fastq.gz ( R2 for mate pair files).

Step 4: Install Snakemake

Install Snakemake using conda :

conda create -c bioconda -c conda-forge -n snakemake snakemake

Step 5: Execute workflow

Activate the conda environment:

conda activate snakemake

Test your configuration by performing a dry-run via

snakemake --use-conda -n

Execute the workflow locally via

snakemake --use-conda --cores $N

using $N cores or run it in a cluster environment via

snakemake --use-conda --cluster qsub --jobs 100
 

Step 6: Investigate results

After successful execution, you can create a self-contained interactive HTML report with all results via:

snakemake --report report.html

This report can, e.g., be forwarded to your collaborators. An example (using some trivial test data) can be seen here .

A trackhub folder is created in results/trackhub/ which can be used to view the tracks on https://genome-euro.ucsc.edu/cgi-bin/hgHubConnect

Figures are places in results/plots/joined/

Code Snippets

 9
10
shell:
    "wget https://hgdownload.soe.ucsc.edu/goldenPath/{wildcards.species}/database/chromInfo.txt.gz -O {output}"
17
18
19
20
21
22
    shell:
        "z%s {input} > {output}" % chromosome_grep

rule download_genes:
    output:
        "results/{species}/data/refGene.txt.gz"
23
24
shell:
    "wget https://hgdownload.soe.ucsc.edu/goldenPath/{wildcards.species}/database/refGene.txt.gz -O {output}"
31
32
shell:
    """z%s {input} | awk '{{OFS="\t"}}{{print $3, $5, $6, ".", ".", $4}}' | uniq > {output}""" % chromosome_grep
40
41
shell:
    """awk '{{OFS="\t"}}{{if ($6=="+") {{print $1,$2,$2+1,".",".","+"}} else {{print $1, $3-1, $3,".",".","-"}}}}' {input} | uniq > {output}"""
48
49
shell:
    "wget https://genome-idx.s3.amazonaws.com/bt/{wildcards.species}.zip -O {output} -o {log}"
56
57
shell:
    "unzip {input} -d {wildcards.path}"
70
71
wrapper:
    "0.67.0/bio/fastqc"
17
18
shell:
    "macs2 callpeak -t {input.treatment} -c {input.control} -g {params.gs} --bdg --broad --outdir results/{wildcards.species}/se_broadpeakcalling -n {wildcards.celltype}_{wildcards.condition} > {log}"
34
35
shell:
    "macs2 callpeak -t {input.treatment} -c {input.control} -g {params.gs} --bdg --broad --outdir results/{wildcards.species}/pe_broadpeakcalling -n {wildcards.celltype}_{wildcards.condition} -f BAMPE > {log}"
44
45
shell:
    "bedtools merge -d 5000 -i {input} > {output}"
33
34
wrapper:
    "0.67.0/bio/trimmomatic/pe"
50
51
wrapper:
    "0.49.0/bio/bwa/mem"
66
67
wrapper:
    "0.67.0/bio/bowtie2/align"
76
77
wrapper:
    "0.67.0/bio/samtools/sort"
89
90
wrapper:
    "0.64.0/bio/picard/markduplicates"
 99
100
wrapper:
    "0.50.4/bio/samtools/view"
107
108
shell:
    """samtools view -h {input} | awk '($9>=0 ? $9 : -$9) > 50 || $1 ~ /^@/' | samtools view -bS - > {output}"""
115
116
shell:
    "samtools collate -Ou {input} | bedtools bamtobed -i - | chiptools pairbed | bedtools sort > {output}"
123
124
shell:
    "awk '{{if (($3-$2)>50) print}}' {input} > {output}"
136
137
wrapper:
    "0.64.0/bio/bedtools/genomecov"
15
16
shell:
    "bdgplot heat {input.bedgraph} {input.regions} -o {output[0]} -od {output[1]}"
27
28
shell:
    "bdgplot tss {input.bedgraph} {input.regions} -o {output[0]} -od {output[1]}"
39
40
shell:
    "bdgplot average {input.bedgraph} {input.regions} -o {output[0]} -od {output[1]}"
51
52
shell:
    "bdgtools joinfigs {wildcards.plottype} {input} -o {output} --name {wildcards.comparisongroup}"
28
29
wrapper:
    "0.67.0/bio/trimmomatic/se"
43
44
wrapper:
    "0.49.0/bio/bwa/mem"
58
59
wrapper:
    "0.67.0/bio/bowtie2/align"
68
69
wrapper:
    "0.67.0/bio/samtools/sort"
78
79
wrapper:
    "0.50.4/bio/samtools/view"
91
92
wrapper:
    "0.64.0/bio/picard/markduplicates"
101
102
shell:
    "bedtools bamtobed -i {input} | gzip > {output}"
111
112
shell:
    "zcat {input} | bedtools sort | gzip > {output}"
11
12
shell:
    "prefetch {wildcards.sra} -o {output}"
SnakeMake From line 11 of rules/sra.smk
22
23
shell:
    "fastq-dump --split-files --gzip {input} -O results/dumped_fastq/"
SnakeMake From line 22 of rules/sra.smk
32
33
shell:
    "fastq-dump --gzip {input} -O results/dumped_fastq/"
SnakeMake From line 32 of rules/sra.smk
40
41
shell:
    "mv {input} {output}"
SnakeMake From line 40 of rules/sra.smk
48
49
shell:
    "mv {input} {output}"
SnakeMake From line 48 of rules/sra.smk
 8
 9
10
11
12
run:
    text = "\n".join(
        f"genome {species}\ntrackDb {species}/trackDb.txt\n" 
        for species in set(samples["species"]))
    open(output[0], "w").write(text)
19
20
21
22
run:
    name = config.get("name", "mu-chip")
    mail = config.get("mail", "[email protected]")
    open(output[0], "w").write(f"""hub {name}
38
39
script:
    "../scripts/trackhub.py"
50
51
shell:
    "LC_COLLATE=C sort -k1,1 -k2,2n {input} -T {params.tmp} > {output.bdg}"
59
60
wrapper:
    "0.50.3/bio/ucsc/bedGraphToBigWig"
67
68
shell:
    """awk  '{{OFS="\t"}}{{$5=($5>1000?1000:$5);print}} {{input}} > {{output}}'"""
78
79
shell:
    "bedToBigBed -type=bed6+3 {input.peaks} {input.sizes} {output}"
89
90
shell:
    "bedToBigBed {input.domains} {input.sizes} {output}"
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
from pathlib import PurePath
import colorsys

import trackhub

def get_color(i, n_colors):
    h =  i/float(n_colors)
    s = 0.5 # color_intensity[filetype]
    color = [255*v for v in colorsys.hsv_to_rgb(h, 0.5, 0.5)]
    return ",".join([str(int(c)) for c in list(color)])

def histone_track(name, color="200,0,0"):
    composite = trackhub.CompositeTrack(
        name=name+"_composite",
        short_label=name,
        tracktype='bigWig',
        visibility='full',
        color=color
    )
    signal_view = trackhub.ViewTrack(
        name=name+"_signal",
        view='signal',
        visibility='full',
        tracktype='bigWig',
        short_label=name+'_signal',
        autoScale="on"
    )
    regions_view = trackhub.ViewTrack(
        name=name+'_region',
        view='regions',
        visibility='dense',
        tracktype='bigWig',
        short_label=name+'_regions')

    composite.add_view(signal_view)
    composite.add_view(regions_view)

    for signal_type in ["treat_pileup", "control_lambda"]:
        track = trackhub.Track(
            tracktype='bigWig',
            name=name+"_"+signal_type,
            url="%s_%s.bw" % (name, signal_type),
            short_label=signal_type,
            autoScale="on"
        )
        signal_view.add_tracks(track)

    for region_type in ["domains"]:
        track = trackhub.Track(
            name=name+"_"+region_type,
            url="%s_%s.bb" %(name, region_type),
            short_label=region_type,
            tracktype='bigBed')
        regions_view.add_tracks(track)
    return composite

names = [PurePath(name).stem.replace("_domains", "") for name in snakemake.input.domains]
colors = [get_color(i, len(names)) for i in range(len(names))]
hub, genomes_file, genome, trackdb = trackhub.default_hub(
    hub_name="testing",
    genome="hg38",
    email="[email protected]")
trackdb.add_tracks([histone_track(*pair) for pair in zip(names, colors)])
open(snakemake.output[0], "w").write(str(trackdb))
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
__author__ = "Johannes Köster, Julian de Ruiter"
__copyright__ = "Copyright 2016, Johannes Köster and Julian de Ruiter"
__email__ = "[email protected], [email protected]"
__license__ = "MIT"


from os import path

from snakemake.shell import shell


# Extract arguments.
extra = snakemake.params.get("extra", "")

sort = snakemake.params.get("sort", "none")
sort_order = snakemake.params.get("sort_order", "coordinate")
sort_extra = snakemake.params.get("sort_extra", "")

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

# Check inputs/arguments.
if not isinstance(snakemake.input.reads, str) and len(snakemake.input.reads) not in {
    1,
    2,
}:
    raise ValueError("input must have 1 (single-end) or " "2 (paired-end) elements")

if sort_order not in {"coordinate", "queryname"}:
    raise ValueError("Unexpected value for sort_order ({})".format(sort_order))

# Determine which pipe command to use for converting to bam or sorting.
if sort == "none":

    # Simply convert to bam using samtools view.
    pipe_cmd = "samtools view -Sbh -o {snakemake.output[0]} -"

elif sort == "samtools":

    # Sort alignments using samtools sort.
    pipe_cmd = "samtools sort {sort_extra} -o {snakemake.output[0]} -"

    # Add name flag if needed.
    if sort_order == "queryname":
        sort_extra += " -n"

    prefix = path.splitext(snakemake.output[0])[0]
    sort_extra += " -T " + prefix + ".tmp"

elif sort == "picard":

    # Sort alignments using picard SortSam.
    pipe_cmd = (
        "picard SortSam {sort_extra} INPUT=/dev/stdin"
        " OUTPUT={snakemake.output[0]} SORT_ORDER={sort_order}"
    )

else:
    raise ValueError("Unexpected value for params.sort ({})".format(sort))

shell(
    "(bwa mem"
    " -t {snakemake.threads}"
    " {extra}"
    " {snakemake.params.index}"
    " {snakemake.input.reads}"
    " | " + pipe_cmd + ") {log}"
)
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
__author__ = "Roman Chernyatchik"
__copyright__ = "Copyright (c) 2019 JetBrains"
__email__ = "[email protected]"
__license__ = "MIT"

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=True, stderr=True)
extra = snakemake.params.get("extra", "")

shell(
    "bedGraphToBigWig {extra}"
    " {snakemake.input.bedGraph} {snakemake.input.chromsizes}"
    " {snakemake.output} {log}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


shell("samtools view {snakemake.params} {snakemake.input[0]} > {snakemake.output[0]}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
__author__ = "Antonie Vietor"
__copyright__ = "Copyright 2020, Antonie Vietor"
__email__ = "[email protected]"
__license__ = "MIT"

import os
from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

genome = ""
input_file = ""

if (os.path.splitext(snakemake.input[0])[-1]) == ".bam":
    input_file = "-ibam " + snakemake.input[0]

if len(snakemake.input) > 1:
    if (os.path.splitext(snakemake.input[0])[-1]) == ".bed":
        input_file = "-i " + snakemake.input.get("bed")
        genome = "-g " + snakemake.input.get("ref")

shell(
    "(genomeCoverageBed"
    " {snakemake.params}"
    " {input_file}"
    " {genome}"
    " > {snakemake.output[0]}) {log}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "picard MarkDuplicates {snakemake.params} INPUT={snakemake.input} "
    "OUTPUT={snakemake.output.bam} METRICS_FILE={snakemake.output.metrics} "
    "{log}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell

extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

n = len(snakemake.input.sample)
assert (
    n == 1 or n == 2
), "input->sample must have 1 (single-end) or 2 (paired-end) elements."

if n == 1:
    reads = "-U {}".format(*snakemake.input.sample)
else:
    reads = "-1 {} -2 {}".format(*snakemake.input.sample)

shell(
    "(bowtie2 --threads {snakemake.threads} {snakemake.params.extra} "
    "-x {snakemake.params.index} {reads} "
    "| samtools view -Sbh -o {snakemake.output[0]} -) {log}"
)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path
from tempfile import TemporaryDirectory

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=False, stderr=True)


def basename_without_ext(file_path):
    """Returns basename of file path, without the file extension."""

    base = path.basename(file_path)

    split_ind = 2 if base.endswith(".fastq.gz") else 1
    base = ".".join(base.split(".")[:-split_ind])

    return base


# Run fastqc, since there can be race conditions if multiple jobs
# use the same fastqc dir, we create a temp dir.
with TemporaryDirectory() as tempdir:
    shell(
        "fastqc {snakemake.params} --quiet -t {snakemake.threads} "
        "--outdir {tempdir:q} {snakemake.input[0]:q}"
        " {log:q}"
    )

    # Move outputs into proper position.
    output_base = basename_without_ext(snakemake.input[0])
    html_path = path.join(tempdir, output_base + "_fastqc.html")
    zip_path = path.join(tempdir, output_base + "_fastqc.zip")

    if snakemake.output.html != html_path:
        shell("mv {html_path:q} {snakemake.output.html:q}")

    if snakemake.output.zip != zip_path:
        shell("mv {zip_path:q} {snakemake.output.zip:q}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


import os
from snakemake.shell import shell


prefix = os.path.splitext(snakemake.output[0])[0]

# Samtools takes additional threads through its option -@
# One thread for samtools
# Other threads are *additional* threads passed to the argument -@
threads = "" if snakemake.threads <= 1 else " -@ {} ".format(snakemake.threads - 1)

shell(
    "samtools sort {snakemake.params} {threads} -o {snakemake.output[0]} "
    "-T {prefix} {snakemake.input[0]}"
)
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
__author__ = "Johannes Köster, Jorge Langa"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


# Distribute available threads between trimmomatic itself and any potential pigz instances
def distribute_threads(input_files, output_files, available_threads):
    gzipped_input_files = sum(1 for file in input_files if file.endswith(".gz"))
    gzipped_output_files = sum(1 for file in output_files if file.endswith(".gz"))
    potential_threads_per_process = available_threads // (
        1 + gzipped_input_files + gzipped_output_files
    )
    if potential_threads_per_process > 0:
        # decompressing pigz creates at most 4 threads
        pigz_input_threads = (
            min(4, potential_threads_per_process) if gzipped_input_files != 0 else 0
        )
        pigz_output_threads = (
            (available_threads - pigz_input_threads * gzipped_input_files)
            // (1 + gzipped_output_files)
            if gzipped_output_files != 0
            else 0
        )
        trimmomatic_threads = (
            available_threads
            - pigz_input_threads * gzipped_input_files
            - pigz_output_threads * gzipped_output_files
        )
    else:
        # not enough threads for pigz
        pigz_input_threads = 0
        pigz_output_threads = 0
        trimmomatic_threads = available_threads
    return trimmomatic_threads, pigz_input_threads, pigz_output_threads


def compose_input_gz(filename, threads):
    if filename.endswith(".gz") and threads > 0:
        return "<(pigz -p {threads} --decompress --stdout {filename})".format(
            threads=threads, filename=filename
        )
    return filename


def compose_output_gz(filename, threads, compression_level):
    if filename.endswith(".gz") and threads > 0:
        return ">(pigz -p {threads} {compression_level} > {filename})".format(
            threads=threads, compression_level=compression_level, filename=filename
        )
    return filename


extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
compression_level = snakemake.params.get("compression_level", "-5")
trimmer = " ".join(snakemake.params.trimmer)

# Distribute threads
input_files = [snakemake.input.r1, snakemake.input.r2]
output_files = [
    snakemake.output.r1,
    snakemake.output.r1_unpaired,
    snakemake.output.r2,
    snakemake.output.r2_unpaired,
]

trimmomatic_threads, input_threads, output_threads = distribute_threads(
    input_files, output_files, snakemake.threads
)

input_r1, input_r2 = [
    compose_input_gz(filename, input_threads) for filename in input_files
]

output_r1, output_r1_unp, output_r2, output_r2_unp = [
    compose_output_gz(filename, output_threads, compression_level)
    for filename in output_files
]

shell(
    "trimmomatic PE -threads {trimmomatic_threads} {extra} "
    "{input_r1} {input_r2} "
    "{output_r1} {output_r1_unp} "
    "{output_r2} {output_r2_unp} "
    "{trimmer} "
    "{log}"
)
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
__author__ = "Johannes Köster, Jorge Langa"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


# Distribute available threads between trimmomatic itself and any potential pigz instances
def distribute_threads(input_file, output_file, available_threads):
    gzipped_input_files = 1 if input_file.endswith(".gz") else 0
    gzipped_output_files = 1 if output_file.endswith(".gz") else 0
    potential_threads_per_process = available_threads // (
        1 + gzipped_input_files + gzipped_output_files
    )
    if potential_threads_per_process > 0:
        # decompressing pigz creates at most 4 threads
        pigz_input_threads = (
            min(4, potential_threads_per_process) if gzipped_input_files != 0 else 0
        )
        pigz_output_threads = (
            (available_threads - pigz_input_threads * gzipped_input_files)
            // (1 + gzipped_output_files)
            if gzipped_output_files != 0
            else 0
        )
        trimmomatic_threads = (
            available_threads
            - pigz_input_threads * gzipped_input_files
            - pigz_output_threads * gzipped_output_files
        )
    else:
        # not enough threads for pigz
        pigz_input_threads = 0
        pigz_output_threads = 0
        trimmomatic_threads = available_threads
    return trimmomatic_threads, pigz_input_threads, pigz_output_threads


def compose_input_gz(filename, threads):
    if filename.endswith(".gz") and threads > 0:
        return "<(pigz -p {threads} --decompress --stdout {filename})".format(
            threads=threads, filename=filename
        )
    return filename


def compose_output_gz(filename, threads, compression_level):
    if filename.endswith(".gz") and threads > 0:
        return ">(pigz -p {threads} {compression_level} > {filename})".format(
            threads=threads, compression_level=compression_level, filename=filename
        )
    return filename


extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
compression_level = snakemake.params.get("compression_level", "-5")
trimmer = " ".join(snakemake.params.trimmer)

# Distribute threads
trimmomatic_threads, input_threads, output_threads = distribute_threads(
    snakemake.input[0], snakemake.output[0], snakemake.threads
)

# Collect files
input = compose_input_gz(snakemake.input[0], input_threads)
output = compose_output_gz(snakemake.output[0], output_threads, compression_level)

shell(
    "trimmomatic SE -threads {trimmomatic_threads} {extra} {input} {output} {trimmer} {log}"
)
ShowHide 48 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/knutdrand/mu-chip
Name: mu-chip
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...