muChip Analysis Workflow: Trackhub Creation and Sample Visualization

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation
Lack of a description for a new keyword tool/pypi/trackhub .

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Snakemake workflow to analyze muChip experiments. Creates a trackhub with called peaks and domains as well as fragment pileup and control track for each sample. Also creates average-, tss- and heatplots for each sample.

Usage

If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and, if available, its DOI (see above).

Step 1: Obtain a copy of this workflow

Create a new github repository using this workflow as a template .
Clone the newly created repository to your local system, into the place where you want to perform the data analysis.

Step 2: Configure workflow

Configure the workflow according to your needs via editing the files in the config/ folder. Adjust config.yaml to configure the workflow execution, and samples.tsv to specify your sample setup. See config/samples.tsv for local file example and config/public_samples.tsv for example using SRA.

Step 3: Get the read files

Move any local read files to the results/reads folder. The file names should end with _R1_001.fastq.gz ( R2 for mate pair files).

Step 4: Install Snakemake

Install Snakemake using conda :

conda create -c bioconda -c conda-forge -n snakemake snakemake

Step 5: Execute workflow

Activate the conda environment:

conda activate snakemake

Test your configuration by performing a dry-run via

snakemake --use-conda -n

Execute the workflow locally via

snakemake --use-conda --cores $N

using $N cores or run it in a cluster environment via

snakemake --use-conda --cluster qsub --jobs 100

Step 6: Investigate results

After successful execution, you can create a self-contained interactive HTML report with all results via:

snakemake --report report.html

This report can, e.g., be forwarded to your collaborators. An example (using some trivial test data) can be seen here .

A trackhub folder is created in results/trackhub/ which can be used to view the tracks on https://genome-euro.ucsc.edu/cgi-bin/hgHubConnect

Figures are places in results/plots/joined/

Code Snippets

shell:
    "wget https://hgdownload.soe.ucsc.edu/goldenPath/{wildcards.species}/database/chromInfo.txt.gz -O {output}"

SnakeMake From line 9 of rules/other.smk

    shell:
        "z%s {input} > {output}" % chromosome_grep

rule download_genes:
    output:
        "results/{species}/data/refGene.txt.gz"

SnakeMake From line 17 of rules/other.smk

shell:
    "wget https://hgdownload.soe.ucsc.edu/goldenPath/{wildcards.species}/database/refGene.txt.gz -O {output}"

SnakeMake From line 23 of rules/other.smk

shell:
    """z%s {input} | awk '{{OFS="\t"}}{{print $3, $5, $6, ".", ".", $4}}' | uniq > {output}""" % chromosome_grep

SnakeMake From line 31 of rules/other.smk

shell:
    """awk '{{OFS="\t"}}{{if ($6=="+") {{print $1,$2,$2+1,".",".","+"}} else {{print $1, $3-1, $3,".",".","-"}}}}' {input} | uniq > {output}"""

SnakeMake From line 40 of rules/other.smk

shell:
    "wget https://genome-idx.s3.amazonaws.com/bt/{wildcards.species}.zip -O {output} -o {log}"

SnakeMake From line 48 of rules/other.smk

shell:
    "unzip {input} -d {wildcards.path}"

SnakeMake From line 56 of rules/other.smk

wrapper:
    "0.67.0/bio/fastqc"

SnakeMake FastQC From line 70 of rules/other.smk

shell:
    "macs2 callpeak -t {input.treatment} -c {input.control} -g {params.gs} --bdg --broad --outdir results/{wildcards.species}/se_broadpeakcalling -n {wildcards.celltype}_{wildcards.condition} > {log}"

SnakeMake macs2 From line 17 of rules/peakcalling.smk

shell:
    "macs2 callpeak -t {input.treatment} -c {input.control} -g {params.gs} --bdg --broad --outdir results/{wildcards.species}/pe_broadpeakcalling -n {wildcards.celltype}_{wildcards.condition} -f BAMPE > {log}"

SnakeMake macs2 From line 34 of rules/peakcalling.smk

shell:
    "bedtools merge -d 5000 -i {input} > {output}"

SnakeMake BEDTools From line 44 of rules/peakcalling.smk

wrapper:
    "0.67.0/bio/trimmomatic/pe"

SnakeMake From line 33 of rules/pemapping.smk

wrapper:
    "0.49.0/bio/bwa/mem"

SnakeMake From line 50 of rules/pemapping.smk

wrapper:
    "0.67.0/bio/bowtie2/align"

SnakeMake From line 66 of rules/pemapping.smk

wrapper:
    "0.67.0/bio/samtools/sort"

SnakeMake From line 76 of rules/pemapping.smk

wrapper:
    "0.64.0/bio/picard/markduplicates"

SnakeMake From line 89 of rules/pemapping.smk

wrapper:
    "0.50.4/bio/samtools/view"

SnakeMake From line 99 of rules/pemapping.smk

shell:
    """samtools view -h {input} | awk '($9>=0 ? $9 : -$9) > 50 || $1 ~ /^@/' | samtools view -bS - > {output}"""

SnakeMake SAMtools From line 107 of rules/pemapping.smk

shell:
    "samtools collate -Ou {input} | bedtools bamtobed -i - | chiptools pairbed | bedtools sort > {output}"

SnakeMake SAMtools BEDTools From line 115 of rules/pemapping.smk

shell:
    "awk '{{if (($3-$2)>50) print}}' {input} > {output}"

SnakeMake From line 123 of rules/pemapping.smk

wrapper:
    "0.64.0/bio/bedtools/genomecov"

SnakeMake From line 136 of rules/pemapping.smk

shell:
    "bdgplot heat {input.bedgraph} {input.regions} -o {output[0]} -od {output[1]}"

SnakeMake From line 15 of rules/plots.smk

shell:
    "bdgplot tss {input.bedgraph} {input.regions} -o {output[0]} -od {output[1]}"

SnakeMake From line 27 of rules/plots.smk

shell:
    "bdgplot average {input.bedgraph} {input.regions} -o {output[0]} -od {output[1]}"

SnakeMake From line 39 of rules/plots.smk

shell:
    "bdgtools joinfigs {wildcards.plottype} {input} -o {output} --name {wildcards.comparisongroup}"

SnakeMake From line 51 of rules/plots.smk

wrapper:
    "0.67.0/bio/trimmomatic/se"

SnakeMake From line 28 of rules/semapping.smk

wrapper:
    "0.49.0/bio/bwa/mem"

SnakeMake From line 43 of rules/semapping.smk

wrapper:
    "0.67.0/bio/bowtie2/align"

SnakeMake From line 58 of rules/semapping.smk

wrapper:
    "0.67.0/bio/samtools/sort"

SnakeMake From line 68 of rules/semapping.smk

wrapper:
    "0.50.4/bio/samtools/view"

SnakeMake From line 78 of rules/semapping.smk

wrapper:
    "0.64.0/bio/picard/markduplicates"

SnakeMake From line 91 of rules/semapping.smk

shell:
    "bedtools bamtobed -i {input} | gzip > {output}"

SnakeMake BEDTools From line 101 of rules/semapping.smk

shell:
    "zcat {input} | bedtools sort | gzip > {output}"

SnakeMake BEDTools From line 111 of rules/semapping.smk

shell:
    "prefetch {wildcards.sra} -o {output}"

SnakeMake From line 11 of rules/sra.smk

shell:
    "fastq-dump --split-files --gzip {input} -O results/dumped_fastq/"

SnakeMake From line 22 of rules/sra.smk

shell:
    "fastq-dump --gzip {input} -O results/dumped_fastq/"

SnakeMake From line 32 of rules/sra.smk

shell:
    "mv {input} {output}"

SnakeMake From line 40 of rules/sra.smk

shell:
    "mv {input} {output}"

SnakeMake From line 48 of rules/sra.smk

run:
    text = "\n".join(
        f"genome {species}\ntrackDb {species}/trackDb.txt\n" 
        for species in set(samples["species"]))
    open(output[0], "w").write(text)

SnakeMake From line 8 of rules/trackhub.smk

run:
    name = config.get("name", "mu-chip")
    mail = config.get("mail", "[email protected]")
    open(output[0], "w").write(f"""hub {name}

SnakeMake From line 19 of rules/trackhub.smk

script:
    "../scripts/trackhub.py"

SnakeMake trackhub From line 38 of rules/trackhub.smk

shell:
    "LC_COLLATE=C sort -k1,1 -k2,2n {input} -T {params.tmp} > {output.bdg}"

SnakeMake From line 50 of rules/trackhub.smk

wrapper:
    "0.50.3/bio/ucsc/bedGraphToBigWig"

SnakeMake bedGraphToBigWig From line 59 of rules/trackhub.smk

shell:
    """awk  '{{OFS="\t"}}{{$5=($5>1000?1000:$5);print}} {{input}} > {{output}}'"""

SnakeMake From line 67 of rules/trackhub.smk

shell:
    "bedToBigBed -type=bed6+3 {input.peaks} {input.sizes} {output}"

SnakeMake From line 78 of rules/trackhub.smk

shell:
    "bedToBigBed {input.domains} {input.sizes} {output}"

SnakeMake From line 89 of rules/trackhub.smk

from pathlib import PurePath
import colorsys

import trackhub

def get_color(i, n_colors):
    h =  i/float(n_colors)
    s = 0.5 # color_intensity[filetype]
    color = [255*v for v in colorsys.hsv_to_rgb(h, 0.5, 0.5)]
    return ",".join([str(int(c)) for c in list(color)])

def histone_track(name, color="200,0,0"):
    composite = trackhub.CompositeTrack(
        name=name+"_composite",
        short_label=name,
        tracktype='bigWig',
        visibility='full',
        color=color
    )
    signal_view = trackhub.ViewTrack(
        name=name+"_signal",
        view='signal',
        visibility='full',
        tracktype='bigWig',
        short_label=name+'_signal',
        autoScale="on"
    )
    regions_view = trackhub.ViewTrack(
        name=name+'_region',
        view='regions',
        visibility='dense',
        tracktype='bigWig',
        short_label=name+'_regions')

    composite.add_view(signal_view)
    composite.add_view(regions_view)

    for signal_type in ["treat_pileup", "control_lambda"]:
        track = trackhub.Track(
            tracktype='bigWig',
            name=name+"_"+signal_type,
            url="%s_%s.bw" % (name, signal_type),
            short_label=signal_type,
            autoScale="on"
        )
        signal_view.add_tracks(track)

    for region_type in ["domains"]:
        track = trackhub.Track(
            name=name+"_"+region_type,
            url="%s_%s.bb" %(name, region_type),
            short_label=region_type,
            tracktype='bigBed')
        regions_view.add_tracks(track)
    return composite

names = [PurePath(name).stem.replace("_domains", "") for name in snakemake.input.domains]
colors = [get_color(i, len(names)) for i in range(len(names))]
hub, genomes_file, genome, trackdb = trackhub.default_hub(
    hub_name="testing",
    genome="hg38",
    email="[email protected]")
trackdb.add_tracks([histone_track(*pair) for pair in zip(names, colors)])
open(snakemake.output[0], "w").write(str(trackdb))

Python bigWig trackhub From line 1 of scripts/trackhub.py

__author__ = "Johannes Köster, Julian de Ruiter"
__copyright__ = "Copyright 2016, Johannes Köster and Julian de Ruiter"
__email__ = "[email protected], [email protected]"
__license__ = "MIT"


from os import path

from snakemake.shell import shell


# Extract arguments.
extra = snakemake.params.get("extra", "")

sort = snakemake.params.get("sort", "none")
sort_order = snakemake.params.get("sort_order", "coordinate")
sort_extra = snakemake.params.get("sort_extra", "")

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

# Check inputs/arguments.
if not isinstance(snakemake.input.reads, str) and len(snakemake.input.reads) not in {
    1,
    2,
}:
    raise ValueError("input must have 1 (single-end) or " "2 (paired-end) elements")

if sort_order not in {"coordinate", "queryname"}:
    raise ValueError("Unexpected value for sort_order ({})".format(sort_order))

# Determine which pipe command to use for converting to bam or sorting.
if sort == "none":

    # Simply convert to bam using samtools view.
    pipe_cmd = "samtools view -Sbh -o {snakemake.output[0]} -"

elif sort == "samtools":

    # Sort alignments using samtools sort.
    pipe_cmd = "samtools sort {sort_extra} -o {snakemake.output[0]} -"

    # Add name flag if needed.
    if sort_order == "queryname":
        sort_extra += " -n"

    prefix = path.splitext(snakemake.output[0])[0]
    sort_extra += " -T " + prefix + ".tmp"

elif sort == "picard":

    # Sort alignments using picard SortSam.
    pipe_cmd = (
        "picard SortSam {sort_extra} INPUT=/dev/stdin"
        " OUTPUT={snakemake.output[0]} SORT_ORDER={sort_order}"
    )

else:
    raise ValueError("Unexpected value for params.sort ({})".format(sort))

shell(
    "(bwa mem"
    " -t {snakemake.threads}"
    " {extra}"
    " {snakemake.params.index}"
    " {snakemake.input.reads}"
    " | " + pipe_cmd + ") {log}"
)

Python Snakemake SAMtools Picard From line 1 of mem/wrapper.py

__author__ = "Roman Chernyatchik"
__copyright__ = "Copyright (c) 2019 JetBrains"
__email__ = "[email protected]"
__license__ = "MIT"

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=True, stderr=True)
extra = snakemake.params.get("extra", "")

shell(
    "bedGraphToBigWig {extra}"
    " {snakemake.input.bedGraph} {snakemake.input.chromsizes}"
    " {snakemake.output} {log}"
)

Python Snakemake bedGraphToBigWig From line 4 of bedGraphToBigWig/wrapper.py

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


shell("samtools view {snakemake.params} {snakemake.input[0]} > {snakemake.output[0]}")

Python Snakemake SAMtools From line 1 of view/wrapper.py

__author__ = "Antonie Vietor"
__copyright__ = "Copyright 2020, Antonie Vietor"
__email__ = "[email protected]"
__license__ = "MIT"

import os
from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

genome = ""
input_file = ""

if (os.path.splitext(snakemake.input[0])[-1]) == ".bam":
    input_file = "-ibam " + snakemake.input[0]

if len(snakemake.input) > 1:
    if (os.path.splitext(snakemake.input[0])[-1]) == ".bed":
        input_file = "-i " + snakemake.input.get("bed")
        genome = "-g " + snakemake.input.get("ref")

shell(
    "(genomeCoverageBed"
    " {snakemake.params}"
    " {input_file}"
    " {genome}"
    " > {snakemake.output[0]}) {log}"
)

Python Snakemake From line 1 of genomecov/wrapper.py

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "picard MarkDuplicates {snakemake.params} INPUT={snakemake.input} "
    "OUTPUT={snakemake.output.bam} METRICS_FILE={snakemake.output.metrics} "
    "{log}"
)

Python Snakemake Picard From line 1 of markduplicates/wrapper.py

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell

extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

n = len(snakemake.input.sample)
assert (
    n == 1 or n == 2
), "input->sample must have 1 (single-end) or 2 (paired-end) elements."

if n == 1:
    reads = "-U {}".format(*snakemake.input.sample)
else:
    reads = "-1 {} -2 {}".format(*snakemake.input.sample)

shell(
    "(bowtie2 --threads {snakemake.threads} {snakemake.params.extra} "
    "-x {snakemake.params.index} {reads} "
    "| samtools view -Sbh -o {snakemake.output[0]} -) {log}"
)

Python Snakemake SAMtools From line 1 of align/wrapper.py

__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path
from tempfile import TemporaryDirectory

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=False, stderr=True)


def basename_without_ext(file_path):
    """Returns basename of file path, without the file extension."""

    base = path.basename(file_path)

    split_ind = 2 if base.endswith(".fastq.gz") else 1
    base = ".".join(base.split(".")[:-split_ind])

    return base


# Run fastqc, since there can be race conditions if multiple jobs
# use the same fastqc dir, we create a temp dir.
with TemporaryDirectory() as tempdir:
    shell(
        "fastqc {snakemake.params} --quiet -t {snakemake.threads} "
        "--outdir {tempdir:q} {snakemake.input[0]:q}"
        " {log:q}"
    )

    # Move outputs into proper position.
    output_base = basename_without_ext(snakemake.input[0])
    html_path = path.join(tempdir, output_base + "_fastqc.html")
    zip_path = path.join(tempdir, output_base + "_fastqc.zip")

    if snakemake.output.html != html_path:
        shell("mv {html_path:q} {snakemake.output.html:q}")

    if snakemake.output.zip != zip_path:
        shell("mv {zip_path:q} {snakemake.output.zip:q}")

Python Snakemake FastQC From line 3 of fastqc/wrapper.py

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


import os
from snakemake.shell import shell


prefix = os.path.splitext(snakemake.output[0])[0]

# Samtools takes additional threads through its option -@
# One thread for samtools
# Other threads are *additional* threads passed to the argument -@
threads = "" if snakemake.threads <= 1 else " -@ {} ".format(snakemake.threads - 1)

shell(
    "samtools sort {snakemake.params} {threads} -o {snakemake.output[0]} "
    "-T {prefix} {snakemake.input[0]}"
)

Python Snakemake SAMtools From line 1 of sort/wrapper.py

__author__ = "Johannes Köster, Jorge Langa"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


# Distribute available threads between trimmomatic itself and any potential pigz instances
def distribute_threads(input_files, output_files, available_threads):
    gzipped_input_files = sum(1 for file in input_files if file.endswith(".gz"))
    gzipped_output_files = sum(1 for file in output_files if file.endswith(".gz"))
    potential_threads_per_process = available_threads // (
        1 + gzipped_input_files + gzipped_output_files
    )
    if potential_threads_per_process > 0:
        # decompressing pigz creates at most 4 threads
        pigz_input_threads = (
            min(4, potential_threads_per_process) if gzipped_input_files != 0 else 0
        )
        pigz_output_threads = (
            (available_threads - pigz_input_threads * gzipped_input_files)
            // (1 + gzipped_output_files)
            if gzipped_output_files != 0
            else 0
        )
        trimmomatic_threads = (
            available_threads
            - pigz_input_threads * gzipped_input_files
            - pigz_output_threads * gzipped_output_files
        )
    else:
        # not enough threads for pigz
        pigz_input_threads = 0
        pigz_output_threads = 0
        trimmomatic_threads = available_threads
    return trimmomatic_threads, pigz_input_threads, pigz_output_threads


def compose_input_gz(filename, threads):
    if filename.endswith(".gz") and threads > 0:
        return "<(pigz -p {threads} --decompress --stdout {filename})".format(
            threads=threads, filename=filename
        )
    return filename


def compose_output_gz(filename, threads, compression_level):
    if filename.endswith(".gz") and threads > 0:
        return ">(pigz -p {threads} {compression_level} > {filename})".format(
            threads=threads, compression_level=compression_level, filename=filename
        )
    return filename


extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
compression_level = snakemake.params.get("compression_level", "-5")
trimmer = " ".join(snakemake.params.trimmer)

# Distribute threads
input_files = [snakemake.input.r1, snakemake.input.r2]
output_files = [
    snakemake.output.r1,
    snakemake.output.r1_unpaired,
    snakemake.output.r2,
    snakemake.output.r2_unpaired,
]

trimmomatic_threads, input_threads, output_threads = distribute_threads(
    input_files, output_files, snakemake.threads
)

input_r1, input_r2 = [
    compose_input_gz(filename, input_threads) for filename in input_files
]

output_r1, output_r1_unp, output_r2, output_r2_unp = [
    compose_output_gz(filename, output_threads, compression_level)
    for filename in output_files
]

shell(
    "trimmomatic PE -threads {trimmomatic_threads} {extra} "
    "{input_r1} {input_r2} "
    "{output_r1} {output_r1_unp} "
    "{output_r2} {output_r2_unp} "
    "{trimmer} "
    "{log}"
)

Python Snakemake Trimmomatic From line 12 of pe/wrapper.py

__author__ = "Johannes Köster, Jorge Langa"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


# Distribute available threads between trimmomatic itself and any potential pigz instances
def distribute_threads(input_file, output_file, available_threads):
    gzipped_input_files = 1 if input_file.endswith(".gz") else 0
    gzipped_output_files = 1 if output_file.endswith(".gz") else 0
    potential_threads_per_process = available_threads // (
        1 + gzipped_input_files + gzipped_output_files
    )
    if potential_threads_per_process > 0:
        # decompressing pigz creates at most 4 threads
        pigz_input_threads = (
            min(4, potential_threads_per_process) if gzipped_input_files != 0 else 0
        )
        pigz_output_threads = (
            (available_threads - pigz_input_threads * gzipped_input_files)
            // (1 + gzipped_output_files)
            if gzipped_output_files != 0
            else 0
        )
        trimmomatic_threads = (
            available_threads
            - pigz_input_threads * gzipped_input_files
            - pigz_output_threads * gzipped_output_files
        )
    else:
        # not enough threads for pigz
        pigz_input_threads = 0
        pigz_output_threads = 0
        trimmomatic_threads = available_threads
    return trimmomatic_threads, pigz_input_threads, pigz_output_threads


def compose_input_gz(filename, threads):
    if filename.endswith(".gz") and threads > 0:
        return "<(pigz -p {threads} --decompress --stdout {filename})".format(
            threads=threads, filename=filename
        )
    return filename


def compose_output_gz(filename, threads, compression_level):
    if filename.endswith(".gz") and threads > 0:
        return ">(pigz -p {threads} {compression_level} > {filename})".format(
            threads=threads, compression_level=compression_level, filename=filename
        )
    return filename


extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
compression_level = snakemake.params.get("compression_level", "-5")
trimmer = " ".join(snakemake.params.trimmer)

# Distribute threads
trimmomatic_threads, input_threads, output_threads = distribute_threads(
    snakemake.input[0], snakemake.output[0], snakemake.threads
)

# Collect files
input = compose_input_gz(snakemake.input[0], input_threads)
output = compose_output_gz(snakemake.output[0], output_threads, compression_level)

shell(
    "trimmomatic SE -threads {trimmomatic_threads} {extra} {input} {output} {trimmer} {log}"
)