ChIP-seq peak-calling, QC and differential analysis pipeline (Snakemake port of the nextflow pipeline at https://nf-co.re/chipseq).

public 8mo ago Version: 5 0 bookmarks

View Workflow

This is the template for a new Snakemake workflow. Replace this text with a comprehensive description covering the purpose and domain. Insert your code into the respective folders, i.e. scripts , rules , and envs . Define the entry point of the workflow in the Snakefile and the main configuration in the config.yaml file.

Authors

Antonie Vietor (@AntonieV)

Usage

If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and, if available, its DOI (see above).

Step 1: Obtain a copy of this workflow

Create a new github repository using this workflow as a template .
Clone the newly created repository to your local system, into the place where you want to perform the data analysis.

Step 2: Configure workflow

Configure the workflow according to your needs via editing the files in the config/ folder. Adjust config.yaml to configure the workflow execution, and samples.tsv to specify your sample setup.

Step 3: Install Snakemake

Install Snakemake using conda :

conda create -c bioconda -c conda-forge -n snakemake snakemake

For installation details, see the instructions in the Snakemake documentation .

Step 4: Execute workflow

Activate the conda environment:

conda activate snakemake

Test your configuration by performing a dry-run via

snakemake --use-conda -n

Execute the workflow locally via

snakemake --use-conda --cores $N

using $N cores or run it in a cluster environment via

snakemake --use-conda --cluster qsub --jobs 100

snakemake --use-conda --drmaa --jobs 100

If you not only want to fix the software stack but also the underlying OS, use

snakemake --use-conda --use-singularity

in combination with any of the modes above. See the Snakemake documentation for further details.

Step 5: Investigate results

After successful execution, you can create a self-contained interactive HTML report with all results via:

snakemake --report report.html

This report can, e.g., be forwarded to your collaborators. An example (using some trivial test data) can be seen here .

Step 6: Commit changes

Whenever you change something, don't forget to commit the changes back to your github copy of the repository:

git commit -a
git push

Step 7: Obtain updates from upstream

Whenever you want to synchronize your workflow copy with new developments from upstream, do the following.

Once, register the upstream repository in your local copy: git remote add -f upstream [email protected]:snakemake-workflows/chipseq.git or git remote add -f upstream https://github.com/snakemake-workflows/chipseq.git if you do not have setup ssh keys.
Update the upstream version: git fetch upstream .
Create a diff with the current version: git diff HEAD upstream/master workflow > upstream-changes.diff .
Investigate the changes: vim upstream-changes.diff .
Apply the modified diff via: git apply upstream-changes.diff .
Carefully check whether you need to update the config files: git diff HEAD upstream/master config . If so, do it manually, and only where necessary, since you would otherwise likely overwrite your settings and samples.

Step 8: Contribute back

In case you have also changed or added steps, please consider contributing them back to the original repository:

Fork the original repo to a personal or lab account.
Clone the fork to your local system, to a different place than where you ran your analysis.
Copy the modified files from your analysis to the clone of your fork, e.g., cp -r workflow path/to/fork . Make sure to not accidentally copy config file contents or sample sheets. Instead, manually update the example config files if necessary.
Commit and push your changes to your fork.
Create a pull request against the original repository.

Testing

Test cases are in the subfolder .test . They are automatically executed via continuous integration with Github Actions .

Code Snippets

__author__ = "Antonie Vietor"
__copyright__ = "Copyright 2020, Antonie Vietor"
__email__ = "[email protected]"
__license__ = "MIT"

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

region = snakemake.params.get("region")
region_param = ""

if region and region is not None:
    region_param = ' -region "' + region + '"'

shell(
    "(bamtools filter"
    " -in {snakemake.input[0]}"
    " -out {snakemake.output[0]}"
    + region_param
    + " -script {snakemake.params.json}) {log}"
)

Python Snakemake From line 1 of filter_json/wrapper.py

__author__ = "Antonie Vietor"
__copyright__ = "Copyright 2020, Antonie Vietor"
__email__ = "[email protected]"
__license__ = "MIT"

import os
from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

genome = ""
input_file = ""

if (os.path.splitext(snakemake.input[0])[-1]) == ".bam":
    input_file = "-ibam " + snakemake.input[0]

if len(snakemake.input) > 1:
    if (os.path.splitext(snakemake.input[0])[-1]) == ".bed":
        input_file = "-i " + snakemake.input.get("bed")
        genome = "-g " + snakemake.input.get("ref")

shell(
    "(genomeCoverageBed"
    " {snakemake.params}"
    " {input_file}"
    " {genome}"
    " > {snakemake.output[0]}) {log}"
)

Python Snakemake From line 1 of genomecov/wrapper.py

__author__ = "Patrik Smeds"
__copyright__ = "Copyright 2016, Patrik Smeds"
__email__ = "[email protected]"
__license__ = "MIT"

from os import path

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

# Check inputs/arguments.
if len(snakemake.input) == 0:
    raise ValueError("A reference genome has to be provided!")
elif len(snakemake.input) > 1:
    raise ValueError("Only one reference genome can be inputed!")

# Prefix that should be used for the database
prefix = snakemake.params.get("prefix", "")

if len(prefix) > 0:
    prefix = "-p " + prefix

# Contrunction algorithm that will be used to build the database, default is bwtsw
construction_algorithm = snakemake.params.get("algorithm", "")

if len(construction_algorithm) != 0:
    construction_algorithm = "-a " + construction_algorithm

shell(
    "bwa index" " {prefix}" " {construction_algorithm}" " {snakemake.input[0]}" " {log}"
)

Python Snakemake BWA From line 1 of index/wrapper.py

__author__ = "Johannes Köster, Julian de Ruiter"
__copyright__ = "Copyright 2016, Johannes Köster and Julian de Ruiter"
__email__ = "[email protected], [email protected]"
__license__ = "MIT"


from os import path

from snakemake.shell import shell


# Extract arguments.
extra = snakemake.params.get("extra", "")

sort = snakemake.params.get("sort", "none")
sort_order = snakemake.params.get("sort_order", "coordinate")
sort_extra = snakemake.params.get("sort_extra", "")

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

# Check inputs/arguments.
if not isinstance(snakemake.input.reads, str) and len(snakemake.input.reads) not in {
    1,
    2,
}:
    raise ValueError("input must have 1 (single-end) or " "2 (paired-end) elements")

if sort_order not in {"coordinate", "queryname"}:
    raise ValueError("Unexpected value for sort_order ({})".format(sort_order))

# Determine which pipe command to use for converting to bam or sorting.
if sort == "none":

    # Simply convert to bam using samtools view.
    pipe_cmd = "samtools view -Sbh -o {snakemake.output[0]} -"

elif sort == "samtools":

    # Sort alignments using samtools sort.
    pipe_cmd = "samtools sort {sort_extra} -o {snakemake.output[0]} -"

    # Add name flag if needed.
    if sort_order == "queryname":
        sort_extra += " -n"

    prefix = path.splitext(snakemake.output[0])[0]
    sort_extra += " -T " + prefix + ".tmp"

elif sort == "picard":

    # Sort alignments using picard SortSam.
    pipe_cmd = (
        "picard SortSam {sort_extra} INPUT=/dev/stdin"
        " OUTPUT={snakemake.output[0]} SORT_ORDER={sort_order}"
    )

else:
    raise ValueError("Unexpected value for params.sort ({})".format(sort))

shell(
    "(bwa mem"
    " -t {snakemake.threads}"
    " {extra}"
    " {snakemake.params.index}"
    " {snakemake.input.reads}"
    " | " + pipe_cmd + ") {log}"
)

Python Snakemake SAMtools Picard From line 1 of mem/wrapper.py

__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


n = len(snakemake.input)
assert n == 2, "Input must contain 2 (paired-end) elements."

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

shell(
    "cutadapt"
    " {snakemake.params.adapters}"
    " {snakemake.params.others}"
    " -o {snakemake.output.fastq1}"
    " -p {snakemake.output.fastq2}"
    " -j {snakemake.threads}"
    " {snakemake.input}"
    " > {snakemake.output.qc} {log}"
)

Python Snakemake Cutadapt From line 3 of pe/wrapper.py

__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


log = snakemake.log_fmt_shell(stdout=False, stderr=True)

shell(
    "cutadapt"
    " {snakemake.params}"
    " -j {snakemake.threads}"
    " -o {snakemake.output.fastq}"
    " {snakemake.input[0]}"
    " > {snakemake.output.qc} {log}"
)

Python Snakemake Cutadapt From line 3 of se/wrapper.py

__author__ = "Antonie Vietor"
__copyright__ = "Copyright 2020, Antonie Vietor"
__email__ = "[email protected]"
__license__ = "MIT"

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=True, stderr=True)

out_tab = snakemake.output.get("matrix_tab")
out_bed = snakemake.output.get("matrix_bed")

optional_output = ""

if out_tab:
    optional_output += " --outFileNameMatrix {out_tab} ".format(out_tab=out_tab)

if out_bed:
    optional_output += " --outFileSortedRegions {out_bed} ".format(out_bed=out_bed)

shell(
    "(computeMatrix "
    "{snakemake.params.command} "
    "{snakemake.params.extra} "
    "-R {snakemake.input.bed} "
    "-S {snakemake.input.bigwig} "
    "-o {snakemake.output.matrix_gz} "
    "{optional_output}) {log}"
)

Python Snakemake From line 1 of computematrix/wrapper.py

__author__ = "Antonie Vietor"
__copyright__ = "Copyright 2020, Antonie Vietor"
__email__ = "[email protected]"
__license__ = "MIT"

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=True, stderr=True)

out_region = snakemake.output.get("regions")
out_matrix = snakemake.output.get("heatmap_matrix")

optional_output = ""

if out_region:
    optional_output += " --outFileSortedRegions {out_region} ".format(
        out_region=out_region
    )

if out_matrix:
    optional_output += " --outFileNameMatrix {out_matrix} ".format(
        out_matrix=out_matrix
    )

shell(
    "(plotHeatmap "
    "-m {snakemake.input[0]} "
    "-o {snakemake.output.heatmap_img} "
    "{optional_output} "
    "{snakemake.params}) {log}"
)

Python Snakemake From line 1 of plotheatmap/wrapper.py

__author__ = "Antonie Vietor"
__copyright__ = "Copyright 2020, Antonie Vietor"
__email__ = "[email protected]"
__license__ = "MIT"

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=True, stderr=True)

out_region = snakemake.output.get("regions")
out_data = snakemake.output.get("data")

optional_output = ""

if out_region:
    optional_output += " --outFileSortedRegions {out_region} ".format(
        out_region=out_region
    )

if out_data:
    optional_output += " --outFileNameData {out_data} ".format(out_data=out_data)

shell(
    "(plotProfile "
    "-m {snakemake.input[0]} "
    "-o {snakemake.output.plot_img} "
    "{optional_output} "
    "{snakemake.params}) {log}"
)

Python Snakemake From line 1 of plotprofile/wrapper.py

__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path

from snakemake.shell import shell


input_dirs = set(path.dirname(fp) for fp in snakemake.input)
output_dir = path.dirname(snakemake.output[0])
output_name = path.basename(snakemake.output[0])
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "multiqc"
    " {snakemake.params}"
    " --force"
    " -o {output_dir}"
    " -n {output_name}"
    " {input_dirs}"
    " {log}"
)

Python Snakemake MultiQC From line 3 of multiqc/wrapper.py

__author__ = "David Laehnemann, Antonie Vietor"
__copyright__ = "Copyright 2020, David Laehnemann, Antonie Vietor"
__email__ = "[email protected]"
__license__ = "MIT"

import sys
from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

res = snakemake.resources.get("mem_gb", "3")
if not res or res is None:
    res = 3

exts_to_prog = {
    ".alignment_summary_metrics": "CollectAlignmentSummaryMetrics",
    ".insert_size_metrics": "CollectInsertSizeMetrics",
    ".insert_size_histogram.pdf": "CollectInsertSizeMetrics",
    ".quality_distribution_metrics": "QualityScoreDistribution",
    ".quality_distribution.pdf": "QualityScoreDistribution",
    ".quality_by_cycle_metrics": "MeanQualityByCycle",
    ".quality_by_cycle.pdf": "MeanQualityByCycle",
    ".base_distribution_by_cycle_metrics": "CollectBaseDistributionByCycle",
    ".base_distribution_by_cycle.pdf": "CollectBaseDistributionByCycle",
    ".gc_bias.detail_metrics": "CollectGcBiasMetrics",
    ".gc_bias.summary_metrics": "CollectGcBiasMetrics",
    ".gc_bias.pdf": "CollectGcBiasMetrics",
    ".rna_metrics": "RnaSeqMetrics",
    ".bait_bias_detail_metrics": "CollectSequencingArtifactMetrics",
    ".bait_bias_summary_metrics": "CollectSequencingArtifactMetrics",
    ".error_summary_metrics": "CollectSequencingArtifactMetrics",
    ".pre_adapter_detail_metrics": "CollectSequencingArtifactMetrics",
    ".pre_adapter_summary_metrics": "CollectSequencingArtifactMetrics",
    ".quality_yield_metrics": "CollectQualityYieldMetrics",
}
progs = set()

for file in snakemake.output:
    matched = False
    for ext in exts_to_prog:
        if file.endswith(ext):
            progs.add(exts_to_prog[ext])
            matched = True
    if not matched:
        sys.exit(
            "Unknown type of metrics file requested, for possible metrics files, see https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/picard/collectmultiplemetrics.html"
        )

programs = " PROGRAM=" + " PROGRAM=".join(progs)

out = str(snakemake.wildcards.sample)  # as default
output_file = str(snakemake.output[0])
for ext in exts_to_prog:
    if output_file.endswith(ext):
        out = output_file[: -len(ext)]
        break

shell(
    "(picard -Xmx{res}g CollectMultipleMetrics "
    "I={snakemake.input.bam} "
    "O={out} "
    "R={snakemake.input.ref} "
    "{snakemake.params}{programs}) {log}"
)

Python Snakemake From line 1 of collectmultiplemetrics/wrapper.py

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "picard MarkDuplicates {snakemake.params} INPUT={snakemake.input} "
    "OUTPUT={snakemake.output.bam} METRICS_FILE={snakemake.output.metrics} "
    "{log}"
)

Python Snakemake Picard From line 1 of markduplicates/wrapper.py

__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


inputs = " ".join("INPUT={}".format(in_) for in_ in snakemake.input)
log = snakemake.log_fmt_shell(stdout=False, stderr=True)

shell(
    "picard"
    " MergeSamFiles"
    " {snakemake.params}"
    " {inputs}"
    " OUTPUT={snakemake.output[0]}"
    " {log}"
)

Python Snakemake Picard From line 3 of mergesamfiles/wrapper.py

__author__ = "Antonie Vietor"
__copyright__ = "Copyright 2020, Antonie Vietor"
__email__ = "[email protected]"
__license__ = "MIT"

import os
from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

params = ""
if (os.path.splitext(snakemake.input[0])[-1]) == ".bam":
    if "-bam" not in (snakemake.input[0]):
        params = "-bam "

shell(
    "(preseq lc_extrap {params} {snakemake.params} {snakemake.input[0]} -output {snakemake.output[0]}) {log}"
)

Python Snakemake From line 1 of lc_extrap/wrapper.py

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2019, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"

import subprocess
import sys
from snakemake.shell import shell

species = snakemake.params.species.lower()
release = int(snakemake.params.release)
fmt = snakemake.params.fmt
build = snakemake.params.build
flavor = snakemake.params.get("flavor", "")

branch = ""
if release >= 81 and build == "GRCh37":
    # use the special grch37 branch for new releases
    branch = "grch37/"

if flavor:
    flavor += "."

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

suffix = ""
if fmt == "gtf":
    suffix = "gtf.gz"
elif fmt == "gff3":
    suffix = "gff3.gz"

url = "ftp://ftp.ensembl.org/pub/{branch}release-{release}/{fmt}/{species}/{species_cap}.{build}.{release}.{flavor}{suffix}".format(
    release=release,
    build=build,
    species=species,
    fmt=fmt,
    species_cap=species.capitalize(),
    suffix=suffix,
    flavor=flavor,
    branch=branch,
)

try:
    shell("(curl -L {url} | gzip -d > {snakemake.output[0]}) {log}")
except subprocess.CalledProcessError as e:
    if snakemake.log:
        sys.stderr = open(snakemake.log[0], "a")
    print(
        "Unable to download annotation data from Ensembl. "
        "Did you check that this combination of species, build, and release is actually provided?",
        file=sys.stderr,
    )
    exit(1)

Python Snakemake From line 1 of ensembl-annotation/wrapper.py

__author__ = "Michael Chambers"
__copyright__ = "Copyright 2019, Michael Chambers"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


shell("samtools faidx {snakemake.params} {snakemake.input[0]} > {snakemake.output[0]}")

Python Snakemake SAMtools From line 1 of faidx/wrapper.py

__author__ = "Christopher Preusch"
__copyright__ = "Copyright 2017, Christopher Preusch"
__email__ = "cpreusch[at]ust.hk"
__license__ = "MIT"


from snakemake.shell import shell


shell("samtools flagstat {snakemake.input[0]} > {snakemake.output[0]}")

Python Snakemake SAMtools From line 1 of flagstat/wrapper.py

__author__ = "Antonie Vietor"
__copyright__ = "Copyright 2020, Antonie Vietor"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

shell("samtools idxstats {snakemake.input.bam} > {snakemake.output[0]} {log}")

Python Snakemake SAMtools From line 1 of idxstats/wrapper.py

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


shell("samtools index {snakemake.params} {snakemake.input[0]} {snakemake.output[0]}")

Python Snakemake SAMtools From line 1 of index/wrapper.py

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


import os
from snakemake.shell import shell


prefix = os.path.splitext(snakemake.output[0])[0]

# Samtools takes additional threads through its option -@
# One thread for samtools
# Other threads are *additional* threads passed to the argument -@
threads = "" if snakemake.threads <= 1 else " -@ {} ".format(snakemake.threads - 1)

shell(
    "samtools sort {snakemake.params} {threads} -o {snakemake.output[0]} "
    "-T {prefix} {snakemake.input[0]}"
)

Python Snakemake SAMtools From line 1 of sort/wrapper.py

__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


extra = snakemake.params.get("extra", "")
region = snakemake.params.get("region", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)


shell("samtools stats {extra} {snakemake.input} {region} > {snakemake.output} {log}")

Python Snakemake SAMtools From line 3 of stats/wrapper.py

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


shell("samtools view {snakemake.params} {snakemake.input[0]} > {snakemake.output[0]}")

Python Snakemake SAMtools From line 1 of view/wrapper.py

__author__ = "Roman Chernyatchik"
__copyright__ = "Copyright (c) 2019 JetBrains"
__email__ = "[email protected]"
__license__ = "MIT"

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=True, stderr=True)
extra = snakemake.params.get("extra", "")

shell(
    "bedGraphToBigWig {extra}"
    " {snakemake.input.bedGraph} {snakemake.input.chromsizes}"
    " {snakemake.output} {log}"
)

Python Snakemake bedGraphToBigWig From line 4 of bedGraphToBigWig/wrapper.py

__author__ = "Jan Forster"
__copyright__ = "Copyright 2019, Jan Forster"
__email__ = "[email protected]"
__license__ = "MIT"

from snakemake.shell import shell

## Extract arguments
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "(bedtools intersect"
    " {extra}"
    " -a {snakemake.input.left}"
    " -b {snakemake.input.right}"
    " > {snakemake.output})"
    " {log}"
)

Python Snakemake From line 1 of intersect/wrapper.py

__author__ = "Jan Forster, Felix Mölder"
__copyright__ = "Copyright 2019, Jan Forster"
__email__ = "[email protected], [email protected]"
__license__ = "MIT"

from snakemake.shell import shell

## Extract arguments
extra = snakemake.params.get("extra", "")

log = snakemake.log_fmt_shell(stdout=True, stderr=True)
if len(snakemake.input) > 1:
    if all(f.endswith(".gz") for f in snakemake.input):
        cat = "zcat"
    elif all(not f.endswith(".gz") for f in snakemake.input):
        cat = "cat"
    else:
        raise ValueError("Input files must be all compressed or uncompressed.")
    shell(
        "({cat} {snakemake.input} | "
        "sort -k1,1 -k2,2n | "
        "bedtools merge {extra} "
        "-i stdin > {snakemake.output}) "
        " {log}"
    )
else:
    shell(
        "( bedtools merge"
        " {extra}"
        " -i {snakemake.input}"
        " > {snakemake.output})"
        " {log}"
    )

Python Snakemake BEDTools From line 1 of merge/wrapper.py

__author__ = "Antonie Vietor"
__copyright__ = "Copyright 2020, Antonie Vietor"
__email__ = "[email protected]"
__license__ = "MIT"

from snakemake.shell import shell
import re

log = snakemake.log_fmt_shell(stdout=True, stderr=True)

jsd_sample = snakemake.input.get("jsd_sample")
out_counts = snakemake.output.get("counts")
out_metrics = snakemake.output.get("qc_metrics")
optional_output = ""
jsd = ""

if jsd_sample:
    jsd += " --JSDsample {jsd} ".format(jsd=jsd_sample)

if out_counts:
    optional_output += " --outRawCounts {out_counts} ".format(out_counts=out_counts)

if out_metrics:
    optional_output += " --outQualityMetrics {metrics} ".format(metrics=out_metrics)

shell(
    "(plotFingerprint "
    "-b {snakemake.input.bam_files} "
    "-o {snakemake.output.fingerprint} "
    "{optional_output} "
    "--numberOfProcessors {snakemake.threads} "
    "{jsd} "
    "{snakemake.params}) {log}"
)
# ToDo: remove the 'NA' string replacement when fixed in deepTools, see:
# https://github.com/deeptools/deepTools/pull/999
regex_passes = 2

with open(out_metrics, "rt") as f:
    metrics = f.read()
    for i in range(regex_passes):
        metrics = re.sub("\tNA(\t|\n)", "\tnan\\1", metrics)

with open(out_metrics, "wt") as f:
    f.write(metrics)

Python Snakemake From line 1 of plotfingerprint/wrapper.py

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2019, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"

import subprocess as sp
import sys
from itertools import product
from snakemake.shell import shell

species = snakemake.params.species.lower()
release = int(snakemake.params.release)
build = snakemake.params.build

branch = ""
if release >= 81 and build == "GRCh37":
    # use the special grch37 branch for new releases
    branch = "grch37/"

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

spec = ("{build}" if int(release) > 75 else "{build}.{release}").format(
    build=build, release=release
)

suffixes = ""
datatype = snakemake.params.get("datatype", "")
chromosome = snakemake.params.get("chromosome", "")
if datatype == "dna":
    if chromosome:
        suffixes = ["dna.chromosome.{}.fa.gz".format(chromosome)]
    else:
        suffixes = ["dna.primary_assembly.fa.gz", "dna.toplevel.fa.gz"]
elif datatype == "cdna":
    suffixes = ["cdna.all.fa.gz"]
elif datatype == "cds":
    suffixes = ["cds.all.fa.gz"]
elif datatype == "ncrna":
    suffixes = ["ncrna.fa.gz"]
elif datatype == "pep":
    suffixes = ["pep.all.fa.gz"]
else:
    raise ValueError("invalid datatype, must be one of dna, cdna, cds, ncrna, pep")

if chromosome:
    if not datatype == "dna":
        raise ValueError(
            "invalid datatype, to select a single chromosome the datatype must be dna"
        )

success = False
for suffix in suffixes:
    url = "ftp://ftp.ensembl.org/pub/{branch}release-{release}/fasta/{species}/{datatype}/{species_cap}.{spec}.{suffix}".format(
        release=release,
        species=species,
        datatype=datatype,
        spec=spec.format(build=build, release=release),
        suffix=suffix,
        species_cap=species.capitalize(),
        branch=branch,
    )

    try:
        shell("curl -sSf {url} > /dev/null 2> /dev/null")
    except sp.CalledProcessError:
        continue

    shell("(curl -L {url} | gzip -d > {snakemake.output[0]}) {log}")
    success = True
    break

if not success:
    print(
        "Unable to download requested sequence data from Ensembl. "
        "Did you check that this combination of species, build, and release is actually provided?",
        file=sys.stderr,
    )
    exit(1)

Python Snakemake From line 1 of ensembl-sequence/wrapper.py

__author__ = "Antonie Vietor"
__copyright__ = "Copyright 2020, Antonie Vietor"
__email__ = "[email protected]"
__license__ = "MIT"

from snakemake.shell import shell

extra = snakemake.params.get("extra", "")

log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "(bedtools complement"
    " {extra}"
    " -i {snakemake.input.in_file}"
    " -g {snakemake.input.genome}"
    " > {snakemake.output[0]})"
    " {log}"
)

Python Snakemake From line 1 of complement/wrapper.py

__author__ = "Antonie Vietor"
__copyright__ = "Copyright 2020, Antonie Vietor"
__email__ = "[email protected]"
__license__ = "MIT"

from snakemake.shell import shell

extra = snakemake.params.get("extra", "")
genome = snakemake.input.get("genome", "")
faidx = snakemake.input.get("faidx", "")

log = snakemake.log_fmt_shell(stdout=True, stderr=True)

if genome:
    extra += " -g {}".format(genome)
elif faidx:
    extra += " -faidx {}".format(faidx)

shell(
    "(bedtools sort"
    " {extra}"
    " -i {snakemake.input.in_file}"
    " > {snakemake.output[0]})"
    " {log}"
)

Python Snakemake From line 1 of sort/wrapper.py

__author__ = "Antonie Vietor"
__copyright__ = "Copyright 2020, Antonie Vietor"
__email__ = "[email protected]"
__license__ = "MIT"

from snakemake.shell import shell
import os

log = snakemake.log_fmt_shell(stdout=True, stderr=True)

genome = snakemake.input.get("genome", "")
extra = snakemake.params.get("extra", "")
motif_files = snakemake.input.get("motif_files", "")
matrix = snakemake.output.get("matrix", "")

if genome == "":
    genome = "none"

# optional files
opt_files = {
    "gtf": "-gtf",
    "gene": "-gene",
    "motif_files": "-m",
    "filter_motiv": "-fm",
    "center": "-center",
    "nearest_peak": "-p",
    "tag": "-d",
    "vcf": "-vcf",
    "bed_graph": "-bedGraph",
    "wig": "-wig",
    "map": "-map",
    "cmp_genome": "-cmpGenome",
    "cmp_Liftover": "-cmpLiftover",
    "advanced_annotation": "-ann",
    "mfasta": "-mfasta",
    "mbed": "-mbed",
    "mlogic": "-mlogic",
}

requires_motives = False
for i in opt_files:
    file = None
    if i == "mfasta" or i == "mbed" or i == "mlogic":
        file = snakemake.output.get(i, "")
        if file:
            requires_motives = True
    else:
        file = snakemake.input.get(i, "")
    if file:
        extra += " {flag} {file}".format(flag=opt_files[i], file=file)

if requires_motives and motif_files == "":
    sys.exit(
        "The optional output files require motif_file(s) as input. For more information please see http://homer.ucsd.edu/homer/ngs/annotation.html."
    )

# optional matrix output files:
if matrix:
    if motif_files == "":
        sys.exit(
            "The matrix output files require motif_file(s) as input. For more information please see http://homer.ucsd.edu/homer/ngs/annotation.html."
        )
    ext = ".count.matrix.txt"
    matrix_out = [i for i in snakemake.output if i.endswith(ext)][0]
    matrix_name = os.path.basename(matrix_out[: -len(ext)])
    extra += " -matrix {}".format(matrix_name)

shell(
    "(annotatePeaks.pl"
    " {snakemake.params.mode}"
    " {snakemake.input.peaks}"
    " {genome}"
    " {extra}"
    " -cpu {snakemake.threads}"
    " > {snakemake.output.annotations})"
    " {log}"
)

Python Snakemake From line 1 of annotatePeaks/wrapper.py

__author__ = "Antonie Vietor"
__copyright__ = "Copyright 2020, Antonie Vietor"
__email__ = "[email protected]"
__license__ = "MIT"

import os
import sys
from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=True, stderr=True)

in_contr = snakemake.input.get("control")
params = "{}".format(snakemake.params)
opt_input = ""
out_dir = ""

ext = "_peaks.xls"
out_file = [o for o in snakemake.output if o.endswith(ext)][0]
out_name = os.path.basename(out_file[: -len(ext)])
out_dir = os.path.dirname(out_file)

if in_contr:
    opt_input = "-c {contr}".format(contr=in_contr)

if out_dir:
    out_dir = "--outdir {dir}".format(dir=out_dir)

if any(out.endswith(("_peaks.narrowPeak", "_summits.bed")) for out in snakemake.output):
    if any(
        out.endswith(("_peaks.broadPeak", "_peaks.gappedPeak"))
        for out in snakemake.output
    ):
        sys.exit(
            "Output files with _peaks.narrowPeak and/or _summits.bed extensions cannot be created together with _peaks.broadPeak and/or _peaks.gappedPeak extended output files.\n"
            "For usable extensions please see https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/macs2/callpeak.html.\n"
        )
    else:
        if " --broad" in params:
            sys.exit(
                "If --broad option in params is given, the _peaks.narrowPeak and _summits.bed files will not be created. \n"
                "Remove --broad option from params if these files are needed.\n"
            )

if any(
    out.endswith(("_peaks.broadPeak", "_peaks.gappedPeak")) for out in snakemake.output
):
    if "--broad " not in params and not params.endswith("--broad"):
        params += " --broad "

if any(
    out.endswith(("_treat_pileup.bdg", "_control_lambda.bdg"))
    for out in snakemake.output
):
    if all(p not in params for p in ["--bdg", "-B"]):
        params += " --bdg "
else:
    if any(p in params for p in ["--bdg", "-B"]):
        sys.exit(
            "If --bdg or -B option in params is given, the _control_lambda.bdg and _treat_pileup.bdg extended files must be specified in output. \n"
        )

shell(
    "(macs2 callpeak "
    "-t {snakemake.input.treatment} "
    "{opt_input} "
    "{out_dir} "
    "-n {out_name} "
    "{params}) {log}"
)

Python Snakemake From line 1 of callpeak/wrapper.py

__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path
from tempfile import TemporaryDirectory

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=False, stderr=True)


def basename_without_ext(file_path):
    """Returns basename of file path, without the file extension."""

    base = path.basename(file_path)

    split_ind = 2 if base.endswith(".fastq.gz") else 1
    base = ".".join(base.split(".")[:-split_ind])

    return base


# Run fastqc, since there can be race conditions if multiple jobs
# use the same fastqc dir, we create a temp dir.
with TemporaryDirectory() as tempdir:
    shell(
        "fastqc {snakemake.params} --quiet -t {snakemake.threads} "
        "--outdir {tempdir:q} {snakemake.input[0]:q}"
        " {log:q}"
    )

    # Move outputs into proper position.
    output_base = basename_without_ext(snakemake.input[0])
    html_path = path.join(tempdir, output_base + "_fastqc.html")
    zip_path = path.join(tempdir, output_base + "_fastqc.zip")

    if snakemake.output.html != html_path:
        shell("mv {html_path:q} {snakemake.output.html:q}")

    if snakemake.output.zip != zip_path:
        shell("mv {zip_path:q} {snakemake.output.zip:q}")

Python Snakemake FastQC From line 3 of fastqc/wrapper.py

__author__ = "Johannes Köster, Derek Croote"
__copyright__ = "Copyright 2020, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"

import os
import tempfile
from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=True, stderr=True)

outdir = os.path.dirname(snakemake.output[0])
if outdir:
    outdir = "--outdir {}".format(outdir)

extra = snakemake.params.get("extra", "")

with tempfile.TemporaryDirectory() as tmp:
    shell(
        "fasterq-dump --temp {tmp} --threads {snakemake.threads} "
        "{extra} {outdir} {snakemake.wildcards.accession} {log}"
    )

Python Snakemake From line 1 of fasterq-dump/wrapper.py

__author__ = "Antonie Vietor"
__copyright__ = "Copyright 2020, Antonie Vietor"
__email__ = "[email protected]"
__license__ = "MIT"

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=True, stderr=True)
extra = snakemake.params.get("extra", "")

# optional input files and directories
fasta = snakemake.input.get("fasta", "")
chr_names = snakemake.input.get("chr_names", "")
tmp_dir = snakemake.params.get("tmp_dir", "")
r_path = snakemake.params.get("r_path", "")

if fasta:
    extra += " -G {}".format(fasta)
if chr_names:
    extra += " -A {}".format(chr_names)
if tmp_dir:
    extra += " --tmpDir {}".format(tmp_dir)
if r_path:
    extra += " --Rpath {}".format(r_path)

shell(
    "(featureCounts"
    " {extra}"
    " -T {snakemake.threads}"
    " -J"
    " -a {snakemake.input.annotation}"
    " -o {snakemake.output[0]}"
    " {snakemake.input.sam})"
    " {log}"
)

Python Snakemake From line 1 of featurecounts/wrapper.py

wrapper:
    "0.66.0/bio/bedtools/merge"

SnakeMake From line 11 of rules/consensus_peak_analysis.smk

wrapper:
    "0.66.0/bio/bedtools/merge"

SnakeMake From line 24 of rules/consensus_peak_analysis.smk

script:
    "../scripts/macs2_merged_expand.py"

SnakeMake From line 39 of rules/consensus_peak_analysis.smk

shell:
    "gawk -v FS='\t' -v OFS='\t' 'FNR  > 1 {{ print $1, $2, $3, $4 \"0\", \"+\"}}' {input} > {output} 2> {log}"

SnakeMake From line 51 of rules/consensus_peak_analysis.smk

shell:
    "$(echo -e 'GeneID\tChr\tStart\tEnd\tStrand' > {output} && "
    " gawk -v FS='\t' -v OFS='\t' 'FNR > 1 {{ print $4, $1, $2, $3,  \" + \" }}' {input} >> {output}) "
    " 2> {log}"

SnakeMake From line 63 of rules/consensus_peak_analysis.smk

shell:
    "Rscript ../workflow/scripts/plot_peak_intersect.R -i {input} -o {output} 2> {log}"

SnakeMake From line 77 of rules/consensus_peak_analysis.smk

shell:
    "find {input} -type f -name '*.consensus_{wildcards.peak}-peaks.boolean.bed' -exec echo -e 'results/IGV/consensus/{wildcards.antibody}/\"{{}}\"\t0,0,0' \; > {output} 2> {log}"

SnakeMake From line 87 of rules/consensus_peak_analysis.smk

wrapper:
    "0.68.0/bio/homer/annotatePeaks"

SnakeMake From line 106 of rules/consensus_peak_analysis.smk

shell:
    "cut -f2- {input} | gawk 'NR==1; NR > 1 {{print $0 | \"sort -T '.' -k1,1 -k2,2n\"}}' | cut -f6- > {output}"

SnakeMake From line 118 of rules/consensus_peak_analysis.smk

shell:
    "paste {input.bool} {input.trim} > {output}"

SnakeMake From line 131 of rules/consensus_peak_analysis.smk

wrapper:
    "0.73.0/bio/subread/featurecounts"

SnakeMake From line 151 of rules/consensus_peak_analysis.smk

script:
    "../scripts/col_mod_featurecounts.py"

SnakeMake FeatureCounts From line 165 of rules/consensus_peak_analysis.smk

script:
    "../scripts/featurecounts_deseq2.R"

SnakeMake From line 206 of rules/consensus_peak_analysis.smk

shell:
    "find {input} -type f -name '*.consensus_{wildcards.peak}-peaks.deseq2.FDR_0.05.results.bed' -exec echo -e 'results/IGV/consensus/{wildcards.antibody}/deseq2/\"{{}}\"\t255,0,0' \; > {output} 2> {log}"

SnakeMake From line 216 of rules/consensus_peak_analysis.smk

wrapper:
    "0.64.0/bio/cutadapt/pe"

SnakeMake From line 14 of rules/cutadapt.smk

wrapper:
    "0.64.0/bio/cutadapt/se"

SnakeMake From line 31 of rules/cutadapt.smk

wrapper:
    "0.64.0/bio/samtools/view"

SnakeMake From line 19 of rules/filtering.smk

wrapper:
    "0.64.0/bio/bamtools/filter_json"

SnakeMake From line 32 of rules/filtering.smk

wrapper:
    "0.64.0/bio/samtools/sort"

SnakeMake From line 46 of rules/filtering.smk

shell:
    " ../workflow/scripts/rm_orphan_pe_bam.py {input} {output.bam} {params} 2> {log}"

SnakeMake From line 62 of rules/filtering.smk

wrapper:
    "0.64.0/bio/samtools/sort"

SnakeMake From line 76 of rules/filtering.smk

shell:
    "ln -sr {input} {output}"

SnakeMake From line 88 of rules/filtering.smk

wrapper:
    "0.64.0/bio/bwa/mem"

SnakeMake From line 15 of rules/mapping.smk

wrapper:
    "0.64.0/bio/picard/mergesamfiles"

SnakeMake From line 30 of rules/mapping.smk

wrapper:
    "0.64.0/bio/picard/markduplicates"

SnakeMake From line 43 of rules/mapping.smk

wrapper:
    "0.66.0/bio/deeptools/plotfingerprint"

SnakeMake From line 34 of rules/peak_analysis.smk

wrapper:
    "0.68.0/bio/macs2/callpeak"

SnakeMake From line 64 of rules/peak_analysis.smk

wrapper:
    "0.68.0/bio/macs2/callpeak"

SnakeMake From line 93 of rules/peak_analysis.smk

shell:
    "cat {input.peaks} | "
    " wc -l | "
    " gawk -v OFS='\t' '{{print \"{wildcards.sample}-{wildcards.control}_{wildcards.peak}_peaks\", $1}}' "
    " > {output} 2> {log}"

SnakeMake From line 105 of rules/peak_analysis.smk

script:
    "../scripts/plot_peaks_count_macs2.R"

SnakeMake From line 120 of rules/peak_analysis.smk

wrapper:
    "0.66.0/bio/bedtools/intersect"

SnakeMake From line 133 of rules/peak_analysis.smk

shell:
    "grep 'mapped (' {input.flagstats} | "
    " gawk -v a=$(gawk -F '\t' '{{sum += $NF}} END {{print sum}}' < {input.intersect}) "
    " -v OFS='\t' "
    " '{{print \"{wildcards.sample}-{wildcards.control}_{wildcards.peak}_peaks\", a/$1}}' "
    " > {output} 2> {log}"

SnakeMake From line 147 of rules/peak_analysis.smk

script:
    "../scripts/plot_frip_score.R"

SnakeMake From line 163 of rules/peak_analysis.smk

shell:
    " find {input} -type f -name '*_peaks.{wildcards.peak}Peak' -exec echo -e 'results/IGV/macs2_callpeak/{wildcards.peak}/\"{{}}\"\t0,0,178' \; > {output} 2> {log}"

SnakeMake From line 173 of rules/peak_analysis.smk

wrapper:
    "0.68.0/bio/homer/annotatePeaks"

SnakeMake From line 190 of rules/peak_analysis.smk

shell:
    "Rscript ../workflow/scripts/plot_macs_qc.R -i {params.input} -s {params.sample_control_combinations}  -o {output.plot} -p {output.summmary} 2> {log}"

SnakeMake MaCS From line 206 of rules/peak_analysis.smk

shell:
    "Rscript ../workflow/scripts/plot_homer_annotatepeaks.R -i {params.input} -s {params.sample_control_combinations}  -o {output.plot} -p {output.summmary} 2> {log}"

SnakeMake homer From line 222 of rules/peak_analysis.smk

script:
    "../scripts/plot_annotatepeaks_summary_homer.R"

SnakeMake From line 234 of rules/peak_analysis.smk

wrapper:
    "0.64.0/bio/preseq/lc_extrap"

SnakeMake From line 10 of rules/post-analysis.smk

wrapper:
    "0.64.0/bio/picard/collectmultiplemetrics"

SnakeMake From line 48 of rules/post-analysis.smk

wrapper:
    "0.64.0/bio/bedtools/genomecov"

SnakeMake From line 78 of rules/post-analysis.smk

shell:
    "sort -k1,1 -k2,2n {input} > {output} 2> {log}"

SnakeMake From line 88 of rules/post-analysis.smk

wrapper:
    "0.64.0/bio/ucsc/bedGraphToBigWig"

SnakeMake bedGraphToBigWig From line 101 of rules/post-analysis.smk

shell:
    "find {input} -type f -name '*.bigWig' -exec echo -e 'results/IGV/big_wig/\"{{}}\"\t0,0,178' \;  > {output} 2> {log}"

SnakeMake From line 112 of rules/post-analysis.smk

wrapper:
    "0.64.0/bio/deeptools/computematrix"

SnakeMake From line 136 of rules/post-analysis.smk

wrapper:
    "0.64.0/bio/deeptools/plotprofile"

SnakeMake From line 151 of rules/post-analysis.smk

wrapper:
    "0.64.0/bio/deeptools/plotheatmap"

SnakeMake From line 166 of rules/post-analysis.smk

shell:
    "( Rscript -e \"library(caTools); source('../workflow/scripts/run_spp.R')\" "
    "  -c={input} -savp={output.plot} -savd={output.r_data} "
    "  -out={output.res_phantom} -p={threads} 2>&1 ) >{log}"

SnakeMake From line 182 of rules/post-analysis.smk

script:
    "../scripts/phantompeak_correlation.R"

SnakeMake From line 197 of rules/post-analysis.smk

shell:
    "( gawk -v OFS='\t' '{{print $1, $9}}' {input.data} | cat {input.nsc_header} - > {output.nsc} && "
    "  gawk -v OFS='\t' '{{print $1, $10}}' {input.data} | cat {input.rsc_header} - > {output.rsc} 2>&1 ) >{log}"

SnakeMake From line 213 of rules/post-analysis.smk

wrapper:
    "0.72.0/bio/fastqc"

SnakeMake FastQC From line 12 of rules/qc.smk

wrapper:
    "0.64.0/bio/multiqc"

SnakeMake MultiQC From line 22 of rules/qc.smk

wrapper:
    "0.67.0/bio/reference/ensembl-sequence"

SnakeMake From line 13 of rules/ref.smk

wrapper:
    "0.64.0/bio/reference/ensembl-annotation"

SnakeMake From line 28 of rules/ref.smk

wrapper:
    "0.72.0/bio/sra-tools/fasterq-dump"

SnakeMake From line 42 of rules/ref.smk

wrapper:
    "0.72.0/bio/sra-tools/fasterq-dump"

SnakeMake From line 53 of rules/ref.smk

shell:
    "../workflow/scripts/gtf2bed {input} > {output} 2> {log}"

SnakeMake GFFutils From line 65 of rules/ref.smk

wrapper:
    "0.64.0/bio/samtools/faidx"

SnakeMake From line 76 of rules/ref.smk

wrapper:
    "0.64.0/bio/bwa/index"

SnakeMake From line 89 of rules/ref.smk

shell:
    "cut -f 1,2 {input.genome} > {output} 2> {log}"

SnakeMake From line 99 of rules/ref.smk

script:
    "../scripts/generate_igenomes.py"

SnakeMake From line 111 of rules/ref.smk

script:
    "../scripts/generate_blacklist.py"

SnakeMake From line 127 of rules/ref.smk

wrapper:
    "0.68.0/bio/bedtools/sort"

SnakeMake From line 139 of rules/ref.smk

wrapper:
    "0.68.0/bio/bedtools/complement"

SnakeMake From line 152 of rules/ref.smk

script:
    "../scripts/get_gsize.py"

SnakeMake From line 167 of rules/ref.smk

wrapper:
    "0.64.0/bio/samtools/flagstat"

SnakeMake From line 8 of rules/stats.smk

wrapper:
    "0.64.0/bio/samtools/idxstats"

SnakeMake From line 19 of rules/stats.smk

wrapper:
    "0.64.0/bio/samtools/stats"

SnakeMake From line 31 of rules/stats.smk

wrapper:
    "0.64.0/bio/samtools/index"

SnakeMake From line 10 of rules/utils.smk

import sys
import pandas as pd
import os.path

sys.stderr = open(snakemake.log[0], "w")

samples = pd.read_csv(snakemake.input.get("samples_file"), sep="\t")


def get_group_control_combination(bam_path):
    sample = os.path.basename(bam_path.split(".sorted.bam")[0])
    sample_row = samples.loc[samples["sample"] == sample]
    group = sample_row["group"].iloc[0]
    control = sample_row["control"].iloc[0]
    if pd.isnull(control):
        control = sample
    return "{}_{}_{}".format(group, control, sample)


def modify_header(old_header):
    return [get_group_control_combination(i) if i in snakemake.input["bam"] else i for i in old_header]


f_counts_tab = pd.read_csv(snakemake.input["featurecounts"], sep="\t", skiprows=1)
header = list(f_counts_tab.columns.values)
header_mod = modify_header(header)
f_counts_frame = pd.DataFrame(f_counts_tab.values)
f_counts_frame.columns = header_mod
f_counts_frame.to_csv(snakemake.output[0], sep='\t', index=False)

Python Pandas From line 1 of scripts/col_mod_featurecounts.py

    ## - FIRST SIX COLUMNS OF FEATURECOUNTS_FILE SHOULD BE INTERVAL INFO. REMAINDER OF COLUMNS SHOULD BE SAMPLES-SPECIFIC COUNTS.
    ## - SAMPLE NAMES HAVE TO END IN "_R1" REPRESENTING REPLICATE ID. LAST 3 CHARACTERS OF SAMPLE NAME WILL BE TRIMMED TO OBTAIN GROUP ID FOR DESEQ2 COMPARISONS.
    ## - BAM_SUFFIX IS PORTION OF FILENAME AFTER SAMPLE NAME IN FEATURECOUNTS COLUMN SAMPLE NAMES E.G. ".rmDup.bam" if "DRUG_R1.rmDup.bam"
    ## - PACKAGES BELOW NEED TO BE AVAILABLE TO LOAD WHEN RUNNING R

################################################
################################################
## LOAD LIBRARIES                             ##
################################################
################################################

#library(optparse)
library(DESeq2)
library(vsn)
library(ggplot2)
library(RColorBrewer)
library(pheatmap)
library(lattice)
library(BiocParallel)

################################################
################################################
## PARSE COMMAND-LINE PARAMETERS              ##
################################################
################################################

#option_list <- list(make_option(c("-i", "--featurecount_file"), type="character", default=NULL, help="Feature count file generated by the SubRead featureCounts command.", metavar="path"),
#                    make_option(c("-b", "--bam_suffix"), type="character", default=NULL, help="Portion of filename after sample name in featurecount file header e.g. '.rmDup.bam' if 'DRUG_R1.rmDup.bam'", metavar="string"),
#                    make_option(c("-o", "--outdir"), type="character", default='./', help="Output directory", metavar="path"),
#                    make_option(c("-p", "--outprefix"), type="character", default='differential', help="Output prefix", metavar="string"),
#                    make_option(c("-s", "--outsuffix"), type="character", default='', help="Output suffix for comparison-level results", metavar="string"),
#                    make_option(c("-v", "--vst"), type="logical", default=FALSE, help="Run vst transform instead of rlog", metavar="boolean"),
#                    make_option(c("-c", "--cores"), type="integer", default=1, help="Number of cores", metavar="integer"))
#
#opt_parser <- OptionParser(option_list=option_list)
#opt <- parse_args(opt_parser)
#
#if (is.null(opt$featurecount_file)){
#    print_help(opt_parser)
#    stop("Please provide featurecount file.", call.=FALSE)
#}
#if (is.null(opt$bam_suffix)){
#    print_help(opt_parser)
#    stop("Please provide bam suffix in header of featurecount file.", call.=FALSE)
#}

################################################
################################################
## READ IN COUNTS FILE                        ##
################################################
################################################

featurecount_file <- snakemake@input[[1]]   # AVI: adapted to snakemake
# AVI: suffix and prefix are already removed in rule featurecounts_modified_colnames
#bam_suffix <- ".bam"                       # AVI
#if (snakemake@params[["singleend"]]) {    # AVI
#    bam_suffix <- ".mLb.clN.sorted.bam"    # AVI
#}

count.table <- read.delim(file=featurecount_file,header=TRUE)  # AVI: removed 'skip=1', this was already done in rule featurecounts_modified_colnames
#colnames(count.table) <- gsub(bam_suffix,"",colnames(count.table))  # AVI
#colnames(count.table) <- as.character(lapply(colnames(count.table), function (x) tail(strsplit(x,'.',fixed=TRUE)[[1]],1)))  # AVI
rownames(count.table) <- count.table$Geneid
interval.table <- count.table[,1:6]
count.table <- count.table[,7:ncol(count.table),drop=FALSE]

################################################
################################################
## RUN DESEQ2                                 ##
################################################
################################################

# AVI: this is handled by snakemake
#if (file.exists(opt$outdir) == FALSE) {
#    dir.create(opt$outdir,recursive=TRUE)
#}
#setwd(opt$outdir)

samples.vec <- sort(colnames(count.table))
groups <- sub("_[^_]+$", "", samples.vec)
print(unique(groups))
if (length(unique(groups)) == 1) {
    quit(save = "no", status = 0, runLast = FALSE)
}

DDSFile <- snakemake@output[["dds"]]  # AVI: adapted to snakemake
if (file.exists(DDSFile) == FALSE) {
    counts <- count.table[,samples.vec,drop=FALSE]
    print(head(counts))
    coldata <- data.frame(row.names=colnames(counts),condition=groups)
    print(head(coldata))

    # AVI: set threads limit to prevent the "'bplapply' receive data failed" error
    # see also https://github.com/kdkorthauer/dmrseq/issues/7
    threads <- floor(snakemake@threads[[1]] * 0.75)

    dds <- DESeqDataSetFromMatrix(countData = round(counts), colData = coldata, design = ~ condition)
    dds <- DESeq(dds, parallel=TRUE, BPPARAM=MulticoreParam(ifelse(threads>0, threads, 1)))  # AVI: set threads limit

    if (!snakemake@params[["vst"]]) {
        rld <- rlog(dds)
    } else {
        rld <- vst(dds)
    }
    save(dds,rld,file=DDSFile)
}

#################################################
#################################################
### PLOT QC                                    ##
#################################################
#################################################

PlotPCAFile <- snakemake@output[["plot_pca"]]  # AVI: adapted to snakemake
PlotHeatmapFile <- snakemake@output[["plot_heatmap"]]  # AVI: adapted to snakemake
if (file.exists(PlotPCAFile) == FALSE) {
    # pdf(file=PlotFile,onefile=TRUE,width=7,height=7)  # AVI: splitted in separate pdf files

    ## PCA
    pdf(file=PlotPCAFile,onefile=TRUE,width=7,height=7)  # AVI: added to create separate pdf files
    pca.data <- DESeq2::plotPCA(rld,intgroup=c("condition"),returnData=TRUE)
    percentVar <- round(100 * attr(pca.data, "percentVar"))
    plot <- ggplot(pca.data, aes(PC1, PC2, color=condition)) +
            geom_point(size=3) +
            xlab(paste0("PC1: ",percentVar[1],"% variance")) +
            ylab(paste0("PC2: ",percentVar[2],"% variance")) +
            theme(panel.grid.major = element_blank(),
                  panel.grid.minor = element_blank(),
                  panel.background = element_blank(),
                  panel.border = element_rect(colour = "black", fill=NA, size=1))
    print(plot)
    dev.off()  # AVI: added to create separate pdf files
}

    ## WRITE PC1 vs PC2 VALUES TO FILE
if (file.exists(PlotHeatmapFile) == FALSE) {   # AVI: added for splitted pdf files
    pca.vals <- pca.data[,1:2]
    colnames(pca.vals) <- paste(colnames(pca.vals),paste(percentVar,'% variance',sep=""), sep=": ")
    pca.vals <- cbind(sample = rownames(pca.vals), pca.vals)
    write.table(pca.vals,file=snakemake@output[["pca_data"]],row.names=FALSE,col.names=TRUE,sep="\t",quote=TRUE) # AVI: adapted to snakemake

    ## SAMPLE CORRELATION HEATMAP
    pdf(file=PlotHeatmapFile,onefile=TRUE,width=7,height=7)  # AVI: added to create separate pdf files
    sampleDists <- dist(t(assay(rld)))
    sampleDistMatrix <- as.matrix(sampleDists)
    colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
    pheatmap(sampleDistMatrix,clustering_distance_rows=sampleDists,clustering_distance_cols=sampleDists,col=colors)
    dev.off()  # AVI: added to create separate pdf files

    ## WRITE SAMPLE DISTANCES TO FILE  # AVI: adapted to snakemake
    write.table(cbind(sample = rownames(sampleDistMatrix), sampleDistMatrix),file=snakemake@output[["dist_data"]],row.names=FALSE,col.names=TRUE,sep="\t",quote=FALSE)


}

#################################################
#################################################
### SAVE SIZE FACTORS                          ##
#################################################
#################################################

#SizeFactorsDir <- "sizeFactors/"
#if (file.exists(SizeFactorsDir) == FALSE) {
#    dir.create(SizeFactorsDir,recursive=TRUE)
#}

NormFactorsFile <- snakemake@output[["size_factors_rdata"]]  # AVI: adapted to snakemake
if (file.exists(NormFactorsFile) == FALSE) {
    normFactors <- sizeFactors(dds)
    save(normFactors,file=NormFactorsFile)

    for (name in names(sizeFactors(dds))) {
        sizeFactorFile <- snakemake@output[["size_factors_res"]]  # AVI: adapted to snakemake
        if (file.exists(sizeFactorFile) == FALSE) {
            write(as.numeric(sizeFactors(dds)[name]),file=sizeFactorFile)
        }
    }
}

#################################################
#################################################
### WRITE LOG FILE                             ##
#################################################
#################################################

LogFile <- snakemake@log[[1]]  # AVI: adapted to snakemake
if (file.exists(LogFile) == FALSE) {
    cat("\nSamples =",samples.vec,"\n\n",file=LogFile,append=TRUE,sep=', ')
    cat("Groups =",groups,"\n\n",file=LogFile,append=TRUE,sep=', ')
    cat("Dimensions of count matrix =",dim(counts),"\n\n",file=LogFile,append=FALSE,sep=' ')
    cat("\n",file=LogFile,append=TRUE,sep='')
}

#################################################
#################################################
### LOOP THROUGH COMPARISONS                   ##
#################################################
#################################################

ResultsFile <- snakemake@output[["results"]] # AVI: adapted to snakemake
if (file.exists(ResultsFile) == FALSE) {

    raw.counts <- counts(dds,normalized=FALSE)
    colnames(raw.counts) <- paste(colnames(raw.counts),'raw',sep='.')
    pseudo.counts <- counts(dds,normalized=TRUE)
    colnames(pseudo.counts) <- paste(colnames(pseudo.counts),'pseudo',sep='.')

    deseq2_results_list <- list()
    comparisons <- combn(unique(groups),2)
    for (idx in 1:ncol(comparisons)) {

        control.group <- comparisons[1,idx]
        treat.group <- comparisons[2,idx]
        CompPrefix <- paste(control.group,treat.group,sep="vs")
        cat("Saving results for ",CompPrefix," ...\n",sep="")

        # AVI: this is handled by snakemake
        #CompOutDir <- paste(CompPrefix,'/',sep="")
        #if (file.exists(CompOutDir) == FALSE) {
        #    dir.create(CompOutDir,recursive=TRUE)
        #}

        control.samples <- samples.vec[which(groups == control.group)]
        treat.samples <- samples.vec[which(groups == treat.group)]
        comp.samples <- c(control.samples,treat.samples)

        comp.results <- results(dds,contrast=c("condition",c(control.group,treat.group)))
        comp.df <- as.data.frame(comp.results)
        comp.table <- cbind(interval.table, as.data.frame(comp.df), raw.counts[,paste(comp.samples,'raw',sep='.')], pseudo.counts[,paste(comp.samples,'pseudo',sep='.')])

        ## WRITE RESULTS FILE
        CompResultsFile <- snakemake@output[["results"]]
        write.table(comp.table, file=CompResultsFile, col.names=TRUE, row.names=FALSE, sep='\t', quote=FALSE)

        ## FILTER RESULTS BY FDR & LOGFC AND WRITE RESULTS FILE
        # pdf(file=snakemake@output[["deseq2_plots"]],width=10,height=8) # AVI: splitted in separate pdf files
        if (length(comp.samples) > 2) {
            for (MIN_FDR in c(0.01,0.05)) {

                ## SUBSET RESULTS BY FDR
                pass.fdr.table <- subset(comp.table, padj < MIN_FDR)
                pass.fdr.up.table <- subset(comp.table, padj < MIN_FDR & log2FoldChange > 0)
                pass.fdr.down.table <- subset(comp.table, padj < MIN_FDR & log2FoldChange < 0)

                ## SUBSET RESULTS BY FDR AND LOGFC
                pass.fdr.logFC.table <- subset(comp.table, padj < MIN_FDR & abs(log2FoldChange) >= 1)
                pass.fdr.logFC.up.table <- subset(comp.table, padj < MIN_FDR & abs(log2FoldChange) >= 1 & log2FoldChange > 0)
                pass.fdr.logFC.down.table <- subset(comp.table, padj < MIN_FDR & abs(log2FoldChange) >= 1 & log2FoldChange < 0)

                ## WRITE RESULTS FILE
                if (MIN_FDR == 0.01) {
                    CompResultsFile <- snakemake@output[["FDR_1_perc_res"]]
                    CompBEDFile <- snakemake@output[["FDR_1_perc_bed"]]
                    MAplotFile <- snakemake@output[["plot_FDR_1_perc_MA"]]  # AVI: added to create separate pdf files
                    VolcanoPlotFile <- snakemake@output[["plot_FDR_1_perc_volcano"]]  # AVI: added to create separate pdf files
                }
                if (MIN_FDR == 0.05) {
                    CompResultsFile <- snakemake@output[["FDR_5_perc_res"]]
                    CompBEDFile <- snakemake@output[["FDR_5_perc_bed"]]
                    MAplotFile <- snakemake@output[["plot_FDR_5_perc_MA"]]  # AVI: added to create separate pdf files
                    VolcanoPlotFile <- snakemake@output[["plot_FDR_5_perc_volcano"]]  # AVI: added to create separate pdf files
                }

                write.table(pass.fdr.table, file=CompResultsFile, col.names=TRUE, row.names=FALSE, sep='\t', quote=FALSE)
                write.table(pass.fdr.table[,c("Chr","Start","End","Geneid","log2FoldChange","Strand")], file=CompBEDFile, col.names=FALSE, row.names=FALSE, sep='\t', quote=FALSE)

                ## MA PLOT & VOLCANO PLOT
                pdf(file=MAplotFile,width=10,height=8)  # AVI: added to create separate pdf files
                DESeq2::plotMA(comp.results, main=paste("MA plot FDR <= ",MIN_FDR,sep=""), ylim=c(-2,2),alpha=MIN_FDR)
                dev.off()  # AVI: added to create separate pdf files

                pdf(file=VolcanoPlotFile,width=10,height=8)  # AVI: added to create separate pdf files
                plot(comp.table$log2FoldChange, -1*log10(comp.table$padj), col=ifelse(comp.table$padj<=MIN_FDR, "red", "black"), xlab="logFC", ylab="-1*log10(FDR)", main=paste("Volcano plot FDR <=",MIN_FDR,sep=" "), pch=20)
                dev.off()  # AVI: added to create separate pdf files

                ## ADD COUNTS TO LOGFILE
                cat(CompPrefix," genes with FDR <= ",MIN_FDR,": ",nrow(pass.fdr.table)," (up=",nrow(pass.fdr.up.table),", down=",nrow(pass.fdr.down.table),")","\n",file=LogFile,append=TRUE,sep="")
                cat(CompPrefix," genes with FDR <= ",MIN_FDR," & FC > 2: ",nrow(pass.fdr.logFC.table)," (up=",nrow(pass.fdr.logFC.up.table),", down=",nrow(pass.fdr.logFC.down.table),")","\n",file=LogFile,append=TRUE,sep="")

            }
            cat("\n",file=LogFile,append=TRUE,sep="")
        # AVI: creates required output files with message
        } else {
            stop("More than 2 samples treated with the same antibody are needed to calculate the FDR & LOGFC.")
        }

        ## SAMPLE CORRELATION HEATMAP
        pdf(file=snakemake@output[["plot_sample_corr_heatmap"]],width=10,height=8)  # AVI: added to create separate pdf files
        rld.subset <- assay(rld)[,comp.samples]
        sampleDists <- dist(t(rld.subset))
        sampleDistMatrix <- as.matrix(sampleDists)
        colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
        pheatmap(sampleDistMatrix,clustering_distance_rows=sampleDists,clustering_distance_cols=sampleDists,col=colors)
        dev.off()  # AVI: added to create separate pdf files

        ## SCATTER PLOT FOR RLOG COUNTS
        pdf(file=snakemake@output[["plot_scatter"]],width=10,height=8)  # AVI: added to create separate pdf files
        combs <- combn(comp.samples,2,simplify=FALSE)
        clabels <- sapply(combs,function(x){paste(x,collapse=' & ')})
        plotdat <- data.frame(x=unlist(lapply(combs, function(x){rld.subset[, x[1] ]})),y=unlist(lapply(combs, function(y){rld.subset[, y[2] ]})),comp=rep(clabels, each=nrow(rld.subset)))
        plot <- xyplot(y~x|comp,plotdat,
                       panel=function(...){
                           panel.xyplot(...)
                           panel.abline(0,1,col="red")
                       },
                       par.strip.text=list(cex=0.5))
        print(plot)
        dev.off()  # AVI: added to create separate pdf files

        colnames(comp.df) <- paste(CompPrefix,".",colnames(comp.df),sep="")
        deseq2_results_list[[idx]] <- comp.df

    }

    ## WRITE RESULTS FROM ALL COMPARISONS TO FILE
    deseq2_results_table <- cbind(interval.table,do.call(cbind, deseq2_results_list),raw.counts,pseudo.counts)
    write.table(deseq2_results_table, file=ResultsFile, col.names=TRUE, row.names=FALSE, sep='\t', quote=FALSE)

}

#################################################
#################################################
### R SESSION INFO                             ##
#################################################
#################################################

cat(unlist(sessionInfo()),file=LogFile,append=TRUE,sep='')

R Snakemake ggplot2 FeatureCounts RColorBrewer pheatmap lattice vsn From line 37 of scripts/featurecounts_deseq2.R

import os
import yaml
from smart_open import open

# download blacklist and trim it for a specific chromosome


def copy_blacklist(igenomes, blacklist_path):
    with open(igenomes) as fin:
        with open(blacklist_path, 'w') as fout:
            for line in fin:
                fout.write(line)


def get_blacklist_from_igenomes(igenomes, blacklist_path):
    with open(igenomes) as f:
        igenomes = yaml.load(f, Loader=yaml.FullLoader)
        if "blacklist" in igenomes["params"]["genomes"][build]:
            blacklist_link = igenomes["params"]["genomes"][build]["blacklist"]
            with open(blacklist_link) as fin:
                with open(blacklist_path, 'w') as fout:
                    for line in fin:
                        if line.startswith("chr"):
                            line = line.replace("chr", "", 1)
                        if chromosome:
                            if line.startswith("{}\t".format(chromosome)):
                                fout.write(line)
                        else:
                            fout.write(line)
        else:
            open(blacklist_path, 'a').close()


igenomes = snakemake.input[0]
blacklist_path = snakemake.output.get("blacklist_path", "")

build = snakemake.params.get("build", "")
chromosome = snakemake.params.get("chromosome", "")
blacklist = snakemake.params.get("blacklist", "")

if blacklist:
    copy_blacklist(igenomes, blacklist_path)
else:
    get_blacklist_from_igenomes(igenomes, blacklist_path)

Python PyYAML smart-open From line 1 of scripts/generate_blacklist.py

import os
import yaml
from smart_open import open

# download and parse igenomes file


def parse_to_yaml(igenomes):
    params = {"=": ":", " { ": ": {", "\"": "\'", " ": "", "fasta:": "\'fasta\':", "bwa:": "\'bwa\':",
              "bowtie2:": "\'bowtie2\':", "star:": "\'star\':", "bismark:": "\'bismark\':", "gtf:": "\'gtf\':",
              "bed12:": "\'bed12\':", "readme:": "\'readme\':", "mito_name:": "\'mito_name\':",
              "macs_gsize:": "\'macs_gsize\':", "blacklist:": "\'blacklist\':", "\'\'": "\', \'",
              "params:": "\'params\':", "genomes:": "\'genomes\':", ":": " : ", "{": " { ", "}": " } ",
              "} \'": "}, \'"}
    for i in params:
        igenomes = igenomes.replace(i, params[i])
    return igenomes


def add_links(igenomes):
    return igenomes.replace(
        "$ { baseDir } /", "https://raw.githubusercontent.com/nf-core/chipseq/1.2.2/"
    ).replace(
        "$ { params.igenomes_base } /", "s3://ngi-igenomes/igenomes/"
    )


igenomes_path = snakemake.output[0]
igenomes_release = snakemake.params.get("igenomes_release", "")
blacklist = snakemake.params.get("blacklist", "")

if igenomes_release:
    igenomes_link = "https://raw.githubusercontent.com/nf-core/chipseq/{version}/conf/igenomes.config".format(
        version=igenomes_release
    )
else:
    sys.exit("The igenomes_release to use must be specified in the config.yaml file. "
             "Please see https://github.com/nf-core/chipseq/releases for available releases. ")

# removing header
with open(igenomes_link) as fin:
    with open(igenomes_path, 'w') as fout:
        for line in fin:
            if not line.strip().startswith('*'):
                if not line.strip().startswith('/*'):
                    if not line.strip().startswith('//'):
                        fout.write(line)

# parsing igenomes file to yaml format
with open(igenomes_path) as f:
    igenomes = yaml.load(add_links(parse_to_yaml(yaml.load(f, Loader=yaml.FullLoader))), Loader=yaml.FullLoader)
with open(igenomes_path, 'w') as f:
    yaml.dump(igenomes, f)

Python PyYAML smart-open From line 1 of scripts/generate_igenomes.py

import yaml

def get_gsize_from_igenomes(igenomes, build):
    if build:
        with open(igenomes) as f:
            igenomes = yaml.load(f, Loader=yaml.FullLoader)
        if igenomes:
            if igenomes["params"]["genomes"][build]:
                if "macs_gsize" in igenomes["params"]["genomes"][build]:
                    return igenomes["params"]["genomes"][build]["macs_gsize"]
    return ""


igenomes = snakemake.input[0]
gsize_out = snakemake.output[0]

config_gsize = snakemake.params.get("extra", "")
build = snakemake.params.get("build", "")

if config_gsize:
    with open(gsize_out, 'w') as f:
        f.write("-g {}".format(config_gsize))
else:
    with open(gsize_out, 'w') as f:
        macs_gsize = get_gsize_from_igenomes(igenomes, build)
        if macs_gsize:
            f.write("-g {}".format(macs_gsize))
        else:
            f.write("")

Python PyYAML From line 1 of scripts/get_gsize.py

import os
import errno
import argparse
import sys  # AVI: added to create log files

sys.stderr = open(snakemake.log[0], "w")  # AVI


############################################
############################################
## PARSE ARGUMENTS
############################################
############################################

# Description = 'Add sample boolean files and aggregate columns from merged MACS narrow or broad peak file.'
# Epilog = """Example usage: python macs2_merged_expand.py <MERGED_INTERVAL_FILE> <SAMPLE_NAME_LIST> <OUTFILE> --is_narrow_peak --min_replicates 1"""
#
# argParser = argparse.ArgumentParser(description=Description, epilog=Epilog)
#
# ## REQUIRED PARAMETERS
# argParser.add_argument('MERGED_INTERVAL_FILE', help="Merged MACS2 interval file created using linux sort and mergeBed.")
# argParser.add_argument('SAMPLE_NAME_LIST', help="Comma-separated list of sample names as named in individual MACS2 broadPeak/narrowPeak output file e.g. SAMPLE_R1 for SAMPLE_R1_peak_1.")
# argParser.add_argument('OUTFILE', help="Full path to output directory.")
#
# ## OPTIONAL PARAMETERS
# argParser.add_argument('-in', '--is_narrow_peak', dest="IS_NARROW_PEAK", help="Whether merged interval file was generated from narrow or broad peak files (default: False).",action='store_true')
# argParser.add_argument('-mr', '--min_replicates', type=int, dest="MIN_REPLICATES", default=1, help="Minumum number of replicates per sample required to contribute to merged peak (default: 1).")
# args = argParser.parse_args()

############################################
############################################
## HELPER FUNCTIONS
############################################
############################################

def makedir(path):
    if not len(path) == 0:
        try:
            os.makedirs(path)
        except OSError as exception:
            if exception.errno != errno.EEXIST:
                raise


############################################
############################################
## MAIN FUNCTION
############################################
############################################

## MergedIntervalTxtFile is file created using commands below:
## 1) broadPeak
## sort -k1,1 -k2,2n <MACS_BROADPEAK_FILES_LIST> | mergeBed -c 2,3,4,5,6,7,8,9 -o collapse,collapse,collapse,collapse,collapse,collapse,collapse,collapse > merged_peaks.txt
## 2) narrowPeak
## sort -k1,1 -k2,2n <MACS_NARROWPEAK_FILE_LIST> | mergeBed -c 2,3,4,5,6,7,8,9,10 -o collapse,collapse,collapse,collapse,collapse,collapse,collapse,collapse,collapse > merged_peaks.txt

def macs2_merged_expand(MergedIntervalTxtFile, SampleNameList, OutFile, isNarrow=False, minReplicates=1):
    makedir(os.path.dirname(OutFile))

    combFreqDict = {}
    totalOutIntervals = 0
    SampleNameList = sorted(SampleNameList)
    fin = open(MergedIntervalTxtFile, 'r')
    fout = open(OutFile, 'w')
    oFields = ['chr', 'start', 'end', 'interval_id', 'num_peaks', 'num_samples'] + [x + '.bool' for x in
                                                                                    SampleNameList] + [x + '.fc' for x
                                                                                                       in
                                                                                                       SampleNameList] + [
                  x + '.qval' for x in SampleNameList] + [x + '.pval' for x in SampleNameList] + [x + '.start' for x in
                                                                                                  SampleNameList] + [
                  x + '.end' for x in SampleNameList]
    if isNarrow:
        oFields += [x + '.summit' for x in SampleNameList]
    fout.write('\t'.join(oFields) + '\n')
    while True:
        line = fin.readline()
        if line:
            lspl = line.strip().split('\t')

            chromID = lspl[0];
            mstart = int(lspl[1]);
            mend = int(lspl[2]);
            starts = [int(x) for x in lspl[3].split(',')];
            ends = [int(x) for x in lspl[4].split(',')]
            names = lspl[5].split(',');
            fcs = [float(x) for x in lspl[8].split(',')]
            pvals = [float(x) for x in lspl[9].split(',')];
            qvals = [float(x) for x in lspl[10].split(',')]
            summits = []
            if isNarrow:
                summits = [int(x) for x in lspl[11].split(',')]

            ## GROUP SAMPLES BY REMOVING TRAILING *_R*
            groupDict = {}
            for sID in ['_'.join(x.split('_')[:-2]) for x in names]:
                gID = '_'.join(sID.split('_')[:-1])
                if gID not in groupDict:
                    groupDict[gID] = []
                if sID not in groupDict[gID]:
                    groupDict[gID].append(sID)

            ## GET SAMPLES THAT PASS REPLICATE THRESHOLD
            passRepThreshList = []
            for gID, sIDs in groupDict.items():
                if len(sIDs) >= minReplicates:
                    passRepThreshList += sIDs

            ## GET VALUES FROM INDIVIDUAL PEAK SETS
            fcDict = {};
            qvalDict = {};
            pvalDict = {};
            startDict = {};
            endDict = {};
            summitDict = {}
            for idx in range(len(names)):
                sample = '_'.join(names[idx].split('_')[:-2])
                if sample in passRepThreshList:
                    if sample not in fcDict:
                        fcDict[sample] = []
                    fcDict[sample].append(str(fcs[idx]))
                    if sample not in qvalDict:
                        qvalDict[sample] = []
                    qvalDict[sample].append(str(qvals[idx]))
                    if sample not in pvalDict:
                        pvalDict[sample] = []
                    pvalDict[sample].append(str(pvals[idx]))
                    if sample not in startDict:
                        startDict[sample] = []
                    startDict[sample].append(str(starts[idx]))
                    if sample not in endDict:
                        endDict[sample] = []
                    endDict[sample].append(str(ends[idx]))
                    if isNarrow:
                        if sample not in summitDict:
                            summitDict[sample] = []
                        summitDict[sample].append(str(summits[idx]))

            samples = sorted(fcDict.keys())
            if samples != []:
                numSamples = len(samples)
                boolList = ['TRUE' if x in samples else 'FALSE' for x in SampleNameList]
                fcList = [';'.join(fcDict[x]) if x in samples else 'NA' for x in SampleNameList]
                qvalList = [';'.join(qvalDict[x]) if x in samples else 'NA' for x in SampleNameList]
                pvalList = [';'.join(pvalDict[x]) if x in samples else 'NA' for x in SampleNameList]
                startList = [';'.join(startDict[x]) if x in samples else 'NA' for x in SampleNameList]
                endList = [';'.join(endDict[x]) if x in samples else 'NA' for x in SampleNameList]
                oList = [str(x) for x in [chromID, mstart, mend, 'Interval_' + str(totalOutIntervals + 1), len(names),
                                          numSamples] + boolList + fcList + qvalList + pvalList + startList + endList]
                if isNarrow:
                    oList += [';'.join(summitDict[x]) if x in samples else 'NA' for x in SampleNameList]
                fout.write('\t'.join(oList) + '\n')

                tsamples = tuple(sorted(samples))
                if tsamples not in combFreqDict:
                    combFreqDict[tsamples] = 0
                combFreqDict[tsamples] += 1
                totalOutIntervals += 1

        else:
            fin.close()
            fout.close()
            break

    ## WRITE FILE FOR INTERVAL INTERSECT ACROSS SAMPLES.
    ## COMPATIBLE WITH UPSETR PACKAGE.
    fout = open(OutFile[:-4] + '.intersect.txt', 'w')
    combFreqItems = sorted([(combFreqDict[x], x) for x in combFreqDict.keys()], reverse=True)
    for k, v in combFreqItems:
        fout.write('%s\t%s\n' % ('&'.join(v), k))
    fout.close()


############################################
############################################
## RUN FUNCTION
############################################
############################################

# AVI: arguments adapted to snakemake
macs2_merged_expand(MergedIntervalTxtFile=snakemake.input[0],
                    SampleNameList=list(snakemake.params.get("sample_control_peak")), OutFile=snakemake.output.get("bool_txt"),
                    isNarrow=snakemake.params.get("narrow_param"),
                    minReplicates=int(snakemake.params.get("min_reps_consensus")))

Python Snakemake From line 37 of scripts/macs2_merged_expand.py

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")
system(paste0("cp ", snakemake@input[["header"]], " ", snakemake@output[[1]]))
load(snakemake@input[["data"]])
write.table(crosscorr['cross.correlation'], file=snakemake@output[[1]], sep=',', quote=FALSE,
row.names=FALSE, col.names=FALSE, append=TRUE)

R From line 3 of scripts/phantompeak_correlation.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library("tidyverse")

homer_data <- read_tsv(snakemake@input[[1]])
homer_data <- homer_data %>% gather(`exon`, `Intergenic`, `intron`, `promoter-TSS`, `TTS`, key="sequence_element", value="counts")

peaks_sum <- ggplot(homer_data, aes(x = counts, y = sample, fill = sequence_element)) +
  geom_bar(position="fill", stat="Identity") +
  theme_minimal() +
  labs(x="", y="Peak count") +
  theme(legend.position = "right") +
  guides(fill=guide_legend("sequence element")) +
  ggtitle("Peak to feature proportion")

ggsave(snakemake@output[[1]], peaks_sum)

R ggplot2 tidyverse From line 1 of scripts/plot_annotatepeaks_summary_homer.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library("tidyverse")

data <- lapply(snakemake@input, read.table, header=F, stringsAsFactors = F)
frip_scores <- tibble()
for (i in 1:length(data)) {
  frip_scores <- rbind(frip_scores, data[[i]])
}
names(frip_scores) <- c("sample_control", "frip")

frip <- ggplot(frip_scores, aes(x = sample_control, y = frip, fill = sample_control)) +
  geom_bar(stat="Identity", color="black") +
  theme_minimal() +
  labs(x="", y="FRiP score") +
  theme(legend.position = "right") +
  guides(fill=guide_legend("samples with controls")) +
  ggtitle("FRiP score")

ggsave(snakemake@output[[1]], frip)

R ggplot2 tidyverse From line 1 of scripts/plot_frip_score.R

library(optparse)
library(ggplot2)
library(reshape2)
library(scales)

################################################
################################################
## PARSE COMMAND-LINE PARAMETERS              ##
################################################
################################################

option_list <- list(make_option(c("-i", "--homer_files"), type="character", default=NULL, help="Comma-separated list of homer annotated text files.", metavar="path"),
                    make_option(c("-s", "--sample_ids"), type="character", default=NULL, help="Comma-separated list of sample ids associated with homer annotated text files. Must be unique and in same order as homer files input.", metavar="string"),
                    make_option(c("-o", "--outdir"), type="character", default='./', help="Output directory", metavar="path"),
                    make_option(c("-p", "--outprefix"), type="character", default='homer_annotation', help="Output prefix", metavar="string"))

opt_parser <- OptionParser(option_list=option_list)
opt <- parse_args(opt_parser)

if (is.null(opt$homer_files)){
    print_help(opt_parser)
    stop("At least one homer annotated file must be supplied", call.=FALSE)
}
if (is.null(opt$sample_ids)){
    print_help(opt_parser)
    stop("Please provide sample ids associated with homer files.", call.=FALSE)
}

if (file.exists(opt$outdir) == FALSE) {
    dir.create(dirname(opt$outdir),recursive=TRUE)  # AVI
}

HomerFiles <- unlist(strsplit(opt$homer_files,","))
SampleIDs <- unlist(strsplit(opt$sample_ids,","))
if (length(HomerFiles) != length(SampleIDs)) {
    print_help(opt_parser)
    stop("Number of sample ids must equal number of homer annotated files.", call.=FALSE)
}

################################################
################################################
## READ IN DATA                               ##
################################################
################################################

plot.dat <- data.frame()
plot.dist.dat <- data.frame()
plot.feature.dat <- data.frame()
for (idx in 1:length(HomerFiles)) {

    sampleid = SampleIDs[idx]
    anno.dat <- read.csv(HomerFiles[idx], sep="\t", header=TRUE)
    anno.dat <- anno.dat[,c("Annotation","Distance.to.TSS","Nearest.PromoterID")]

    ## REPLACE UNASSIGNED FEATURE ENTRIES WITH SENSIBLE VALUES
    unassigned <- which(is.na(as.character(anno.dat$Distance.to.TSS)))
    anno.dat$Distance.to.TSS[unassigned] <- 1000000

    anno.dat$Annotation <- as.character(anno.dat$Annotation)
    anno.dat$Annotation[unassigned] <- "Unassigned"
    anno.dat$Annotation <- as.factor(anno.dat$Annotation)

    anno.dat$Nearest.PromoterID <- as.character(anno.dat$Nearest.PromoterID)
    anno.dat$Nearest.PromoterID[unassigned] <- "Unassigned"
    anno.dat$Nearest.PromoterID <- as.factor(anno.dat$Nearest.PromoterID)

    anno.dat$name <- rep(sampleid,nrow(anno.dat))
    anno.dat$Distance.to.TSS <- abs(anno.dat$Distance.to.TSS) + 1
    plot.dat <- rbind(plot.dat,anno.dat)

    ## GET ANNOTATION COUNTS
    anno.freq <- as.character(lapply(strsplit(as.character(anno.dat$Annotation)," "), function(x) x[1]))
    anno.freq <- as.data.frame(table(anno.freq))
    colnames(anno.freq) <- c("feature",sampleid)
    anno.melt <- melt(anno.freq)
    plot.feature.dat <- rbind(plot.feature.dat,anno.melt)

    ## GET CLOSEST INSTANCE OF GENE TO ANY GIVEN PEAK
    unique.gene.dat <- anno.dat[order(anno.dat$Distance.to.TSS),]
    unique.gene.dat <- unique.gene.dat[!duplicated(unique.gene.dat$Nearest.PromoterID), ]
    dist.freq <- rep("> 10kb",nrow(unique.gene.dat))
    dist.freq[which(unique.gene.dat$Distance.to.TSS < 10000)] <- "< 10kb"
    dist.freq[which(unique.gene.dat$Distance.to.TSS < 5000)] <- "< 5kb"
    dist.freq[which(unique.gene.dat$Distance.to.TSS < 2000)] <- "< 2kb"
    dist.freq <- as.data.frame(table(dist.freq))
    colnames(dist.freq) <- c("distance",sampleid)
    dist.melt <- melt(dist.freq)
    plot.dist.dat <- rbind(plot.dist.dat,dist.melt)

}
plot.dat$name <- factor(plot.dat$name, levels=sort(unique(as.character(plot.dat$name))))
plot.dist.dat$variable <- factor(plot.dist.dat$variable, levels=sort(unique(as.character(plot.dist.dat$variable))))
plot.feature.dat$variable <- factor(plot.feature.dat$variable, levels=sort(unique(as.character(plot.feature.dat$variable))))

summary.dat <- dcast(plot.feature.dat, variable ~ feature, value.var="value")
colnames(summary.dat)[1] <- "sample"
SummaryFile <- file.path(opt$outprefix)  # AVI -p flag redefined as summary output
write.table(summary.dat,file=SummaryFile,sep="\t",row.names=F,col.names=T,quote=F)  # AVI

################################################
################################################
## PLOTS                                      ##
################################################
################################################

PlotFile <- file.path(opt$outdir,paste(opt$outprefix,".plots.pdf",sep=""))  # AVI commented out
pdf(opt$outdir,height=6,width=3*length(HomerFiles))  # AVI -o flag redefined as plot output

## FEATURE COUNT STACKED BARPLOT
plot  <- ggplot(plot.feature.dat, aes(x=variable, y=value, group=feature)) +
         geom_bar(stat="identity", position = "fill", aes(colour=feature,fill=feature), alpha = 0.3) +
         xlab("") +
         ylab("% Feature") +
         ggtitle("Peak Location Relative to Annotation") +
         scale_y_continuous(labels = percent_format()) +
         theme(panel.grid.major = element_blank(),
               panel.grid.minor = element_blank(),
               panel.background = element_blank(),
               axis.text.y = element_text(colour="black"),
               axis.text.x= element_text(colour="black",face="bold"),
               axis.line.x = element_line(size = 1, colour = "black", linetype = "solid"),
               axis.line.y = element_line(size = 1, colour = "black", linetype = "solid"))
print(plot)

## DISTANCE TO CLOSEST GENE ACROSS ALL PEAKS STACKED BARPLOT
plot  <- ggplot(plot.dist.dat, aes(x=variable, y=value, group=distance)) +
         geom_bar(stat="identity", position = "fill", aes(colour=distance,fill=distance), alpha = 0.3) +
         xlab("") +
         ylab("% Unique genes to closest peak") +
         ggtitle("Distance of Closest Peak to Gene") +
         scale_y_continuous(labels = percent_format()) +
         theme(panel.grid.major = element_blank(),
               panel.grid.minor = element_blank(),
               panel.background = element_blank(),
               axis.text.y = element_text(colour="black"),
               axis.text.x= element_text(colour="black",face="bold"),
               axis.line.x = element_line(size = 1, colour = "black", linetype = "solid"),
               axis.line.y = element_line(size = 1, colour = "black", linetype = "solid"))
print(plot)

## VIOLIN PLOT OF PEAK DISTANCE TO TSS
plot  <- ggplot(plot.dat, aes(x=name, y=Distance.to.TSS)) +
         geom_violin(aes(colour=name,fill=name), alpha = 0.3) +
         geom_boxplot(width=0.1) +
         xlab("") +
         ylab(expression(log[10]*" distance to TSS")) +
         ggtitle("Peak Distribution Relative to TSS") +
         scale_y_continuous(trans='log10',breaks = trans_breaks("log10", function(x) 10^x), labels = trans_format("log10", math_format(10^.x))) +
         theme(legend.position="none",
               panel.grid.major = element_blank(),
               panel.grid.minor = element_blank(),
               panel.background = element_blank(),
               axis.text.y = element_text(colour="black"),
               axis.text.x= element_text(colour="black",face="bold"),
               axis.line.x = element_line(size = 1, colour = "black", linetype = "solid"),
               axis.line.y = element_line(size = 1, colour = "black", linetype = "solid"))
print(plot)
dev.off()

R ggplot2 optparse reshape2 scales From line 36 of scripts/plot_homer_annotatepeaks.R

library(optparse)
library(ggplot2)
library(reshape2)
library(scales)

################################################
################################################
## PARSE COMMAND-LINE PARAMETERS              ##
################################################
################################################

option_list <- list(make_option(c("-i", "--peak_files"), type="character", default=NULL, help="Comma-separated list of peak files.", metavar="path"),
                    make_option(c("-s", "--sample_ids"), type="character", default=NULL, help="Comma-separated list of sample ids associated with peak files. Must be unique and in same order as peaks files input.", metavar="string"),
                    make_option(c("-o", "--outdir"), type="character", default='./', help="Output directory", metavar="path"),
                    make_option(c("-p", "--outprefix"), type="character", default='macs2_peakqc', help="Output prefix", metavar="string"))

opt_parser <- OptionParser(option_list=option_list)
opt <- parse_args(opt_parser)

if (is.null(opt$peak_files)){
    print_help(opt_parser)
    stop("At least one peak file must be supplied", call.=FALSE)
}
if (is.null(opt$sample_ids)){
    print_help(opt_parser)
    stop("Please provide sample ids associated with peak files.", call.=FALSE)
}

if (file.exists(opt$outdir) == FALSE) {
    dir.create(dirname(opt$outdir),recursive=TRUE)  # AVI
}

PeakFiles <- unlist(strsplit(opt$peak_files,","))
SampleIDs <- unlist(strsplit(opt$sample_ids,","))
if (length(PeakFiles) != length(SampleIDs)) {
    print_help(opt_parser)
    stop("Number of sample ids must equal number of homer annotated files.", call.=FALSE)
}

################################################
################################################
## READ IN DATA                               ##
################################################
################################################

plot.dat <- data.frame()
summary.dat <- data.frame()
for (idx in 1:length(PeakFiles)) {
    sampleid = SampleIDs[idx]
    isNarrow <- FALSE
    header <- c("chrom","start","end","name","pileup", "strand", "fold", "-log10(pvalue)","-log10(qvalue)")
    fsplit <- unlist(strsplit(basename(PeakFiles[idx]), split='.',fixed=TRUE))
    if (fsplit[length(fsplit)] == 'narrowPeak') {
        isNarrow <- TRUE
        header <- c(header,"summit")
    }
    peaks <- read.table(PeakFiles[idx], sep="\t", header=FALSE)
    colnames(peaks) <- header

    ## GET SUMMARY STATISTICS
    peaks.dat <- peaks[,c('fold','-log10(qvalue)','-log10(pvalue)')]
    peaks.dat$length <- (peaks$end - peaks$start)
    for (cname in colnames(peaks.dat)) {
        sdat <- summary(peaks.dat[,cname])
        sdat["num_peaks"] <- nrow(peaks.dat)
        sdat["measure"] <- cname
        sdat["sample"] <- sampleid
        sdat <- t(data.frame(x=matrix(sdat),row.names=names(sdat)))
        summary.dat <- rbind(summary.dat,sdat)
    }
    colnames(peaks.dat) <- c('fold','fdr','pvalue','length')
    peaks.dat$name <- rep(sampleid,nrow(peaks.dat))
    plot.dat <- rbind(plot.dat,peaks.dat)
}
plot.dat$name <- factor(plot.dat$name, levels=sort(unique(as.character(plot.dat$name))))

SummaryFile <- file.path(opt$outprefix)  # AVI -p flag redefined as summary output
write.table(summary.dat,file=SummaryFile,quote=FALSE,sep="\t",row.names=FALSE,col.names=TRUE)

################################################
################################################
## PLOTS                                      ##
################################################
################################################

## RETURNS VIOLIN PLOT OBJECT
violin.plot <- function(plot.dat,x,y,ylab,title,log) {

    plot  <- ggplot(plot.dat, aes_string(x=x, y=y)) +
             geom_violin(aes_string(colour=x,fill=x), alpha = 0.3) +
             geom_boxplot(width=0.1) +
             xlab("") +
             ylab(ylab) +
             ggtitle(title) +
             theme(legend.position="none",
                   panel.grid.major = element_blank(),
                   panel.grid.minor = element_blank(),
                   panel.background = element_blank(),
                   axis.text.y = element_text(colour="black"),
                   axis.text.x= element_text(colour="black",face="bold"),
                   axis.line.x = element_line(size = 1, colour = "black", linetype = "solid"),
                   axis.line.y = element_line(size = 1, colour = "black", linetype = "solid"))
    if (log == 10) {
        plot <- plot + scale_y_continuous(trans='log10',breaks = trans_breaks("log10", function(x) 10^x), labels = trans_format("log10", math_format(10^.x)))
    }
    if (log == 2) {
        plot <- plot + scale_y_continuous(trans='log2',breaks = trans_breaks("log2", function(x) 2^x), labels = trans_format("log2", math_format(2^.x)))
    }
    return(plot)
}

############################

#PlotFile <- file.path(opt$outdir,paste(opt$outprefix,".plots.pdf",sep="")) # AVI commented out
pdf(opt$outdir,height=6,width=3*length(unique(plot.dat$name)))  # AVI -o flag redefined as plot output

## PEAK COUNT PLOT
peak.count.dat <- as.data.frame(table(plot.dat$name))
colnames(peak.count.dat) <- c("name","count")
plot  <- ggplot(peak.count.dat, aes(x=name, y=count)) +
         geom_bar(stat="identity",aes(colour=name,fill=name), position = "dodge", width = 0.8, alpha = 0.3) +
         xlab("") +
         ylab("Number of peaks") +
         ggtitle("Peak count") +
         theme(legend.position="none",
               panel.grid.major = element_blank(),
               panel.grid.minor = element_blank(),
               panel.background = element_blank(),
               axis.text.y = element_text(colour="black"),
               axis.text.x= element_text(colour="black",face="bold"),
               axis.line.x = element_line(size = 1, colour = "black", linetype = "solid"),
               axis.line.y = element_line(size = 1, colour = "black", linetype = "solid")) +
         geom_text(aes(label = count, x = name, y = count), position = position_dodge(width = 0.8), vjust = -0.6)
print(plot)

## VIOLIN PLOTS
print(violin.plot(plot.dat=plot.dat,x="name",y="length",ylab=expression(log[10]*" peak length"),title="Peak length distribution",log=10))
print(violin.plot(plot.dat=plot.dat,x="name",y="fold",ylab=expression(log[2]*" fold-enrichment"),title="Fold-change distribution",log=2))
print(violin.plot(plot.dat=plot.dat,x="name",y="fdr",ylab=expression(-log[10]*" qvalue"),title="FDR distribution",log=-1))
print(violin.plot(plot.dat=plot.dat,x="name",y="pvalue",ylab=expression(-log[10]*" pvalue"),title="Pvalue distribution",log=-1))

dev.off()

R ggplot2 optparse reshape2 scales From line 36 of scripts/plot_macs_qc.R

library(optparse)
library(UpSetR)

################################################
################################################
## PARSE COMMAND-LINE PARAMETERS              ##
################################################
################################################

option_list <- list(make_option(c("-i", "--input_file"), type="character", default=NULL, help="Path to tab-delimited file containing two columns i.e sample1&sample2&sample3 indicating intersect between samples <TAB> set size.", metavar="path"),
                    make_option(c("-o", "--output_file"), type="character", default=NULL, help="Path to output file with '.pdf' extension.", metavar="path"))

opt_parser <- OptionParser(option_list=option_list)
opt <- parse_args(opt_parser)

if (is.null(opt$input_file)){
    print_help(opt_parser)
    stop("Input file must be supplied.", call.=FALSE)
}
if (is.null(opt$output_file)){
    print_help(opt_parser)
    stop("Output pdf file must be supplied.", call.=FALSE)
}

OutDir <- dirname(opt$output_file)
if (file.exists(OutDir) == FALSE) {
    dir.create(OutDir,recursive=TRUE)
}

################################################
################################################
## PLOT DATA                                  ##
################################################
################################################

comb.dat <- read.table(opt$input_file,sep="\t",header=FALSE)
comb.vec <- comb.dat[,2]
comb.vec <- setNames(comb.vec,comb.dat[,1])
sets <- sort(unique(unlist(strsplit(names(comb.vec),split='&'))), decreasing = TRUE)

nintersects = length(names(comb.vec))
if (nintersects > 70) {
    nintersects <- 70
    comb.vec <- sort(comb.vec, decreasing = TRUE)[1:70]
    sets <- sort(unique(unlist(strsplit(names(comb.vec),split='&'))), decreasing = TRUE)
}

pdf(opt$output_file,onefile=F,height=10,width=20)

upset(
    fromExpression(comb.vec),
    nsets = length(sets),
    nintersects = nintersects,
    sets = sets,
    keep.order = TRUE,
    sets.bar.color = "#56B4E9",
    point.size = 3,
    line.size = 1,
    mb.ratio = c(0.55, 0.45),
    order.by = "freq",
    number.angles = 30,
    text.scale = c(1.5, 1.5, 1.5, 1.5, 1.5, 1.2)
)

dev.off()

R optparse UpSetR From line 35 of scripts/plot_peak_intersect.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library("tidyverse")

data <- lapply(snakemake@input, read.table, header=F, stringsAsFactors = F)
counts <- tibble()
for (i in 1:length(data)) {
  counts <- rbind(counts, data[[i]])
}
names(counts) <- c("sample_control", "count")

peaks_counts <- ggplot(counts, aes(x = count, y = sample_control, fill=sample_control)) +
  geom_bar(stat="Identity", color="black") +
  theme_minimal() +
  labs(x="Peak count", y="") +
  theme(legend.position = "right") +
  guides(fill=guide_legend("samples with controls")) +
  ggtitle("Total peak count")

ggsave(snakemake@output[[1]], peaks_counts)

R ggplot2 tidyverse From line 1 of scripts/plot_peaks_count_macs2.R

import os
import pysam
import argparse

############################################
############################################
## PARSE ARGUMENTS
############################################
############################################

Description = 'Remove singleton reads from paired-end BAM file i.e if read1 is present in BAM file without read 2 and vice versa.'
Epilog = """Example usage: bampe_rm_orphan.py <BAM_INPUT_FILE> <BAM_OUTPUT_FILE>"""

argParser = argparse.ArgumentParser(description=Description, epilog=Epilog)

## REQUIRED PARAMETERS
argParser.add_argument('BAM_INPUT_FILE', help="Input BAM file sorted by name.")
argParser.add_argument('BAM_OUTPUT_FILE', help="Output BAM file sorted by name.")

## OPTIONAL PARAMETERS
argParser.add_argument('-fr', '--only_fr_pairs', dest="ONLY_FR_PAIRS", help="Only keeps pairs that are in FR orientation on same chromosome.",action='store_true')
args = argParser.parse_args()

############################################
############################################
## HELPER FUNCTIONS
############################################
############################################

def makedir(path):

    if not len(path) == 0:
        try:
            #!# AVI: changed because of race conditions if directory exists, original code:  os.makedirs(path)
            os.makedirs(path, exist_ok=True)
        except OSError as exception:
            if exception.errno != errno.EEXIST:
                raise

############################################
############################################
## MAIN FUNCTION
############################################
############################################

def bampe_rm_orphan(BAMIn,BAMOut,onlyFRPairs=False):

    ## SETUP DIRECTORY/FILE STRUCTURE
    OutDir = os.path.dirname(BAMOut)
    makedir(OutDir)

    ## COUNT VARIABLES
    totalReads = 0; totalOutputPairs = 0; totalSingletons = 0; totalImproperPairs = 0

    ## ITERATE THROUGH BAM FILE
    EOF = 0
    SAMFin = pysam.AlignmentFile(BAMIn,"rb")  #!# AVI: changed to new API from pysam.Samfile
    SAMFout = pysam.AlignmentFile(BAMOut, "wb",header=SAMFin.header)   #!# AVI: changed to new API from pysam.Samfile
    currRead = next(SAMFin)     #!# AVI: adapted for the use of the iterator, original code: currRead = SAMFin.next()

    for read in SAMFin.fetch(until_eof=True): #!# AVI: added .fetch() to explicitly use new API
        totalReads += 1
        if currRead.qname == read.qname:
            pair1 = currRead; pair2 = read

            ## FILTER FOR READS ON SAME CHROMOSOME IN FR ORIENTATION
            if onlyFRPairs:
                if pair1.tid == pair2.tid:

                    ## READ1 FORWARD AND READ2 REVERSE STRAND
                    if not pair1.is_reverse and pair2.is_reverse:
                        if pair1.reference_start <= pair2.reference_start:
                            totalOutputPairs += 1
                            SAMFout.write(pair1)
                            SAMFout.write(pair2)
                        else:
                            totalImproperPairs += 1

                    ## READ1 REVERSE AND READ2 FORWARD STRAND
                    elif pair1.is_reverse and not pair2.is_reverse:
                        if pair2.reference_start <= pair1.reference_start:
                            totalOutputPairs += 1
                            SAMFout.write(pair1)
                            SAMFout.write(pair2)
                        else:
                            totalImproperPairs += 1

                    else:
                        totalImproperPairs += 1
                else:
                    totalImproperPairs += 1
            else:
                totalOutputPairs += 1
                SAMFout.write(pair1)
                SAMFout.write(pair2)

            ## RESET COUNTER
            try:
                totalReads += 1
                currRead = next(SAMFin)   #!# AVI: adapted for the use of the iterator, original code: currRead = SAMFin.next()
            except:
                StopIteration
                EOF = 1

        ## READS WHERE ONLY ONE OF A PAIR IS IN FILE
        else:
            totalSingletons += 1
            pair1 = currRead
            currRead = read

    if not EOF:
        totalReads += 1
        totalSingletons += 1
        pair1 = currRead

    ## CLOSE ALL FILE HANDLES
    SAMFin.close()
    SAMFout.close()

    LogFile = os.path.join(OutDir,'%s_bampe_rm_orphan.log' % (os.path.basename(BAMOut[:-4])))
    SamLogFile = open(LogFile,'w')
    SamLogFile.write('\n##############################\n')
    SamLogFile.write('FILES/DIRECTORIES')
    SamLogFile.write('\n##############################\n\n')
    SamLogFile.write('Input File: ' + BAMIn + '\n')
    SamLogFile.write('Output File: ' + BAMOut + '\n')
    SamLogFile.write('\n##############################\n')
    SamLogFile.write('OVERALL COUNTS')
    SamLogFile.write('\n##############################\n\n')
    SamLogFile.write('Total Input Reads = ' + str(totalReads) + '\n')
    SamLogFile.write('Total Output Pairs = ' + str(totalOutputPairs) + '\n')
    SamLogFile.write('Total Singletons Excluded = ' + str(totalSingletons) + '\n')
    SamLogFile.write('Total Improper Pairs Excluded = ' + str(totalImproperPairs) + '\n')
    SamLogFile.write('\n##############################\n')
    SamLogFile.close()

############################################
############################################
## RUN FUNCTION
############################################
############################################

bampe_rm_orphan(BAMIn=args.BAM_INPUT_FILE,BAMOut=args.BAM_OUTPUT_FILE,onlyFRPairs=args.ONLY_FR_PAIRS)