Snakemake Workflow For WGBS Data

public 1yr ago 0 bookmarks

View Workflow

biscuit_snakemake_workflow — View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

This Snakemake workflow takes paired-end whole-genome bisulfite sequencing (WGBS) data and processes it using BISulfite-seq CUI Toolkit (BISCUIT).

BISCUIT was written to perform alignment, DNA methylation and mutation calling, and allele specific methylation from bisulfite sequencing data (https://huishenlab.github.io/biscuit/).

Download BISCUIT here: https://github.com/huishenlab/biscuit/releases/latest.

Components of the workflow

The following components are generally in order, but may run in a different order, depending on exact dependencies needed.

[default off] Generate asset files used during QC related rules
[default off] Modify and index genome reference to including methylation controls
[default off] Trim adapters and/or hard clip R2
[default off] Run Fastq Screen in bisulfite mode
Run FastQC on raw FASTQ files
Alignment, duplicate tagging, indexing, flagstat of input data (biscuitBlaster v1 and v2)
Methylation information extraction (BED Format)
Merge C and G beta values in CpG dinucleotide context
[default off] SNP and Epiread extraction
[default off] Run Preseq on aligned BAM
MultiQC with BICUIT QC modules specifically for methyaltion data
[default off] Generate plots of the observed / expected coverage ratio for different genomic features
[default off] Generate percentage of covered CpGs and CpG island coverage figures
[default off] QC methylated and unmethylated controls

Many options can be easily specified in the config.yaml ! Otherwise, the commands in the Snakefile can also be modified to meet different needs.

Dependencies

The following dependencies are downloaded when running with --use-conda , otherwise you must have these in your PATH.

snakemake (version 6.0+)
biscuit (version 1.0.1+)
htslib (version 1.12+)
samtools (version 1.12+)
samblaster
parallel (preferably version 20201122)
bedtools
preseq (version 3.1.2+, must be compiled with htslib enabled)
fastqc
trim_galore
fastq_screen (only required if running fastq_screen )
bismark (only required if running fastq_screen )
pigz
python (version 3.7+)
- pandas
- numpy
- matplotlib
multiqc
R (version 4.1.1+)
- tidyverse (only required for plotting methylation controls)
- ggplot2 (only required for plotting methylation controls)
- patchwork (only required for plotting methylation controls)
- viridislite (only required for plotting methylation controls)

Two things of note, 1) it is easiest when working with snakemake to install mamba using conda when running with --use-conda , and 2) it is preferable to install snakemake using conda , rather than using a module. This is due to potential conflicts between packages (such as matplotlib ) that can be found in the snakemake module's python distrubtion and the conda installed python distribution.

Running the workflow

Clone the repo (https://github.com/huishenlab/Biscuit_Snakemake_Workflow/tree/master).
- git clone [email protected]:huishenlab/Biscuit_Snakemake_Workflow.git (SSH)
- git clone https://github.com/huishenlab/Biscuit_Snakemake_Workflow.git (HTTPS)
Place gzipped FASTQ files into raw_data/ . Alternatively, you can specify the location of your gzipped FASTQ files in config/config.yaml .
Replace the example config/samples.tsv with your own sample sheet containing:
- A row for each sample
- The following three columns for each row:
  - A. sample
  - B. fq1 (name of R1 file for sample in your raw data directory)
  - C. fq2 (name of R2 file for sample in your raw data directory)
  - D. Any other columns included are ignored Note, you can either edit config/samples.tsv in place or specify the path to your sample sheet in config/config.yaml . If you create your own sample sheet, make sure to include the header line as is seen in the example file.
Modify the config.yaml to specify the appropriate
- Reference genome
- Biscuit index
- Biscuit QC assets (https://github.com/huishenlab/biscuit/releases/latest)
- Toggle optional workflow components
- Set other run parameters in config/config.yaml
- Turn on optional rules in config/config.yaml (change from False to True)
- If you are using environmental modules on your system, you can set the locations in the corresponding location. By default, the pipeline will use conda / mamba to download the required packages. Note, if using the modules and a module is not available, snakemake gives a warning but will run successfully as long as the required executables are in the path .
Then submit the rest workflow to an HPC using something similar to bin/run_snakemake_workflow.sh (e.g., qsub -q [queue_name] bin/run_snakemake_workflow.sh ). bin/run_snakemake_workflow.sh works for a PBS/Torque queue system, but will need to be modifed to work with a Slurm or other system.

After the workflow

The output files in config['output_directory']/analysis/pileup/ may be imported into a BSseq object using bicuiteer::readBiscuit() .
config['output_directory']/analysis/multiqc/multiqc_report.html contains the methylation-specific BISCUIT QC modules (https://huishenlab.github.io/biscuit/docs/alignment/QC.html)

Test dataset

This workflow comes with a working example dataset. To test the smakemake workflow on your system, place the 10 FASTQ files in bin/working_example_dataset into raw_data/ and use the default config/samples.tsv sample sheet. These example files can be mapped to the human genome.

Example default workflow - 1 sample

workflow diagram

Helpful snakemake commands for debugging a workflow

For more information on Snakemake: https://snakemake.readthedocs.io/en/stable/

Do a test run: snakemake -npr
Unlock directory after a manually aborted run: snakemake --unlock --cores 1
Create a workflow diagram for your run: snakemake --dag | dot -Tpng > my_dag.png
Run pipeline from the command line: snakemake --use-conda --cores 1

Code Snippets

shell:
    """
    set -euo pipefail

    bins=({params.bins})
    outfiles=({params.outfile_no_gz})

    for i in "${{!bins[@]}}"; do 
        bash workflow/scripts/find_binned_average.sh \
             {input.reference}.fai \
             ${{bins[$i]}} \
             {params.covFilter} \
             {input.bed} \
             ${{outfiles[$i]}} # find_binned_average gzips the file, so different than {{output.outfile}}
    done

    """   

SnakeMake From line 38 of rules/binned_average.smk

shell:
    """
    mkdir -p {output.refdir}

    if (file {input} | grep -q "extra field"); then
        cat <(bgzip -d {input}) <(zcat bin/puc19.fa.gz) <(zcat bin/lambda.fa.gz) | bgzip > {output.ref}
    elif (file {input} | grep -q "gzip compressed data, was"); then
        cat <(zcat {input}) <(zcat bin/puc19.fa.gz) <(zcat bin/lambda.fa.gz) | bgzip > {output.ref}
    else
        cat {input} <(zcat bin/puc19.fa.gz) <(zcat bin/lambda.fa.gz) | bgzip > {output.ref}
    fi

    biscuit index {output.ref}
    samtools faidx {output.ref}
    """

SnakeMake SAMtools From line 31 of rules/biscuit.smk

shell:
    """
    if [ {params.biscuit_version} == "v2" ]; then
        echo "biscuit blaster v2" 2> {log.biscuit_blaster_version}

        # biscuitBlaster pipeline
        biscuit align -@ {params.bb_threads} -b {params.lib_type} \
            -R '@RG\tLB:{params.LB}\tID:{params.ID}\tPL:{params.PL}\tPU:{params.PU}\tSM:{params.SM}' \
            {input.reference} <(zcat {input.R1}) <(zcat {input.R2}) 2> {log.biscuit} | \
        samblaster --addMateTags -d {params.disc} -s {params.split} -u {params.unmapped} 2> {log.samblaster} | \
        samtools sort -@ {params.st_threads} -m 5G -o {output.bam} -O BAM - 2> {log.samtools_sort}
        samtools index -@ {params.st_threads} {output.bam} 2> {log.samtools_index}

        # Get some initial stats
        samtools flagstat {output.bam} 1> {output.flagstat} 2> {log.samtools_flagstat}

        # Sort, compress, and index discordant read file
        samtools sort -@ {params.st_threads} -o {output.disc} -O BAM {params.disc} 2> {log.sort_disc}
        samtools index -@ {params.st_threads} {output.disc} 2> {log.index_disc}

        # Sort, compress, and index split read file
        samtools sort -@ {params.st_threads} -o {output.split} -O BAM {params.split} 2> {log.sort_split}
        samtools index -@ {params.st_threads} {output.split} 2> {log.index_split}

        # Compress unmapped/clipped FASTQ
        bgzip -@ {params.st_threads} {params.unmapped} 2> {log.bgzip_unmapped}

        # Clean up
        rm {params.disc}
        rm {params.split}
    elif [ {params.biscuit_version} == "v1" ]; then
        echo "biscuit blaster v1" 2> {log.biscuit_blaster_version}

        # biscuitBlaster pipeline
        biscuit align -@ {params.bb_threads} -b {params.lib_type} \
            -R '@RG\tLB:{params.LB}\tID:{params.ID}\tPL:{params.PL}\tPU:{params.PU}\tSM:{params.SM}' \
            {input.reference} <(zcat {input.R1}) <(zcat {input.R2}) 2> {log.biscuit} | \
        samblaster --addMateTags 2> {log.samblaster} | \
        samtools sort -@ {params.st_threads} -m 5G -o {output.bam} -O BAM - 2> {log.samtools_sort}
        samtools index -@ {params.st_threads} {output.bam} 2> {log.samtools_index}

        # Get some initial stats
        samtools flagstat {output.bam} 1> {output.flagstat} 2> {log.samtools_flagstat}

        # Add files to ensure smooth output transition
        touch {output.disc}
        touch {output.disc_bai}
        touch {output.split}
        touch {output.split_bai}
        touch {output.unmapped}
    else
        echo "biscuit: biscuit_blaster_version must be v1 or v2 in config/config.yaml" 2> {log.biscuit_blaster_version}
    fi
    """

SnakeMake SAMtools SAMBLASTER From line 129 of rules/biscuit.smk

shell:
    """
    if [ {params.nome} == "True" ]; then
        biscuit pileup -N -@ {threads} -o {params.vcf} {input.ref} {input.bam} 2> {log.pileup}
    else
        biscuit pileup -@ {threads} -o {params.vcf} {input.ref} {input.bam} 2> {log.pileup}
    fi

    bgzip {params.vcf} 2> {log.vcf_gz}
    tabix -p vcf {output.vcf_gz} 2> {log.vcf_tbi}

    biscuit vcf2bed -t cg {output.vcf_gz} 1> {params.bed} 2> {log.vcf2bed}
    bgzip {params.bed} 2> {log.bed_gz}
    tabix -p bed {output.bed_gz} 2> {log.bed_tbi}
    """

SnakeMake tabix From line 218 of rules/biscuit.smk

shell:
    """
    if [ {params.nome} == "True" ]; then
        biscuit mergecg -N {input.ref} {input.bed} 1> {params.mergecg} 2> {log.mergecg}
    else
        biscuit mergecg {input.ref} {input.bed} 1> {params.mergecg} 2> {log.mergecg}
    fi

    bgzip {params.mergecg} 2> {log.mergecg_gz}
    tabix -p bed {output.mergecg_gz} 2> {log.mergecg_tbi}
    """

SnakeMake tabix From line 261 of rules/biscuit.smk

shell:
    """
    biscuit vcf2bed -t snp {input.vcf_gz} > {params.snp_bed} 2> {log}
    bgzip {params.snp_bed}
    tabix -p bed {output.snp_bed_gz}
    """

SnakeMake tabix From line 295 of rules/biscuit.smk

shell:
    """
    if [[ "$(zcat {input.snps} | head -n 1 | wc -l)" == "1" ]]; then
        if [ {params.nome} == "True" ]; then
            biscuit epiread -N -@ {threads} -B <(zcat {input.snps}) {input.ref} {input.bam} | sort -k1,1 -k2,2n > {params.epibed} 2> {log}
        else
            biscuit epiread -@ {threads} -B <(zcat {input.snps}) {input.ref} {input.bam} | sort -k1,1 -k2,2n > {params.epibed} 2> {log}
        fi
    else
        if [ {params.nome} == "True" ]; then
            biscuit epiread -N {input.ref} {input.bam} | sort -k1,1 -k2,2n > {params.epibed} 2> {log}
        else
            biscuit epiread {input.ref} {input.bam} | sort -k1,1 -k2,2n > {params.epibed} 2> {log}
        fi
    fi

    bgzip {params.epibed}
    tabix -p bed {output.epibed_gz}
    """

SnakeMake tabix From line 327 of rules/biscuit.smk

shell:
    """
    set +o pipefail
    mkdir -p {params.outdir}

    bedtools genomecov -bga -split -ibam {input.bam} | bedtools map -a {params.bins} -b - -c 4 | gzip > {output.uni}
    """

SnakeMake BEDTools From line 30 of rules/covg_uniformity.smk

script:
    '../scripts/plot_covg_uniformity.py'

SnakeMake From line 64 of rules/covg_uniformity.smk

shell:
    """
    set +o pipefail
    mkdir -p {params.outdir}

    # Unfiltered coverage
    bedtools genomecov -bga -split -ibam {input.bam} | \
    LC_ALL=C sort --parallel={threads} -k1,1 -k2,2n -T {params.outdir} | \
    bedtools intersect -a {params.cpg} -b - -wo -sorted | \
    bedtools groupby -g 1-3 -c 7 -o min | \
    gzip -c > {output.unf}

    # MAPQ >= 40 filtered coverage
    samtools view -q 40 -b {input.bam} | \
    bedtools genomecov -bga -split -ibam stdin | \
    LC_ALL=C sort --parallel={threads} -k1,1 -k2,2n -T {params.outdir} | \
    bedtools intersect -a {params.cpg} -b - -wo -sorted | \
    bedtools groupby -g 1-3 -c 7 -o min | \
    gzip -c > {output.fil}
    """

SnakeMake SAMtools BEDTools From line 31 of rules/cpg_stats.smk

shell:
    """
    echo -e "Feature\tCpG_Count\tQ40_Reads\tAll_Reads" > {output}

    # All CpGs
    zcat {input.unf} | wc -l | awk '{{ printf("AllCpGs\\t%s", $1) }}' >> {output}
    zcat {input.fil} | awk '$4>0{{ a += 1 }} END{{ printf("\\t%d", a) }}' >> {output}
    zcat {input.unf} | awk '$4>0{{ a += 1 }} END{{ printf("\\t%d\\n", a) }}' >> {output}

    # CpG Island CpGs
    bedtools intersect -a {input.unf} -b <(bedtools merge -i {params.cgi}) -sorted | wc -l | \
    awk '{{ printf("CGICpGs\\t%s", $1) }}' >> {output}
    bedtools intersect -a {input.fil} -b <(bedtools merge -i {params.cgi}) -sorted | \
    awk '$4>0{{ a += 1 }} END{{ printf("\\t%d", a) }}' >> {output}
    bedtools intersect -a {input.unf} -b <(bedtools merge -i {params.cgi}) -sorted | \
    awk '$4>0{{ a += 1 }} END{{ printf("\\t%d\\n", a) }}' >> {output}

    # Exonic CpGs
    bedtools intersect -a {input.unf} -b <(bedtools merge -i {params.exn}) -sorted | wc -l | \
    awk '{{ printf("ExonicCpGs\\t%s", $1) }}' >> {output}
    bedtools intersect -a {input.fil} -b <(bedtools merge -i {params.exn}) -sorted | \
    awk '$4>0{{ a += 1 }} END{{ printf("\\t%d", a) }}' >> {output}
    bedtools intersect -a {input.unf} -b <(bedtools merge -i {params.exn}) -sorted | \
    awk '$4>0{{ a += 1 }} END{{ printf("\\t%d\\n", a) }}' >> {output}

    # Genic CpGs
    bedtools intersect -a {input.unf} -b <(bedtools merge -i {params.gen}) -sorted | wc -l | \
    awk '{{ printf("GenicCpGs\\t%s", $1) }}' >> {output}
    bedtools intersect -a {input.fil} -b <(bedtools merge -i {params.gen}) -sorted | \
    awk '$4>0{{ a += 1 }} END{{ printf("\\t%d", a) }}' >> {output}
    bedtools intersect -a {input.unf} -b <(bedtools merge -i {params.gen}) -sorted | \
    awk '$4>0{{ a += 1 }} END{{ printf("\\t%d\\n", a) }}' >> {output}

    # Repeat-masked CpGs
    bedtools intersect -a {input.unf} -b <(bedtools merge -i {params.msk}) -sorted | wc -l | \
    awk '{{ printf("RepeatCpGs\\t%s", $1) }}' >> {output}
    bedtools intersect -a {input.fil} -b <(bedtools merge -i {params.msk}) -sorted | \
    awk '$4>0{{ a += 1 }} END{{ printf("\\t%d", a) }}' >> {output}
    bedtools intersect -a {input.unf} -b <(bedtools merge -i {params.msk}) -sorted | \
    awk '$4>0{{ a += 1 }} END{{ printf("\\t%d\\n", a) }}' >> {output}
    """

SnakeMake BEDTools From line 75 of rules/cpg_stats.smk

shell:
    """
    # Total number of CpG islands
    zcat {params.cgi} | wc -l | awk '{{ printf("n_cpg_islands\\t%s\\n", $1) }}' > {output}

    # Number of CpG islands with at least one read with MAPQ>=40 covering 1+, 3+, 5+, and 7+ CpGs in that CpG island
    bedtools intersect -a {input.fil} -b <(bedtools merge -i {params.cgi}) -sorted -wo | \
    awk '$4>0 {{ print $5":"$6"-"$7 }}' | uniq -c | \
    awk '{{ if ($1 >= 1) {{ a += 1; }} if ($1 >= 3) {{ b += 1; }} if ($1 >= 5) {{ c += 1; }} if ($1 >= 7) {{ d += 1; }} }} END {{ printf("%d %d %d %d", a, b, c, d) }}' | \
    awk '{{ printf("one_cpg\\t%s\\nthree_cpgs\\t%s\\nfive_cpgs\\t%s\\nseven_cpgs\\t%s\\n", $1, $2, $3, $4) }}' >> {output}
    """

SnakeMake BEDTools From line 137 of rules/cpg_stats.smk

script:
    '../scripts/plot_cpg_stats_features.py'

SnakeMake From line 177 of rules/cpg_stats.smk

script:
    '../scripts/plot_cpg_stats_cgi.py'

SnakeMake From line 204 of rules/cpg_stats.smk

shell:
    """
    set +o pipefail;

    echo "Create output directory" 1> {log}
    mkdir -p {params.dir}

    echo "Set reference location" 1> {log}
    # Reference location
    REFLOC={input.ref}.fai

    echo "Retrieve CpG island file" 1> {log}
    # CpG Islands from UCSC on only the canonical chromosomes
    wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/{params.gen}/database/cpgIslandExt.txt.gz | \
    gunzip -c | \
    awk 'BEGIN{{ OFS="\\t"; }}{{ print $2, $3, $4; }}' | \
    awk '{{ if ($1 ~ /^chr[1234567890XYM]{{1,2}}$/) {{ print }} }}' | \
    bedtools sort -i - | \
    gzip -c > {output.cgi}
    echo "CpG island file retrieved" 1> {log}

    # Create middles for finding locations
    bedtools slop \
        -i {output.cgi} \
        -g ${{REFLOC}} \
        -b 2000 | \
    bedtools merge -i - \
    > {params.tmp_sho}

    bedtools slop \
        -i {output.cgi} \
        -g ${{REFLOC}} \
        -b 4000 | \
    bedtools merge -i - \
    > {params.tmp_shl}

    # CpG Open Seas (intervening locations)
    sort -k1,1 -k2,2n ${{REFLOC}} | \
    bedtools complement -L -i {params.tmp_shl} -g - | \
    gzip -c > {output.opn}

    # CpG Shelves (shores +/- 2kb)
    bedtools subtract \
        -a {params.tmp_shl} \
        -b {params.tmp_sho} | \
    gzip -c > {output.shl}

    # CpG Shores (island +/- 2kb)
    bedtools subtract \
        -a {params.tmp_sho} \
        -b {output.cgi} | \
    gzip -c > {output.sho}

    # Clean up CpG temporary files
    rm -f {params.tmp_shl} {params.tmp_sho}
    echo "CpG seascape created and cleaned up" 1> {log}

    # BISCUIT assets (includes CpG and top/bottom 10% GC-content files)
    build_biscuit_QC_assets.pl --outdir {params.dir} --ref {input.ref} 1> {log}

    # Repeat-masked file
    if [[ {params.gen} != "mm9" ]]; then
        wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/{params.gen}/database/rmsk.txt.gz | \
        gunzip -c | \
        awk 'BEGIN{{ OFS="\\t"; }}{{ print $6, $7, $8; }}' | \
        awk '{{ if ($1 ~ /^chr[1234567890XYM]{{1,2}}$/) {{ print }} }}' | \
        bedtools sort -i - | \
        bedtools merge -i - | \
        gzip -c > {output.msk}
    else
        wget -q -nd -r -nH -np --accept-regex 'chr[1234567890XYM]*_rmsk.txt.gz$' \
            --regex-type=posix http://hgdownload.cse.ucsc.edu/goldenpath/mm9/database/

        cat chr*_rmsk.txt.gz | \
        gunzip -c | \
        awk 'BEGIN{{ OFS="\\t"; }}{{ print $6, $7, $8; }}' | \
        bedtools sort -i - | \
        bedtools merge -i - | \
        gzip -c > {output.msk}

        if [[ -f index.html ]]; then rm index.html; fi
        if [[ -f robots.txt ]]; then rm robots.txt; fi
        rm chr*_rmsk.txt.gz
    fi

    # Genic, intergenic, and exon files
    wget -qO- {params.annt} | \
    gunzip -c | \
    awk 'BEGIN{{ OFS="\\t"; }}{{ if ($3 == "gene") {{ print $1, $4, $5 > "{params.tmp_gen}"; }} else if ($3 == "exon") {{ print $1, $4, $5 > "{params.tmp_exn}"; }} }}'

    bedtools sort -i {params.tmp_gen} | \
    bedtools merge -i - | \
    gzip -c > {output.gen}

    bedtools sort -i {params.tmp_exn} | \
    bedtools merge -i - | \
    gzip -c > {output.exn}

    sort -k1,1 -k2,2n {input.ref}.fai | \
    bedtools complement -L -i {output.gen} -g - | \
    gzip -c > {output.inr}

    # Clean up
    rm -f {params.tmp_gen} {params.tmp_exn}
    """       

SnakeMake BEDTools From line 67 of rules/create_assets.smk

shell:
    """
    set +o pipefail;

    #cd {params.dir}

    wget -P {params.dir} --no-check-certificate -q https://bismap.hoffmanlab.org/raw/{params.gen}/k100.bismap.bedgraph.gz

    # Create average mappability scores in 10kb windows
    zcat {output.bis} | sort -k1,1 -k2,2n | gzip > {output.sor}

    bedtools makewindows -w 10000 -g {input.ref}.fai | gzip 1> {output.byn} 2> {log}
    bedtools map -a {output.byn} -b {output.sor} -c 4 -o mean | gzip 1> {output.wgt} 2> {log}

    bedtools intersect -a {output.bis} -b {input.cpg} | gzip -c 1> {output.cpg} 2> {log}
    bedtools intersect -a {output.bis} -b {input.cgi} | gzip -c 1> {output.cgi} 2> {log}
    bedtools intersect -a {output.bis} -b {input.exn} | gzip -c 1> {output.exn} 2> {log}
    bedtools intersect -a {output.bis} -b {input.gen} | gzip -c 1> {output.gen} 2> {log}
    bedtools intersect -a {output.bis} -b {input.inr} | gzip -c 1> {output.inr} 2> {log}
    bedtools intersect -a {output.bis} -b {input.msk} | gzip -c 1> {output.msk} 2> {log}
    """

SnakeMake BEDTools From line 208 of rules/create_assets.smk

shell:
    """
    bash workflow/scripts/create_context_beds.sh \
        -t {threads} \
        -o {params.dir} \
        {input.ref} \
        {params.cpg}/{input.cpg} 2> {log}
    """

SnakeMake From line 255 of rules/create_assets.smk

shell:
    """
    set +o pipefail
    mkdir -p {params.outdir}

    # Find genomic coverage
    samtools view -hb -F 0x4 -q 40 {input.bam} | \
    bedtools genomecov -bg -ibam stdin | \
    gzip -c > {output.cov}
    """

SnakeMake SAMtools BEDTools From line 30 of rules/obs_exp.smk

shell:
    """
    set +o pipefail

    # Find intersections
    bedtools intersect -a {params.cpg} -b {input.cov} -wo | gzip -c > {output.cpg}
    """

SnakeMake BEDTools From line 60 of rules/obs_exp.smk

shell:
    """
    set +o pipefail

    # Find intersections
    bedtools intersect -a {params.cgi} -b {input.cov} -wo | gzip -c > {output.cgi}
    """

SnakeMake BEDTools From line 87 of rules/obs_exp.smk

shell:
    """
    set +o pipefail

    # Find intersections
    bedtools intersect -a {params.intergenic} -b {input.cov} -wo | gzip -c > {output.intergenic}
    """

SnakeMake BEDTools From line 114 of rules/obs_exp.smk

shell:
    """
    set +o pipefail

    # Find intersections
    bedtools intersect -a {params.exon} -b {input.cov} -wo | gzip -c > {output.exon}
    """

SnakeMake BEDTools From line 141 of rules/obs_exp.smk

shell:
    """
    set +o pipefail

    # Find intersections
    bedtools intersect -a {params.genic} -b {input.cov} -wo | gzip -c > {output.genic}
    """

SnakeMake BEDTools From line 168 of rules/obs_exp.smk

shell:
    """
    set +o pipefail

    # Find intersections
    bedtools intersect -a {params.rmsk} -b {input.cov} -wo | gzip -c > {output.rmsk}
    """

SnakeMake BEDTools From line 195 of rules/obs_exp.smk

shell:
    """
    set +o pipefail

    # Find intersections
    bedtools intersect -a {params.mapped} -b {input.cov} -wo | gzip -c > {output.mapped}
    """

SnakeMake BEDTools From line 222 of rules/obs_exp.smk

script:
    '../scripts/calculate_feature_sizes.py'

SnakeMake From line 259 of rules/obs_exp.smk

script:
    '../scripts/plot_obs_exp.py'

SnakeMake From line 291 of rules/obs_exp.smk

script:
    '../scripts/rename.R'

SnakeMake From line 25 of rules/preprocessing.smk

shell:
    """
    mkdir -p {params.dir}
    fastqc --outdir {params.dir} --threads {threads} {input} 2> {log.stderr} 1> {log.stdout}
    """

SnakeMake FastQC From line 61 of rules/preprocessing.smk

shell:
    """
    trim_galore \
        --paired \
        {input} \
        --output_dir {params.outdir} \
        --cores {threads} \
        --fastqc \
        {params.args_list} \
        2> {log.stderr} 1> {log.stdout}

    # Create merged R1 and R2 FASTQs, clean up files that were merged
    cat {params.outdir}/{wildcards.sample}-*-R1_val_1.fq.gz > {params.outdir}/{wildcards.sample}-R1_val_1_merged.fq.gz
    cat {params.outdir}/{wildcards.sample}-*-R2_val_2.fq.gz > {params.outdir}/{wildcards.sample}-R2_val_2_merged.fq.gz
    rm {params.outdir}/{wildcards.sample}-*-R1_val_1.fq.gz
    rm {params.outdir}/{wildcards.sample}-*-R2_val_2.fq.gz
    """

SnakeMake Trim_Galore From line 90 of rules/preprocessing.smk

shell:
    """
    fastq_screen --bisulfite --conf {params.conf} --outdir {params.output_dir} {input} 2> {log.fastq_screen}
    """

SnakeMake From line 133 of rules/preprocessing.smk

shell:
    """
    set +o pipefail;
    QC.sh \
        -o {params.output_dir} \
        --vcf {input.vcf} \
        {params.assets} \
        {input.ref} \
        {wildcards.sample} \
        {input.bam} \
        2> {log}
    """

SnakeMake From line 55 of rules/qc.smk

shell:
    """
    mkdir -p {params.dir}
    preseq c_curve {params.opt} -o {params.out} -P -B {input.bam} 2> {log}
    """

SnakeMake preseq From line 89 of rules/qc.smk

shell:
    """
    multiqc -f -o {params.output_dir} -n multiqc_report.html {params.mqc_dirs} 2> {log}
    """

SnakeMake MultiQC From line 151 of rules/qc.smk

script:
    '../scripts/plot_percent_covered.py'

SnakeMake From line 172 of rules/qc.smk

shell:
    """
    mkdir -p {params.lambda_dir}
    mkdir -p {params.puc19_dir}

    # >J02459.1 Escherichia phage Lambda, complete genome - UNMETHYLATED CONTROL
    zcat {input.bed} | {{ grep '^J02459.1' || true; }} > {output.lambda_bed} 2> {log.lambda_log}

    # >M77789.2 Cloning vector pUC19, complete sequence - METHYLATED CONTROL
    zcat {input.bed} | {{ grep '^M77789.2' || true; }}  > {output.puc19_bed} 2> {log.puc19_log}
    """

SnakeMake From line 199 of rules/qc.smk

script:
    '../scripts/control_vector.R'

SnakeMake From line 229 of rules/qc.smk

shell:
    """
    python3 workflow/scripts/region_centered_bin_averages.py {params.args_list} {input.region_bed} | \
    sort -k1,1 -k2,2n > {output.region_centered_bins}
    """

SnakeMake From line 30 of rules/region_centered_bin_averages.smk

shell:
    """
    # do the intersecting for this sample
    # column 14 has the methylation
    bedtools intersect \
        -a {input.region_bins} \
        -b {input.merged_sample_bed} \
        -sorted -wo | \
    bedtools groupby \
        -g 1-10 \
        -c 14 \
        -o mean | \
    awk '{{print $7,"\\t",$8,"\\t",$9,"\\t",$4,"\\t",$10,"\\t",$11}}' > {output.bed}
    """

SnakeMake BEDTools From line 54 of rules/region_centered_bin_averages.smk

import pandas as pd

DEFAULT_2 = {'cov': [0], 'ol': [0]}
DEFAULT_3 = {'start': [0], 'end': [0], 'weight': [0]}

NAMES_2 = ['cov', 'ol']
NAMES_3 = ['start', 'end', 'weight']

# TODO: This can probably be adjusted to read files in chunks to reduce memory overhead
def calculate_feature_sizes(samp, infiles, paramfiles, outfile):
    # Load the asset files
    try:
        df_map_p = pd.read_csv(paramfiles['bismap'], sep='\t', header=None, usecols=[1, 2, 3], skiprows=1, names=NAMES_3)
    except pandas.errors.EmptyDataError:
        df_map_p = pd.DataFrame(DEFAULT_3)
    try:
        df_cpg_p = pd.read_csv(paramfiles['cpg'], sep='\t', header=None, usecols=[1, 2, 3], skiprows=1, names=NAMES_3)
    except pandas.errors.EmptyDataError:
        df_cpg_p = pd.DataFrame(DEFAULT_3)
    try:
        df_cgi_p = pd.read_csv(paramfiles['cgi'], sep='\t', header=None, usecols=[1, 2, 3], skiprows=1, names=NAMES_3)
    except pandas.errors.EmptyDataError:
        df_cgi_p = pd.DataFrame(DEFAULT_3)
    try:
        df_exn_p = pd.read_csv(paramfiles['exon'], sep='\t', header=None, usecols=[1, 2, 3], skiprows=1, names=NAMES_3)
    except pandas.errors.EmptyDataError:
        df_exn_p = pd.DataFrame(DEFAULT_3)
    try:
        df_gen_p = pd.read_csv(paramfiles['genic'], sep='\t', header=None, usecols=[1, 2, 3], skiprows=1, names=NAMES_3)
    except pandas.errors.EmptyDataError:
        df_gen_p = pd.DataFrame(DEFAULT_3)
    try:
        df_int_p = pd.read_csv(paramfiles['intergenic'], sep='\t', header=None, usecols=[1, 2, 3], skiprows=1, names=NAMES_3)
    except pandas.errors.EmptyDataError:
        df_int_p = pd.DataFrame(DEFAULT_3)
    try:
        df_msk_p = pd.read_csv(paramfiles['rmsk'], sep='\t', header=None, usecols=[1, 2, 3], skiprows=1, names=NAMES_3)
    except pandas.errors.EmptyDataError:
        df_msk_p = pd.DataFrame(DEFAULT_3)

    # Load the Snakemake generated files
    try:
        df_map_i = pd.read_csv(infiles['mapped'], sep='\t', header=None, usecols=[3, 8], names=NAMES_2)
    except pandas.errors.EmptyDataError:
        df_map_i = pd.DataFrame(DEFAULT_2)
    try:
        df_cpg_i = pd.read_csv(infiles['cpg'], sep='\t', header=None, usecols=[3, 8], names=NAMES_2)
    except pandas.errors.EmptyDataError:
        df_cpg_i = pd.DataFrame(DEFAULT_3)
    try:
        df_cgi_i = pd.read_csv(infiles['cgi'], sep='\t', header=None, usecols=[3, 8], names=NAMES_2)
    except pandas.errors.EmptyDataError:
        df_cgi_i = pd.DataFrame(DEFAULT_3)
    try:
        df_exn_i = pd.read_csv(infiles['exon'], sep='\t', header=None, usecols=[3, 8], names=NAMES_2)
    except pandas.errors.EmptyDataError:
        df_exn_i = pd.DataFrame(DEFAULT_3)
    try:
        df_gen_i = pd.read_csv(infiles['genic'], sep='\t', header=None, usecols=[3, 8], names=NAMES_2)
    except pandas.errors.EmptyDataError:
        df_gen_i = pd.DataFrame(DEFAULT_3)
    try:
        df_int_i = pd.read_csv(infiles['intergenic'], sep='\t', header=None, usecols=[3, 8], names=NAMES_2)
    except pandas.errors.EmptyDataError:
        df_int_i = pd.DataFrame(DEFAULT_3)
    try:
        df_msk_i = pd.read_csv(infiles['rmsk'], sep='\t', header=None, usecols=[3, 8], names=NAMES_2)
    except pandas.errors.EmptyDataError:
        df_msk_i = pd.DataFrame(DEFAULT_3)

    # Calculate sizes for finding the sum (weight * (end - start))
    df_map_p['size'] = df_map_p['weight'] * (df_map_p['end'] - df_map_p['start'])
    df_cpg_p['size'] = df_cpg_p['weight'] * (df_cpg_p['end'] - df_cpg_p['start'])
    df_cgi_p['size'] = df_cgi_p['weight'] * (df_cgi_p['end'] - df_cgi_p['start'])
    df_exn_p['size'] = df_exn_p['weight'] * (df_exn_p['end'] - df_exn_p['start'])
    df_gen_p['size'] = df_gen_p['weight'] * (df_gen_p['end'] - df_gen_p['start'])
    df_int_p['size'] = df_int_p['weight'] * (df_int_p['end'] - df_int_p['start'])
    df_msk_p['size'] = df_msk_p['weight'] * (df_msk_p['end'] - df_msk_p['start'])

    # Calculate sizes for finding the sum (coverage * number of overlapping bases)
    df_map_i['size'] = df_map_i['cov'] * df_map_i['ol']
    df_cpg_i['size'] = df_cpg_i['cov'] * df_cpg_i['ol']
    df_cgi_i['size'] = df_cgi_i['cov'] * df_cgi_i['ol']
    df_exn_i['size'] = df_exn_i['cov'] * df_exn_i['ol']
    df_gen_i['size'] = df_gen_i['cov'] * df_gen_i['ol']
    df_int_i['size'] = df_int_i['cov'] * df_int_i['ol']
    df_msk_i['size'] = df_msk_i['cov'] * df_msk_i['ol']

    # Calculate the necessary sums
    sum_map_p = df_map_p['size'].sum(); sum_map_i = df_map_i['size'].sum()
    sum_cpg_p = df_cpg_p['size'].sum(); sum_cpg_i = df_cpg_i['size'].sum()
    sum_cgi_p = df_cgi_p['size'].sum(); sum_cgi_i = df_cgi_i['size'].sum()
    sum_exn_p = df_exn_p['size'].sum(); sum_exn_i = df_exn_i['size'].sum()
    sum_gen_p = df_gen_p['size'].sum(); sum_gen_i = df_gen_i['size'].sum()
    sum_int_p = df_int_p['size'].sum(); sum_int_i = df_int_i['size'].sum()
    sum_msk_p = df_msk_p['size'].sum(); sum_msk_i = df_msk_i['size'].sum()

    with open(outfile, 'w') as f:
        f.write(
            '{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}'.format(
                samp,
                sum_map_p, sum_cpg_p, sum_cgi_p, sum_msk_p, sum_exn_p, sum_gen_p, sum_int_p,
                sum_map_i, sum_cpg_i, sum_cgi_i, sum_msk_i, sum_exn_i, sum_gen_i, sum_int_i,
            )
        )

calculate_feature_sizes(
    snakemake.wildcards['sample'],
    snakemake.input,
    snakemake.params,
    snakemake.output['data']
)

Python Pandas From line 1 of scripts/calculate_feature_sizes.py

require(tidyverse)
require(ggplot2)
require(patchwork)
require(viridisLite)

import_data <- function(files, logfile) {
    df <- NULL

    for (f in files) {
        if (file.info(f)$size > 0) {
            name <- gsub("\\.bed", "", basename(f))
            cat("Loading", f, "as", name, "\n", file=logfile, append=TRUE)

            my_bed <- read.delim(f, sep="\t", header=FALSE)
            colnames(my_bed) <- c("chr", "start", "end", "beta", "depth", "context")
            my_bed$sample <- rep(name, nrow(my_bed))

            df <- rbind(df, my_bed)
        }
    }

    return(df)
}

create_plot <- function(lam_files, puc_files, out_files, log_file) {
    cat("Loading data\n", file=log_file)
    lam <- import_data(lam_files, log_file)
    puc <- import_data(puc_files, log_file)

    n_samples_l <- length(unique(lam$sample))
    n_samples_p <- length(unique(puc$sample))

    # Unmethylated control
    if (!is.null(lam)) {
        cat("lambda phage data found. making lambda plots\n", file=log_file, append=TRUE)
        topleft <- ggplot(lam, aes(x=sample, y=depth)) +
            geom_boxplot(color='#357BA2FF') +
            theme_bw() +
            theme(
                axis.text.x = element_blank(),
                axis.text.y = element_text(size=12),
                axis.title.y = element_text(size=25),
                plot.title = element_text(size=25, hjust=0.5),
                plot.subtitle = element_text(size=15, hjust=0.5),
            ) +
            scale_color_manual() +
            ggtitle("Unmethylated", subtitle=paste("N =", n_samples_l, "Samples")) +
            expand_limits(y=0) +
            ylab("Coverage") +
            xlab("")

        bottomleft <- ggplot(lam, aes(x=sample, y=beta)) +
            geom_boxplot(color='#357BA2FF') +
            theme_bw() +
            theme(
                axis.text.x = element_text(size=12, angle=45, hjust=1, vjust=1),
                axis.text.y = element_text(size=12),
                axis.title.y = element_text(size=25),
                plot.title = element_text(size=25, hjust=0.5),
                plot.subtitle = element_text(size=15, hjust=0.5),
            ) +
            scale_color_manual() +
            ylim(c(0, 1)) +
            ylab("Beta") +
            xlab("")
    }

    # Methylated control
    if (!is.null(puc)) {
        cat("pUC19 data found. making puc19 plots\n", file=log_file, append=TRUE)
        topright <- ggplot(puc, aes(x=sample, y=depth)) +
            geom_boxplot(color='#357BA2FF') +
            theme_bw() +
            theme(
                axis.text.x = element_blank(),
                axis.text.y = element_text(size=12),
                axis.title.y = element_text(size=25),
                plot.title = element_text(size=25, hjust=0.5),
                plot.subtitle = element_text(size=15, hjust=0.5),
            ) +
            scale_color_manual() +
            ggtitle("Methylated", subtitle=paste("N =", n_samples_p, "Samples")) +
            expand_limits(y=0) +
            ylab("") +
            xlab("")

        bottomright <- ggplot(puc, aes(x=sample, y=beta)) +
            geom_boxplot(color='#357BA2FF') +
            theme_bw() +
            theme(
                axis.text.x = element_text(size=12, angle=45, hjust=1, vjust=1),
                axis.text.y = element_text(size=12),
                axis.title.y = element_text(size=25),
                plot.title = element_text(size=25, hjust=0.5),
                plot.subtitle = element_text(size=15, hjust=0.5),
            ) +
            scale_color_manual() +
            ylim(c(0, 1)) +
            ylab("") +
            xlab("")
    }

    # Create plot
    if (exists("topleft") & exists("topright")) {
        cat("found plots for both lambda phage and pUC19. attempting to make patchwork plot\n", file=log_file, append=TRUE)
        layout <- "
        AB
        CD
        "

        pw <- topleft + topright + bottomleft + bottomright +
            patchwork::plot_layout(design = layout) +
            patchwork::plot_annotation(tag_levels = "A", title = "Control Vectors") &
            theme(plot.tag = element_text(face = "bold")) &
            theme(plot.title = element_text(size=25, hjust=0.5))

        ggsave(out_files, plot=pw, width=7, height=10)
    } else {
        cat("could not find plots for both lambda phage and pUC19. filling placeholder file\n", file=log_file, append=TRUE)
        pdf(file=out_files, width=7, height=10)

        plot.new()
        text(x=0.5, y=0.5, "NO PLOT CREATED.\nLIKELY REASON: NO DEPTH FOR CONTROL VECTOR(S)\nIN SUPPLED BED FILES")

        dev.off()
    }
}

create_plot(snakemake@input[["lambda_files"]], snakemake@input[["puc19_files"]], snakemake@output[["pdf"]], snakemake@log[["fn"]])

R ggplot2 tidyverse patcHwork viridisLite From line 1 of scripts/control_vector.R

set -euo pipefail

# Check for bedtools, awk, and parallel in PATH
function check_path {
    if [[ `which bedtools 2>&1 > /dev/null` ]]; then
        >&2 echo "bedtools does not exist in PATH"
        exit 1
    else
        >&2 echo "Using bedtools found at: `which bedtools`"
    fi
    if [[ `which awk 2>&1 > /dev/null` ]]; then
        >&2 echo "awk does not exist in PATH"
        exit 1
    else
        if awk --version | grep -q GNU; then
            >&2 echo "Using GNU awk found at: `which awk`"
        else
            >&2 echo "It doesn't appear you are using GNU awk"
            >&2 echo "Try adding GNU awk at the front of PATH"
            exit 1
        fi
    fi
    if [[ `which parallel 2>&1 > /dev/null` ]]; then
        >&2 echo "parallel does not exist in PATH"
        >&2 echo "Make sure to add GNU parallel to PATH"
        exit 1
    else
        if parallel --version | grep -q GNU; then
            >&2 echo "Using GNU parallel found at: `which parallel`"
        else
            >&2 echo "It doesn't appear you are using GNU parallel."
            >&2 echo "Try adding GNU parallel at the front of PATH"
            exit 1
        fi
    fi
}

expand_windows() {
    # Load chromosome sizes to help with keeping extracted ranges within bounds of the chromosomes
    chr_sizes=""
    while read line; do
        chr="$(echo ${line} | awk '{ print $1 }')"
        siz="$(echo ${line} | awk '{ print $2 }')"

        if [ -z "$chr_sizes" ]; then
            chr_sizes="${chr},${siz}"
        else
            chr_sizes="${chr_sizes};${chr},${siz}"
        fi
    done < $1

    awk -v sizes="${chr_sizes}" \
    'BEGIN { 
        OFS="\t"

        # Create a key-value dictionary of chromosome sizes
        split(sizes, tmp1, ";")
        for (i in tmp1) {
            split(tmp1[i], tmp2, ",")
            chr_lengths[tmp2[1]] = tmp2[2]
        }
    } {
        chr = $1
        if ($2 - 35 < 0) { beg = 0 }
        else { beg = $2 - 35 }
        if ($3 + 35 > chr_lengths[chr]) { end = chr_lengths[chr] }
        else { end = $3 + 35 }

        print chr, beg, end
    }' $2
}
export -f expand_windows

find_context() {
    awk \
    'BEGIN { OFS = "\t" } {
        # Extract position
        match($1, /([^:]*):([0-9]*)-([0-9]*)/, position)
        chr = position[1]
        beg = position[2] + 35
        end = position[3] - 35

        # Reference sequence
        a1=toupper($2)
        a2=a1
        a3=a1

        # 4-base sequence and context
        rseq_4 = substr(a1,35,4)
        if (rseq_4~/[CG]CG[CG]/) { context = "SCGS" }
        else if (rseq_4~/[AT]CG[AT]/) { context = "WCGW" }
        else { context = "SCGW" }

        n_cpgs = gsub(/CG/,"",a1)                # number of CpGs in string
        gc_per = gsub(/[CG]/,"",a2) / length(a3) # GC-content of string
        rseq_6 = substr(a3,34,6)                 # 6-base sequence
        rseq_2 = substr(a3,36,2)                 # 2-base sequence

        # Need to have an easy way to extract CpGs into specific context and number of neighboring CpGs files
        if (n_cpgs == 1) { tag = "0" }
        else if (n_cpgs == 2) { tag = "1" }
        else if (n_cpgs == 3) { tag = "2" }
        else { tag = "3" }

        if (!(rseq_4~/N/)) print chr, beg, end, n_cpgs-1, gc_per, rseq_4, rseq_6, context, rseq_2, context"_"tag
    }' $1;
}
export -f find_context

create_files() {
    check_path

    if [ ! -d ${outdir} ]; then
        mkdir -p ${outdir}
    fi

    cd ${outdir}

    # Split up CpGs for quicker processing
    zcat ${cpgbed} | split -d -l 1000000 - cpgsplit_

    # Retrieve sequences around CpGs from genome FASTA file
    export genome
    parallel -j ${thread} 'expand_windows ${genome}.fai {} | bedtools getfasta -bed - -fi ${genome} -tab -fo cpgcontext_{}.tab' ::: cpgsplit_*

    # Annotate CpGs
    parallel -j ${thread} 'find_context {} > {.}.context' ::: cpgcontext_*.tab

    # Extract CpGs into specific context and number of neighboring CpG files
    grep --no-filename WCGW_0 *.context | cut -f1-9 | sort -k1,1 -k2,2n -k3,3n | gzip > wcgw_0_neighbors.bed.gz
    grep --no-filename WCGW_1 *.context | cut -f1-9 | sort -k1,1 -k2,2n -k3,3n | gzip > wcgw_1_neighbors.bed.gz
    grep --no-filename WCGW_2 *.context | cut -f1-9 | sort -k1,1 -k2,2n -k3,3n | gzip > wcgw_2_neighbors.bed.gz
    grep --no-filename WCGW_3 *.context | cut -f1-9 | sort -k1,1 -k2,2n -k3,3n | gzip > wcgw_3p_neighbors.bed.gz
    grep --no-filename SCGW_0 *.context | cut -f1-9 | sort -k1,1 -k2,2n -k3,3n | gzip > scgw_0_neighbors.bed.gz
    grep --no-filename SCGW_1 *.context | cut -f1-9 | sort -k1,1 -k2,2n -k3,3n | gzip > scgw_1_neighbors.bed.gz
    grep --no-filename SCGW_2 *.context | cut -f1-9 | sort -k1,1 -k2,2n -k3,3n | gzip > scgw_2_neighbors.bed.gz
    grep --no-filename SCGW_3 *.context | cut -f1-9 | sort -k1,1 -k2,2n -k3,3n | gzip > scgw_3p_neighbors.bed.gz
    grep --no-filename SCGS_0 *.context | cut -f1-9 | sort -k1,1 -k2,2n -k3,3n | gzip > scgs_0_neighbors.bed.gz
    grep --no-filename SCGS_1 *.context | cut -f1-9 | sort -k1,1 -k2,2n -k3,3n | gzip > scgs_1_neighbors.bed.gz
    grep --no-filename SCGS_2 *.context | cut -f1-9 | sort -k1,1 -k2,2n -k3,3n | gzip > scgs_2_neighbors.bed.gz
    grep --no-filename SCGS_3 *.context | cut -f1-9 | sort -k1,1 -k2,2n -k3,3n | gzip > scgs_3p_neighbors.bed.gz

    rm cpgsplit_* cpgcontext_*
}

usage() {
    >&2 echo -e "\nUsage: create_context_beds.sh [-h,--help] [-o,--outdir] genome cpgbed\n"
    >&2 echo -e "Required inputs:"
    >&2 echo -e "\tgenome       : Path to reference FASTA file used in creating CpG BED"
    >&2 echo -e "\tcpgbed       : CpG BED file containing CpG locations in genome\n"
    >&2 echo -e "Optional inputs:"
    >&2 echo -e "\t-h,--help    : Print help message and exit"
    >&2 echo -e "\t-o,--outdir  : Output directory [DEFAULT: assets]"
    >&2 echo -e "\t-t,--threads : Number of threads to use [DEFAULT: 1]"
}

# Initialize default variable values
outdir="assets"
thread=1

# Process command line arguments
OPTS=$(getopt \
    --options ho:t: \
    --long help,outdir:,threads: \
    --name "$(basename "$0")" \
    -- "$@"
)
eval set -- ${OPTS}

while true; do
    case "$1" in
        -h|--help )
            usage
            exit 0
            ;;
        -o|--outdir )
            outdir="$2"
            shift 2
            ;;
        -t|--threads )
            thread="$2"
            shift 2
            ;;
        -- )
            shift
            break
            ;;
        * )
            >&2 echo "Unknown option: $1"
            usage
            exit 1
            ;;
    esac
done

# Make sure there are the correct number of inputs
if [[ $# -ne 2 ]]; then
    >&2 echo "$0: Missing inputs"
    usage
    exit 1
fi

# Fill required positional arguments
genome=$1
cpgbed=$2

# Input checks
if [[ ! -f "${genome}.fai" ]]; then
    >&2 echo "Cannot locate fai-indexed reference: ${genome}.fai"
    >&2 echo "Please provide a viable path to the reference genome FASTA file."
    exit 1
fi

if [[ ! -f "${cpgbed}" ]]; then
    >&2 echo "Cannot locate CpG BED file: ${cpgbed}"
    >&2 echo "Please provide an existing CpG BED file"
    exit 1
fi

>&2 echo -e "Inputs:"
>&2 echo -e "\tOutput directory : ${outdir}"
>&2 echo -e "\tNumber of threads: ${thread}"
>&2 echo -e "\tGenome Reference : ${genome}"
>&2 echo -e "\tCpG Location BED : ${cpgbed}"

create_files

Shell BEDTools From line 20 of scripts/create_context_beds.sh

set -euo pipefail



# Find binned average values
function find_averages {
    tmp=$(echo $RANDOM|md5sum|head -c 20; echo) # get random name for tmp file
    if [[ `file ${infile}` =~ "gzip" ]]; then
        zcat ${infile} |
        awk -v filt=${filter} '{ if ($5 >= filt) { print } }' > $tmp
    else
        awk -v filt=${filter} '{ if ($5 >= filt) { print } }' \
        ${infile} > $tmp
    fi

    # Find average beta value across bins
    bedtools makewindows -w ${window} \
        -g ${reffai} | \
    sort -k1,1 -k2,2n | \
    bedtools map \
        -prec 2 \
        -a - \
        -b $tmp \
        -c 4 -o mean |
    gzip > ${otfile}.gz

    rm -f $tmp
}

# Print helpful usage information
function usage {
    >&2 echo -e "\nUsage: find_binned_average.sh [-h,--help] reference window filter infile outfile"
    >&2 echo -e "Required inputs:"
    >&2 echo -e "\treference    : FAI-index for reference to create windows from"
    >&2 echo -e "\twindow       : Window size for binned average finding"
    >&2 echo -e "\tfilter       : Coverage filter to apply before finding averages"
    >&2 echo -e "\tinfile       : BISCUIT CG BED file to calculate methylation average from"
    >&2 echo -e "\toutfile      : Name of BED file to write output to"
    >&2 echo -e "Optional inputs:"
    >&2 echo -e "\t-h,--help    : Print help message and exit\n"
}

# Process command line arguments
OPTS=$(getopt \
    --options h \
    --long help \
    --name "$(basename "$0")" \
    -- "$@"
)
eval set -- ${OPTS}

while true; do
    case "$1" in
        -h|--help )
            usage
            exit 0
            ;;
        -- )
            shift
            break
            ;;
        * )
            >&2 echo "Unknown option: $1"
            usage
            exit 1
            ;;
    esac
done

# Make sure there are the correct number of inputs
if [[ $# -ne 5 ]]; then
    >&2 echo "$0: Missing inputs"
    usage
    exit 1
fi

# Fill required positional arguments
reffai="${1}"
window="${2}"
filter="${3}"
infile="${4}"
otfile="${5}"

# Do some checks on the given inputs
if [[ ! -f ${reffai} ]]; then
    >&2 echo "Doesn't appear like you provided an existing FAI file: ${reffai}"
    exit 1
fi

if [[ ! "${reffai}" =~ .fai$ ]]; then
    >&2 echo "Doesn't appear like you provided a viable FAI file: ${reffai}"
    exit 1
fi

if ! [[ "${window}" =~ ^[0-9]+$ ]]; then
    >&2 echo "window needs to be an integer: ${window}"
    exit 1
fi

if ! [[ ${filter} =~ ^[0-9]+$ ]]; then
    >&2 echo "filter needs to be an integer: ${filter}"
    exit 1
fi

if [[ ! -f ${infile} ]]; then
    >&2 echo "Doesn't appear that infile exists: ${infile}"
    exit 1
fi

find_averages

Shell BEDTools From line 20 of scripts/find_binned_average.sh

from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import os

# Create list of chromosomes (include enough to cover human and mouse)
CHROMS = [f'chr{i}' for i in range(1, 23)]
CHROMS.extend(['chrX', 'chrY', 'chrM'])

def find_x_ticks_labels(chrs):
    """Create list of chromosomes that are included in BED file

    Inputs -
        chrs - chr column from BED file loaded as a DataFrame
    Returns -
        dict, keys are the chromosomes and values are the first index location of the chromosome
    """
    out = {}
    for c in CHROMS:
        try:
            out[c] = list(chrs).index(c)
        except ValueError: # chromosome isn't in list, so ignore
            continue

    return out

def create_plot(files, params, outfile):
    # Setup the order that chromosomes need to be sorted into
    chr_sort_order = pd.api.types.CategoricalDtype(CHROMS, ordered=True)

    # Average bismap mappability scores for each bin
    # Treat bins with no score as having a weight of 0 since there's no data there
    weights = pd.read_csv(
        params['map_scores'],
        sep='\t',
        header=None,
        names=['chr', 'start', 'end', 'raw_weight'],
        na_values='.'
    )
    weights['weight'] = weights['raw_weight'].fillna(0)
    weights['ideal'] = weights['weight'] / (weights['end'] - weights['start'])

    # Setup figure for coverage across genome
    fig, ax = plt.subplots(figsize=(9,5))
    plt.tight_layout()

    plt.title('Weighted Coverage in 10kb Windows', fontsize=24)
    ax.set_xlabel('')
    ax.set_ylabel('Weighted Coverage / Window Width', fontsize=18)

    # Process data and fill coverage across genome figure
    x_values = []
    x_tk_lab = []
    plot_data = {
        'samp': [],
        'frac': [],
        'mean': [],
        'stdv': [],
        'lavg': [],
        'lstd': [],
    }

    for s in files:
        samp = os.path.basename(s).replace('.10kb_binned_coverage.bed.gz', '')

        # Read data
        in_df = pd.read_csv(s, sep='\t', header=None, names=['chr', 'start', 'end', 'covg'], na_values='.')

        # Merge average bismap mappability scores as a weight column
        df = in_df.merge(weights, on=['chr', 'start', 'end'], how='outer')

        # Sort chromosomes to put chrM at the end
        df['chr'] = df['chr'].astype(chr_sort_order)
        df = df.sort_values(['chr', 'start'], ignore_index=True)

        # Calculate the weighted coverage [(avg mappability) * (coverage) / (window width)]
        df['weighted_covg'] = df['weight'] * df['covg'] / (df['end'] - df['start'])

        # Bins with both a raw weight and at least one base covered
        to_plot = df[(df['raw_weight'].notna()) & (df['covg'] > 0)]

        # Determine number of bins that had a bismap weight and at least one base covered
        n_weights = len(df[df['raw_weight'].notna()]['chr'])
        n_wgt_cov = len(to_plot[to_plot['weighted_covg'] > 0.01]['chr'])

        # Non-zero bins
        non_zero = [i if i > 0 else None for i in list(to_plot['weighted_covg'])]

        # Setup data that will be saved to output file and plotted in other figures
        plot_data['samp'].append(samp)
        plot_data['frac'].append(n_wgt_cov / n_weights)
        plot_data['mean'].append(np.average([i for i in non_zero if i]))
        plot_data['stdv'].append(np.std([i for i in non_zero if i]))
        plot_data['lavg'].append(np.average([np.log10(i) for i in non_zero if i]))
        plot_data['lstd'].append(np.std([np.log10(i) for i in non_zero if i]))

        # Fill x values and ticks/labels if that hasn't been done yet
        if len(x_values) == 0 or len(x_tk_lab) == 0:
            x_values = list(df.index)
            x_tk_lab = find_x_ticks_labels(df['chr'])

        # Add weighted coverage to coverage across genome figure
        ax.plot(x_values, df['weighted_covg'], 'k-')

    # Finish up coverage across genome figure
    ax.set_xlim(-500, len(x_values)+500)
    ax.set_xticks(list(x_tk_lab.values()))
    ax.set_xticklabels(x_tk_lab.keys(), rotation=90)

    plt.savefig(outfile['covg'], bbox_inches='tight')
    plt.close('all')

    # Create other figures
    df = pd.DataFrame(plot_data)

    fig, ax = plt.subplots()
    plt.tight_layout()

    sns.scatterplot(ax=ax, x='frac', y='lstd', data=df)

    line_loc = np.std([np.log10(i) for i in weights['ideal'] if i > 0])
    ax.axhline(line_loc, alpha=0.6, color='grey', linestyle='--')
    ax.text(x = 0.01, y = line_loc-0.01, s = 'Ideal Std. Dev.', ha='left', va='top', fontsize=14)

    ax.set_xlim(0, 1); ax.set_ylim(0, 1.1*max(plot_data['lstd']))

    plt.title('Whole Genome Coverage Uniformity', fontsize=24)
    plt.xlabel('(# Bins with Weighted Coverage > 0.01) /\n(# Bins with Nonzero Weights)', fontsize=18, va='top')
    plt.ylabel('std[ log10(weighted cov.) ]', fontsize=18)

    plt.savefig(outfile['frac'], bbox_inches='tight')
    plt.close('all')

    # Write data to output file
    df.to_csv(outfile['data'], index=False, sep='\t')

create_plot(list(snakemake.input['files']), snakemake.params, snakemake.output)

Python Pandas numpy matplotlib seaborn From line 1 of scripts/plot_covg_uniformity.py

from matplotlib import pyplot as plt
import pandas as pd
import os

def create_plot(samples, outfiles):
    # Load data
    l_df = []
    for s in samples:
        df = pd.read_csv(s, sep='\t', header=None, names=['group', 'count']).set_index(['group'])

        d = {
            'sample': [os.path.basename(s).replace('.cpg_stats_cgi_table.txt', '')],
            '1': [100.0 * df['count']['one_cpg'] / df['count']['n_cpg_islands']],
            '3': [100.0 * df['count']['three_cpgs'] / df['count']['n_cpg_islands']],
            '5': [100.0 * df['count']['five_cpgs'] / df['count']['n_cpg_islands']],
            '7': [100.0 * df['count']['seven_cpgs'] / df['count']['n_cpg_islands']],
            'n_cgis': [df['count']['n_cpg_islands']]
        }

        l_df.append(pd.DataFrame(d))

    df = pd.concat(l_df)
    df = df.sort_values(by = 'sample')

    y = [i for i, s in enumerate(list(df['sample']))]
    n = [s for i, s in enumerate(list(df['sample']))]

    # Create plot
    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows=2, ncols=2, sharex='col', sharey='row')
    plt.tight_layout()

    ax1.barh(y, df['1'], height=0.7, color='black')
    ax2.barh(y, df['3'], height=0.7, color='black')
    ax3.barh(y, df['5'], height=0.7, color='black')
    ax4.barh(y, df['7'], height=0.7, color='black')

    ax1.set_xlim(0, 110); ax1.set_ylim(-1, len(n)+1)
    ax2.set_xlim(0, 110); ax3.set_ylim(-1, len(n)+1)
    ax1.set_xticks(range(0, 110, 10)); ax1.set_xticklabels([str(i) for i in range(0, 110, 10)], fontsize=18)
    ax2.set_xticks(range(0, 110, 10)); ax2.set_xticklabels([str(i) for i in range(0, 110, 10)], fontsize=18)

    ax1.set_yticks(y); ax1.set_yticklabels(n, fontsize=18)
    ax3.set_yticks(y); ax3.set_yticklabels(n, fontsize=18)

    ax1.text(50, len(n), '1 Read', va='center', ha='center', size=16)
    ax2.text(50, len(n), '3 Reads', va='center', ha='center', size=16)
    ax3.text(50, len(n), '5 Reads', va='center', ha='center', size=16)
    ax4.text(50, len(n), '7 Reads', va='center', ha='center', size=16)

    plt.suptitle('# Reads Covering CpG Island', x=0.5, y=1.05, fontsize=24)
    ax1.set_title('', fontsize=16)
    ax2.set_title('', fontsize=16)
    ax3.set_title('', fontsize=16)
    ax4.set_title('', fontsize=16)

    ax1.set_xlabel('', fontsize=20)
    ax2.set_xlabel('', fontsize=20)
    ax3.set_xlabel('Percent Covered', fontsize=20)
    ax4.set_xlabel('Percent Covered', fontsize=20)

    ax1.set_ylabel('')
    ax2.set_ylabel('')
    ax3.set_ylabel('')
    ax4.set_ylabel('')

    plt.savefig(outfiles['cgi'], bbox_inches='tight')
    plt.close('all')

create_plot(snakemake.input['files'], snakemake.output)

Python Pandas matplotlib From line 1 of scripts/plot_cpg_stats_cgi.py

from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import os

TITLE = {
    'AllCpGs': 'All CpGs in Genome',
    'CGICpGs': 'CpGs in CpG Islands',
    'ExonicCpGs': 'CpGs in Exons',
    'GenicCpGs': 'CpGs in Genes',
    'RepeatCpGs': 'CpGs in Repeat-masked Space'
}

def create_plot(samples, outfiles):
    # Load the data and rearrange by feature type
    d_df = {}
    for s in samples:
        df = pd.read_csv(s, sep='\t').set_index(['Feature'], drop=False)
        all_cnt = df['CpG_Count']['AllCpGs']
        all_q40 = df['Q40_Reads']['AllCpGs']
        all_all = df['All_Reads']['AllCpGs']

        for row in df.itertuples():
            data = {'sample': [], 'l2_q40': [], 'l2_all': [], 'q40_percent': [], 'all_percent': []}
            data['sample'].append(os.path.basename(s).replace('.cpg_stats_feature_table.tsv', ''))
            data['l2_q40'].append(np.log2( (row.Q40_Reads / all_q40) / (row.CpG_Count / all_cnt) ))
            data['l2_all'].append(np.log2( (row.All_Reads / all_all) / (row.CpG_Count / all_cnt) ))
            data['q40_percent'].append(100.0 * row.Q40_Reads / row.CpG_Count)
            data['all_percent'].append(100.0 * row.All_Reads / row.CpG_Count)

            if row.Feature not in d_df.keys():
                d_df[row.Feature] = pd.DataFrame(data)
            else:
                d_df[row.Feature] = d_df[row.Feature].append(pd.DataFrame(data), sort=True)

    for key, df in d_df.items():
        y = [i for i, s in enumerate(list(df['sample']))]
        n = [s for i, s in enumerate(list(df['sample']))]

        fig, ((ax1,ax2), (ax3,ax4)) = plt.subplots(nrows=2, ncols=2, sharex='col', sharey='row')
        plt.tight_layout()

        # Plots
        ax1.barh(y, df['all_percent'], height=0.7, color='black')
        ax2.plot(df['l2_all'], y, 'kx', markersize=8)
        ax2.axvline(0.0, alpha=0.6, color='grey', linestyle='--')

        ax3.barh(y, df['q40_percent'], height=0.7, color='black')
        ax4.plot(df['l2_q40'], y, 'kx', markersize=8)
        ax4.axvline(0.0, alpha=0.6, color='grey', linestyle='--')

        # Axis ticks and labels
        ax1.set_xlim(0, 110)
        ax2.set_xlim(-3.5, 5.5)

        ax1.set_xticks(range(0, 110, 10)); ax1.set_xticklabels([str(i) for i in range(0, 110, 10)], fontsize=18)
        ax2.set_xticks(range(-3, 6))     ; ax2.set_xticklabels([str(i) for i in range(-3, 6)], fontsize=18)

        ax1.set_ylim(-1, len(n))
        ax3.set_ylim(-1, len(n))

        ax1.set_yticks(y); ax1.set_yticklabels(n, fontsize=18)
        ax3.set_yticks(y); ax3.set_yticklabels(n, fontsize=18)

        # Titles and such
        plt.suptitle(TITLE[key], x=0.5, y=1.05, fontsize=24)
        ax1.set_title('', fontsize=20)
        ax2.set_title('', fontsize=20)
        ax3.set_title('', fontsize=20)
        ax4.set_title('', fontsize=20)

        ax1.set_xlabel('', fontsize=20)
        ax2.set_xlabel('', fontsize=20)
        ax3.set_xlabel('Percent Covered', fontsize=20)
        ax4.set_xlabel('log2(obs / exp)', fontsize=20)

        ax1.set_ylabel('All Reads', fontsize=20)
        ax2.set_ylabel('')
        ax3.set_ylabel('Q40 Reads', fontsize=20)
        ax4.set_ylabel('')

        plt.savefig(outfiles[key], bbox_inches='tight')
        plt.close('all')

create_plot(snakemake.input['files'], snakemake.output)

Python Pandas numpy matplotlib From line 1 of scripts/plot_cpg_stats_features.py

from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import os

# Column names for Snakemake generated files
cols = [
    'sample',
    'refln', 'cpgln', 'cgiln', 'mskln', 'exnln', 'genln', 'intln',
    'mapln', 'fcpgs', 'fcgis', 'frmsk', 'fexon', 'fgene', 'fintr'
]

# Make it easier to parse the files based on their column names
ORDER = {
    'cpgs': ('cpgln', 'fcpgs'),
    'cgis': ('cgiln', 'fcgis'),
    'rmsk': ('mskln', 'frmsk'),
    'exon': ('exnln', 'fexon'),
    'gene': ('genln', 'fgene'),
    'intr': ('intln', 'fintr')
}

# Which columns to keep for generating plot
keepers = ['sample'] + list(ORDER.keys())

# Plot titles for each column
TITLE = {
    'cpgs': 'Observed / Expected Coverage for All CpGs',
    'cgis': 'Observed / Expected Coverage for CpG Islands',
    'rmsk': 'Observed / Expected Coverage for Repeat-Masked Space',
    'exon': 'Observed / Expected Coverage for Exons',
    'gene': 'Observed / Expected Coverage for Genic Space',
    'intr': 'Observed / Expected Coverage for Intergenic Space'
}

def create_plot(samples, outfiles):
    # Load the data and calculate the obs/exp ratios
    l_df = []
    for s in samples:
        df = pd.read_csv(s, sep='\t', header=None, names=cols)
        for k, v in ORDER.items():
            num = df[v[0]] / df['refln']
            den = df[v[1]] / df['mapln']

            df[k] = np.log2(num / den)
        l_df.append(df)

    # Combine dataframes for plotting
    df = pd.concat(l_df)
    df = df[keepers]
    df = df.sort_values(by = 'sample')

    y     = [i for i, s in enumerate(list(df['sample']))]
    names = [s for i, s in enumerate(list(df['sample']))]

    # Create a plot for each of the features
    for k in ORDER.keys():
        fig, ax = plt.subplots()
        plt.tight_layout()

        plt.plot(list(df[k]), y, 'kx', markersize=8)
        ax.axvline(0.0, alpha=0.6, color='grey', linestyle='--')

        plt.title(TITLE[k], fontsize=24)
        plt.xlabel('log2(obs / exp)', fontsize=20)
        plt.ylabel('', fontsize=20)

        plt.xlim(-3.5, 5.5)
        plt.ylim(-1, len(names))

        plt.xticks(
            [i for i in np.arange(-3, 6, 1)],
            [str(i) for i in np.arange(-3, 6, 1)],
            fontsize=18
        )
        plt.yticks(y, names, fontsize=18)

        plt.savefig(outfiles[k], bbox_inches='tight')
        plt.close('all')

create_plot(list(snakemake.input['files']), snakemake.output)

Python Pandas numpy matplotlib From line 1 of scripts/plot_obs_exp.py

from matplotlib import pyplot as plt
import os

def create_plot(data, outfile):
    # Load data
    all = {}
    for s in data['all']:
        with open(s, 'r') as f:
            fd = f.readlines()[2:]

        dd = {}
        for l in fd:
            fields = l.strip().split()
            dd[int(float(fields[0]))] = int(float(fields[1]))

        total = sum(dd.values())
        covrd = total - dd[0]

        samp = os.path.basename(s).replace('_covdist_all_base_table.txt', '')
        all[samp] = 100 * covrd / total

    q40 = {}
    for s in data['q40']:
        with open(s, 'r') as f:
            fd = f.readlines()[2:]

        dd = {}
        for l in fd:
            fields = l.strip().split()
            dd[int(float(fields[0]))] = int(float(fields[1]))

        total = sum(dd.values())
        covrd = total - dd[0]

        samp = os.path.basename(s).replace('_covdist_q40_base_table.txt', '')
        q40[samp] = 100 * covrd / total

    y = [i for i, s in enumerate(list(all.keys()))]
    n = [s for i, s in enumerate(list(all.keys()))]

    # Create plot
    fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, sharey='row')
    plt.tight_layout()

    ax1.barh(y, all.values(), height=0.7, color='black')
    ax2.barh(y, q40.values(), height=0.7, color='black')

    ax1.set_xlim(0, 110);
    ax2.set_xlim(0, 110);
    ax1.set_xticks(range(0, 110, 20)); ax1.set_xticklabels([str(i) for i in range(0, 110, 20)], fontsize=18)
    ax2.set_xticks(range(0, 110, 20)); ax2.set_xticklabels([str(i) for i in range(0, 110, 20)], fontsize=18)

    ax1.set_ylim(-1, len(n))
    ax1.set_yticks(y); ax1.set_yticklabels(n, fontsize=18)

    plt.suptitle('Percent of Genome Covered', x=0.5, y=1.10, fontsize=24)
    ax1.set_title('All Reads', fontsize=16)
    ax2.set_title('Q40 Reads', fontsize=16)

    ax1.set_xlabel('Percent Covered', fontsize=20)
    ax2.set_xlabel('Percent Covered', fontsize=20)

    plt.savefig(outfile['out'], bbox_inches='tight')
    plt.close('all')

create_plot(snakemake.input, snakemake.output)

Python matplotlib From line 1 of scripts/plot_percent_covered.py

import argparse
import numpy as np

# A terminology note:
#
# The "start" of the chromosome is base 0 (or 1 in 1-based indexing) in the reference. The "end" of the chromosome is,
# therefore, the last (highest numbered) base in the reference.

class Record(object):

    def __init__(self, fields, args):
        """Initialize record.

        Inputs -
            fields - list of fields from BED file
            args   - argparse.parse_args() from running script
        """
        self.fields = fields       # list of fields from BED file entry
        self.chrm = fields[0]      # chromosome
        self.beg  = int(fields[1]) # start of region
        self.end  = int(fields[2]) # end of region

        # Collapse initial region to the middle of the region
        if args.collapse:
            self.mid = (self.beg + self.end + 1) / 2 # midpoint of initial region
            self.beg = self.mid # set start to midpoint
            self.end = self.mid # set end to midpoint

        self.step1 = args.flankstep # step size in reverse direction (relative to reference and strand)
        self.step2 = args.flankstep # step size in forward direction (relative to reference and strand)

    def __str__(self):
        """String method for print()."""
        return f'chrm: {self.chrm}, start: {self.beg}, end: {self.end}, step1: {self.step1}, step2: {self.step2}'

    def sample_forward(self, args, index_func):
        """Move towards the end of the chromosome (relative to the reference) from the target region.

        Inputs -
            args       - argparse.parse_args() from running script
            index_func - function for how to adjust the index value for bin
        Returns -
            None
        """
        # If forward step is negative, then don't create any windows towards the end of the chromosome
        if self.step2 < 0:
            return

        # When moving towards the end of the chromosome, the windows begin at the 3' end (relative to the reference) of
        # the region of interest.
        # NB: this variable is a temporary variable and may be adjusted depending on provided inputs
        _window_beg = self.end

        # Create args.flanknumber of windows while moving towards the end of the chromosome
        for i in range(args.flanknumber):
            # This is a temporary variable and may be adjusted depending on inputs
            _window_end = _window_beg + self.step2

            if args.outer:
                # If args.outer, set the end of the window to be the 3' end of the temporary window:
                # (_window_end-1, _window_end)
                window_end = int(_window_end)
                window_beg = window_end - 1
            elif args.middle:
                # If args.middle, set the end of the window to be the middle of the temporary window:
                # (middle of window - 1, middle of window)
                window_mid = int((_window_beg + _window_end)/2.0)
                window_beg = window_mid-1
                window_end = window_mid
            else:
                # Otherwise use the whole temporary window
                window_beg = int(_window_beg)
                window_end = int(_window_end)

            # Index of the probe window moving towards the end of the chromosome (relative to the reference)
            index = index_func(i)

            # The index is negative for reverse strand (-) regions
            if index < 0:
                if args.flankbygene:
                    # For args.flankbygene, set the region column as the number index of the current window
                    reg = '({:d})-({:d})'.format(-i-1, -i)
                elif args.flanktoneighbor:
                    # For args.flanktoneighbor, set the region column as the corresponding percentage covered by the
                    # current window between the current interval and the next
                    if args.fold:
                        # If args.fold, halve region percentage to account for setting the forward and backward regions
                        # indices the same
                        reg = '({:d})-({:d})%'.format(
                            int(float(-i-1) / args.flanknumber / 2 * 100),
                            int(float(-i) / args.flanknumber / 2 * 100)
                        )
                    else:
                        # Otherwise, you can leave the percentage as is
                        reg = '({:d})-({:d})%'.format(
                            int(float(-i-1) / args.flanknumber * 100),
                            int(float(-i) / args.flanknumber * 100)
                        )
                else:
                    # Otherwise, set the region column as bases covered relative the end of the interval
                    reg = '({:d})-({:d})'.format(int(self.end - window_end), int(self.end - window_beg))

            # Handle forward strand intervals
            else:
                # The descriptions are the same as the negative index case, so won't rehash them here
                if args.flankbygene:
                    reg = '{:d}-{:d}'.format(i, i+1)
                elif args.flanktoneighbor:
                    # FIXME: This should behave in a similar manner to the args.flanktoneighbor above (where there is an
                    #        if-else block for args.fold. This is how it's done for sample_backward(), but for some
                    #        reason it's not done here.
                    reg = '{:d}-{:d}%'.format(
                        int(float(i) / args.flanknumber * 100),
                        int(float(i+1) / args.flanknumber * 100)
                    )
                else:
                    reg = '{:d}-{:d}'.format(int(window_beg - self.end), int(window_end - self.end))

            # When args.collapse is set, you can have index = 0
            # To be honest, I don't know what the area value is supposed to represent...
            if index >= 0:
                area = 1
            else:
                area = -1

            # Expand probe window if desired
            if args.expansion > 0:
                # Start properly accounts for running off the start of the chromosome, but end does not account for
                # running off the end of the chromosome. Maybe this can be accounted for somehow?
                window_beg = max(window_beg - args.expansion,0)
                window_end = window_end + args.expansion

            # Potential FIXME: I think we would still want to print out this information, even if window_beg == 0.
            #                  Therefore, I think that window_beg > 0 should be changed to window_beg >= 0
            if window_beg > 0 and window_end > window_beg:
                print(
                    '{}\t{:d}\t{:d}\t{:d}\t{}\t{:d}\t{}'.format(
                        self.chrm,  # chromosome of probe window
                        window_beg, # start of probe window (0-based)
                        window_end, # end of probe window (1-based, non-inclusive)
                        index,      # 0-indexed index of the probe window
                        reg,        # region covered by probe window
                        area,       # not exactly sure what this is....
                        '\t'.join(self.fields) # original values from input region of interest
                    )
                )

            # Move current temporary end to start of next probe window
            _window_beg = _window_end

    def sample_backward(self, args, index_func):
        """Move towards the start of the chromosome (relative to the reference) from target region.

        Inputs -
            args       - argparse.parse_args() from running script
            index_func - function for how to adjust the index value for bin
        Returns -
            None
        """
        # If backward step is negative, then don't create any windows towards the start of the chromosome
        if self.step1 < 0:
            return

        # When moving towards the start of the chromosome, the windows begin at the 5' end (relative to the reference)
        # of the region of interest.
        # NB: this variable is a temporary variable and may be adjusted depending on provided inputs
        _window_end = self.beg

        # Create args.flanknumber of windows while moving towards the start of the chromosome
        for i in range(args.flanknumber):
            # This is a temporary variable and may be adjusted depending on inputs
            _window_beg = _window_end - self.step1

            if args.outer:
                # If args.outer, set the end of the window to be the 5' end of the temporary window:
                # (_window_beg, _window_beg+1)
                window_beg = int(_window_beg)
                window_end = window_beg + 1
            elif args.middle:
                # If args.middle, set the end of the window to be the middle of the temporary window:
                # (middle of window - 1, middle of window)
                window_mid = int((_window_beg + _window_end)/2.0)
                window_beg = window_mid-1
                window_end = window_mid
            else:
                # Otherwise use the whole temporary window
                window_beg = int(_window_beg)
                window_end = int(_window_end)

            # Index of the probe window moving towards the start of the chromosome (relative to the reference)
            index = index_func(i)

            # The index is negative for forward strand (+) regions
            if index < 0:
                if args.flankbygene:
                    # For args.flankbygene, set the region column as the number index of the current window
                    reg = '({:d})-({:d})'.format(-i-1, -i)
                elif args.flanktoneighbor:
                    # For args.flanktoneighbor, set the region column as the corresponding percentage covered by the
                    # current window between the current interval and the previous
                    if args.fold:
                        # If args.fold, halve region percentage to account for setting the forward and backward regions
                        # indices the same
                        reg = '({:d})-({:d})%'.format(
                            int(float(-i-1) / args.flanknumber / 2 * 100),
                            int(float(-i) / args.flanknumber / 2 * 100)
                        )
                    else:
                        # Otherwise, you can leave the percentage as is
                        reg = '({:d})-({:d})%'.format(
                            int(float(-i-1) / args.flanknumber * 100),
                            int(float(-i) / args.flanknumber * 100)
                        )
                else:
                    # Otherwise, set the region column as bases covered relative the end of the interval
                    reg = '({:d})-({:d})'.format(int(window_beg - self.beg), int(window_end - self.beg))

            # Handle reverse strand intervals
            else:
                # The descriptions are the same as the negative index case, so won't rehash them here
                if args.flankbygene:
                    reg = '{:d}-{:d}'.format(i, i+1)
                elif args.flanktoneighbor:
                    if args.fold:
                        reg = '{:d}-{:d}%'.format(
                            int(float(i) / args.flanknumber / 2 * 100),
                            int(float(i+1) / args.flanknumber / 2 * 100)
                        )
                    else:
                        reg = '{:d}-{:d}%'.format(
                            int(float(i) / args.flanknumber * 100),
                            int(float(i+1) / args.flanknumber * 100)
                        )
                else:
                    reg = '{:d}-{:d}'.format(int(self.beg - window_end), int(self.beg - window_beg))

            # When args.collapse is set, you can have index = 0
            # To be honest, I don't know what the area value is supposed to represent...
            if index >= 0:
                area = 1
            else:
                area = -1

            # Expand probe window if desired
            if args.expansion > 0:
                # Start properly accounts for running off the start of the chromosome, but end does not account for
                # running off the end of the chromosome. Maybe this can be accounted for somehow?
                window_beg = max(window_beg - args.expansion,0)
                window_end = window_end + args.expansion

            # Potential FIXME: I think we would still want to print out this information, even if window_beg == 0.
            #                  Therefore, I think that window_beg > 0 should be changed to window_beg >= 0
            if window_beg > 0 and window_end > window_beg:
                print(
                    '{}\t{:d}\t{:d}\t{:d}\t{}\t{:d}\t{}'.format(
                        self.chrm,  # chromosome of probe window
                        window_beg, # start of probe window (0-based)
                        window_end, # end of probe window (1-based, non-inclusive)
                        index,      # 0-indexed index of the probe window
                        reg,        # region covered by probe window
                        area,       # not exactly sure what this is....
                        '\t'.join(self.fields) # original values from input region of interest
                    )
                )

            # Move current temporary start to end of next probe window
            _window_end = _window_beg

    def sample_internal(self, args, index_func):
        """Move within the target region.

        Inputs -
            args       - argparse.parse_args() from running script
            index_func - function for how to adjust the index value for bin
        Returns -
            None
        """
        if args.outer:
            # sentinels is a list of locations to sample from inside the region of interest
            # NB: If args.numinternal = 1, then the location for the internal region will always be:
            #     (region start, region start + 1)
            # NB: Here, start is set to be 1-based
            sentinels = list(np.linspace(self.beg+1, self.end, args.numinternal))

            for i in range(len(sentinels)):
                window_end = int(sentinels[i]) # end of window to probe
                window_beg = window_end - 1    # start of window to probe (reset to 0-based)

                # Expand probe window if desired
                if args.expansion > 0:
                    # Start properly accounts for running off the start of the chromosome, but end does not account for
                    # running off the end of the chromosome. Maybe this can be accounted for somehow?
                    window_beg = max(window_beg - args.expansion,0)
                    window_end = window_end + args.expansion

                # Index of the internal probe window
                index = index_func(i)

                # Potential FIXME: I think we would still want to print out this information, even if window_beg == 0.
                #                  Therefore, I think that window_beg > 0 should be changed to window_beg >= 0
                if window_beg > 0 and window_end > window_beg:
                    print(
                        '{}\t{:d}\t{:d}\t{:d}\t{:d}-{:d}%\t0\t{}'.format(
                            self.chrm,  # chromosome of probe window
                            window_beg, # start of probe window (0-based)
                            window_end, # end of probe window (1-based, non-inclusive)
                            index,      # 0-indexed index of the probe window
                            int(float(index)/args.numinternal*100),   # low end percent of range of the region covered by the probe window
                            int(float(index+1)/args.numinternal*100), # high end percent of range of the region covered by the probe window
                            '\t'.join(self.fields) # original values from input region of interest
                        )
                    )

            return

        # If args.outer is not set, then break up the internal region of interest into numinternal regions
        # For example, for region (0, 100), then if numinternal = 1 the probe window will cover (0, 100). If
        # numinternal = 2, the probe windows will be (0, 50) and (50, 100)
        # NB: Everything stays in BED 0-based indexing for this part of the function
        sentinels = list(np.linspace(self.beg, self.end, args.numinternal+1))

        for i in range(len(sentinels)-1):
            window_beg = int(sentinels[i])   # start of probe window
            window_end = int(sentinels[i+1]) # end of probe window

            # If args.middle is set, we want to set the probe window as the middle position of the window. So, for the
            # previously listed examples, the (0, 100) interval would be set to (49, 50) and the (0, 50) and (50, 100)
            # intervals would become (24, 25) and (74, 75)
            if args.middle:
                window_mid = int((sentinels[i] + sentinels[i+1])/2) # find middle of the current window
                window_beg = window_mid-1 # reset the start to the middle (setting value to 0-based index)
                window_end = window_mid   # reset the end to the middle location

            # Expand probe window if desired
            # See above comments for handling the end position relative to the end of the chromosome
            if args.expansion > 0:
                window_beg = max(window_beg - args.expansion,0)
                window_end = window_end + args.expansion

            # Index of the internal probe window (0 for internal window with only 1 internal probe window)
            index = index_func(i)

            # See the above potential FIXME, as this applies here
            if window_beg > 0 and window_end > window_beg:
                print(
                    '{}\t{:d}\t{:d}\t{:d}\t{:d}-{:d}%\t0\t{}'.format(
                        self.chrm,  # chromosome of probe window
                        window_beg, # start of probe window (0-based)
                        window_end, # end of probe window (1-based, non-inclusive)
                        index,      # 0-indexed index of the probe window
                        int(float(index)/args.numinternal*100),   # low end percent of range of the region covered by the probe window
                        int(float(index+1)/args.numinternal*100), # high end percent of range of the region covered by the probe window
                        '\t'.join(self.fields) # original values from input region of interest
                    )
                )

        return

def process_record(r, r0, r2):
    """Parse target region, creating windows based on the flanking inputs given on the command line.

    Inputs -
        r  - record to process
        r0 - previous record in file
        r2 - next record in file
    Returns -
        None
    """
    if args.flankbygene:
        # When flanking by gene (or more accurately, flanking by region of interest), set the step size to be equal to
        # the size of the region of interest
        # NB: specifically, this is related to the size of the windows in the region of interest, so the number of
        #     internal windows (args.numinternal) plays into this size
        r.step1 = float(r.end - r.beg + 1) / (args.numinternal)
        r.step2 = r.step1

    elif args.flanktoneighbor:
        # When flanking by neighbor, the entire space between the end of the previous region and the start of the
        # current region will be broken up into args.flanknumber number of windows (and likewise for the space between
        # the end of the current region and the start of the next region)

        # If no previous region exists or the end of the previous region overlaps the start of the current region, then
        # don't create any windows upstream (towards the start of the chromosome relative to the reference) of the
        # current region
        if r0 is None or r.beg < r0.end:
            r.step1 = -1
        else:
            # Evenly divide up region between end of previous region and start of current region into args.flanknumber
            # number of windows
            r.step1 = float(r.beg - r0.end + 1) / (args.flanknumber)

            # Based on the description of args.fold, I'm not entirely sure why the the step size is halved when
            # args.fold is set
            if args.fold:
                r.step1 /= 2

        # If no next region exists or the start of the next region overlaps the end of the current region, then don't
        # create any windows downstream (towards the end of the chromosome relative to the reference) of the current
        # region
        if r2 is None or r.end > r2.beg:
            r.step2 = -1
        else:
            # Evenly divide up region between end of current region and start of next region into args.flanknumber
            # number of windows
            r.step2 = float(r2.beg - r.end + 1) / (args.flanknumber)

            # Based on the description of args.fold, I'm not entirely sure why the the step size is halved when
            # args.fold is set
            if args.fold:
                r.step2 /= 2

    # If strand information is not included, assume all regions are on the forward (+) strand
    # Otherwise pick off the strand information from the input file
    if args.strand is None:
        strand = '+'
    else:
        strand = r.fields[args.strand-1]

    # If args.ignoreend, then the end of the region is ignored. For forward strand regions, this is the 3' end relative
    # to the reference, so the region will be set to (start, start+1). For reverse strand regions, this is the 5' end
    # relative to the reference, so the region will be set to (end-1, end)
    if args.ignoreend:
        args.numinternal = 1
        if strand == '+':
            r.end = r.beg + 1
        else:
            r.beg = r.end - 1

    # Potential FIXME: If args.fold is included, then every line will be printed twice - once for this if-block and once
    #                  for the strand if-else block. Further, the indices between the two prints will differ, as
    #                  args.fold makes the index equivalent between the upstream and downstream windows. This could be
    #                  solved by placing a return as the last expression in the block
    if args.fold:
        r.sample_backward(args, lambda i: -i-1)
        r.sample_internal(args, lambda i: min(i, args.numinternal-1-i))
        r.sample_forward(args, lambda i: -i-1)

    if strand == '+':
        # Create probe regions for forward strand (+) intervals
        r.sample_backward(args, lambda i: -i-1)
        r.sample_internal(args, lambda i: i)
        r.sample_forward(args, lambda i: args.numinternal+i)
    else:
        # For reverse strand (-) intervals, moving upstream from the CTCF cite is actually moving downstream relative to
        # the reference (i.e., the position value gets larger)
        r.sample_forward(args, lambda i: -i-1)
        r.sample_internal(args, lambda i: args.numinternal-1-i)
        r.sample_backward(args, lambda i: args.numinternal+i)

    return

def main(args):
    """Main function to run.

    Inputs -
        args   - argparse.parse_args() from running script
    Returns -
        None
    """
    # If collapsing interval to the middle, then (re)set the number of points sampled in the internal interval
    if args.collapse:
        if args.outer: # if args.outer, then keep the collapsed middle for internal probe
            args.numinternal = 1
        else: # ignore internal interval if --outer is not set
            args.numinternal = 0

    # Loop through intervals in BED file
    r0 = None # previous record (interval), only used with args.flanktoneighbor
    r  = None # current record (interval)
    r2 = None # next record (interval), only used with args.flanktoneighbor
    for line in args.table:
        r2 = Record(line.strip('\n').split('\t'), args)

        if r is not None: # skip the first line
            if r.chrm == r2.chrm: # current and next intervals fall on the same chromosome
                process_record(r, r0, r2)
            else: # next interval falls on a different chromosome
                process_record(r, r0, None)

        r0 = r
        r  = r2

    # Handle the last record
    r2 = None
    process_record(r, r0, r2)

    return

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Generate meta gene')
    parser.add_argument('table', help="Input bed file", type = argparse.FileType('r'), default='-')

    # How to handle intervals
    parser.add_argument(
        '--middle',
        action = 'store_true',
        help = 'use middle point of each probe window'
    )
    parser.add_argument(
        '--outer',
        action = 'store_true',
        help = 'use outer point of each probe window (w.r.t. to the target region)'
    )
    parser.add_argument(
        '--collapse',
        action = 'store_true',
        help = 'collapse internal interval to middle - all probe windows will now be relative to this location'
    )

    # Control how the probe window sizes are determined
    # Set step size to region of interest size - highest precedence when setting probe step size
    parser.add_argument(
        '--flankbygene',
        action = 'store_true',
        help = 'allow the size of steps to vary according to the region of interest length'
    )
    # Set step size to be flanknumber equal steps to next region - second highest precedence when setting step size
    parser.add_argument(
        '--flanktoneighbor',
        action = 'store_true',
        help = 'size of step is dependent on the closest region of interest to current region'
    )
    # Create equal size steps with length args.flankstep - used when flankbygene and flanktoneighbor are not set
    parser.add_argument(
        '-f', '--flankstep',
        type = int,
        default=100,
        help = 'set step size to X bases (default 100)'
    )
    parser.add_argument(
        '-m', '--flanknumber',
        type = int,
        default=30 ,
        help = 'number of points to sample outside of the region of interest (default 30)')

    # Expand probe windows - expands window in both directions, occurs for internal and external windows
    parser.add_argument(
        '--expansion',
        type = int,
        default = 0,
        help = 'number of bases to expand window, expands window in both directions (default 0)'
    )

    # Control internal sampling
    parser.add_argument(
        '-n', '--numinternal',
        type = int,
        default = 30,
        help = 'number of points to sample in the region of interest, --middle ignores this (default 30)'
    )

    # Other inputs
    parser.add_argument(
        '--fold',
        action = 'store_true',
        help = 'use the same index for intervals on both sides of the target region (usually used when strand is irrelevant)'
    )
    parser.add_argument(
        '-s', '--strand',
        type = int,
        default = None,
        help = 'the column which contains strand information in input file - if None then ignore strand (default None)'
    )
    parser.add_argument(
        '--ignoreend',
        action ='store_true',
        help = 'ignore the end of the input interval'
    )

    parser.set_defaults(func=main)
    args = parser.parse_args()

    try:
        args.func(args)
    except IOError:
        exit

Python numpy From line 14 of scripts/region_centered_bin_averages.py

rename_fastqs <- function(samplesheet, fastq_dir, out_dir) {
    sheet <- read.delim(samplesheet, sep="\t")

    ifelse(!dir.exists(out_dir), dir.create(file.path(out_dir), recursive = TRUE), FALSE)

    for (samp in sheet$sample) {
        print(paste("Renaming PE reads for:", samp))

        # Rename R1 reads
        file_1 <- paste0(fastq_dir, "/", unlist(strsplit(subset(sheet, sample == samp)$fq1, split = ",")))
        print(paste("Found", length(file_1), "R1 files for", samp))

        for (i in 1:length(file_1)) {
            if (file.exists(file_1[i])) {
                new_name <- paste0(out_dir, "/", samp, "-", i, "-R1.fastq.gz")
                system(paste0("ln -sr ", file_1[i], " ", new_name))
                print(paste(file_1[i], "successfully symlinked to", new_name))
            } else {
                stop(paste("Error:", file_1[i], "not found in", fastq_dir))
            }
        }

        # Rename R2 reads
        file_2 <- paste0(fastq_dir, "/", unlist(strsplit(subset(sheet, sample == samp)$fq2, split = ",")))
        print(paste("Found", length(file_2), "R2 files for", samp))

        for (i in 1:length(file_2)) {
            if (file.exists(file_2[i])) {
                new_name <- paste0(out_dir, "/", samp, "-", i, "-R2.fastq.gz")
                system(paste0("ln -sr ", file_2[i], " ", new_name))
                print(paste(file_2[i], "successfully symlinked to", new_name))
            } else {
                stop(paste("Error:", file_2[i], "not found in", fastq_dir))
            }
        }
    }
}

rename_fastqs(snakemake@params[["samplesheet"]], snakemake@params[["fastq_dir"]], snakemake@output[["symlink_dir"]])