Snakemake Workflow For WGBS Data

public public 1yr ago 0 bookmarks

This Snakemake workflow takes paired-end whole-genome bisulfite sequencing (WGBS) data and processes it using BISulfite-seq CUI Toolkit (BISCUIT).

BISCUIT was written to perform alignment, DNA methylation and mutation calling, and allele specific methylation from bisulfite sequencing data (https://huishenlab.github.io/biscuit/).

Download BISCUIT here: https://github.com/huishenlab/biscuit/releases/latest.

Components of the workflow

The following components are generally in order, but may run in a different order, depending on exact dependencies needed.

  • [default off] Generate asset files used during QC related rules

  • [default off] Modify and index genome reference to including methylation controls

  • [default off] Trim adapters and/or hard clip R2

  • [default off] Run Fastq Screen in bisulfite mode

  • Run FastQC on raw FASTQ files

  • Alignment, duplicate tagging, indexing, flagstat of input data (biscuitBlaster v1 and v2)

  • Methylation information extraction (BED Format)

  • Merge C and G beta values in CpG dinucleotide context

  • [default off] SNP and Epiread extraction

  • [default off] Run Preseq on aligned BAM

  • MultiQC with BICUIT QC modules specifically for methyaltion data

  • [default off] Generate plots of the observed / expected coverage ratio for different genomic features

  • [default off] Generate percentage of covered CpGs and CpG island coverage figures

  • [default off] QC methylated and unmethylated controls

Many options can be easily specified in the config.yaml ! Otherwise, the commands in the Snakefile can also be modified to meet different needs.

Dependencies

The following dependencies are downloaded when running with --use-conda , otherwise you must have these in your PATH.

  • snakemake (version 6.0+)

  • biscuit (version 1.0.1+)

  • htslib (version 1.12+)

  • samtools (version 1.12+)

  • samblaster

  • parallel (preferably version 20201122)

  • bedtools

  • preseq (version 3.1.2+, must be compiled with htslib enabled)

  • fastqc

  • trim_galore

  • fastq_screen (only required if running fastq_screen )

  • bismark (only required if running fastq_screen )

  • pigz

  • python (version 3.7+)

    • pandas

    • numpy

    • matplotlib

  • multiqc

  • R (version 4.1.1+)

    • tidyverse (only required for plotting methylation controls)

    • ggplot2 (only required for plotting methylation controls)

    • patchwork (only required for plotting methylation controls)

    • viridislite (only required for plotting methylation controls)

Two things of note, 1) it is easiest when working with snakemake to install mamba using conda when running with --use-conda , and 2) it is preferable to install snakemake using conda , rather than using a module. This is due to potential conflicts between packages (such as matplotlib ) that can be found in the snakemake module's python distrubtion and the conda installed python distribution.

Running the workflow

  • Clone the repo (https://github.com/huishenlab/Biscuit_Snakemake_Workflow/tree/master).

    • git clone [email protected]:huishenlab/Biscuit_Snakemake_Workflow.git (SSH)

    • git clone https://github.com/huishenlab/Biscuit_Snakemake_Workflow.git (HTTPS)

  • Place gzipped FASTQ files into raw_data/ . Alternatively, you can specify the location of your gzipped FASTQ files in config/config.yaml .

  • Replace the example config/samples.tsv with your own sample sheet containing:

    • A row for each sample

    • The following three columns for each row:

      • A. sample

      • B. fq1 (name of R1 file for sample in your raw data directory)

      • C. fq2 (name of R2 file for sample in your raw data directory)

      • D. Any other columns included are ignored Note, you can either edit config/samples.tsv in place or specify the path to your sample sheet in config/config.yaml . If you create your own sample sheet, make sure to include the header line as is seen in the example file.

  • Modify the config.yaml to specify the appropriate

    • Reference genome

    • Biscuit index

    • Biscuit QC assets (https://github.com/huishenlab/biscuit/releases/latest)

    • Toggle optional workflow components

    • Set other run parameters in config/config.yaml

    • Turn on optional rules in config/config.yaml (change from False to True)

    • If you are using environmental modules on your system, you can set the locations in the corresponding location. By default, the pipeline will use conda / mamba to download the required packages. Note, if using the modules and a module is not available, snakemake gives a warning but will run successfully as long as the required executables are in the path .

  • Then submit the rest workflow to an HPC using something similar to bin/run_snakemake_workflow.sh (e.g., qsub -q [queue_name] bin/run_snakemake_workflow.sh ). bin/run_snakemake_workflow.sh works for a PBS/Torque queue system, but will need to be modifed to work with a Slurm or other system.

After the workflow

  • The output files in config['output_directory']/analysis/pileup/ may be imported into a BSseq object using bicuiteer::readBiscuit() .

  • config['output_directory']/analysis/multiqc/multiqc_report.html contains the methylation-specific BISCUIT QC modules (https://huishenlab.github.io/biscuit/docs/alignment/QC.html)

Test dataset

This workflow comes with a working example dataset. To test the smakemake workflow on your system, place the 10 FASTQ files in bin/working_example_dataset into raw_data/ and use the default config/samples.tsv sample sheet. These example files can be mapped to the human genome.

Example default workflow - 1 sample

workflow diagram

Helpful snakemake commands for debugging a workflow

For more information on Snakemake: https://snakemake.readthedocs.io/en/stable/

  • Do a test run: snakemake -npr

  • Unlock directory after a manually aborted run: snakemake --unlock --cores 1

  • Create a workflow diagram for your run: snakemake --dag | dot -Tpng > my_dag.png

  • Run pipeline from the command line: snakemake --use-conda --cores 1

Code Snippets

38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
shell:
    """
    set -euo pipefail

    bins=({params.bins})
    outfiles=({params.outfile_no_gz})

    for i in "${{!bins[@]}}"; do 
        bash workflow/scripts/find_binned_average.sh \
             {input.reference}.fai \
             ${{bins[$i]}} \
             {params.covFilter} \
             {input.bed} \
             ${{outfiles[$i]}} # find_binned_average gzips the file, so different than {{output.outfile}}
    done

    """   
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
shell:
    """
    mkdir -p {output.refdir}

    if (file {input} | grep -q "extra field"); then
        cat <(bgzip -d {input}) <(zcat bin/puc19.fa.gz) <(zcat bin/lambda.fa.gz) | bgzip > {output.ref}
    elif (file {input} | grep -q "gzip compressed data, was"); then
        cat <(zcat {input}) <(zcat bin/puc19.fa.gz) <(zcat bin/lambda.fa.gz) | bgzip > {output.ref}
    else
        cat {input} <(zcat bin/puc19.fa.gz) <(zcat bin/lambda.fa.gz) | bgzip > {output.ref}
    fi

    biscuit index {output.ref}
    samtools faidx {output.ref}
    """
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
shell:
    """
    if [ {params.biscuit_version} == "v2" ]; then
        echo "biscuit blaster v2" 2> {log.biscuit_blaster_version}

        # biscuitBlaster pipeline
        biscuit align -@ {params.bb_threads} -b {params.lib_type} \
            -R '@RG\tLB:{params.LB}\tID:{params.ID}\tPL:{params.PL}\tPU:{params.PU}\tSM:{params.SM}' \
            {input.reference} <(zcat {input.R1}) <(zcat {input.R2}) 2> {log.biscuit} | \
        samblaster --addMateTags -d {params.disc} -s {params.split} -u {params.unmapped} 2> {log.samblaster} | \
        samtools sort -@ {params.st_threads} -m 5G -o {output.bam} -O BAM - 2> {log.samtools_sort}
        samtools index -@ {params.st_threads} {output.bam} 2> {log.samtools_index}

        # Get some initial stats
        samtools flagstat {output.bam} 1> {output.flagstat} 2> {log.samtools_flagstat}

        # Sort, compress, and index discordant read file
        samtools sort -@ {params.st_threads} -o {output.disc} -O BAM {params.disc} 2> {log.sort_disc}
        samtools index -@ {params.st_threads} {output.disc} 2> {log.index_disc}

        # Sort, compress, and index split read file
        samtools sort -@ {params.st_threads} -o {output.split} -O BAM {params.split} 2> {log.sort_split}
        samtools index -@ {params.st_threads} {output.split} 2> {log.index_split}

        # Compress unmapped/clipped FASTQ
        bgzip -@ {params.st_threads} {params.unmapped} 2> {log.bgzip_unmapped}

        # Clean up
        rm {params.disc}
        rm {params.split}
    elif [ {params.biscuit_version} == "v1" ]; then
        echo "biscuit blaster v1" 2> {log.biscuit_blaster_version}

        # biscuitBlaster pipeline
        biscuit align -@ {params.bb_threads} -b {params.lib_type} \
            -R '@RG\tLB:{params.LB}\tID:{params.ID}\tPL:{params.PL}\tPU:{params.PU}\tSM:{params.SM}' \
            {input.reference} <(zcat {input.R1}) <(zcat {input.R2}) 2> {log.biscuit} | \
        samblaster --addMateTags 2> {log.samblaster} | \
        samtools sort -@ {params.st_threads} -m 5G -o {output.bam} -O BAM - 2> {log.samtools_sort}
        samtools index -@ {params.st_threads} {output.bam} 2> {log.samtools_index}

        # Get some initial stats
        samtools flagstat {output.bam} 1> {output.flagstat} 2> {log.samtools_flagstat}

        # Add files to ensure smooth output transition
        touch {output.disc}
        touch {output.disc_bai}
        touch {output.split}
        touch {output.split_bai}
        touch {output.unmapped}
    else
        echo "biscuit: biscuit_blaster_version must be v1 or v2 in config/config.yaml" 2> {log.biscuit_blaster_version}
    fi
    """
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
shell:
    """
    if [ {params.nome} == "True" ]; then
        biscuit pileup -N -@ {threads} -o {params.vcf} {input.ref} {input.bam} 2> {log.pileup}
    else
        biscuit pileup -@ {threads} -o {params.vcf} {input.ref} {input.bam} 2> {log.pileup}
    fi

    bgzip {params.vcf} 2> {log.vcf_gz}
    tabix -p vcf {output.vcf_gz} 2> {log.vcf_tbi}

    biscuit vcf2bed -t cg {output.vcf_gz} 1> {params.bed} 2> {log.vcf2bed}
    bgzip {params.bed} 2> {log.bed_gz}
    tabix -p bed {output.bed_gz} 2> {log.bed_tbi}
    """
261
262
263
264
265
266
267
268
269
270
271
shell:
    """
    if [ {params.nome} == "True" ]; then
        biscuit mergecg -N {input.ref} {input.bed} 1> {params.mergecg} 2> {log.mergecg}
    else
        biscuit mergecg {input.ref} {input.bed} 1> {params.mergecg} 2> {log.mergecg}
    fi

    bgzip {params.mergecg} 2> {log.mergecg_gz}
    tabix -p bed {output.mergecg_gz} 2> {log.mergecg_tbi}
    """
295
296
297
298
299
300
shell:
    """
    biscuit vcf2bed -t snp {input.vcf_gz} > {params.snp_bed} 2> {log}
    bgzip {params.snp_bed}
    tabix -p bed {output.snp_bed_gz}
    """
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
shell:
    """
    if [[ "$(zcat {input.snps} | head -n 1 | wc -l)" == "1" ]]; then
        if [ {params.nome} == "True" ]; then
            biscuit epiread -N -@ {threads} -B <(zcat {input.snps}) {input.ref} {input.bam} | sort -k1,1 -k2,2n > {params.epibed} 2> {log}
        else
            biscuit epiread -@ {threads} -B <(zcat {input.snps}) {input.ref} {input.bam} | sort -k1,1 -k2,2n > {params.epibed} 2> {log}
        fi
    else
        if [ {params.nome} == "True" ]; then
            biscuit epiread -N {input.ref} {input.bam} | sort -k1,1 -k2,2n > {params.epibed} 2> {log}
        else
            biscuit epiread {input.ref} {input.bam} | sort -k1,1 -k2,2n > {params.epibed} 2> {log}
        fi
    fi

    bgzip {params.epibed}
    tabix -p bed {output.epibed_gz}
    """
30
31
32
33
34
35
36
shell:
    """
    set +o pipefail
    mkdir -p {params.outdir}

    bedtools genomecov -bga -split -ibam {input.bam} | bedtools map -a {params.bins} -b - -c 4 | gzip > {output.uni}
    """
64
65
script:
    '../scripts/plot_covg_uniformity.py'
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
shell:
    """
    set +o pipefail
    mkdir -p {params.outdir}

    # Unfiltered coverage
    bedtools genomecov -bga -split -ibam {input.bam} | \
    LC_ALL=C sort --parallel={threads} -k1,1 -k2,2n -T {params.outdir} | \
    bedtools intersect -a {params.cpg} -b - -wo -sorted | \
    bedtools groupby -g 1-3 -c 7 -o min | \
    gzip -c > {output.unf}

    # MAPQ >= 40 filtered coverage
    samtools view -q 40 -b {input.bam} | \
    bedtools genomecov -bga -split -ibam stdin | \
    LC_ALL=C sort --parallel={threads} -k1,1 -k2,2n -T {params.outdir} | \
    bedtools intersect -a {params.cpg} -b - -wo -sorted | \
    bedtools groupby -g 1-3 -c 7 -o min | \
    gzip -c > {output.fil}
    """
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
shell:
    """
    echo -e "Feature\tCpG_Count\tQ40_Reads\tAll_Reads" > {output}

    # All CpGs
    zcat {input.unf} | wc -l | awk '{{ printf("AllCpGs\\t%s", $1) }}' >> {output}
    zcat {input.fil} | awk '$4>0{{ a += 1 }} END{{ printf("\\t%d", a) }}' >> {output}
    zcat {input.unf} | awk '$4>0{{ a += 1 }} END{{ printf("\\t%d\\n", a) }}' >> {output}

    # CpG Island CpGs
    bedtools intersect -a {input.unf} -b <(bedtools merge -i {params.cgi}) -sorted | wc -l | \
    awk '{{ printf("CGICpGs\\t%s", $1) }}' >> {output}
    bedtools intersect -a {input.fil} -b <(bedtools merge -i {params.cgi}) -sorted | \
    awk '$4>0{{ a += 1 }} END{{ printf("\\t%d", a) }}' >> {output}
    bedtools intersect -a {input.unf} -b <(bedtools merge -i {params.cgi}) -sorted | \
    awk '$4>0{{ a += 1 }} END{{ printf("\\t%d\\n", a) }}' >> {output}

    # Exonic CpGs
    bedtools intersect -a {input.unf} -b <(bedtools merge -i {params.exn}) -sorted | wc -l | \
    awk '{{ printf("ExonicCpGs\\t%s", $1) }}' >> {output}
    bedtools intersect -a {input.fil} -b <(bedtools merge -i {params.exn}) -sorted | \
    awk '$4>0{{ a += 1 }} END{{ printf("\\t%d", a) }}' >> {output}
    bedtools intersect -a {input.unf} -b <(bedtools merge -i {params.exn}) -sorted | \
    awk '$4>0{{ a += 1 }} END{{ printf("\\t%d\\n", a) }}' >> {output}

    # Genic CpGs
    bedtools intersect -a {input.unf} -b <(bedtools merge -i {params.gen}) -sorted | wc -l | \
    awk '{{ printf("GenicCpGs\\t%s", $1) }}' >> {output}
    bedtools intersect -a {input.fil} -b <(bedtools merge -i {params.gen}) -sorted | \
    awk '$4>0{{ a += 1 }} END{{ printf("\\t%d", a) }}' >> {output}
    bedtools intersect -a {input.unf} -b <(bedtools merge -i {params.gen}) -sorted | \
    awk '$4>0{{ a += 1 }} END{{ printf("\\t%d\\n", a) }}' >> {output}

    # Repeat-masked CpGs
    bedtools intersect -a {input.unf} -b <(bedtools merge -i {params.msk}) -sorted | wc -l | \
    awk '{{ printf("RepeatCpGs\\t%s", $1) }}' >> {output}
    bedtools intersect -a {input.fil} -b <(bedtools merge -i {params.msk}) -sorted | \
    awk '$4>0{{ a += 1 }} END{{ printf("\\t%d", a) }}' >> {output}
    bedtools intersect -a {input.unf} -b <(bedtools merge -i {params.msk}) -sorted | \
    awk '$4>0{{ a += 1 }} END{{ printf("\\t%d\\n", a) }}' >> {output}
    """
137
138
139
140
141
142
143
144
145
146
147
shell:
    """
    # Total number of CpG islands
    zcat {params.cgi} | wc -l | awk '{{ printf("n_cpg_islands\\t%s\\n", $1) }}' > {output}

    # Number of CpG islands with at least one read with MAPQ>=40 covering 1+, 3+, 5+, and 7+ CpGs in that CpG island
    bedtools intersect -a {input.fil} -b <(bedtools merge -i {params.cgi}) -sorted -wo | \
    awk '$4>0 {{ print $5":"$6"-"$7 }}' | uniq -c | \
    awk '{{ if ($1 >= 1) {{ a += 1; }} if ($1 >= 3) {{ b += 1; }} if ($1 >= 5) {{ c += 1; }} if ($1 >= 7) {{ d += 1; }} }} END {{ printf("%d %d %d %d", a, b, c, d) }}' | \
    awk '{{ printf("one_cpg\\t%s\\nthree_cpgs\\t%s\\nfive_cpgs\\t%s\\nseven_cpgs\\t%s\\n", $1, $2, $3, $4) }}' >> {output}
    """
177
178
script:
    '../scripts/plot_cpg_stats_features.py'
204
205
script:
    '../scripts/plot_cpg_stats_cgi.py'
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
shell:
    """
    set +o pipefail;

    echo "Create output directory" 1> {log}
    mkdir -p {params.dir}

    echo "Set reference location" 1> {log}
    # Reference location
    REFLOC={input.ref}.fai

    echo "Retrieve CpG island file" 1> {log}
    # CpG Islands from UCSC on only the canonical chromosomes
    wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/{params.gen}/database/cpgIslandExt.txt.gz | \
    gunzip -c | \
    awk 'BEGIN{{ OFS="\\t"; }}{{ print $2, $3, $4; }}' | \
    awk '{{ if ($1 ~ /^chr[1234567890XYM]{{1,2}}$/) {{ print }} }}' | \
    bedtools sort -i - | \
    gzip -c > {output.cgi}
    echo "CpG island file retrieved" 1> {log}

    # Create middles for finding locations
    bedtools slop \
        -i {output.cgi} \
        -g ${{REFLOC}} \
        -b 2000 | \
    bedtools merge -i - \
    > {params.tmp_sho}

    bedtools slop \
        -i {output.cgi} \
        -g ${{REFLOC}} \
        -b 4000 | \
    bedtools merge -i - \
    > {params.tmp_shl}

    # CpG Open Seas (intervening locations)
    sort -k1,1 -k2,2n ${{REFLOC}} | \
    bedtools complement -L -i {params.tmp_shl} -g - | \
    gzip -c > {output.opn}

    # CpG Shelves (shores +/- 2kb)
    bedtools subtract \
        -a {params.tmp_shl} \
        -b {params.tmp_sho} | \
    gzip -c > {output.shl}

    # CpG Shores (island +/- 2kb)
    bedtools subtract \
        -a {params.tmp_sho} \
        -b {output.cgi} | \
    gzip -c > {output.sho}

    # Clean up CpG temporary files
    rm -f {params.tmp_shl} {params.tmp_sho}
    echo "CpG seascape created and cleaned up" 1> {log}

    # BISCUIT assets (includes CpG and top/bottom 10% GC-content files)
    build_biscuit_QC_assets.pl --outdir {params.dir} --ref {input.ref} 1> {log}

    # Repeat-masked file
    if [[ {params.gen} != "mm9" ]]; then
        wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/{params.gen}/database/rmsk.txt.gz | \
        gunzip -c | \
        awk 'BEGIN{{ OFS="\\t"; }}{{ print $6, $7, $8; }}' | \
        awk '{{ if ($1 ~ /^chr[1234567890XYM]{{1,2}}$/) {{ print }} }}' | \
        bedtools sort -i - | \
        bedtools merge -i - | \
        gzip -c > {output.msk}
    else
        wget -q -nd -r -nH -np --accept-regex 'chr[1234567890XYM]*_rmsk.txt.gz$' \
            --regex-type=posix http://hgdownload.cse.ucsc.edu/goldenpath/mm9/database/

        cat chr*_rmsk.txt.gz | \
        gunzip -c | \
        awk 'BEGIN{{ OFS="\\t"; }}{{ print $6, $7, $8; }}' | \
        bedtools sort -i - | \
        bedtools merge -i - | \
        gzip -c > {output.msk}

        if [[ -f index.html ]]; then rm index.html; fi
        if [[ -f robots.txt ]]; then rm robots.txt; fi
        rm chr*_rmsk.txt.gz
    fi

    # Genic, intergenic, and exon files
    wget -qO- {params.annt} | \
    gunzip -c | \
    awk 'BEGIN{{ OFS="\\t"; }}{{ if ($3 == "gene") {{ print $1, $4, $5 > "{params.tmp_gen}"; }} else if ($3 == "exon") {{ print $1, $4, $5 > "{params.tmp_exn}"; }} }}'

    bedtools sort -i {params.tmp_gen} | \
    bedtools merge -i - | \
    gzip -c > {output.gen}

    bedtools sort -i {params.tmp_exn} | \
    bedtools merge -i - | \
    gzip -c > {output.exn}

    sort -k1,1 -k2,2n {input.ref}.fai | \
    bedtools complement -L -i {output.gen} -g - | \
    gzip -c > {output.inr}

    # Clean up
    rm -f {params.tmp_gen} {params.tmp_exn}
    """       
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
shell:
    """
    set +o pipefail;

    #cd {params.dir}

    wget -P {params.dir} --no-check-certificate -q https://bismap.hoffmanlab.org/raw/{params.gen}/k100.bismap.bedgraph.gz

    # Create average mappability scores in 10kb windows
    zcat {output.bis} | sort -k1,1 -k2,2n | gzip > {output.sor}

    bedtools makewindows -w 10000 -g {input.ref}.fai | gzip 1> {output.byn} 2> {log}
    bedtools map -a {output.byn} -b {output.sor} -c 4 -o mean | gzip 1> {output.wgt} 2> {log}

    bedtools intersect -a {output.bis} -b {input.cpg} | gzip -c 1> {output.cpg} 2> {log}
    bedtools intersect -a {output.bis} -b {input.cgi} | gzip -c 1> {output.cgi} 2> {log}
    bedtools intersect -a {output.bis} -b {input.exn} | gzip -c 1> {output.exn} 2> {log}
    bedtools intersect -a {output.bis} -b {input.gen} | gzip -c 1> {output.gen} 2> {log}
    bedtools intersect -a {output.bis} -b {input.inr} | gzip -c 1> {output.inr} 2> {log}
    bedtools intersect -a {output.bis} -b {input.msk} | gzip -c 1> {output.msk} 2> {log}
    """
255
256
257
258
259
260
261
262
shell:
    """
    bash workflow/scripts/create_context_beds.sh \
        -t {threads} \
        -o {params.dir} \
        {input.ref} \
        {params.cpg}/{input.cpg} 2> {log}
    """
30
31
32
33
34
35
36
37
38
39
shell:
    """
    set +o pipefail
    mkdir -p {params.outdir}

    # Find genomic coverage
    samtools view -hb -F 0x4 -q 40 {input.bam} | \
    bedtools genomecov -bg -ibam stdin | \
    gzip -c > {output.cov}
    """
60
61
62
63
64
65
66
shell:
    """
    set +o pipefail

    # Find intersections
    bedtools intersect -a {params.cpg} -b {input.cov} -wo | gzip -c > {output.cpg}
    """
87
88
89
90
91
92
93
shell:
    """
    set +o pipefail

    # Find intersections
    bedtools intersect -a {params.cgi} -b {input.cov} -wo | gzip -c > {output.cgi}
    """
114
115
116
117
118
119
120
shell:
    """
    set +o pipefail

    # Find intersections
    bedtools intersect -a {params.intergenic} -b {input.cov} -wo | gzip -c > {output.intergenic}
    """
141
142
143
144
145
146
147
shell:
    """
    set +o pipefail

    # Find intersections
    bedtools intersect -a {params.exon} -b {input.cov} -wo | gzip -c > {output.exon}
    """
168
169
170
171
172
173
174
shell:
    """
    set +o pipefail

    # Find intersections
    bedtools intersect -a {params.genic} -b {input.cov} -wo | gzip -c > {output.genic}
    """
195
196
197
198
199
200
201
shell:
    """
    set +o pipefail

    # Find intersections
    bedtools intersect -a {params.rmsk} -b {input.cov} -wo | gzip -c > {output.rmsk}
    """
222
223
224
225
226
227
228
shell:
    """
    set +o pipefail

    # Find intersections
    bedtools intersect -a {params.mapped} -b {input.cov} -wo | gzip -c > {output.mapped}
    """
259
260
script:
    '../scripts/calculate_feature_sizes.py'
291
292
script:
    '../scripts/plot_obs_exp.py'
25
26
script:
    '../scripts/rename.R'
61
62
63
64
65
shell:
    """
    mkdir -p {params.dir}
    fastqc --outdir {params.dir} --threads {threads} {input} 2> {log.stderr} 1> {log.stdout}
    """
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
shell:
    """
    trim_galore \
        --paired \
        {input} \
        --output_dir {params.outdir} \
        --cores {threads} \
        --fastqc \
        {params.args_list} \
        2> {log.stderr} 1> {log.stdout}

    # Create merged R1 and R2 FASTQs, clean up files that were merged
    cat {params.outdir}/{wildcards.sample}-*-R1_val_1.fq.gz > {params.outdir}/{wildcards.sample}-R1_val_1_merged.fq.gz
    cat {params.outdir}/{wildcards.sample}-*-R2_val_2.fq.gz > {params.outdir}/{wildcards.sample}-R2_val_2_merged.fq.gz
    rm {params.outdir}/{wildcards.sample}-*-R1_val_1.fq.gz
    rm {params.outdir}/{wildcards.sample}-*-R2_val_2.fq.gz
    """
133
134
135
136
shell:
    """
    fastq_screen --bisulfite --conf {params.conf} --outdir {params.output_dir} {input} 2> {log.fastq_screen}
    """
55
56
57
58
59
60
61
62
63
64
65
66
shell:
    """
    set +o pipefail;
    QC.sh \
        -o {params.output_dir} \
        --vcf {input.vcf} \
        {params.assets} \
        {input.ref} \
        {wildcards.sample} \
        {input.bam} \
        2> {log}
    """
SnakeMake From line 55 of rules/qc.smk
89
90
91
92
93
shell:
    """
    mkdir -p {params.dir}
    preseq c_curve {params.opt} -o {params.out} -P -B {input.bam} 2> {log}
    """
151
152
153
154
shell:
    """
    multiqc -f -o {params.output_dir} -n multiqc_report.html {params.mqc_dirs} 2> {log}
    """
172
173
script:
    '../scripts/plot_percent_covered.py'
SnakeMake From line 172 of rules/qc.smk
199
200
201
202
203
204
205
206
207
208
209
shell:
    """
    mkdir -p {params.lambda_dir}
    mkdir -p {params.puc19_dir}

    # >J02459.1 Escherichia phage Lambda, complete genome - UNMETHYLATED CONTROL
    zcat {input.bed} | {{ grep '^J02459.1' || true; }} > {output.lambda_bed} 2> {log.lambda_log}

    # >M77789.2 Cloning vector pUC19, complete sequence - METHYLATED CONTROL
    zcat {input.bed} | {{ grep '^M77789.2' || true; }}  > {output.puc19_bed} 2> {log.puc19_log}
    """
SnakeMake From line 199 of rules/qc.smk
229
230
script:
    '../scripts/control_vector.R'
SnakeMake From line 229 of rules/qc.smk
30
31
32
33
34
shell:
    """
    python3 workflow/scripts/region_centered_bin_averages.py {params.args_list} {input.region_bed} | \
    sort -k1,1 -k2,2n > {output.region_centered_bins}
    """
54
55
56
57
58
59
60
61
62
63
64
65
66
67
shell:
    """
    # do the intersecting for this sample
    # column 14 has the methylation
    bedtools intersect \
        -a {input.region_bins} \
        -b {input.merged_sample_bed} \
        -sorted -wo | \
    bedtools groupby \
        -g 1-10 \
        -c 14 \
        -o mean | \
    awk '{{print $7,"\\t",$8,"\\t",$9,"\\t",$4,"\\t",$10,"\\t",$11}}' > {output.bed}
    """
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
import pandas as pd

DEFAULT_2 = {'cov': [0], 'ol': [0]}
DEFAULT_3 = {'start': [0], 'end': [0], 'weight': [0]}

NAMES_2 = ['cov', 'ol']
NAMES_3 = ['start', 'end', 'weight']

# TODO: This can probably be adjusted to read files in chunks to reduce memory overhead
def calculate_feature_sizes(samp, infiles, paramfiles, outfile):
    # Load the asset files
    try:
        df_map_p = pd.read_csv(paramfiles['bismap'], sep='\t', header=None, usecols=[1, 2, 3], skiprows=1, names=NAMES_3)
    except pandas.errors.EmptyDataError:
        df_map_p = pd.DataFrame(DEFAULT_3)
    try:
        df_cpg_p = pd.read_csv(paramfiles['cpg'], sep='\t', header=None, usecols=[1, 2, 3], skiprows=1, names=NAMES_3)
    except pandas.errors.EmptyDataError:
        df_cpg_p = pd.DataFrame(DEFAULT_3)
    try:
        df_cgi_p = pd.read_csv(paramfiles['cgi'], sep='\t', header=None, usecols=[1, 2, 3], skiprows=1, names=NAMES_3)
    except pandas.errors.EmptyDataError:
        df_cgi_p = pd.DataFrame(DEFAULT_3)
    try:
        df_exn_p = pd.read_csv(paramfiles['exon'], sep='\t', header=None, usecols=[1, 2, 3], skiprows=1, names=NAMES_3)
    except pandas.errors.EmptyDataError:
        df_exn_p = pd.DataFrame(DEFAULT_3)
    try:
        df_gen_p = pd.read_csv(paramfiles['genic'], sep='\t', header=None, usecols=[1, 2, 3], skiprows=1, names=NAMES_3)
    except pandas.errors.EmptyDataError:
        df_gen_p = pd.DataFrame(DEFAULT_3)
    try:
        df_int_p = pd.read_csv(paramfiles['intergenic'], sep='\t', header=None, usecols=[1, 2, 3], skiprows=1, names=NAMES_3)
    except pandas.errors.EmptyDataError:
        df_int_p = pd.DataFrame(DEFAULT_3)
    try:
        df_msk_p = pd.read_csv(paramfiles['rmsk'], sep='\t', header=None, usecols=[1, 2, 3], skiprows=1, names=NAMES_3)
    except pandas.errors.EmptyDataError:
        df_msk_p = pd.DataFrame(DEFAULT_3)

    # Load the Snakemake generated files
    try:
        df_map_i = pd.read_csv(infiles['mapped'], sep='\t', header=None, usecols=[3, 8], names=NAMES_2)
    except pandas.errors.EmptyDataError:
        df_map_i = pd.DataFrame(DEFAULT_2)
    try:
        df_cpg_i = pd.read_csv(infiles['cpg'], sep='\t', header=None, usecols=[3, 8], names=NAMES_2)
    except pandas.errors.EmptyDataError:
        df_cpg_i = pd.DataFrame(DEFAULT_3)
    try:
        df_cgi_i = pd.read_csv(infiles['cgi'], sep='\t', header=None, usecols=[3, 8], names=NAMES_2)
    except pandas.errors.EmptyDataError:
        df_cgi_i = pd.DataFrame(DEFAULT_3)
    try:
        df_exn_i = pd.read_csv(infiles['exon'], sep='\t', header=None, usecols=[3, 8], names=NAMES_2)
    except pandas.errors.EmptyDataError:
        df_exn_i = pd.DataFrame(DEFAULT_3)
    try:
        df_gen_i = pd.read_csv(infiles['genic'], sep='\t', header=None, usecols=[3, 8], names=NAMES_2)
    except pandas.errors.EmptyDataError:
        df_gen_i = pd.DataFrame(DEFAULT_3)
    try:
        df_int_i = pd.read_csv(infiles['intergenic'], sep='\t', header=None, usecols=[3, 8], names=NAMES_2)
    except pandas.errors.EmptyDataError:
        df_int_i = pd.DataFrame(DEFAULT_3)
    try:
        df_msk_i = pd.read_csv(infiles['rmsk'], sep='\t', header=None, usecols=[3, 8], names=NAMES_2)
    except pandas.errors.EmptyDataError:
        df_msk_i = pd.DataFrame(DEFAULT_3)

    # Calculate sizes for finding the sum (weight * (end - start))
    df_map_p['size'] = df_map_p['weight'] * (df_map_p['end'] - df_map_p['start'])
    df_cpg_p['size'] = df_cpg_p['weight'] * (df_cpg_p['end'] - df_cpg_p['start'])
    df_cgi_p['size'] = df_cgi_p['weight'] * (df_cgi_p['end'] - df_cgi_p['start'])
    df_exn_p['size'] = df_exn_p['weight'] * (df_exn_p['end'] - df_exn_p['start'])
    df_gen_p['size'] = df_gen_p['weight'] * (df_gen_p['end'] - df_gen_p['start'])
    df_int_p['size'] = df_int_p['weight'] * (df_int_p['end'] - df_int_p['start'])
    df_msk_p['size'] = df_msk_p['weight'] * (df_msk_p['end'] - df_msk_p['start'])

    # Calculate sizes for finding the sum (coverage * number of overlapping bases)
    df_map_i['size'] = df_map_i['cov'] * df_map_i['ol']
    df_cpg_i['size'] = df_cpg_i['cov'] * df_cpg_i['ol']
    df_cgi_i['size'] = df_cgi_i['cov'] * df_cgi_i['ol']
    df_exn_i['size'] = df_exn_i['cov'] * df_exn_i['ol']
    df_gen_i['size'] = df_gen_i['cov'] * df_gen_i['ol']
    df_int_i['size'] = df_int_i['cov'] * df_int_i['ol']
    df_msk_i['size'] = df_msk_i['cov'] * df_msk_i['ol']

    # Calculate the necessary sums
    sum_map_p = df_map_p['size'].sum(); sum_map_i = df_map_i['size'].sum()
    sum_cpg_p = df_cpg_p['size'].sum(); sum_cpg_i = df_cpg_i['size'].sum()
    sum_cgi_p = df_cgi_p['size'].sum(); sum_cgi_i = df_cgi_i['size'].sum()
    sum_exn_p = df_exn_p['size'].sum(); sum_exn_i = df_exn_i['size'].sum()
    sum_gen_p = df_gen_p['size'].sum(); sum_gen_i = df_gen_i['size'].sum()
    sum_int_p = df_int_p['size'].sum(); sum_int_i = df_int_i['size'].sum()
    sum_msk_p = df_msk_p['size'].sum(); sum_msk_i = df_msk_i['size'].sum()

    with open(outfile, 'w') as f:
        f.write(
            '{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}'.format(
                samp,
                sum_map_p, sum_cpg_p, sum_cgi_p, sum_msk_p, sum_exn_p, sum_gen_p, sum_int_p,
                sum_map_i, sum_cpg_i, sum_cgi_i, sum_msk_i, sum_exn_i, sum_gen_i, sum_int_i,
            )
        )

calculate_feature_sizes(
    snakemake.wildcards['sample'],
    snakemake.input,
    snakemake.params,
    snakemake.output['data']
)
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
require(tidyverse)
require(ggplot2)
require(patchwork)
require(viridisLite)

import_data <- function(files, logfile) {
    df <- NULL

    for (f in files) {
        if (file.info(f)$size > 0) {
            name <- gsub("\\.bed", "", basename(f))
            cat("Loading", f, "as", name, "\n", file=logfile, append=TRUE)

            my_bed <- read.delim(f, sep="\t", header=FALSE)
            colnames(my_bed) <- c("chr", "start", "end", "beta", "depth", "context")
            my_bed$sample <- rep(name, nrow(my_bed))

            df <- rbind(df, my_bed)
        }
    }

    return(df)
}

create_plot <- function(lam_files, puc_files, out_files, log_file) {
    cat("Loading data\n", file=log_file)
    lam <- import_data(lam_files, log_file)
    puc <- import_data(puc_files, log_file)

    n_samples_l <- length(unique(lam$sample))
    n_samples_p <- length(unique(puc$sample))

    # Unmethylated control
    if (!is.null(lam)) {
        cat("lambda phage data found. making lambda plots\n", file=log_file, append=TRUE)
        topleft <- ggplot(lam, aes(x=sample, y=depth)) +
            geom_boxplot(color='#357BA2FF') +
            theme_bw() +
            theme(
                axis.text.x = element_blank(),
                axis.text.y = element_text(size=12),
                axis.title.y = element_text(size=25),
                plot.title = element_text(size=25, hjust=0.5),
                plot.subtitle = element_text(size=15, hjust=0.5),
            ) +
            scale_color_manual() +
            ggtitle("Unmethylated", subtitle=paste("N =", n_samples_l, "Samples")) +
            expand_limits(y=0) +
            ylab("Coverage") +
            xlab("")

        bottomleft <- ggplot(lam, aes(x=sample, y=beta)) +
            geom_boxplot(color='#357BA2FF') +
            theme_bw() +
            theme(
                axis.text.x = element_text(size=12, angle=45, hjust=1, vjust=1),
                axis.text.y = element_text(size=12),
                axis.title.y = element_text(size=25),
                plot.title = element_text(size=25, hjust=0.5),
                plot.subtitle = element_text(size=15, hjust=0.5),
            ) +
            scale_color_manual() +
            ylim(c(0, 1)) +
            ylab("Beta") +
            xlab("")
    }

    # Methylated control
    if (!is.null(puc)) {
        cat("pUC19 data found. making puc19 plots\n", file=log_file, append=TRUE)
        topright <- ggplot(puc, aes(x=sample, y=depth)) +
            geom_boxplot(color='#357BA2FF') +
            theme_bw() +
            theme(
                axis.text.x = element_blank(),
                axis.text.y = element_text(size=12),
                axis.title.y = element_text(size=25),
                plot.title = element_text(size=25, hjust=0.5),
                plot.subtitle = element_text(size=15, hjust=0.5),
            ) +
            scale_color_manual() +
            ggtitle("Methylated", subtitle=paste("N =", n_samples_p, "Samples")) +
            expand_limits(y=0) +
            ylab("") +
            xlab("")

        bottomright <- ggplot(puc, aes(x=sample, y=beta)) +
            geom_boxplot(color='#357BA2FF') +
            theme_bw() +
            theme(
                axis.text.x = element_text(size=12, angle=45, hjust=1, vjust=1),
                axis.text.y = element_text(size=12),
                axis.title.y = element_text(size=25),
                plot.title = element_text(size=25, hjust=0.5),
                plot.subtitle = element_text(size=15, hjust=0.5),
            ) +
            scale_color_manual() +
            ylim(c(0, 1)) +
            ylab("") +
            xlab("")
    }

    # Create plot
    if (exists("topleft") & exists("topright")) {
        cat("found plots for both lambda phage and pUC19. attempting to make patchwork plot\n", file=log_file, append=TRUE)
        layout <- "
        AB
        CD
        "

        pw <- topleft + topright + bottomleft + bottomright +
            patchwork::plot_layout(design = layout) +
            patchwork::plot_annotation(tag_levels = "A", title = "Control Vectors") &
            theme(plot.tag = element_text(face = "bold")) &
            theme(plot.title = element_text(size=25, hjust=0.5))

        ggsave(out_files, plot=pw, width=7, height=10)
    } else {
        cat("could not find plots for both lambda phage and pUC19. filling placeholder file\n", file=log_file, append=TRUE)
        pdf(file=out_files, width=7, height=10)

        plot.new()
        text(x=0.5, y=0.5, "NO PLOT CREATED.\nLIKELY REASON: NO DEPTH FOR CONTROL VECTOR(S)\nIN SUPPLED BED FILES")

        dev.off()
    }
}

create_plot(snakemake@input[["lambda_files"]], snakemake@input[["puc19_files"]], snakemake@output[["pdf"]], snakemake@log[["fn"]])
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
set -euo pipefail

# Check for bedtools, awk, and parallel in PATH
function check_path {
    if [[ `which bedtools 2>&1 > /dev/null` ]]; then
        >&2 echo "bedtools does not exist in PATH"
        exit 1
    else
        >&2 echo "Using bedtools found at: `which bedtools`"
    fi
    if [[ `which awk 2>&1 > /dev/null` ]]; then
        >&2 echo "awk does not exist in PATH"
        exit 1
    else
        if awk --version | grep -q GNU; then
            >&2 echo "Using GNU awk found at: `which awk`"
        else
            >&2 echo "It doesn't appear you are using GNU awk"
            >&2 echo "Try adding GNU awk at the front of PATH"
            exit 1
        fi
    fi
    if [[ `which parallel 2>&1 > /dev/null` ]]; then
        >&2 echo "parallel does not exist in PATH"
        >&2 echo "Make sure to add GNU parallel to PATH"
        exit 1
    else
        if parallel --version | grep -q GNU; then
            >&2 echo "Using GNU parallel found at: `which parallel`"
        else
            >&2 echo "It doesn't appear you are using GNU parallel."
            >&2 echo "Try adding GNU parallel at the front of PATH"
            exit 1
        fi
    fi
}

expand_windows() {
    # Load chromosome sizes to help with keeping extracted ranges within bounds of the chromosomes
    chr_sizes=""
    while read line; do
        chr="$(echo ${line} | awk '{ print $1 }')"
        siz="$(echo ${line} | awk '{ print $2 }')"

        if [ -z "$chr_sizes" ]; then
            chr_sizes="${chr},${siz}"
        else
            chr_sizes="${chr_sizes};${chr},${siz}"
        fi
    done < $1

    awk -v sizes="${chr_sizes}" \
    'BEGIN { 
        OFS="\t"

        # Create a key-value dictionary of chromosome sizes
        split(sizes, tmp1, ";")
        for (i in tmp1) {
            split(tmp1[i], tmp2, ",")
            chr_lengths[tmp2[1]] = tmp2[2]
        }
    } {
        chr = $1
        if ($2 - 35 < 0) { beg = 0 }
        else { beg = $2 - 35 }
        if ($3 + 35 > chr_lengths[chr]) { end = chr_lengths[chr] }
        else { end = $3 + 35 }

        print chr, beg, end
    }' $2
}
export -f expand_windows

find_context() {
    awk \
    'BEGIN { OFS = "\t" } {
        # Extract position
        match($1, /([^:]*):([0-9]*)-([0-9]*)/, position)
        chr = position[1]
        beg = position[2] + 35
        end = position[3] - 35

        # Reference sequence
        a1=toupper($2)
        a2=a1
        a3=a1

        # 4-base sequence and context
        rseq_4 = substr(a1,35,4)
        if (rseq_4~/[CG]CG[CG]/) { context = "SCGS" }
        else if (rseq_4~/[AT]CG[AT]/) { context = "WCGW" }
        else { context = "SCGW" }

        n_cpgs = gsub(/CG/,"",a1)                # number of CpGs in string
        gc_per = gsub(/[CG]/,"",a2) / length(a3) # GC-content of string
        rseq_6 = substr(a3,34,6)                 # 6-base sequence
        rseq_2 = substr(a3,36,2)                 # 2-base sequence

        # Need to have an easy way to extract CpGs into specific context and number of neighboring CpGs files
        if (n_cpgs == 1) { tag = "0" }
        else if (n_cpgs == 2) { tag = "1" }
        else if (n_cpgs == 3) { tag = "2" }
        else { tag = "3" }

        if (!(rseq_4~/N/)) print chr, beg, end, n_cpgs-1, gc_per, rseq_4, rseq_6, context, rseq_2, context"_"tag
    }' $1;
}
export -f find_context

create_files() {
    check_path

    if [ ! -d ${outdir} ]; then
        mkdir -p ${outdir}
    fi

    cd ${outdir}

    # Split up CpGs for quicker processing
    zcat ${cpgbed} | split -d -l 1000000 - cpgsplit_

    # Retrieve sequences around CpGs from genome FASTA file
    export genome
    parallel -j ${thread} 'expand_windows ${genome}.fai {} | bedtools getfasta -bed - -fi ${genome} -tab -fo cpgcontext_{}.tab' ::: cpgsplit_*

    # Annotate CpGs
    parallel -j ${thread} 'find_context {} > {.}.context' ::: cpgcontext_*.tab

    # Extract CpGs into specific context and number of neighboring CpG files
    grep --no-filename WCGW_0 *.context | cut -f1-9 | sort -k1,1 -k2,2n -k3,3n | gzip > wcgw_0_neighbors.bed.gz
    grep --no-filename WCGW_1 *.context | cut -f1-9 | sort -k1,1 -k2,2n -k3,3n | gzip > wcgw_1_neighbors.bed.gz
    grep --no-filename WCGW_2 *.context | cut -f1-9 | sort -k1,1 -k2,2n -k3,3n | gzip > wcgw_2_neighbors.bed.gz
    grep --no-filename WCGW_3 *.context | cut -f1-9 | sort -k1,1 -k2,2n -k3,3n | gzip > wcgw_3p_neighbors.bed.gz
    grep --no-filename SCGW_0 *.context | cut -f1-9 | sort -k1,1 -k2,2n -k3,3n | gzip > scgw_0_neighbors.bed.gz
    grep --no-filename SCGW_1 *.context | cut -f1-9 | sort -k1,1 -k2,2n -k3,3n | gzip > scgw_1_neighbors.bed.gz
    grep --no-filename SCGW_2 *.context | cut -f1-9 | sort -k1,1 -k2,2n -k3,3n | gzip > scgw_2_neighbors.bed.gz
    grep --no-filename SCGW_3 *.context | cut -f1-9 | sort -k1,1 -k2,2n -k3,3n | gzip > scgw_3p_neighbors.bed.gz
    grep --no-filename SCGS_0 *.context | cut -f1-9 | sort -k1,1 -k2,2n -k3,3n | gzip > scgs_0_neighbors.bed.gz
    grep --no-filename SCGS_1 *.context | cut -f1-9 | sort -k1,1 -k2,2n -k3,3n | gzip > scgs_1_neighbors.bed.gz
    grep --no-filename SCGS_2 *.context | cut -f1-9 | sort -k1,1 -k2,2n -k3,3n | gzip > scgs_2_neighbors.bed.gz
    grep --no-filename SCGS_3 *.context | cut -f1-9 | sort -k1,1 -k2,2n -k3,3n | gzip > scgs_3p_neighbors.bed.gz

    rm cpgsplit_* cpgcontext_*
}

usage() {
    >&2 echo -e "\nUsage: create_context_beds.sh [-h,--help] [-o,--outdir] genome cpgbed\n"
    >&2 echo -e "Required inputs:"
    >&2 echo -e "\tgenome       : Path to reference FASTA file used in creating CpG BED"
    >&2 echo -e "\tcpgbed       : CpG BED file containing CpG locations in genome\n"
    >&2 echo -e "Optional inputs:"
    >&2 echo -e "\t-h,--help    : Print help message and exit"
    >&2 echo -e "\t-o,--outdir  : Output directory [DEFAULT: assets]"
    >&2 echo -e "\t-t,--threads : Number of threads to use [DEFAULT: 1]"
}

# Initialize default variable values
outdir="assets"
thread=1

# Process command line arguments
OPTS=$(getopt \
    --options ho:t: \
    --long help,outdir:,threads: \
    --name "$(basename "$0")" \
    -- "$@"
)
eval set -- ${OPTS}

while true; do
    case "$1" in
        -h|--help )
            usage
            exit 0
            ;;
        -o|--outdir )
            outdir="$2"
            shift 2
            ;;
        -t|--threads )
            thread="$2"
            shift 2
            ;;
        -- )
            shift
            break
            ;;
        * )
            >&2 echo "Unknown option: $1"
            usage
            exit 1
            ;;
    esac
done

# Make sure there are the correct number of inputs
if [[ $# -ne 2 ]]; then
    >&2 echo "$0: Missing inputs"
    usage
    exit 1
fi

# Fill required positional arguments
genome=$1
cpgbed=$2

# Input checks
if [[ ! -f "${genome}.fai" ]]; then
    >&2 echo "Cannot locate fai-indexed reference: ${genome}.fai"
    >&2 echo "Please provide a viable path to the reference genome FASTA file."
    exit 1
fi

if [[ ! -f "${cpgbed}" ]]; then
    >&2 echo "Cannot locate CpG BED file: ${cpgbed}"
    >&2 echo "Please provide an existing CpG BED file"
    exit 1
fi

>&2 echo -e "Inputs:"
>&2 echo -e "\tOutput directory : ${outdir}"
>&2 echo -e "\tNumber of threads: ${thread}"
>&2 echo -e "\tGenome Reference : ${genome}"
>&2 echo -e "\tCpG Location BED : ${cpgbed}"

create_files
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
set -euo pipefail



# Find binned average values
function find_averages {
    tmp=$(echo $RANDOM|md5sum|head -c 20; echo) # get random name for tmp file
    if [[ `file ${infile}` =~ "gzip" ]]; then
        zcat ${infile} |
        awk -v filt=${filter} '{ if ($5 >= filt) { print } }' > $tmp
    else
        awk -v filt=${filter} '{ if ($5 >= filt) { print } }' \
        ${infile} > $tmp
    fi

    # Find average beta value across bins
    bedtools makewindows -w ${window} \
        -g ${reffai} | \
    sort -k1,1 -k2,2n | \
    bedtools map \
        -prec 2 \
        -a - \
        -b $tmp \
        -c 4 -o mean |
    gzip > ${otfile}.gz

    rm -f $tmp
}

# Print helpful usage information
function usage {
    >&2 echo -e "\nUsage: find_binned_average.sh [-h,--help] reference window filter infile outfile"
    >&2 echo -e "Required inputs:"
    >&2 echo -e "\treference    : FAI-index for reference to create windows from"
    >&2 echo -e "\twindow       : Window size for binned average finding"
    >&2 echo -e "\tfilter       : Coverage filter to apply before finding averages"
    >&2 echo -e "\tinfile       : BISCUIT CG BED file to calculate methylation average from"
    >&2 echo -e "\toutfile      : Name of BED file to write output to"
    >&2 echo -e "Optional inputs:"
    >&2 echo -e "\t-h,--help    : Print help message and exit\n"
}

# Process command line arguments
OPTS=$(getopt \
    --options h \
    --long help \
    --name "$(basename "$0")" \
    -- "$@"
)
eval set -- ${OPTS}

while true; do
    case "$1" in
        -h|--help )
            usage
            exit 0
            ;;
        -- )
            shift
            break
            ;;
        * )
            >&2 echo "Unknown option: $1"
            usage
            exit 1
            ;;
    esac
done

# Make sure there are the correct number of inputs
if [[ $# -ne 5 ]]; then
    >&2 echo "$0: Missing inputs"
    usage
    exit 1
fi

# Fill required positional arguments
reffai="${1}"
window="${2}"
filter="${3}"
infile="${4}"
otfile="${5}"

# Do some checks on the given inputs
if [[ ! -f ${reffai} ]]; then
    >&2 echo "Doesn't appear like you provided an existing FAI file: ${reffai}"
    exit 1
fi

if [[ ! "${reffai}" =~ .fai$ ]]; then
    >&2 echo "Doesn't appear like you provided a viable FAI file: ${reffai}"
    exit 1
fi

if ! [[ "${window}" =~ ^[0-9]+$ ]]; then
    >&2 echo "window needs to be an integer: ${window}"
    exit 1
fi

if ! [[ ${filter} =~ ^[0-9]+$ ]]; then
    >&2 echo "filter needs to be an integer: ${filter}"
    exit 1
fi

if [[ ! -f ${infile} ]]; then
    >&2 echo "Doesn't appear that infile exists: ${infile}"
    exit 1
fi

find_averages
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import os

# Create list of chromosomes (include enough to cover human and mouse)
CHROMS = [f'chr{i}' for i in range(1, 23)]
CHROMS.extend(['chrX', 'chrY', 'chrM'])

def find_x_ticks_labels(chrs):
    """Create list of chromosomes that are included in BED file

    Inputs -
        chrs - chr column from BED file loaded as a DataFrame
    Returns -
        dict, keys are the chromosomes and values are the first index location of the chromosome
    """
    out = {}
    for c in CHROMS:
        try:
            out[c] = list(chrs).index(c)
        except ValueError: # chromosome isn't in list, so ignore
            continue

    return out

def create_plot(files, params, outfile):
    # Setup the order that chromosomes need to be sorted into
    chr_sort_order = pd.api.types.CategoricalDtype(CHROMS, ordered=True)

    # Average bismap mappability scores for each bin
    # Treat bins with no score as having a weight of 0 since there's no data there
    weights = pd.read_csv(
        params['map_scores'],
        sep='\t',
        header=None,
        names=['chr', 'start', 'end', 'raw_weight'],
        na_values='.'
    )
    weights['weight'] = weights['raw_weight'].fillna(0)
    weights['ideal'] = weights['weight'] / (weights['end'] - weights['start'])

    # Setup figure for coverage across genome
    fig, ax = plt.subplots(figsize=(9,5))
    plt.tight_layout()

    plt.title('Weighted Coverage in 10kb Windows', fontsize=24)
    ax.set_xlabel('')
    ax.set_ylabel('Weighted Coverage / Window Width', fontsize=18)

    # Process data and fill coverage across genome figure
    x_values = []
    x_tk_lab = []
    plot_data = {
        'samp': [],
        'frac': [],
        'mean': [],
        'stdv': [],
        'lavg': [],
        'lstd': [],
    }

    for s in files:
        samp = os.path.basename(s).replace('.10kb_binned_coverage.bed.gz', '')

        # Read data
        in_df = pd.read_csv(s, sep='\t', header=None, names=['chr', 'start', 'end', 'covg'], na_values='.')

        # Merge average bismap mappability scores as a weight column
        df = in_df.merge(weights, on=['chr', 'start', 'end'], how='outer')

        # Sort chromosomes to put chrM at the end
        df['chr'] = df['chr'].astype(chr_sort_order)
        df = df.sort_values(['chr', 'start'], ignore_index=True)

        # Calculate the weighted coverage [(avg mappability) * (coverage) / (window width)]
        df['weighted_covg'] = df['weight'] * df['covg'] / (df['end'] - df['start'])

        # Bins with both a raw weight and at least one base covered
        to_plot = df[(df['raw_weight'].notna()) & (df['covg'] > 0)]

        # Determine number of bins that had a bismap weight and at least one base covered
        n_weights = len(df[df['raw_weight'].notna()]['chr'])
        n_wgt_cov = len(to_plot[to_plot['weighted_covg'] > 0.01]['chr'])

        # Non-zero bins
        non_zero = [i if i > 0 else None for i in list(to_plot['weighted_covg'])]

        # Setup data that will be saved to output file and plotted in other figures
        plot_data['samp'].append(samp)
        plot_data['frac'].append(n_wgt_cov / n_weights)
        plot_data['mean'].append(np.average([i for i in non_zero if i]))
        plot_data['stdv'].append(np.std([i for i in non_zero if i]))
        plot_data['lavg'].append(np.average([np.log10(i) for i in non_zero if i]))
        plot_data['lstd'].append(np.std([np.log10(i) for i in non_zero if i]))

        # Fill x values and ticks/labels if that hasn't been done yet
        if len(x_values) == 0 or len(x_tk_lab) == 0:
            x_values = list(df.index)
            x_tk_lab = find_x_ticks_labels(df['chr'])

        # Add weighted coverage to coverage across genome figure
        ax.plot(x_values, df['weighted_covg'], 'k-')

    # Finish up coverage across genome figure
    ax.set_xlim(-500, len(x_values)+500)
    ax.set_xticks(list(x_tk_lab.values()))
    ax.set_xticklabels(x_tk_lab.keys(), rotation=90)

    plt.savefig(outfile['covg'], bbox_inches='tight')
    plt.close('all')

    # Create other figures
    df = pd.DataFrame(plot_data)

    fig, ax = plt.subplots()
    plt.tight_layout()

    sns.scatterplot(ax=ax, x='frac', y='lstd', data=df)

    line_loc = np.std([np.log10(i) for i in weights['ideal'] if i > 0])
    ax.axhline(line_loc, alpha=0.6, color='grey', linestyle='--')
    ax.text(x = 0.01, y = line_loc-0.01, s = 'Ideal Std. Dev.', ha='left', va='top', fontsize=14)

    ax.set_xlim(0, 1); ax.set_ylim(0, 1.1*max(plot_data['lstd']))

    plt.title('Whole Genome Coverage Uniformity', fontsize=24)
    plt.xlabel('(# Bins with Weighted Coverage > 0.01) /\n(# Bins with Nonzero Weights)', fontsize=18, va='top')
    plt.ylabel('std[ log10(weighted cov.) ]', fontsize=18)

    plt.savefig(outfile['frac'], bbox_inches='tight')
    plt.close('all')

    # Write data to output file
    df.to_csv(outfile['data'], index=False, sep='\t')

create_plot(list(snakemake.input['files']), snakemake.params, snakemake.output)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
from matplotlib import pyplot as plt
import pandas as pd
import os

def create_plot(samples, outfiles):
    # Load data
    l_df = []
    for s in samples:
        df = pd.read_csv(s, sep='\t', header=None, names=['group', 'count']).set_index(['group'])

        d = {
            'sample': [os.path.basename(s).replace('.cpg_stats_cgi_table.txt', '')],
            '1': [100.0 * df['count']['one_cpg'] / df['count']['n_cpg_islands']],
            '3': [100.0 * df['count']['three_cpgs'] / df['count']['n_cpg_islands']],
            '5': [100.0 * df['count']['five_cpgs'] / df['count']['n_cpg_islands']],
            '7': [100.0 * df['count']['seven_cpgs'] / df['count']['n_cpg_islands']],
            'n_cgis': [df['count']['n_cpg_islands']]
        }

        l_df.append(pd.DataFrame(d))

    df = pd.concat(l_df)
    df = df.sort_values(by = 'sample')

    y = [i for i, s in enumerate(list(df['sample']))]
    n = [s for i, s in enumerate(list(df['sample']))]

    # Create plot
    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows=2, ncols=2, sharex='col', sharey='row')
    plt.tight_layout()

    ax1.barh(y, df['1'], height=0.7, color='black')
    ax2.barh(y, df['3'], height=0.7, color='black')
    ax3.barh(y, df['5'], height=0.7, color='black')
    ax4.barh(y, df['7'], height=0.7, color='black')

    ax1.set_xlim(0, 110); ax1.set_ylim(-1, len(n)+1)
    ax2.set_xlim(0, 110); ax3.set_ylim(-1, len(n)+1)
    ax1.set_xticks(range(0, 110, 10)); ax1.set_xticklabels([str(i) for i in range(0, 110, 10)], fontsize=18)
    ax2.set_xticks(range(0, 110, 10)); ax2.set_xticklabels([str(i) for i in range(0, 110, 10)], fontsize=18)

    ax1.set_yticks(y); ax1.set_yticklabels(n, fontsize=18)
    ax3.set_yticks(y); ax3.set_yticklabels(n, fontsize=18)

    ax1.text(50, len(n), '1 Read', va='center', ha='center', size=16)
    ax2.text(50, len(n), '3 Reads', va='center', ha='center', size=16)
    ax3.text(50, len(n), '5 Reads', va='center', ha='center', size=16)
    ax4.text(50, len(n), '7 Reads', va='center', ha='center', size=16)

    plt.suptitle('# Reads Covering CpG Island', x=0.5, y=1.05, fontsize=24)
    ax1.set_title('', fontsize=16)
    ax2.set_title('', fontsize=16)
    ax3.set_title('', fontsize=16)
    ax4.set_title('', fontsize=16)

    ax1.set_xlabel('', fontsize=20)
    ax2.set_xlabel('', fontsize=20)
    ax3.set_xlabel('Percent Covered', fontsize=20)
    ax4.set_xlabel('Percent Covered', fontsize=20)

    ax1.set_ylabel('')
    ax2.set_ylabel('')
    ax3.set_ylabel('')
    ax4.set_ylabel('')

    plt.savefig(outfiles['cgi'], bbox_inches='tight')
    plt.close('all')

create_plot(snakemake.input['files'], snakemake.output)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import os

TITLE = {
    'AllCpGs': 'All CpGs in Genome',
    'CGICpGs': 'CpGs in CpG Islands',
    'ExonicCpGs': 'CpGs in Exons',
    'GenicCpGs': 'CpGs in Genes',
    'RepeatCpGs': 'CpGs in Repeat-masked Space'
}

def create_plot(samples, outfiles):
    # Load the data and rearrange by feature type
    d_df = {}
    for s in samples:
        df = pd.read_csv(s, sep='\t').set_index(['Feature'], drop=False)
        all_cnt = df['CpG_Count']['AllCpGs']
        all_q40 = df['Q40_Reads']['AllCpGs']
        all_all = df['All_Reads']['AllCpGs']

        for row in df.itertuples():
            data = {'sample': [], 'l2_q40': [], 'l2_all': [], 'q40_percent': [], 'all_percent': []}
            data['sample'].append(os.path.basename(s).replace('.cpg_stats_feature_table.tsv', ''))
            data['l2_q40'].append(np.log2( (row.Q40_Reads / all_q40) / (row.CpG_Count / all_cnt) ))
            data['l2_all'].append(np.log2( (row.All_Reads / all_all) / (row.CpG_Count / all_cnt) ))
            data['q40_percent'].append(100.0 * row.Q40_Reads / row.CpG_Count)
            data['all_percent'].append(100.0 * row.All_Reads / row.CpG_Count)

            if row.Feature not in d_df.keys():
                d_df[row.Feature] = pd.DataFrame(data)
            else:
                d_df[row.Feature] = d_df[row.Feature].append(pd.DataFrame(data), sort=True)

    for key, df in d_df.items():
        y = [i for i, s in enumerate(list(df['sample']))]
        n = [s for i, s in enumerate(list(df['sample']))]

        fig, ((ax1,ax2), (ax3,ax4)) = plt.subplots(nrows=2, ncols=2, sharex='col', sharey='row')
        plt.tight_layout()

        # Plots
        ax1.barh(y, df['all_percent'], height=0.7, color='black')
        ax2.plot(df['l2_all'], y, 'kx', markersize=8)
        ax2.axvline(0.0, alpha=0.6, color='grey', linestyle='--')

        ax3.barh(y, df['q40_percent'], height=0.7, color='black')
        ax4.plot(df['l2_q40'], y, 'kx', markersize=8)
        ax4.axvline(0.0, alpha=0.6, color='grey', linestyle='--')

        # Axis ticks and labels
        ax1.set_xlim(0, 110)
        ax2.set_xlim(-3.5, 5.5)

        ax1.set_xticks(range(0, 110, 10)); ax1.set_xticklabels([str(i) for i in range(0, 110, 10)], fontsize=18)
        ax2.set_xticks(range(-3, 6))     ; ax2.set_xticklabels([str(i) for i in range(-3, 6)], fontsize=18)

        ax1.set_ylim(-1, len(n))
        ax3.set_ylim(-1, len(n))

        ax1.set_yticks(y); ax1.set_yticklabels(n, fontsize=18)
        ax3.set_yticks(y); ax3.set_yticklabels(n, fontsize=18)

        # Titles and such
        plt.suptitle(TITLE[key], x=0.5, y=1.05, fontsize=24)
        ax1.set_title('', fontsize=20)
        ax2.set_title('', fontsize=20)
        ax3.set_title('', fontsize=20)
        ax4.set_title('', fontsize=20)

        ax1.set_xlabel('', fontsize=20)
        ax2.set_xlabel('', fontsize=20)
        ax3.set_xlabel('Percent Covered', fontsize=20)
        ax4.set_xlabel('log2(obs / exp)', fontsize=20)

        ax1.set_ylabel('All Reads', fontsize=20)
        ax2.set_ylabel('')
        ax3.set_ylabel('Q40 Reads', fontsize=20)
        ax4.set_ylabel('')

        plt.savefig(outfiles[key], bbox_inches='tight')
        plt.close('all')

create_plot(snakemake.input['files'], snakemake.output)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import os

# Column names for Snakemake generated files
cols = [
    'sample',
    'refln', 'cpgln', 'cgiln', 'mskln', 'exnln', 'genln', 'intln',
    'mapln', 'fcpgs', 'fcgis', 'frmsk', 'fexon', 'fgene', 'fintr'
]

# Make it easier to parse the files based on their column names
ORDER = {
    'cpgs': ('cpgln', 'fcpgs'),
    'cgis': ('cgiln', 'fcgis'),
    'rmsk': ('mskln', 'frmsk'),
    'exon': ('exnln', 'fexon'),
    'gene': ('genln', 'fgene'),
    'intr': ('intln', 'fintr')
}

# Which columns to keep for generating plot
keepers = ['sample'] + list(ORDER.keys())

# Plot titles for each column
TITLE = {
    'cpgs': 'Observed / Expected Coverage for All CpGs',
    'cgis': 'Observed / Expected Coverage for CpG Islands',
    'rmsk': 'Observed / Expected Coverage for Repeat-Masked Space',
    'exon': 'Observed / Expected Coverage for Exons',
    'gene': 'Observed / Expected Coverage for Genic Space',
    'intr': 'Observed / Expected Coverage for Intergenic Space'
}

def create_plot(samples, outfiles):
    # Load the data and calculate the obs/exp ratios
    l_df = []
    for s in samples:
        df = pd.read_csv(s, sep='\t', header=None, names=cols)
        for k, v in ORDER.items():
            num = df[v[0]] / df['refln']
            den = df[v[1]] / df['mapln']

            df[k] = np.log2(num / den)
        l_df.append(df)

    # Combine dataframes for plotting
    df = pd.concat(l_df)
    df = df[keepers]
    df = df.sort_values(by = 'sample')

    y     = [i for i, s in enumerate(list(df['sample']))]
    names = [s for i, s in enumerate(list(df['sample']))]

    # Create a plot for each of the features
    for k in ORDER.keys():
        fig, ax = plt.subplots()
        plt.tight_layout()

        plt.plot(list(df[k]), y, 'kx', markersize=8)
        ax.axvline(0.0, alpha=0.6, color='grey', linestyle='--')

        plt.title(TITLE[k], fontsize=24)
        plt.xlabel('log2(obs / exp)', fontsize=20)
        plt.ylabel('', fontsize=20)

        plt.xlim(-3.5, 5.5)
        plt.ylim(-1, len(names))

        plt.xticks(
            [i for i in np.arange(-3, 6, 1)],
            [str(i) for i in np.arange(-3, 6, 1)],
            fontsize=18
        )
        plt.yticks(y, names, fontsize=18)

        plt.savefig(outfiles[k], bbox_inches='tight')
        plt.close('all')

create_plot(list(snakemake.input['files']), snakemake.output)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
from matplotlib import pyplot as plt
import os

def create_plot(data, outfile):
    # Load data
    all = {}
    for s in data['all']:
        with open(s, 'r') as f:
            fd = f.readlines()[2:]

        dd = {}
        for l in fd:
            fields = l.strip().split()
            dd[int(float(fields[0]))] = int(float(fields[1]))

        total = sum(dd.values())
        covrd = total - dd[0]

        samp = os.path.basename(s).replace('_covdist_all_base_table.txt', '')
        all[samp] = 100 * covrd / total

    q40 = {}
    for s in data['q40']:
        with open(s, 'r') as f:
            fd = f.readlines()[2:]

        dd = {}
        for l in fd:
            fields = l.strip().split()
            dd[int(float(fields[0]))] = int(float(fields[1]))

        total = sum(dd.values())
        covrd = total - dd[0]

        samp = os.path.basename(s).replace('_covdist_q40_base_table.txt', '')
        q40[samp] = 100 * covrd / total

    y = [i for i, s in enumerate(list(all.keys()))]
    n = [s for i, s in enumerate(list(all.keys()))]

    # Create plot
    fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, sharey='row')
    plt.tight_layout()

    ax1.barh(y, all.values(), height=0.7, color='black')
    ax2.barh(y, q40.values(), height=0.7, color='black')

    ax1.set_xlim(0, 110);
    ax2.set_xlim(0, 110);
    ax1.set_xticks(range(0, 110, 20)); ax1.set_xticklabels([str(i) for i in range(0, 110, 20)], fontsize=18)
    ax2.set_xticks(range(0, 110, 20)); ax2.set_xticklabels([str(i) for i in range(0, 110, 20)], fontsize=18)

    ax1.set_ylim(-1, len(n))
    ax1.set_yticks(y); ax1.set_yticklabels(n, fontsize=18)

    plt.suptitle('Percent of Genome Covered', x=0.5, y=1.10, fontsize=24)
    ax1.set_title('All Reads', fontsize=16)
    ax2.set_title('Q40 Reads', fontsize=16)

    ax1.set_xlabel('Percent Covered', fontsize=20)
    ax2.set_xlabel('Percent Covered', fontsize=20)

    plt.savefig(outfile['out'], bbox_inches='tight')
    plt.close('all')

create_plot(snakemake.input, snakemake.output)
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
import argparse
import numpy as np

# A terminology note:
#
# The "start" of the chromosome is base 0 (or 1 in 1-based indexing) in the reference. The "end" of the chromosome is,
# therefore, the last (highest numbered) base in the reference.

class Record(object):

    def __init__(self, fields, args):
        """Initialize record.

        Inputs -
            fields - list of fields from BED file
            args   - argparse.parse_args() from running script
        """
        self.fields = fields       # list of fields from BED file entry
        self.chrm = fields[0]      # chromosome
        self.beg  = int(fields[1]) # start of region
        self.end  = int(fields[2]) # end of region

        # Collapse initial region to the middle of the region
        if args.collapse:
            self.mid = (self.beg + self.end + 1) / 2 # midpoint of initial region
            self.beg = self.mid # set start to midpoint
            self.end = self.mid # set end to midpoint

        self.step1 = args.flankstep # step size in reverse direction (relative to reference and strand)
        self.step2 = args.flankstep # step size in forward direction (relative to reference and strand)

    def __str__(self):
        """String method for print()."""
        return f'chrm: {self.chrm}, start: {self.beg}, end: {self.end}, step1: {self.step1}, step2: {self.step2}'

    def sample_forward(self, args, index_func):
        """Move towards the end of the chromosome (relative to the reference) from the target region.

        Inputs -
            args       - argparse.parse_args() from running script
            index_func - function for how to adjust the index value for bin
        Returns -
            None
        """
        # If forward step is negative, then don't create any windows towards the end of the chromosome
        if self.step2 < 0:
            return

        # When moving towards the end of the chromosome, the windows begin at the 3' end (relative to the reference) of
        # the region of interest.
        # NB: this variable is a temporary variable and may be adjusted depending on provided inputs
        _window_beg = self.end

        # Create args.flanknumber of windows while moving towards the end of the chromosome
        for i in range(args.flanknumber):
            # This is a temporary variable and may be adjusted depending on inputs
            _window_end = _window_beg + self.step2

            if args.outer:
                # If args.outer, set the end of the window to be the 3' end of the temporary window:
                # (_window_end-1, _window_end)
                window_end = int(_window_end)
                window_beg = window_end - 1
            elif args.middle:
                # If args.middle, set the end of the window to be the middle of the temporary window:
                # (middle of window - 1, middle of window)
                window_mid = int((_window_beg + _window_end)/2.0)
                window_beg = window_mid-1
                window_end = window_mid
            else:
                # Otherwise use the whole temporary window
                window_beg = int(_window_beg)
                window_end = int(_window_end)

            # Index of the probe window moving towards the end of the chromosome (relative to the reference)
            index = index_func(i)

            # The index is negative for reverse strand (-) regions
            if index < 0:
                if args.flankbygene:
                    # For args.flankbygene, set the region column as the number index of the current window
                    reg = '({:d})-({:d})'.format(-i-1, -i)
                elif args.flanktoneighbor:
                    # For args.flanktoneighbor, set the region column as the corresponding percentage covered by the
                    # current window between the current interval and the next
                    if args.fold:
                        # If args.fold, halve region percentage to account for setting the forward and backward regions
                        # indices the same
                        reg = '({:d})-({:d})%'.format(
                            int(float(-i-1) / args.flanknumber / 2 * 100),
                            int(float(-i) / args.flanknumber / 2 * 100)
                        )
                    else:
                        # Otherwise, you can leave the percentage as is
                        reg = '({:d})-({:d})%'.format(
                            int(float(-i-1) / args.flanknumber * 100),
                            int(float(-i) / args.flanknumber * 100)
                        )
                else:
                    # Otherwise, set the region column as bases covered relative the end of the interval
                    reg = '({:d})-({:d})'.format(int(self.end - window_end), int(self.end - window_beg))

            # Handle forward strand intervals
            else:
                # The descriptions are the same as the negative index case, so won't rehash them here
                if args.flankbygene:
                    reg = '{:d}-{:d}'.format(i, i+1)
                elif args.flanktoneighbor:
                    # FIXME: This should behave in a similar manner to the args.flanktoneighbor above (where there is an
                    #        if-else block for args.fold. This is how it's done for sample_backward(), but for some
                    #        reason it's not done here.
                    reg = '{:d}-{:d}%'.format(
                        int(float(i) / args.flanknumber * 100),
                        int(float(i+1) / args.flanknumber * 100)
                    )
                else:
                    reg = '{:d}-{:d}'.format(int(window_beg - self.end), int(window_end - self.end))

            # When args.collapse is set, you can have index = 0
            # To be honest, I don't know what the area value is supposed to represent...
            if index >= 0:
                area = 1
            else:
                area = -1

            # Expand probe window if desired
            if args.expansion > 0:
                # Start properly accounts for running off the start of the chromosome, but end does not account for
                # running off the end of the chromosome. Maybe this can be accounted for somehow?
                window_beg = max(window_beg - args.expansion,0)
                window_end = window_end + args.expansion

            # Potential FIXME: I think we would still want to print out this information, even if window_beg == 0.
            #                  Therefore, I think that window_beg > 0 should be changed to window_beg >= 0
            if window_beg > 0 and window_end > window_beg:
                print(
                    '{}\t{:d}\t{:d}\t{:d}\t{}\t{:d}\t{}'.format(
                        self.chrm,  # chromosome of probe window
                        window_beg, # start of probe window (0-based)
                        window_end, # end of probe window (1-based, non-inclusive)
                        index,      # 0-indexed index of the probe window
                        reg,        # region covered by probe window
                        area,       # not exactly sure what this is....
                        '\t'.join(self.fields) # original values from input region of interest
                    )
                )

            # Move current temporary end to start of next probe window
            _window_beg = _window_end

    def sample_backward(self, args, index_func):
        """Move towards the start of the chromosome (relative to the reference) from target region.

        Inputs -
            args       - argparse.parse_args() from running script
            index_func - function for how to adjust the index value for bin
        Returns -
            None
        """
        # If backward step is negative, then don't create any windows towards the start of the chromosome
        if self.step1 < 0:
            return

        # When moving towards the start of the chromosome, the windows begin at the 5' end (relative to the reference)
        # of the region of interest.
        # NB: this variable is a temporary variable and may be adjusted depending on provided inputs
        _window_end = self.beg

        # Create args.flanknumber of windows while moving towards the start of the chromosome
        for i in range(args.flanknumber):
            # This is a temporary variable and may be adjusted depending on inputs
            _window_beg = _window_end - self.step1

            if args.outer:
                # If args.outer, set the end of the window to be the 5' end of the temporary window:
                # (_window_beg, _window_beg+1)
                window_beg = int(_window_beg)
                window_end = window_beg + 1
            elif args.middle:
                # If args.middle, set the end of the window to be the middle of the temporary window:
                # (middle of window - 1, middle of window)
                window_mid = int((_window_beg + _window_end)/2.0)
                window_beg = window_mid-1
                window_end = window_mid
            else:
                # Otherwise use the whole temporary window
                window_beg = int(_window_beg)
                window_end = int(_window_end)

            # Index of the probe window moving towards the start of the chromosome (relative to the reference)
            index = index_func(i)

            # The index is negative for forward strand (+) regions
            if index < 0:
                if args.flankbygene:
                    # For args.flankbygene, set the region column as the number index of the current window
                    reg = '({:d})-({:d})'.format(-i-1, -i)
                elif args.flanktoneighbor:
                    # For args.flanktoneighbor, set the region column as the corresponding percentage covered by the
                    # current window between the current interval and the previous
                    if args.fold:
                        # If args.fold, halve region percentage to account for setting the forward and backward regions
                        # indices the same
                        reg = '({:d})-({:d})%'.format(
                            int(float(-i-1) / args.flanknumber / 2 * 100),
                            int(float(-i) / args.flanknumber / 2 * 100)
                        )
                    else:
                        # Otherwise, you can leave the percentage as is
                        reg = '({:d})-({:d})%'.format(
                            int(float(-i-1) / args.flanknumber * 100),
                            int(float(-i) / args.flanknumber * 100)
                        )
                else:
                    # Otherwise, set the region column as bases covered relative the end of the interval
                    reg = '({:d})-({:d})'.format(int(window_beg - self.beg), int(window_end - self.beg))

            # Handle reverse strand intervals
            else:
                # The descriptions are the same as the negative index case, so won't rehash them here
                if args.flankbygene:
                    reg = '{:d}-{:d}'.format(i, i+1)
                elif args.flanktoneighbor:
                    if args.fold:
                        reg = '{:d}-{:d}%'.format(
                            int(float(i) / args.flanknumber / 2 * 100),
                            int(float(i+1) / args.flanknumber / 2 * 100)
                        )
                    else:
                        reg = '{:d}-{:d}%'.format(
                            int(float(i) / args.flanknumber * 100),
                            int(float(i+1) / args.flanknumber * 100)
                        )
                else:
                    reg = '{:d}-{:d}'.format(int(self.beg - window_end), int(self.beg - window_beg))

            # When args.collapse is set, you can have index = 0
            # To be honest, I don't know what the area value is supposed to represent...
            if index >= 0:
                area = 1
            else:
                area = -1

            # Expand probe window if desired
            if args.expansion > 0:
                # Start properly accounts for running off the start of the chromosome, but end does not account for
                # running off the end of the chromosome. Maybe this can be accounted for somehow?
                window_beg = max(window_beg - args.expansion,0)
                window_end = window_end + args.expansion

            # Potential FIXME: I think we would still want to print out this information, even if window_beg == 0.
            #                  Therefore, I think that window_beg > 0 should be changed to window_beg >= 0
            if window_beg > 0 and window_end > window_beg:
                print(
                    '{}\t{:d}\t{:d}\t{:d}\t{}\t{:d}\t{}'.format(
                        self.chrm,  # chromosome of probe window
                        window_beg, # start of probe window (0-based)
                        window_end, # end of probe window (1-based, non-inclusive)
                        index,      # 0-indexed index of the probe window
                        reg,        # region covered by probe window
                        area,       # not exactly sure what this is....
                        '\t'.join(self.fields) # original values from input region of interest
                    )
                )

            # Move current temporary start to end of next probe window
            _window_end = _window_beg

    def sample_internal(self, args, index_func):
        """Move within the target region.

        Inputs -
            args       - argparse.parse_args() from running script
            index_func - function for how to adjust the index value for bin
        Returns -
            None
        """
        if args.outer:
            # sentinels is a list of locations to sample from inside the region of interest
            # NB: If args.numinternal = 1, then the location for the internal region will always be:
            #     (region start, region start + 1)
            # NB: Here, start is set to be 1-based
            sentinels = list(np.linspace(self.beg+1, self.end, args.numinternal))

            for i in range(len(sentinels)):
                window_end = int(sentinels[i]) # end of window to probe
                window_beg = window_end - 1    # start of window to probe (reset to 0-based)

                # Expand probe window if desired
                if args.expansion > 0:
                    # Start properly accounts for running off the start of the chromosome, but end does not account for
                    # running off the end of the chromosome. Maybe this can be accounted for somehow?
                    window_beg = max(window_beg - args.expansion,0)
                    window_end = window_end + args.expansion

                # Index of the internal probe window
                index = index_func(i)

                # Potential FIXME: I think we would still want to print out this information, even if window_beg == 0.
                #                  Therefore, I think that window_beg > 0 should be changed to window_beg >= 0
                if window_beg > 0 and window_end > window_beg:
                    print(
                        '{}\t{:d}\t{:d}\t{:d}\t{:d}-{:d}%\t0\t{}'.format(
                            self.chrm,  # chromosome of probe window
                            window_beg, # start of probe window (0-based)
                            window_end, # end of probe window (1-based, non-inclusive)
                            index,      # 0-indexed index of the probe window
                            int(float(index)/args.numinternal*100),   # low end percent of range of the region covered by the probe window
                            int(float(index+1)/args.numinternal*100), # high end percent of range of the region covered by the probe window
                            '\t'.join(self.fields) # original values from input region of interest
                        )
                    )

            return

        # If args.outer is not set, then break up the internal region of interest into numinternal regions
        # For example, for region (0, 100), then if numinternal = 1 the probe window will cover (0, 100). If
        # numinternal = 2, the probe windows will be (0, 50) and (50, 100)
        # NB: Everything stays in BED 0-based indexing for this part of the function
        sentinels = list(np.linspace(self.beg, self.end, args.numinternal+1))

        for i in range(len(sentinels)-1):
            window_beg = int(sentinels[i])   # start of probe window
            window_end = int(sentinels[i+1]) # end of probe window

            # If args.middle is set, we want to set the probe window as the middle position of the window. So, for the
            # previously listed examples, the (0, 100) interval would be set to (49, 50) and the (0, 50) and (50, 100)
            # intervals would become (24, 25) and (74, 75)
            if args.middle:
                window_mid = int((sentinels[i] + sentinels[i+1])/2) # find middle of the current window
                window_beg = window_mid-1 # reset the start to the middle (setting value to 0-based index)
                window_end = window_mid   # reset the end to the middle location

            # Expand probe window if desired
            # See above comments for handling the end position relative to the end of the chromosome
            if args.expansion > 0:
                window_beg = max(window_beg - args.expansion,0)
                window_end = window_end + args.expansion

            # Index of the internal probe window (0 for internal window with only 1 internal probe window)
            index = index_func(i)

            # See the above potential FIXME, as this applies here
            if window_beg > 0 and window_end > window_beg:
                print(
                    '{}\t{:d}\t{:d}\t{:d}\t{:d}-{:d}%\t0\t{}'.format(
                        self.chrm,  # chromosome of probe window
                        window_beg, # start of probe window (0-based)
                        window_end, # end of probe window (1-based, non-inclusive)
                        index,      # 0-indexed index of the probe window
                        int(float(index)/args.numinternal*100),   # low end percent of range of the region covered by the probe window
                        int(float(index+1)/args.numinternal*100), # high end percent of range of the region covered by the probe window
                        '\t'.join(self.fields) # original values from input region of interest
                    )
                )

        return

def process_record(r, r0, r2):
    """Parse target region, creating windows based on the flanking inputs given on the command line.

    Inputs -
        r  - record to process
        r0 - previous record in file
        r2 - next record in file
    Returns -
        None
    """
    if args.flankbygene:
        # When flanking by gene (or more accurately, flanking by region of interest), set the step size to be equal to
        # the size of the region of interest
        # NB: specifically, this is related to the size of the windows in the region of interest, so the number of
        #     internal windows (args.numinternal) plays into this size
        r.step1 = float(r.end - r.beg + 1) / (args.numinternal)
        r.step2 = r.step1

    elif args.flanktoneighbor:
        # When flanking by neighbor, the entire space between the end of the previous region and the start of the
        # current region will be broken up into args.flanknumber number of windows (and likewise for the space between
        # the end of the current region and the start of the next region)

        # If no previous region exists or the end of the previous region overlaps the start of the current region, then
        # don't create any windows upstream (towards the start of the chromosome relative to the reference) of the
        # current region
        if r0 is None or r.beg < r0.end:
            r.step1 = -1
        else:
            # Evenly divide up region between end of previous region and start of current region into args.flanknumber
            # number of windows
            r.step1 = float(r.beg - r0.end + 1) / (args.flanknumber)

            # Based on the description of args.fold, I'm not entirely sure why the the step size is halved when
            # args.fold is set
            if args.fold:
                r.step1 /= 2

        # If no next region exists or the start of the next region overlaps the end of the current region, then don't
        # create any windows downstream (towards the end of the chromosome relative to the reference) of the current
        # region
        if r2 is None or r.end > r2.beg:
            r.step2 = -1
        else:
            # Evenly divide up region between end of current region and start of next region into args.flanknumber
            # number of windows
            r.step2 = float(r2.beg - r.end + 1) / (args.flanknumber)

            # Based on the description of args.fold, I'm not entirely sure why the the step size is halved when
            # args.fold is set
            if args.fold:
                r.step2 /= 2

    # If strand information is not included, assume all regions are on the forward (+) strand
    # Otherwise pick off the strand information from the input file
    if args.strand is None:
        strand = '+'
    else:
        strand = r.fields[args.strand-1]

    # If args.ignoreend, then the end of the region is ignored. For forward strand regions, this is the 3' end relative
    # to the reference, so the region will be set to (start, start+1). For reverse strand regions, this is the 5' end
    # relative to the reference, so the region will be set to (end-1, end)
    if args.ignoreend:
        args.numinternal = 1
        if strand == '+':
            r.end = r.beg + 1
        else:
            r.beg = r.end - 1

    # Potential FIXME: If args.fold is included, then every line will be printed twice - once for this if-block and once
    #                  for the strand if-else block. Further, the indices between the two prints will differ, as
    #                  args.fold makes the index equivalent between the upstream and downstream windows. This could be
    #                  solved by placing a return as the last expression in the block
    if args.fold:
        r.sample_backward(args, lambda i: -i-1)
        r.sample_internal(args, lambda i: min(i, args.numinternal-1-i))
        r.sample_forward(args, lambda i: -i-1)

    if strand == '+':
        # Create probe regions for forward strand (+) intervals
        r.sample_backward(args, lambda i: -i-1)
        r.sample_internal(args, lambda i: i)
        r.sample_forward(args, lambda i: args.numinternal+i)
    else:
        # For reverse strand (-) intervals, moving upstream from the CTCF cite is actually moving downstream relative to
        # the reference (i.e., the position value gets larger)
        r.sample_forward(args, lambda i: -i-1)
        r.sample_internal(args, lambda i: args.numinternal-1-i)
        r.sample_backward(args, lambda i: args.numinternal+i)

    return

def main(args):
    """Main function to run.

    Inputs -
        args   - argparse.parse_args() from running script
    Returns -
        None
    """
    # If collapsing interval to the middle, then (re)set the number of points sampled in the internal interval
    if args.collapse:
        if args.outer: # if args.outer, then keep the collapsed middle for internal probe
            args.numinternal = 1
        else: # ignore internal interval if --outer is not set
            args.numinternal = 0

    # Loop through intervals in BED file
    r0 = None # previous record (interval), only used with args.flanktoneighbor
    r  = None # current record (interval)
    r2 = None # next record (interval), only used with args.flanktoneighbor
    for line in args.table:
        r2 = Record(line.strip('\n').split('\t'), args)

        if r is not None: # skip the first line
            if r.chrm == r2.chrm: # current and next intervals fall on the same chromosome
                process_record(r, r0, r2)
            else: # next interval falls on a different chromosome
                process_record(r, r0, None)

        r0 = r
        r  = r2

    # Handle the last record
    r2 = None
    process_record(r, r0, r2)

    return

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Generate meta gene')
    parser.add_argument('table', help="Input bed file", type = argparse.FileType('r'), default='-')

    # How to handle intervals
    parser.add_argument(
        '--middle',
        action = 'store_true',
        help = 'use middle point of each probe window'
    )
    parser.add_argument(
        '--outer',
        action = 'store_true',
        help = 'use outer point of each probe window (w.r.t. to the target region)'
    )
    parser.add_argument(
        '--collapse',
        action = 'store_true',
        help = 'collapse internal interval to middle - all probe windows will now be relative to this location'
    )

    # Control how the probe window sizes are determined
    # Set step size to region of interest size - highest precedence when setting probe step size
    parser.add_argument(
        '--flankbygene',
        action = 'store_true',
        help = 'allow the size of steps to vary according to the region of interest length'
    )
    # Set step size to be flanknumber equal steps to next region - second highest precedence when setting step size
    parser.add_argument(
        '--flanktoneighbor',
        action = 'store_true',
        help = 'size of step is dependent on the closest region of interest to current region'
    )
    # Create equal size steps with length args.flankstep - used when flankbygene and flanktoneighbor are not set
    parser.add_argument(
        '-f', '--flankstep',
        type = int,
        default=100,
        help = 'set step size to X bases (default 100)'
    )
    parser.add_argument(
        '-m', '--flanknumber',
        type = int,
        default=30 ,
        help = 'number of points to sample outside of the region of interest (default 30)')

    # Expand probe windows - expands window in both directions, occurs for internal and external windows
    parser.add_argument(
        '--expansion',
        type = int,
        default = 0,
        help = 'number of bases to expand window, expands window in both directions (default 0)'
    )

    # Control internal sampling
    parser.add_argument(
        '-n', '--numinternal',
        type = int,
        default = 30,
        help = 'number of points to sample in the region of interest, --middle ignores this (default 30)'
    )

    # Other inputs
    parser.add_argument(
        '--fold',
        action = 'store_true',
        help = 'use the same index for intervals on both sides of the target region (usually used when strand is irrelevant)'
    )
    parser.add_argument(
        '-s', '--strand',
        type = int,
        default = None,
        help = 'the column which contains strand information in input file - if None then ignore strand (default None)'
    )
    parser.add_argument(
        '--ignoreend',
        action ='store_true',
        help = 'ignore the end of the input interval'
    )

    parser.set_defaults(func=main)
    args = parser.parse_args()

    try:
        args.func(args)
    except IOError:
        exit
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
rename_fastqs <- function(samplesheet, fastq_dir, out_dir) {
    sheet <- read.delim(samplesheet, sep="\t")

    ifelse(!dir.exists(out_dir), dir.create(file.path(out_dir), recursive = TRUE), FALSE)

    for (samp in sheet$sample) {
        print(paste("Renaming PE reads for:", samp))

        # Rename R1 reads
        file_1 <- paste0(fastq_dir, "/", unlist(strsplit(subset(sheet, sample == samp)$fq1, split = ",")))
        print(paste("Found", length(file_1), "R1 files for", samp))

        for (i in 1:length(file_1)) {
            if (file.exists(file_1[i])) {
                new_name <- paste0(out_dir, "/", samp, "-", i, "-R1.fastq.gz")
                system(paste0("ln -sr ", file_1[i], " ", new_name))
                print(paste(file_1[i], "successfully symlinked to", new_name))
            } else {
                stop(paste("Error:", file_1[i], "not found in", fastq_dir))
            }
        }

        # Rename R2 reads
        file_2 <- paste0(fastq_dir, "/", unlist(strsplit(subset(sheet, sample == samp)$fq2, split = ",")))
        print(paste("Found", length(file_2), "R2 files for", samp))

        for (i in 1:length(file_2)) {
            if (file.exists(file_2[i])) {
                new_name <- paste0(out_dir, "/", samp, "-", i, "-R2.fastq.gz")
                system(paste0("ln -sr ", file_2[i], " ", new_name))
                print(paste(file_2[i], "successfully symlinked to", new_name))
            } else {
                stop(paste("Error:", file_2[i], "not found in", fastq_dir))
            }
        }
    }
}

rename_fastqs(snakemake@params[["samplesheet"]], snakemake@params[["fastq_dir"]], snakemake@output[["symlink_dir"]])
R From line 1 of scripts/rename.R
ShowHide 39 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/vari-bbc/Biscuit_Snakemake_Workflow
Name: biscuit_snakemake_workflow
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: GNU General Public License v3.0
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...