pipeline for generating official genome-in-a-bottle stratifications

public public 1yr ago 0 bookmarks

This repository contains the pipeline and configuration used for generating and evaluating the Genome-in-a-Bottle stratifications. The stratification files were developed with the Global Alliance for Genomic Health (GA4GH) Benchmarking Team, the Genome in a Bottle Consortium (GIAB) and the Telomere-to-Telomere Consortium (T2T) . They are intended as a standard resource of BED files for use in stratifying true positive, false positive and false negative variant calls in challenging and targeted regions of the the genome.

The stratification BED files are only accessible on the GIAB FTP site . These files can be used as a standard resource of BED files for use with GA4GH benchmark tools such as hap.py .

Note that this pipeline applies to stratification versions starting with v3.2. Older pipelines and scripts can be found here .

Within this repository, further details (including all source files) can be found in the configuration at config/all.yml . The pipeline itself is maintained in a separate, reference agnostic repository here . Please refer to this pipeline for specific, methodological information regarding how these files are produced.

Currently, this pipeline is set up to build stratifications for the following references:

  • GRCh37

  • GRCh38

  • CHM13

Versioning

Version tags follow the format X.Y-Z where X will be incremented for removal of old stratifications and/or large deletions, Y will be incremented for additions and minor revisions, and Z will be incremented for documentation updates in this repository. Note that X.Y (without Z ) corresponds to the versions located in the FTP site .

For those looking for precise differences between each stratification version, see the stratdiff tool .

General Information

Author Information

  • Principal Investigator: Justin Zook, NIST, jzook@nist.gov

  • Nate Olson, NIST, nathanael.olson@nist.gov

  • Justin Wagner, NIST, justin.wagner@nist.gov

  • Jennifer McDaniel, NIST, jennifer.mcdaniel@nist.gov

  • Nate Dwarshuis, NIST, nathan.dwarshuis@nist.gov

Sharing/Access Information

Licenses/restrictions placed on the data, or limitations of reuse: Publicly released data are freely available for reuse without embargo.

Citations for stratifications are located in the associated READMEs.

If stratifications were used in benchmarking with GA4GH/GIAB best practices or hap.py please reference:

Krusche, P., Trigg, L., Boutros, P.C. et al. 
Best practices for benchmarking germline small-variant calls in human genomes. 
Nat Biotechnol 37, 555-560 (2019). https://doi.org/10.1038/s41587-019-0054-x

Links to publicly accessible locations of the data:

GIAB FTP URL

  • Individual stratification BED files as well as zipped directories (tar.gz) of files

  • stratification READMEs

  • .tsvs for benchmarking with hap.py

  • MD5 checksums

Summary Of Stratifications

Stratifications can be binned into several types: Low Complexity, Functional Technically Difficult, Genome Specific, Functional Regions, GC content, mappability, Other Difficult, Segmental Duplications, Union, Ancestry, Telomeres, and XY. General information for stratification types are provided below.

These categories are present for all indicated references unless otherwise noted.

Low Complexity

Regions with different types and sizes of low complexity sequence, e.g., homopolymers, STRs, VNTRs and other locally repetitive sequences.

Segmental Duplications

Regions with segmental duplications (generally defined as repeated regions >1kb with >90% similarity).

XY

Chomosome XY specific regions such as PAR, XTR or ampliconic.

Functional Regions

Regions to stratify variants inside and outside coding regions.

GC Content

Regions with different ranges (%) of GC content.

Mappability

Regions where short read mapping can be challenging.

Telomeres

Telomeric regions. These are included for CHM13 only.

Union

Regions with different general types of difficult regions or any type of difficult region or complex variant. For example, performance can be measured in just "easy" or "all difficult" regions of the genome.

Ancestry

Regions with inferred patterns of local ancestry.

These only exist for GRCh38.

Genome Specific (GIAB benchmark v4.2.1)

Difficult regions due to potentially difficult variation in a NIST/GIAB sample, including 1) regions containing putative compound heterozygous variants 2) small regions containing multiple phased variants, 3) regions with potential structural or copy number variation.

These only exist for GRCh37/8.

Functional Technically Difficult

Functional, or potentially functional, regions that are also likely to be technically difficult to sequences.

These only exist for GRCh37/8.

Other Difficult

These regions that are difficult to sequences for miscellaneous reasons. In the case of GRCh37/38, these include regions near gaps or errors in the reference or regions that are known to be expanded/collapsed relative to most other genomes.

This category also includes highly polymorphic regions such as those coding the major histocompatibility complex (MHC), the VDJ regions, and regions coding the killer-cell immunoglobulin-like receptor (KIR). These are all highly polymorphic in nature and thus difficult to sequence.

Finally, in the case of CHM13, this category also includes the rDNA arrays.

Data-Use Policy

This data/work was created by employees of the National Institute of Standards and Technology (NIST), an agency of the Federal Government. Pursuant to title 17 United States Code Section 105, works of NIST employees are not subject to copyright protection in the United States. This data/work may be subject to foreign copyright.

The data/work is provided by NIST as a public service and is expressly provided AS IS. NIST MAKES NO WARRANTY OF ANY KIND, EXPRESS, IMPLIED OR STATUTORY, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NON-INFRINGEMENT AND DATA ACCURACY. NIST does not warrant or make any representations regarding the use of the data or the results thereof, including but not limited to the correctness, accuracy, reliability or usefulness of the data. NIST SHALL NOT BE LIABLE AND YOU HEREBY RELEASE NIST FROM LIABILITY FOR ANY INDIRECT, CONSEQUENTIAL, SPECIAL, OR INCIDENTAL DAMAGES (INCLUDING DAMAGES FOR LOSS OF BUSINESS PROFITS, BUSINESS INTERRUPTION, LOSS OF BUSINESS INFORMATION, AND THE LIKE), WHETHER ARISING IN TORT, CONTRACT, OR OTHERWISE, ARISING FROM OR RELATING TO THE DATA (OR THE USE OF OR INABILITY TO USE THIS DATA), EVEN IF NIST HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

To the extent that NIST may hold copyright in countries other than the United States, you are hereby granted the non-exclusive irrevocable and unconditional right to print, publish, prepare derivative works and distribute the NIST data, in any medium, or authorize others to do so on your behalf, on a royalty-free basis throughout the world.

You may improve, modify, and create derivative works of the data or any portion of the data, and you may copy and distribute such modifications or works. Modified works should carry a notice stating that you changed the data and should note the date and nature of any such change. Please explicitly acknowledge the National Institute of Standards and Technology as the source of the data: Data citation recommendations are provided at https://www.nist.gov/open/license.

Permission to use this data is contingent upon your acceptance of the terms of this agreement and upon your providing appropriate acknowledgments of NIST’s creation of the data/work.

Running this pipeline

This repository's existence on Github is primarily for documentation purposes, as the runtime is defined in terms of NIST-specific resources.

However, for those wishing to run the pipeline themselves, there are several options. Both require cloning this repo with submodules and setting up the conda env:

git clone --recurse-submodules https://github.com/ndwarshuis/giab-stratifications.git
cd giab-stratifications
conda env create -f env.yml
conda activate giab-strats-production

No Cluster Environment

Just call the pipeline directly using snakemake:

snakemake --cores 20 --configfile=config/all.yml all

Note that most rules in the pipeline require 12MB of memory or less with the exception of mappability (which runs GEM) and requires ~32GB for size of the references here.

Cluster Environment

First, create a profile from that located at workflow/profiles/nisaba/config.yml (which is designed for a NIST-specific cluster which runs Slurm).

Then run snakemake with this profile:

snakemake --cores 20 --configfile=config/all.yml --profile=workflow/profiles/yerprofile all

Code Snippets

21
22
script:
    "../scripts/python/bedtools/functional/filter_cds.py"
34
35
36
37
38
39
shell:
    """
    mergeBed -i {input.bed} | \
    intersectBed -a stdin -b {input.gapless} -sorted -g {input.genome} | \
    bgzip -c > {output}
    """
51
52
53
54
55
56
shell:
    """
    complementBed -i {input.bed} -g {input.genome} | \
    intersectBed -a stdin -b {input.gapless} -sorted | \
    bgzip -c > {output}
    """
44
45
46
47
48
49
50
51
52
53
shell:
    """
    seqtk gc {params.args} -l 100 {input.ref} | \
    cut -f1-3 | \
    slopBed -i stdin -g {input.genome} -b 50 | \
    mergeBed -i stdin | \
    intersectBed -a stdin -b {input.gapless} -sorted -g {input.genome} | \
    bgzip -c \
    > {output}
    """
 99
100
script:
    "../scripts/python/bedtools/gc/intersect_ranges.py"
SnakeMake From line 99 of rules/gc.smk
53
54
55
56
57
58
59
shell:
    """
    gunzip -c {input.ref} | \
    sed 's/^[A-Za-z]\\+$/\\U&/' | \
    {input.bin} {wildcards.unit_len} {wildcards.total_len} - 2> {log} \
    > {output} 
    """
73
74
shell:
    "sed -n '/unit=[{wildcards.bases}]\+/p' {input} > {output}"
138
139
shell:
    "subtractBed -a {input.a} -b {input.b} > {output}"
169
170
shell:
    "subtractBed -a {input.a} -b {input.b} > {output}"
186
187
188
189
190
191
192
193
194
shell:
    """
    slopBed -i {input} -b 5 -g {params.genome} | \
    cut -f1-3 | \
    mergeBed -i stdin | \
    intersectBed -a stdin -b {params.gapless} -sorted -g {params.genome} | \
    bgzip -c \
    > {output}
    """
260
261
262
263
264
265
266
shell:
    """
    grep 'unit={wildcards.base}' {input} | \
    mergeBed -i stdin -d 1 | \
    awk '$3-$2>={wildcards.merged_len}' > \
    {output}
    """
285
286
287
288
289
290
291
292
shell:
    """
    multiIntersectBed -i {input} | \
    slopBed -i stdin -b 5 -g {params.genome} | \
    mergeBed -i stdin | \
    intersectBed -a stdin -b {params.gapless} -sorted | \
    bgzip -c > {output}
    """
401
402
script:
    "../scripts/python/bedtools/low_complexity/filter_sort_trf.py"
412
413
414
415
416
417
shell:
    """
    gunzip -c {input} | \
    mergeBed -i stdin | \
    bgzip -c > {output}
    """
443
444
script:
    "../scripts/python/bedtools/low_complexity/filter_sort_rmsk.py"
456
457
458
459
460
461
462
shell:
    """
    gunzip -c {input.rmsk} | \
    grep {wildcards.rmsk_class} | \
    mergeBed -i stdin | \
    bgzip -c > {output}
    """
496
497
script:
    "../scripts/python/bedtools/low_complexity/filter_sort_censat.py"
511
512
shell:
    "mergeBed -i {input} | bgzip -c > {output}"
524
525
526
527
528
529
530
shell:
    """
    slopBed -i {input.bed} -b 5 -g {input.genome} | \
    mergeBed -i stdin | \
    intersectBed -a stdin -b {input.gapless} -sorted -g {input.genome} | \
    bgzip -c > {output}
    """
545
546
547
548
549
550
shell:
    """
    complementBed -i {input.bed} -g {params.genome} |
    intersectBed -a stdin -b {params.gapless} -sorted | \
    bgzip -c > {output}
    """
567
568
569
570
571
572
573
574
575
576
shell:
    """
    slopBed -i {input.perfect} -b 5 -g {input.genome} | \
    mergeBed -i stdin | \
    multiIntersectBed -i stdin {input.imperfect} | \
    mergeBed -i stdin | \
    intersectBed -a stdin -b {input.gapless} -sorted -g {input.genome} | \
    bgzip -c \
    > {output}
    """
600
601
602
603
604
605
shell:
    """
    multiIntersectBed -i {input.beds} | \
    slopBed -i stdin -b 5 -g {input.genome} | \
    mergeBed -i stdin > {output}
    """
634
635
636
637
638
639
640
shell:
    """
    awk '({params.lower}+10)<=($3-$2) && ($3-$2)<({params.upper}+10)' {input.tr} | \
    subtractBed -a stdin -b {input.hp} | \
    intersectBed -a stdin -b {input.gapless} -sorted -g {input.genome} | \
    bgzip -c > {output}
    """
667
668
669
670
671
672
shell:
    """
    multiIntersectBed -i {input} | \
    mergeBed -i stdin | \
    bgzip -c > {output}
    """
39
40
shell:
    "curl -sS -L -o {output} {params.url}"
59
60
61
62
63
64
65
66
shell:
    """
    mkdir -p {config.tools_bin_dir} && \
    tar xjf {input} \
    --directory {config.tools_bin_dir} \
    --strip-components=2 \
    {params.bins}
    """
81
82
script:
    "../scripts/python/bedtools/mappability/filter_ref.py"
 98
 99
100
101
102
103
104
105
106
shell:
    """
    PATH={config.tools_bin_dir}:$PATH
    {input.bin} \
    --complement emulate \
    -T {threads} \
    -i {input.fa} \
    -o {params.base} > {log} 2>&1
    """
124
125
126
127
128
129
130
131
132
133
shell:
    """
    {input.bin} \
    -m {wildcards.m} \
    -e {wildcards.e} \
    -l {wildcards.l} \
    -T {threads} \
    -I {input.fa} \
    -o {params.base} > {log} 2>&1
    """
151
152
153
154
155
156
157
shell:
    """
    {input.bin} \
    -I {input.idx} \
    -i {input.map} \
    -o {params.base} 2> {log}
    """
171
172
173
174
175
176
177
178
179
shell:
    """
    sed 's/ AC//' {input} | \
    wig2bed -d | \
    awk '$5>0.9' | \
    cut -f1-3 | \
    gzip -c > \
    {output}
    """
208
209
script:
    "../scripts/python/bedtools/mappability/merge_nonunique.py"
228
229
230
231
232
233
shell:
    """
    complementBed -i {input} -g {params.genome} | \
    intersectBed -a stdin -b {params.gapless} -sorted -g {params.genome} | \
    bgzip -c > {output}
    """
19
20
21
22
23
24
25
26
shell:
    """
    complementBed -i {input.gapless} -g {input.genome} | \
    slopBed -i stdin -b 15000 -g {input.genome} | \
    mergeBed -i stdin | \
    intersectBed -a stdin -b {input.gapless} -sorted -g {input.genome} | \
    bgzip -c > {output}
    """
37
38
script:
    "../scripts/python/bedtools/otherdifficult/filter_vdj.py"
50
51
52
53
54
55
shell:
    """
    mergeBed -i {input.bed} | \
    intersectBed -a stdin -b {input.gapless} -sorted -g {input.genome} | \
    bgzip -c > {output}
    """
40
41
script:
    "../scripts/python/bedtools/other/filter_sort_other.py"
53
54
script:
    "../scripts/python/bedtools/postprocess/list_strats.py"
64
65
script:
    "../scripts/python/bedtools/postprocess/list_md5.py"
74
75
script:
    "../scripts/python/bedtools/postprocess/generate_tsv.py"
91
92
script:
    "../scripts/python/bedtools/postprocess/run_unit_tests.py"
111
112
113
114
115
116
shell:
    """
    curl -fsSq -o {params.tar} {params.url} && \
    mkdir {output} && \
    tar xzf {params.tar} --directory {output} --strip-components=1
    """
141
142
script:
    "../scripts/python/bedtools/postprocess/diff_previous_strats.py"
164
165
166
167
168
169
170
171
172
173
174
run:
    with open(output[0], "w") as f:
        for ref_key, build_key in zip(rks, bks):
            for i in config.buildkey_to_chr_indices(ref_key, build_key):
                pattern = config.refkey_to_final_chr_pattern(ref_key)
                line = [
                    str(i.value),
                    f"{ref_key}@{build_key}",
                    i.chr_name_full(pattern),
                ]
                f.write("\t".join(line) + "\n")
207
208
script:
    "../scripts/rmarkdown/rmarkdown/validate.Rmd"
216
217
shell:
    "gunzip -c {input} > {output}"
227
228
229
230
shell:
    """
    samtools faidx {input}
    """
257
258
script:
    "../scripts/python/bedtools/postprocess/run_happy.py"
278
279
script:
    "../scripts/rmarkdown/rmarkdown/benchmark.Rmd"
289
290
291
292
293
shell:
    """
    cp {input.main} {output.main}
    cp {input.validation} {output.validation}
    """
305
306
307
308
shell:
    """
    tar czf {output} -C {params.parent} {params.target}
    """
326
327
328
329
330
331
332
333
334
shell:
    """
    find {params.root} -type f -exec md5sum {{}} + | \
    sed 's|{params.root}|\.|' | \
    grep -v "\./genome-stratifications-md5s\.txt" | \
    sort -k2,2 > {output} && \
    cd {params.root} && \
    md5sum -c --quiet genome-stratifications-md5s.txt
    """
21
22
script:
    "../scripts/python/bedtools/xy/write_par.py"
SnakeMake From line 21 of rules/ref.smk
35
36
script:
    "../scripts/python/bedtools/misc/get_file.py"
SnakeMake From line 35 of rules/ref.smk
48
49
50
51
52
53
shell:
    """
    gunzip -c {input} | \
    samtools faidx - -o - \
    2> {log} > {output}
    """
65
66
script:
    "../scripts/python/bedtools/ref/get_genome.py"
SnakeMake From line 65 of rules/ref.smk
79
80
81
82
83
shell:
    """
    samtools faidx {input.fa} $(cut -f1 {input.genome} | tr '\n' ' ') 2> {log} | \
    bgzip -c > {output}
    """
123
124
script:
    "../scripts/python/bedtools/ref/get_gapless.py"
SnakeMake From line 123 of rules/ref.smk
154
155
script:
    "../scripts/python/bedtools/ref/get_ftbl_mapper.py"
SnakeMake From line 154 of rules/ref.smk
165
166
167
168
169
170
171
shell:
    """
    gunzip -c {input} | \
    grep -v '^#' | \
    awk -F "\t" 'OFS="\t" {{ print $1,$4,$5,$2,$3,$9 }}' | \
    bgzip -c > {output}
    """
SnakeMake From line 165 of rules/ref.smk
213
214
script:
    "../scripts/python/bedtools/ref/filter_sort_bench_bed.py"
SnakeMake From line 213 of rules/ref.smk
 9
10
shell:
    "curl -sS -L -o {output} {params.url}"
19
20
21
22
23
shell:
    """
    mkdir {output} && \
    tar xzf {input} --directory {output} --strip-components=1
    """
35
36
shell:
    "make -C {input} 2>&1 > {log} && mv {input}/repseq {output}"
30
31
script:
    "../scripts/python/bedtools/segdups/filter_sort_superdups.py"
43
44
45
46
47
48
shell:
    """
    mergeBed -i {input.bed} -d 100 | \
    intersectBed -a stdin -b {input.gapless} -sorted -g {input.genome} | \
    bgzip -c > {output}
    """
58
59
60
61
62
63
shell:
    """
    gunzip -c {input} | \
    awk '($3-$2 > 10000)' | \
    bgzip -c > {output}
    """
75
76
77
78
79
80
shell:
    """
    complementBed -i {input.bed} -g {input.genome} | \
    intersectBed -a stdin -b {input.gapless} -sorted -g {input.genome} | \
    bgzip -c > {output}
    """
23
24
25
26
27
28
29
30
shell:
    """
    seqtk telo {input.ref} 2> {log} | \
    cut -f1-3 | \
    intersectBed -a stdin -b {input.gapless} -sorted -g {input.genome} | \
    bgzip -c \
    > {output}
    """
16
17
18
19
20
21
22
shell:
    """
    multiIntersectBed -i {input} | \
    mergeBed -i stdin | \
    bgzip -c \
    > {output}
    """
35
36
37
38
39
40
41
shell:
    """
    complementBed -i {input.bed} -g {params.genome} |
    intersectBed -a stdin -b {params.gapless} -sorted -g {params.genome} | \
    bgzip -c \
    > {output}
    """
36
37
38
39
40
shell:
    """
    intersectBed -a {input.bed} -b {input.gapless} -sorted -g {input.genome} | \
    bgzip -c > {output}
    """
54
55
script:
    "../scripts/python/bedtools/xy/filter_sort_features.py"
SnakeMake From line 54 of rules/xy.smk
87
88
89
90
91
92
93
shell:
    """
    complementBed -i {input.bed} -g {input.genome} | \
    grep {wildcards.sex_chr} | \
    intersectBed -a stdin -b {input.gapless} -sorted -g {input.genome} | \
    bgzip -c > {output}
    """
105
106
107
108
109
110
111
shell:
    """
    awk -v OFS='\t' {{'print $1,\"0\",$2'}} {input.bed} | \
    grep -v \"X\|Y\" | \
    intersectBed -a stdin -b {input.gapless} -sorted -g {input.genome} | \
    bgzip -c > {output}
    """
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import json
from typing import Any
import common.config as cfg
from common.bed import filter_sort_bed_inner, read_bed


def main(smk: Any, sconf: cfg.GiabStrats) -> None:
    rk = cfg.RefKey(smk.wildcards["ref_key"])
    bk = cfg.BuildKey(smk.wildcards["build_key"])
    to_map = sconf.buildkey_to_final_chr_mapping(rk, bk)

    with open(smk.input["mapper"], "r") as f:
        from_map = json.load(f)

    df = read_bed(smk.input["bed"], more=[3, 4])
    df = df[df[3].str.contains("RefSeq") & (df[4] == "CDS")][[0, 1, 2]].copy()
    df = filter_sort_bed_inner(from_map, to_map, df)

    df.to_csv(
        smk.output[0],
        header=False,
        index=False,
        sep="\t",
    )


main(snakemake, snakemake.config)  # type: ignore
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
from pathlib import Path
from typing import Any, NamedTuple, Callable
import subprocess as sp
import common.config as cfg
import json

# use subprocess here because I don't see an easy way to stream a bedtools
# python object to a bgzip file (it only seems to do gzip...oh well)


class GCInput(NamedTuple):
    bed: Path
    fraction: int
    is_range_bound: bool


def write_simple_range_beds(
    final_path: Callable[[str], Path],
    gs: list[GCInput],
    is_low: bool,
) -> list[str]:
    def fmt_out(bigger_frac: int, smaller_frac: int) -> Path:
        lower_frac, upper_frac = (
            (smaller_frac, bigger_frac) if is_low else (bigger_frac, smaller_frac)
        )
        return final_path(f"gc{lower_frac}to{upper_frac}_slop50")

    pairs = zip(gs[1:], gs[:-1]) if is_low else zip(gs[:-1], gs[1:])
    torun = [
        (fmt_out(bigger.fraction, smaller.fraction), bigger, smaller)
        for bigger, smaller in pairs
    ]
    for out, bigger, smaller in torun:
        with open(out, "wb") as f:
            p0 = sp.Popen(
                ["subtractBed", "-a", bigger.bed, "-b", smaller.bed],
                stdout=sp.PIPE,
            )
            p1 = sp.run(["bgzip", "-c"], stdin=p0.stdout, stdout=f)
            p0.wait()
            if not (p0.returncode == p1.returncode == 0):
                exit(1)
    return [str(t[0]) for t in torun]


def write_middle_range_bed(
    final_path: Callable[[str], Path],
    lower: GCInput,
    upper: GCInput,
    genome: Path,
    gapless: Path,
) -> str:
    out = final_path(f"gc{lower.fraction}to{upper.fraction}_slop50")
    with open(out, "wb") as f:
        p0 = sp.Popen(
            ["complementBed", "-i", upper.bed, "-g", genome],
            stdout=sp.PIPE,
        )
        p1 = sp.Popen(
            ["subtractBed", "-a", "stdin", "-b", lower.bed],
            stdin=p0.stdout,
            stdout=sp.PIPE,
        )
        p2 = sp.Popen(
            ["intersectBed", "-a", "stdin", "-b", gapless, "-sorted", "-g", genome],
            stdin=p1.stdout,
            stdout=sp.PIPE,
        )
        p3 = sp.run(["bgzip", "-c"], stdin=p2.stdout, stdout=f)
        p0.wait()
        p1.wait()
        p2.wait()
        if not (p0.returncode == p1.returncode == p2.returncode == p3.returncode == 0):
            exit(1)
    return str(out)


def write_intersected_range_beds(
    final_path: Callable[[str], Path],
    low: list[GCInput],
    high: list[GCInput],
) -> list[str]:
    pairs = zip(
        [x for x in low if x.is_range_bound],
        [x for x in reversed(high) if x.is_range_bound],
    )
    torun = [
        (i, final_path(f"gclt{b1.fraction}orgt{b2.fraction}_slop50"), b1, b2)
        for i, (b1, b2) in enumerate(pairs)
    ]
    for i, bed_out, b1, b2 in torun:
        with open(bed_out, "wb") as f:
            p0 = sp.Popen(
                ["multiIntersectBed", "-i", b1.bed, b2.bed],
                stdout=sp.PIPE,
            )
            p1 = sp.Popen(
                ["mergeBed", "-i", "stdin"],
                stdin=p0.stdout,
                stdout=sp.PIPE,
            )
            p2 = sp.run(["bgzip", "-c"], stdin=p1.stdout, stdout=f)
            p0.wait()
            p1.wait()
            if not (p0.returncode == p1.returncode == p2.returncode == 0):
                exit(1)

    return [str(t[1]) for t in torun]


def main(smk: Any, sconf: cfg.GiabStrats) -> None:
    rk = cfg.RefKey(smk.wildcards.ref_key)
    bk = cfg.BuildKey(smk.wildcards.build_key)
    gps = sconf.buildkey_to_include(rk, bk).gc
    assert gps is not None, "this should not happen"
    # ASSUME both of these input lists are sorted by GC fraction
    low = [GCInput(p, f, r) for p, (f, r) in zip(smk.input.low, gps.low)]
    high = [GCInput(p, f, r) for p, (f, r) in zip(smk.input.high, gps.high)]
    genome = Path(smk.input.genome)
    gapless = Path(smk.input.gapless)

    def final_path(name: str) -> Path:
        p = Path(str(smk.params.path_pattern).format(name))
        p.parent.mkdir(exist_ok=True, parents=True)
        return p

    low_strats = write_simple_range_beds(final_path, low, True)
    high_strats = write_simple_range_beds(final_path, high, False)
    range_strat = write_middle_range_bed(
        final_path,
        low[-1],
        high[0],
        genome,
        gapless,
    )
    inter_strats = write_intersected_range_beds(
        final_path,
        low,
        high,
    )

    with open(smk.output[0], "w") as f:
        # put the first low and last high input here since these are already
        # in the final directory
        obj = {
            "gc_ranges": [
                low[0][0],
                *low_strats,
                range_strat,
                *high_strats,
                high[-1][0],
            ],
            "widest_extreme": inter_strats[0],
            "other_extremes": inter_strats[1:],
        }
        json.dump(obj, f, indent=4)


main(snakemake, snakemake.config)  # type: ignore
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
from typing import Any
import common.config as cfg
from common.bed import read_bed, filter_sort_bed, write_bed


def main(smk: Any, sconf: cfg.GiabStrats) -> None:
    rk = cfg.RefKey(smk.wildcards["ref_key"])
    bk = cfg.BuildKey(smk.wildcards["build_key"])

    bedfile = sconf.refkey_to_strat(rk).low_complexity.satellites
    assert bedfile is not None, "this should not happen"

    ps = bedfile.params
    conv = sconf.buildkey_to_chr_conversion(rk, bk, ps.chr_pattern)

    df = read_bed(smk.input[0], ps, [bedfile.sat_col])
    df = filter_sort_bed(conv, df)
    df = df[~df[3].str.startswith("ct_")]
    write_bed(smk.output[0], df)


main(snakemake, snakemake.config)  # type: ignore
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
from typing import Any
import common.config as cfg
from common.bed import read_filter_sort_bed


def main(smk: Any, sconf: cfg.GiabStrats) -> None:
    rk = cfg.RefKey(smk.wildcards["ref_key"])
    bk = cfg.BuildKey(smk.wildcards["build_key"])
    rm = sconf.refkey_to_strat(rk).low_complexity.rmsk
    assert rm is not None, "this should not happen"
    read_filter_sort_bed(
        sconf, smk.input[0], smk.output[0], rm.params, rk, bk, [rm.class_col]
    )


main(snakemake, snakemake.config)  # type: ignore
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
from typing import Any
import common.config as cfg
from common.bed import read_filter_sort_bed


def main(smk: Any, sconf: cfg.GiabStrats) -> None:
    rk = cfg.RefKey(smk.wildcards["ref_key"])
    bk = cfg.BuildKey(smk.wildcards["build_key"])
    ss = sconf.refkey_to_strat(rk).low_complexity.simreps
    assert ss is not None, "this should not happen"
    read_filter_sort_bed(sconf, smk.input[0], smk.output[0], ss.params, rk, bk)


main(snakemake, snakemake.config)  # type: ignore
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import re
from typing import Any
import pandas as pd
import common.config as cfg
import subprocess as sp
from pathlib import Path

# This is the one exception to the pattern wherein we filter all input data to
# the chromosomes we actually care about. In the case of mappability, we want to
# run GEM against all random/unplaced contigs, since these are hard to map to
# begin with (otherwise they wouldn't be unplaced). Here, we need to filter
# the fasta to include all the primary contigs we care about + all the other
# unplaced/random contigs. Later we will filter out all the non-primary contigs.


def main(smk: Any, sconf: cfg.GiabStrats) -> None:
    rk = cfg.RefKey(smk.wildcards["ref_key"])
    bk = cfg.BuildKey(smk.wildcards["build_key"])

    pattern = sconf.refkey_to_final_chr_pattern(rk)
    pats = sconf.refkey_to_mappability_patterns(rk)
    cis = sconf.buildkey_to_chr_indices(rk, bk)

    idx = pd.read_table(
        Path(smk.input["idx"]),
        header=None,
        dtype={0: str},
        usecols=[0],
    )

    chrs = [
        *filter(
            lambda c: any(i.chr_name_full(pattern) == c for i in cis)
            or any(re.match(p, c) for p in pats),
            idx[0].tolist(),
        )
    ]

    with open(Path(smk.output[0]), "w") as f:
        # ASSUME sort order doesn't matter and ref is bgzip'd or unzip'd
        # NOTE the output is not bgzip'd because GEM doesn't like it
        p = sp.Popen(["samtools", "faidx", Path(smk.input["fa"]), *chrs], stdout=f)
        p.wait()


main(snakemake, snakemake.config)  # type: ignore
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import re
from typing import Any
from pathlib import Path
import common.config as cfg
from common.bed import read_bed, filter_sort_bed, write_bed
from pybedtools import BedTool as bt  # type: ignore
import json


def main(smk: Any, sconf: cfg.GiabStrats) -> None:
    rk = cfg.RefKey(smk.wildcards["ref_key"])

    pattern = sconf.refkey_to_final_chr_pattern(rk)
    conv = cfg.fullset_conv(pattern)

    inputs = smk.input["bed"]
    genome = Path(smk.input["genome"][0])
    gapless = bt(smk.input["gapless"])

    def final_path(name: str) -> Path:
        p = Path(str(smk.params.path_pattern).format(name))
        p.parent.mkdir(exist_ok=True, parents=True)
        return p

    def to_single_output(basename: str) -> Path:
        m = re.match(".*(l\\d+_m\\d+_e\\d+).*", basename)
        assert m is not None, f"malformed mappability file name: {basename}"
        return final_path(f"nonunique_{m[1]}")

    def read_sort_bed(p: Path) -> bt:
        df = read_bed(p)
        # Sort here because we can't assume wig2bed sorts its output. Also,
        # filtering is necessary because the output should have unplaced contigs
        # in it that we don't want.
        return bt().from_dataframe(filter_sort_bed(conv, df))

    def merge_bed(bed: bt, out: Path) -> None:
        df = bed.merge(d=100).intersect(b=gapless, sorted=True, g=genome).to_dataframe()
        write_bed(out, df)

    def merge_single(bed: bt, out: Path) -> None:
        comp = bed.complement(g=str(genome))
        merge_bed(comp, out)

    all_lowmap = final_path("lowmappabilityall")
    single_lowmap = []

    # If there is only one input, merge this to make one "all_nonunique" bed.
    # Otherwise, merge each individual input and then combine these with
    # multi-intersect to make the "all_nonunique" bed.
    if len(inputs) == 1:
        bed = read_sort_bed(Path(inputs[0]))
        merge_single(bed, all_lowmap)
    else:
        single = [
            (
                to_single_output((p := Path(i)).name),
                read_sort_bed(p),
            )
            for i in inputs
        ]
        for sout, bed in single:
            merge_single(bed, sout)
        mint = bt().multi_intersect(i=[s[0] for s in single])
        merge_bed(mint, all_lowmap)
        single_lowmap = [str(s[0]) for s in single]

    with open(smk.output[0], "w") as f:
        obj = {"all_lowmap": str(all_lowmap), "single_lowmap": single_lowmap}
        json.dump(obj, f)


main(snakemake, snakemake.config)  # type: ignore
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
from pathlib import Path
import subprocess as sp
from typing import Callable
from typing_extensions import assert_never
from tempfile import NamedTemporaryFile as Tmp
from common.io import is_bgzip, is_gzip
import common.config as cfg
from common.io import get_md5, setup_logging

# hacky curl/gzip wrapper; this exists because I got tired of writing
# specialized rules to convert gzip/nozip files to bgzip and back :/
# Solution: force bgzip for references and gzip for bed

smk_log = snakemake.log[0]  # type: ignore

log = setup_logging(smk_log)

GUNZIP = ["gunzip", "-c"]
BGZIP = ["bgzip", "-c"]
GZIP = ["gzip", "-c"]
CURL = ["curl", "-f", "-Ss", "-L", "-q"]


def main(opath: str, src: cfg.BedSrc | cfg.RefSrc | None) -> None:
    if isinstance(src, cfg.FileSrc_):
        # ASSUME these are already tested via the pydantic class for the
        # proper file format
        Path(opath).symlink_to(Path(src.filepath).resolve())

    elif isinstance(src, cfg.HttpSrc_):
        curlcmd = [*CURL, src.url]

        # to test the format of downloaded files, sample the first 65000 bytes
        # (which should be enough to get one block of a bgzip file, which will
        # allow us to test for it)
        curltestcmd = [*CURL, "-r", "0-65000", src.url]

        with open(opath, "wb") as f, Tmp() as tf, open(smk_log, "w") as lf:

            def curl() -> None:
                if sp.run(curlcmd, stdout=f, stderr=lf).returncode != 0:
                    exit(1)

            def curl_test(testfun: Callable[[Path], bool]) -> bool:
                if sp.run(curltestcmd, stdout=tf, stderr=lf).returncode != 0:
                    exit(1)
                return testfun(Path(tf.name))

            def curl_gzip(cmd: list[str]) -> None:
                p1 = sp.Popen(curlcmd, stdout=sp.PIPE, stderr=lf)
                p2 = sp.run(cmd, stdin=p1.stdout, stdout=f, stderr=lf)
                p1.wait()
                if not (p1.returncode == p2.returncode == 0):
                    exit(1)

            # if we are getting a bed file (or related) ensure it is in gzip
            # format
            if isinstance(src, cfg.BedHttpSrc):
                if curl_test(is_gzip):
                    curl()
                else:
                    curl_gzip(GZIP)

            # if we are getting a fasta file ensure it is in bgzip format
            elif isinstance(src, cfg.RefHttpSrc):
                if curl_test(is_bgzip):
                    curl()
                elif curl_test(is_gzip):
                    p1 = sp.Popen(curlcmd, stdout=sp.PIPE, stderr=lf)
                    p2 = sp.Popen(GUNZIP, stdin=p1.stdout, stdout=sp.PIPE, stderr=lf)
                    p3 = sp.run(BGZIP, stdin=p2.stdout, stdout=f, stderr=lf)
                    p1.wait()
                    p2.wait()
                    if not (p1.returncode == p2.returncode == p3.returncode == 0):
                        exit(1)
                else:
                    curl_gzip(BGZIP)
            else:
                assert_never(src)

    elif src is None:
        assert False, "file src is null; this should not happen"
    else:
        assert_never(src)

    if src.md5 is not None and src.md5 != (actual := get_md5(opath, True)):
        log.error("md5s don't match; wanted %s, actual %s", src.md5, actual)
        exit(1)


main(snakemake.output[0], snakemake.params.src)  # type: ignore
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import json
from typing import Any
import common.config as cfg
from common.bed import filter_sort_bed_inner, read_bed, write_bed

VDJ_PAT = "^ID=gene-(IGH|IGK|IGL|TRA|TRB|TRG);"


def main(smk: Any, sconf: cfg.GiabStrats) -> None:
    rk = cfg.RefKey(smk.wildcards["ref_key"])
    bk = cfg.BuildKey(smk.wildcards["build_key"])
    to_map = sconf.buildkey_to_final_chr_mapping(rk, bk)

    with open(smk.input["mapper"], "r") as f:
        from_map = json.load(f)

    df = read_bed(smk.input["bed"], more=[5])
    df = df[df[3].str.match(VDJ_PAT)][[0, 1, 2]].copy()
    df = filter_sort_bed_inner(from_map, to_map, df)

    write_bed(smk.output[0], df)


main(snakemake, snakemake.config)  # type: ignore
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
from typing import Any
import common.config as cfg
from common.bed import read_bed, filter_sort_bed, write_bed
from pybedtools import BedTool as bt  # type: ignore


def main(smk: Any, sconf: cfg.GiabStrats) -> None:
    rk = cfg.RefKey(smk.wildcards["ref_key"])
    bk = cfg.BuildKey(smk.wildcards["build_key"])
    lk = cfg.OtherLevelKey(smk.wildcards["other_level_key"])
    sk = cfg.OtherStratKey(smk.wildcards["other_strat_key"])
    bed = sconf.otherkey_to_bed(rk, bk, lk, sk)
    ps = bed.params
    conv = sconf.buildkey_to_chr_conversion(rk, bk, ps.chr_pattern)
    df = filter_sort_bed(conv, read_bed(smk.input["bed"][0], ps))
    if bed.remove_gaps:
        df = (
            bt()
            .from_dataframe(df)
            .intersect(
                smk.input["gapless"],
                sorted=True,
                g=smk.input["genome"][0],
            )
            .to_dataframe()
        )
    write_bed(smk.output[0], df)


main(snakemake, snakemake.config)  # type: ignore
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
from pathlib import Path
from typing import Any
import common.config as cfg
from common.stratdiff.lib.diff import compare_all
from common.io import setup_logging


log = setup_logging(snakemake.log[0])  # type: ignore


def main(smk: Any, sconf: cfg.GiabStrats) -> None:
    rk = cfg.RefKey(smk.wildcards.ref_key)
    bk = cfg.BuildKey(smk.wildcards.build_key)
    comparison = sconf.buildkey_to_build(rk, bk).comparison
    pattern = sconf.refkey_to_final_chr_pattern(rk)
    ixs = sconf.buildkey_to_chr_indices(rk, bk)

    assert comparison is not None, "this should not happen"

    outdir = Path(smk.output[0]).parent

    logged = compare_all(
        Path(smk.input["new_list"]).parent,
        smk.input["old"],
        outdir,
        comparison.path_mapper,
        comparison.replacements,
        [i.chr_name_full(pattern) for i in ixs],
        comparison.ignore_generated,
        comparison.ignore_other,
    )
    for x in logged:
        log.info(x)


main(snakemake, snakemake.config)  # type: ignore
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
from pathlib import Path
from typing import Any
import re


def main(smk: Any) -> None:
    with open(smk.input[0], "r") as i, open(smk.output[0], "w") as o:
        for f in i:
            fp = Path(f.strip())
            level = re.sub("^[^_]+_", "", fp.name.replace(".bed.gz", ""))
            o.write(f"{level}\t{Path(fp.parent.name) / fp.name}\n")


main(snakemake)  # type: ignore
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
from typing import Any
from os.path import dirname
from common.io import get_md5


def strat_files(path: str) -> list[str]:
    with open(path, "r") as f:
        return [s.strip() for s in f]


def main(smk: Any) -> None:
    out = str(smk.output)
    root = dirname(out)
    ss = strat_files(str(smk.input))
    with open(out, "w") as op:
        for s in ss:
            h = get_md5(s)
            p = s.replace(root + "/", "")
            op.write(f"{h}  {p}\n")


main(snakemake)  # type: ignore
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
from typing import Any


def main(smk: Any) -> None:
    with open(smk.output[0], "w") as f:
        for s in sorted(smk.input):
            f.write(s + "\n")


main(snakemake)  # type: ignore
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import gzip
import os
import subprocess as sp
from typing import Any
from pathlib import Path


def gzip_empty(path: str) -> bool:
    with gzip.open(path, "rb") as f:
        return len(f.read(1)) == 0


def main(smk: Any) -> None:
    # ASSUME if filtered benchmark bed is empty then we have nothing to
    # benchmark because the desired chromosomes in this build are not included
    # in the benchmark
    with open(smk.log[0], "w") as f:
        if gzip_empty(smk.input["bench_bed"][0]):
            Path(smk.output[0]).touch()
            f.write("no overlapping chromosomes, writing empty dummy file\n")
        else:
            cmd = [
                "hap.py",
                *["--engine", "vcfeval"],
                "--verbose",
                *["--threads", str(smk.threads)],
                *["--stratification", smk.input["strats"][0]],
                *["-f", smk.input["bench_bed"][0]],
                *["-o", smk.params["prefix"]],
                smk.input["bench_vcf"][0],
                smk.input["query_vcf"][0],
            ]
            res = sp.run(
                cmd,
                stderr=sp.STDOUT,
                stdout=f,
                env={"HGREF": smk.input["ref"][0], **os.environ},
            )
            if res.returncode != 0:
                exit(1)


main(snakemake)  # type: ignore
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
import pandas as pd
from typing import Any, NamedTuple
from pathlib import Path
from pybedtools import BedTool as bt  # type: ignore
from os.path import dirname, basename
from os import scandir
import common.config as cfg
from common.io import setup_logging, is_bgzip
import subprocess as sp

log = setup_logging(snakemake.log[0])  # type: ignore


class GaplessBT(NamedTuple):
    auto: bt
    parY: bt


def test_bgzip(strat_file: Path) -> list[str]:
    """Each bed file should be bgzip'ed (not just gzip'ed)."""
    return [] if is_bgzip(strat_file) else ["is not bgzip file"]


def test_bed(
    strat_file: Path,
    reverse_map: dict[str, int],
    gapless: GaplessBT,
) -> list[str]:
    """Each stratification should be a valid bed file."""

    # test bed files in multiple phases, where each phase depends on the
    # previous being valid
    #
    # Phase 1: try to import the dataframe (with almost no assumptions except
    # that it has no header)
    try:
        # no low mem since we don't want to assume what dtypes each column has,
        # and by default pandas will chunk the input which could lead to mixed
        # types
        df = pd.read_table(strat_file, header=0, low_memory=False)
    except pd.errors.EmptyDataError:
        # if df is empty, nothing to do
        return []
    except Exception as e:
        # catch any other weird read errors
        return [str(e)]

    # Phase 2: ensure "bed file" has 3 columns
    if len(df.columns) != 3:
        return ["bed file has wrong number of columns"]

    # Phase 3: ensure "bed file" has column of the correct type
    cols = {"chrom": str, "start": int, "end": int}
    df.columns = pd.Index([*cols])

    try:
        df = df.astype(cols)
    except ValueError as e:
        return [str(e)]

    # Phase 4: we now know we have a real bed file, ensure that we didn't
    # invent any new chromosomes (try to convert chrom column to ints, which
    # will produce NaNs on error)
    chr_ints = df["chrom"].map(reverse_map).astype("Int64")
    invalid_chrs = df["chrom"][chr_ints.isnull()].unique().tolist()
    if len(invalid_chrs) > 0:
        return [f"invalid chrs: {invalid_chrs}"]

    # Phase 5: test remaining assumptions:
    # - chromosome column is sorted
    # - each region is disjoint and ascending (eg the start of a region is at
    #   least 1bp away from the end of the previous)
    # - does not intersect with gap regions (if any) or extend beyond chr bounds
    same_chrom = df["chrom"] == df["chrom"].shift(1)
    distance = (df["start"] - df["end"].shift(1))[same_chrom]
    overlapping_lines = distance[distance < 1].index.tolist()

    # choose gap file depending on if this is an XY strat or not; note that the
    # built-in gaps file (if applicable) is obviously whitelisted from this
    # check
    if "gaps_slop15kb" in basename(strat_file):
        no_gaps = True
    else:
        isXY = basename(dirname(strat_file)) == "XY"
        gapless_bt = gapless.parY if isXY else gapless.auto
        gaps = bt.from_dataframe(df).subtract(gapless_bt).to_dataframe()
        no_gaps = len(gaps) == 0

    return [
        msg
        for result, msg in [
            (
                pd.Index(chr_ints).is_monotonic_increasing,
                "chromosomes not sorted",
            ),
            (
                len(overlapping_lines) == 0,
                f"non-disjoint regions at lines: {overlapping_lines}",
            ),
            (no_gaps, "bed contains gap regions"),
        ]
        if result is False
    ]


def test_checksums(checksums: Path) -> list[str]:
    """The md5sum utility should pass when invoked."""
    res = sp.run(
        ["md5sum", "-c", "--strict", "--quiet", checksums.name],
        cwd=checksums.parent,
        capture_output=True,
        text=True,
    )
    errors = [] if (s := res.stdout) == "" else s.strip().split("\n")
    return [f"checksum error: {e}" for e in errors]


def test_tsv_list(tsv_list: Path) -> list[str]:
    """The final stratifications list should match final beds exactly.

    We are running our hotel on very tight margins; no extra or missing beds
    allowed.
    """

    def strat_set(root: Path, sub: Path) -> set[str]:
        if root.is_dir():
            deeper = (Path(p.path) for p in scandir(root))
            return {str(q) for d in deeper for q in strat_set(d, sub / d.name)}
        elif root.is_file() and root.name.endswith(".bed.gz"):
            return {str(sub)}
        else:
            return set()

    current = strat_set(tsv_list.parent, Path("./"))
    with open(tsv_list, "r") as f:
        listed = {line.strip().split("\t")[1] for line in f}
        missing = [f"not in final directory: {p}" for p in listed - current]
        extra = [f"not in final list: {p}" for p in current - listed]
        return missing + extra


def run_all_tests(
    strat_file: Path,
    reverse_map: dict[str, int],
    gapless: GaplessBT,
) -> list[tuple[Path, str]]:
    return [
        (strat_file, msg)
        for msg in test_bgzip(strat_file) + test_bed(strat_file, reverse_map, gapless)
    ]


def strat_files(path: str) -> list[Path]:
    with open(path, "r") as f:
        return [Path(s.strip()) for s in f]


def main(smk: Any, sconf: cfg.GiabStrats) -> None:
    rk = cfg.RefKey(smk.wildcards["ref_key"])
    bk = cfg.BuildKey(smk.wildcards["build_key"])
    reverse_map = {v: k for k, v in sconf.buildkey_to_final_chr_mapping(rk, bk).items()}

    # check global stuff first (since this is faster)

    global_failures: list[str] = test_checksums(
        Path(smk.input["checksums"])
    ) + test_tsv_list(Path(smk.input["strat_list"]))

    for j in global_failures:
        log.error(j)

    if len(global_failures) > 0:
        exit(1)

    # check individual stratification files

    gapless = GaplessBT(
        auto=bt(str(smk.input["gapless_auto"])),
        parY=bt(str(smk.input["gapless_parY"])),
    )

    strat_failures = [
        res
        for p in strat_files(str(smk.input["strats"]))
        for res in run_all_tests(p, reverse_map, gapless)
    ]

    for i in strat_failures:
        log.error("%s: %s" % i)

    if len(strat_failures) > 0:
        exit(1)


main(snakemake, snakemake.config)  # type: ignore
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
from typing import Any
import common.config as cfg
from common.bed import read_filter_sort_bed


def main(smk: Any, sconf: cfg.GiabStrats) -> None:
    rk = cfg.RefKey(smk.wildcards["ref_key"])
    bk = cfg.BuildKey(smk.wildcards["build_key"])
    bench = sconf.stratifications[rk].builds[bk].bench
    assert bench is not None, "this should not happen"
    ps = bench.bench_bed.params
    read_filter_sort_bed(sconf, smk.input[0], smk.output[0], ps, rk, bk)


main(snakemake, snakemake.config)  # type: ignore
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import json
import pandas as pd
from typing import Any
import common.config as cfg


def read_ftbl(path: str) -> "pd.Series[str]":
    filter_cols = ["assembly_unit", "seq_type"]
    map_cols = ["chromosome", "genomic_accession"]
    df = pd.read_table(path, header=0, usecols=filter_cols + map_cols, dtype=str)
    chr_mask = df["seq_type"] == "chromosome"
    asm_mask = df["assembly_unit"] == "Primary Assembly"
    return (
        df[chr_mask & asm_mask][map_cols]
        .drop_duplicates()
        .copy()
        .set_index("chromosome")["genomic_accession"]
    )


def main(smk: Any, sconf: cfg.GiabStrats) -> None:
    rk = cfg.RefKey(smk.wildcards["ref_key"])
    bk = cfg.BuildKey(smk.wildcards["build_key"])
    cis = sconf.buildkey_to_chr_indices(rk, bk)

    # if this fails then something is wrong with the ftbl (eg it doesn't have
    # a complete set of chromosomes)
    ser = read_ftbl(smk.input[0])
    mapper = {ser[i.chr_name]: i.value for i in cis}

    with open(smk.output[0], "w") as f:
        json.dump(mapper, f)


main(snakemake, snakemake.config)  # type: ignore
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
from typing import Any
from pathlib import Path
import common.config as cfg
from pybedtools import BedTool as bt  # type: ignore
from common.bed import filter_sort_bed, read_bed, write_bed
import pandas as pd


def read_genome_bed(p: Path) -> pd.DataFrame:
    df = pd.read_table(
        p,
        header=None,
        names=["chrom", "end"],
        dtype={"chrom": str, "end": int},
    )
    df["start"] = 0
    return df[["chrom", "start", "end"]].copy()


def main(smk: Any, sconf: cfg.GiabStrats) -> None:
    inputs = smk.input
    auto_out = Path(smk.output["auto"])
    parY_out = Path(smk.output["parY"])

    # convert genome to bed file (where each region is just the length of
    # one chromosome)
    genome_bed = read_genome_bed(Path(inputs["genome"]))

    # If we have gap input, make the gapless file, otherwise just symlink to the
    # genome bed file (which just means the entire genome is gapless)
    if hasattr(inputs, "gaps"):
        rk = cfg.RefKey(smk.wildcards["ref_key"])
        bk = cfg.BuildKey(smk.wildcards["build_key"])

        bedfile = sconf.stratifications[rk].gap
        assert bedfile is not None, "this should not happen"
        ps = bedfile.params

        conv = sconf.buildkey_to_chr_conversion(rk, bk, ps.chr_pattern)

        gaps_src = Path(inputs["gaps"])

        gaps = read_bed(gaps_src, ps)
        gaps = filter_sort_bed(conv, gaps)
        gaps = bt.from_dataframe(gaps).merge(d=100).to_dataframe()
        gaps_bed = bt().from_dataframe(gaps)
        gaps_with_parY = bt().from_dataframe(genome_bed).subtract(gaps_bed)

        # If we have a parY bed, subtract parY from the gaps bed, otherwise
        # just link them since we have nothing to subtract off
        if hasattr(inputs, "parY"):
            parY_src = Path(inputs["parY"])
            gaps_no_parY = gaps_with_parY.subtract(bt(parY_src))

            write_bed(parY_out, gaps_with_parY.to_dataframe())
            write_bed(auto_out, gaps_no_parY.to_dataframe())
        else:
            write_bed(auto_out, gaps)
            parY_out.symlink_to(auto_out.resolve())
    else:
        write_bed(auto_out, genome_bed)
        parY_out.symlink_to(auto_out.resolve())


main(snakemake, snakemake.config)  # type: ignore
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from typing import Any
import pandas as pd
import common.config as cfg
from common.bed import filter_sort_bed_inner


def main(smk: Any, sconf: cfg.GiabStrats) -> None:
    rk = cfg.RefKey(smk.wildcards["ref_key"])
    bk = cfg.BuildKey(smk.wildcards["build_key"])

    to_map = sconf.buildkey_to_final_chr_mapping(rk, bk)
    # since this is generated from the reference itself, the chr -> int map
    # is just the reverse of the final int -> chr map
    from_map = {v: k for k, v in to_map.items()}

    # ASSUME the input for this is a .fa.fai file (columns = chr, length)
    df = pd.read_table(
        smk.input[0],
        header=None,
        dtype={0: str, 1: int},
        usecols=[0, 1],
    )

    filtered = filter_sort_bed_inner(from_map, to_map, df, 2)
    filtered.to_csv(smk.output[0], sep="\t", header=False, index=False)


main(snakemake, snakemake.config)  # type: ignore
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
from typing import Any
import common.config as cfg
from common.bed import read_filter_sort_bed


def main(smk: Any, sconf: cfg.GiabStrats) -> None:
    rk = cfg.RefKey(smk.wildcards["ref_key"])
    bk = cfg.BuildKey(smk.wildcards["build_key"])
    bedfile = sconf.refkey_to_strat(rk).segdups.superdups
    assert bedfile is not None, "this should not happen"
    read_filter_sort_bed(sconf, smk.input[0], smk.output[0], bedfile.params, rk, bk)


main(snakemake, snakemake.config)  # type: ignore
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
from typing import Any
import common.config as cfg
from pybedtools import BedTool as bt  # type: ignore
from common.bed import read_bed, filter_sort_bed, write_bed


def main(smk: Any, sconf: cfg.GiabStrats) -> None:
    rk = cfg.RefKey(smk.wildcards["ref_key"])
    bk = cfg.BuildKey(smk.wildcards["build_key"])
    level = smk.params["level"]
    i = cfg.ChrIndex.from_name(smk.wildcards["sex_chr"])

    assert i in [cfg.ChrIndex.CHRX, cfg.ChrIndex.CHRY], "invalid sex chr"

    fs = sconf.refkey_to_strat(rk).xy.features
    assert fs is not None, "this should not happen"
    bedfile = fs.x_bed if i is cfg.ChrIndex.CHRX else fs.y_bed
    ps = bedfile.params

    level_col = bedfile.level_col
    conv = sconf.buildkey_to_chr_conversion(rk, bk, ps.chr_pattern)

    df = read_bed(smk.input["bed"], ps, [level_col])
    filtsort = filter_sort_bed(conv, df)
    filtsort = filtsort[filtsort[level_col].str.contains(level)].drop(
        columns=[level_col]
    )
    gapless = bt(smk.input["gapless"])
    nogaps = (
        bt.from_dataframe(filtsort)
        .intersect(b=gapless, sorted=True, g=str(smk.input["genome"]))
        .to_dataframe()
    )
    write_bed(smk.output[0], nogaps)


main(snakemake, snakemake.config)  # type: ignore
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
from typing import Any
from Bio import bgzf  # type: ignore
import common.config as cfg


def main(smk: Any, sconf: cfg.GiabStrats) -> None:
    i = cfg.ChrIndex.from_name(smk.wildcards["sex_chr"])
    k = cfg.RefKey(smk.wildcards["ref_key"])
    pattern = sconf.refkey_to_final_chr_pattern(k)
    cxy = sconf.stratifications[k].xy

    assert i in [cfg.ChrIndex.CHRX, cfg.ChrIndex.CHRY], "invalid sex chr"

    par_fun = cxy.fmt_x_par if i == cfg.ChrIndex.CHRX else cxy.fmt_y_par
    out = par_fun(pattern)
    assert out is not None, "this should not happen"

    with bgzf.BgzfWriter(smk.output[0], "w") as f:
        f.write(out)


main(snakemake, snakemake.config)  # type: ignore
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
library(tidyverse)

subsets <- snakemake@params$subsets

read_summary <- function(path) {
  build <- path %>%
    dirname() %>%
    dirname() %>%
    dirname() %>%
    basename()
  readr::read_csv(
    path,
    col_types = cols(
      Type = "c",
      Subtype = "c",
      Subset = "c",
      Filter = "c",
      METRIC.Recall = "d",
      METRIC.Precision = "d",
      METRIC.Frac_NA = "d",
      TRUTH.TP = "d",
      TRUTH.TP.het = "d",
      TRUTH.TP.homalt = "d",
      .default = "-",
    )
  ) %>%
    filter(Filter == "PASS" &
             Subtype %in% c("*", "I16_PLUS", "D16_PLUS"),
             Subset %in% subsets)  %>%
    rename(Recall = METRIC.Recall,
           Precision = METRIC.Precision,
           Frac_NA = METRIC.Frac_NA) %>%
    mutate("-log10(FN_Rate)" = -log10(1 - Recall),
           "-log10(FP_Rate)" = -log10(1 - Precision),
           build = build) %>%
    mutate(across(contains("TRUTH.TP"), log10)) %>%
    rename_with(~ sprintf("log10(%s)", .x), contains("TRUTH.TP")) %>%
    select(-Filter, -starts_with("TRUTH.TP")) %>%
    pivot_longer(cols = c(-Type, -Subtype, -Subset, -build),
                 names_to = "metric",
                 values_to = "value") %>%
    mutate(metric = fct_relevel(
      factor(metric),
      "Precision",
      "Recall",
      "Frac_NA",
      after = 0
    )) %>%
    filter(!is.na(value))
}

df <- snakemake@input %>%
  keep(~ file.size(.x) > 0) %>%
  map_dfr(read_summary)

nsubs <- df %>%
  pull(Subset) %>%
  unique() %>%
  length()

nbuilds <- df %>%
  pull(build) %>%
  unique() %>%
  length()

knitr::opts_chunk$set(
  echo = TRUE,
  fig.width=10,
  fig.height=(3 + nsubs * nbuilds * 0.5)
)

plot_metric <- function(df, key) {
  p <- ggplot(df, aes(value, fct_rev(Subset), fill = build)) +
    geom_col(position = "dodge") +
    facet_wrap(c("metric"), scales = "free_x", ncol = 2) +
    labs(x = NULL,
         y = NULL,
         fill = "Build") +
    theme(legend.position = "top")
  print(p)
  cat("\n\n")
}
 99
100
101
df %>%
  filter(Type == "SNP") %>%
  plot_metric()
107
108
109
df %>%
  filter(Type == "INDEL" & Subtype == "*") %>%
  plot_metric()
115
116
117
df %>%
  filter(Type == "INDEL" & Subtype == "I16_PLUS") %>%
  plot_metric()
123
124
125
df %>%
  filter(Type == "INDEL" & Subtype == "D16_PLUS") %>%
  plot_metric()
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
library(tidyverse)

knitr::opts_chunk$set(fig.width=25, fig.height=6, fig_caption=TRUE)

chr_mapper <- readr::read_tsv(
  snakemake@input$chr_mapper,
  col_names = c("chrom_index", "build_key", "chrom"),
  col_types = "icc"
) %>%
  arrange(chrom_index, build_key)

bed_cols <- c("chrom", "start", "end")

read_bed <- function(path) {
  readr::read_tsv(path,
                  comment = "#",
                  col_types = "cii",
                  col_names = bed_cols) %>%
  mutate(region_size = end - start)
}

read_nonN <- function(path) {
  read_bed(path) %>%
    mutate(build_key = path %>% dirname() %>% dirname() %>% basename()) %>%
    left_join(chr_mapper, by = c("chrom", "build_key")) %>%
    select(-chrom)
}

read_strat <- function(path) {
  strat_filename <- basename(path)
  read_bed(path) %>%
    group_by(chrom) %>%
    summarise(total_region = sum(region_size), .groups = "drop") %>%
    mutate(build_key = path %>% dirname() %>% dirname() %>% basename(),
           ## genome = str_extract(strat_filename, "^[^_]+_(HG[0-9]+)_.bed.gz", 1),
           ## strat_level = str_extract(strat_filename, 
           ##                           if_else(is.na(genome),
           ##                                   "^[^_]+_(.*).bed.gz",
           ##                                   "^[^_]+_HG[0-9]+_(.*).bed.gz"),
           ##                           1),
           strat_name = str_extract(strat_filename, "^[^_]+_(.*).bed.gz", 1)) %>%
    left_join(chr_mapper, by = c("chrom", "build_key")) %>%
    select(-chrom)
}

df_chr_sizes <- snakemake@input$nonN %>%
  map_dfr(read_nonN) %>%
  group_by(chrom_index, build_key) %>%
  summarise(nonN_ref_len = sum(region_size), .groups = "drop")

chr_build_combinations <- chr_mapper %>%
  expand(build_key, chrom_index)

file_list <- snakemake@input$strats %>%
  map(readLines) %>%
  flatten()

chr_name <- function(i) {
  case_when(i == 23 ~ "X",
            i == 24 ~ "Y",
            0 < i & i < 23 ~ as.character(i),
            TRUE ~ "oops")
}

chr_levels <- chr_name(1:24)

read_strats <- function(stype, gaps) {
  fs <- keep(file_list, ~ basename(dirname(.x)) == stype)
  if (length(fs) > 0) {
    df <- map_dfr(fs, read_strat)
    all_combos <- chr_build_combinations %>%
      expand(nesting(chrom_index, build_key), strat_name = df$strat_name)
    df %>%
      # join the full cartesian product of all chromosomes and paths, then fill
      # missing with 0 so that we get an empty space on the graphs indicating
      # zero coverage
      full_join(all_combos, by = c("chrom_index", "strat_name", "build_key")) %>%
      left_join(df_chr_sizes, by = c("chrom_index", "build_key")) %>%
      mutate(strat_type = stype,
             coverage = total_region / nonN_ref_len,
             chrom = factor(chr_name(chrom_index), levels = chr_levels)) %>%
      replace_na(list(coverage = 0))
  } else {
    NULL
  }
}

coverage_plot <- function(df, level) {
  cat(sprintf("#### %s\n\n", level))
  p <- df %>%
    ggplot(aes(chrom, coverage, fill = build_key)) +
    geom_col(position = "dodge") +
    scale_y_continuous(labels = scales::percent) +
    theme(legend.position = "top",
          text = element_text(size = 30),
          axis.text.x = element_text(angle = 90),
          strip.text.y = element_text(angle = 0),
          panel.spacing.y = unit(2, "lines")) +
    labs(x = NULL,
         fill = "build")
  print(p)
  cat("\n\n")
}

coverage_plots <- function(stype, gaps) {
  cat(sprintf("## %s\n\n", stype))
  df <- read_strats(stype, gaps)
  if (is.null(df)) {
    cat("None\n\n")
  } else {
    df %>%
      group_by(strat_name) %>%
      group_walk(coverage_plot)
  }
}
145
146
147
snakemake@params$core_levels %>%
  union(snakemake@params$other_levels) %>%
  walk(coverage_plots)
ShowHide 96 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/ndwarshuis/giab-stratifications
Name: giab-stratifications
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...