pipeline for generating official genome-in-a-bottle stratifications
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
This repository contains the pipeline and configuration used for generating and evaluating the Genome-in-a-Bottle stratifications. The stratification files were developed with the Global Alliance for Genomic Health (GA4GH) Benchmarking Team, the Genome in a Bottle Consortium (GIAB) and the Telomere-to-Telomere Consortium (T2T) . They are intended as a standard resource of BED files for use in stratifying true positive, false positive and false negative variant calls in challenging and targeted regions of the the genome.
The stratification BED files are only accessible on the GIAB FTP site . These files can be used as a standard resource of BED files for use with GA4GH benchmark tools such as hap.py .
Note that this pipeline applies to stratification versions starting with v3.2. Older pipelines and scripts can be found here .
Within this repository, further details (including all source files) can
be found in the configuration at
config/all.yml
. The pipeline itself is
maintained in a separate, reference agnostic repository
here
. Please refer to this
pipeline for specific, methodological information regarding how these files are
produced.
Currently, this pipeline is set up to build stratifications for the following references:
-
GRCh37
-
GRCh38
-
CHM13
Versioning
Version tags follow the format
X.Y-Z
where
X
will be incremented for removal
of old stratifications and/or large deletions,
Y
will be incremented for
additions and minor revisions, and
Z
will be incremented for documentation
updates in this repository. Note that
X.Y
(without
Z
) corresponds to the
versions located in
the FTP
site
.
For those looking for precise differences between each stratification version, see the stratdiff tool .
General Information
Author Information
-
Principal Investigator: Justin Zook, NIST, jzook@nist.gov
-
Nate Olson, NIST, nathanael.olson@nist.gov
-
Justin Wagner, NIST, justin.wagner@nist.gov
-
Jennifer McDaniel, NIST, jennifer.mcdaniel@nist.gov
-
Nate Dwarshuis, NIST, nathan.dwarshuis@nist.gov
Sharing/Access Information
Licenses/restrictions placed on the data, or limitations of reuse: Publicly released data are freely available for reuse without embargo.
Citations for stratifications are located in the associated READMEs.
If stratifications were used in benchmarking with GA4GH/GIAB best practices or hap.py please reference:
Krusche, P., Trigg, L., Boutros, P.C. et al.
Best practices for benchmarking germline small-variant calls in human genomes.
Nat Biotechnol 37, 555-560 (2019). https://doi.org/10.1038/s41587-019-0054-x
Links to publicly accessible locations of the data:
-
Individual stratification BED files as well as zipped directories (tar.gz) of files
-
stratification READMEs
-
.tsvs for benchmarking with hap.py
-
MD5 checksums
Summary Of Stratifications
Stratifications can be binned into several types: Low Complexity, Functional Technically Difficult, Genome Specific, Functional Regions, GC content, mappability, Other Difficult, Segmental Duplications, Union, Ancestry, Telomeres, and XY. General information for stratification types are provided below.
These categories are present for all indicated references unless otherwise noted.
Low Complexity
Regions with different types and sizes of low complexity sequence, e.g., homopolymers, STRs, VNTRs and other locally repetitive sequences.
Segmental Duplications
Regions with segmental duplications (generally defined as repeated regions >1kb with >90% similarity).
XY
Chomosome XY specific regions such as PAR, XTR or ampliconic.
Functional Regions
Regions to stratify variants inside and outside coding regions.
GC Content
Regions with different ranges (%) of GC content.
Mappability
Regions where short read mapping can be challenging.
Telomeres
Telomeric regions. These are included for CHM13 only.
Union
Regions with different general types of difficult regions or any type of difficult region or complex variant. For example, performance can be measured in just "easy" or "all difficult" regions of the genome.
Ancestry
Regions with inferred patterns of local ancestry.
These only exist for GRCh38.
Genome Specific (GIAB benchmark v4.2.1)
Difficult regions due to potentially difficult variation in a NIST/GIAB sample, including 1) regions containing putative compound heterozygous variants 2) small regions containing multiple phased variants, 3) regions with potential structural or copy number variation.
These only exist for GRCh37/8.
Functional Technically Difficult
Functional, or potentially functional, regions that are also likely to be technically difficult to sequences.
These only exist for GRCh37/8.
Other Difficult
These regions that are difficult to sequences for miscellaneous reasons. In the case of GRCh37/38, these include regions near gaps or errors in the reference or regions that are known to be expanded/collapsed relative to most other genomes.
This category also includes highly polymorphic regions such as those coding the major histocompatibility complex (MHC), the VDJ regions, and regions coding the killer-cell immunoglobulin-like receptor (KIR). These are all highly polymorphic in nature and thus difficult to sequence.
Finally, in the case of CHM13, this category also includes the rDNA arrays.
Data-Use Policy
This data/work was created by employees of the National Institute of Standards and Technology (NIST), an agency of the Federal Government. Pursuant to title 17 United States Code Section 105, works of NIST employees are not subject to copyright protection in the United States. This data/work may be subject to foreign copyright.
The data/work is provided by NIST as a public service and is expressly provided AS IS. NIST MAKES NO WARRANTY OF ANY KIND, EXPRESS, IMPLIED OR STATUTORY, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NON-INFRINGEMENT AND DATA ACCURACY. NIST does not warrant or make any representations regarding the use of the data or the results thereof, including but not limited to the correctness, accuracy, reliability or usefulness of the data. NIST SHALL NOT BE LIABLE AND YOU HEREBY RELEASE NIST FROM LIABILITY FOR ANY INDIRECT, CONSEQUENTIAL, SPECIAL, OR INCIDENTAL DAMAGES (INCLUDING DAMAGES FOR LOSS OF BUSINESS PROFITS, BUSINESS INTERRUPTION, LOSS OF BUSINESS INFORMATION, AND THE LIKE), WHETHER ARISING IN TORT, CONTRACT, OR OTHERWISE, ARISING FROM OR RELATING TO THE DATA (OR THE USE OF OR INABILITY TO USE THIS DATA), EVEN IF NIST HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
To the extent that NIST may hold copyright in countries other than the United States, you are hereby granted the non-exclusive irrevocable and unconditional right to print, publish, prepare derivative works and distribute the NIST data, in any medium, or authorize others to do so on your behalf, on a royalty-free basis throughout the world.
You may improve, modify, and create derivative works of the data or any portion of the data, and you may copy and distribute such modifications or works. Modified works should carry a notice stating that you changed the data and should note the date and nature of any such change. Please explicitly acknowledge the National Institute of Standards and Technology as the source of the data: Data citation recommendations are provided at https://www.nist.gov/open/license.
Permission to use this data is contingent upon your acceptance of the terms of this agreement and upon your providing appropriate acknowledgments of NIST’s creation of the data/work.
Running this pipeline
This repository's existence on Github is primarily for documentation purposes, as the runtime is defined in terms of NIST-specific resources.
However, for those wishing to run the pipeline themselves, there are several options. Both require cloning this repo with submodules and setting up the conda env:
git clone --recurse-submodules https://github.com/ndwarshuis/giab-stratifications.git
cd giab-stratifications
conda env create -f env.yml
conda activate giab-strats-production
No Cluster Environment
Just call the pipeline directly using snakemake:
snakemake --cores 20 --configfile=config/all.yml all
Note that most rules in the pipeline require 12MB of memory or less with the exception of mappability (which runs GEM) and requires ~32GB for size of the references here.
Cluster Environment
First, create a profile from that located at
workflow/profiles/nisaba/config.yml
(which is designed for a NIST-specific
cluster which runs Slurm).
Then run snakemake with this profile:
snakemake --cores 20 --configfile=config/all.yml --profile=workflow/profiles/yerprofile all
Code Snippets
21 22 | script: "../scripts/python/bedtools/functional/filter_cds.py" |
34 35 36 37 38 39 | shell: """ mergeBed -i {input.bed} | \ intersectBed -a stdin -b {input.gapless} -sorted -g {input.genome} | \ bgzip -c > {output} """ |
51 52 53 54 55 56 | shell: """ complementBed -i {input.bed} -g {input.genome} | \ intersectBed -a stdin -b {input.gapless} -sorted | \ bgzip -c > {output} """ |
44 45 46 47 48 49 50 51 52 53 | shell: """ seqtk gc {params.args} -l 100 {input.ref} | \ cut -f1-3 | \ slopBed -i stdin -g {input.genome} -b 50 | \ mergeBed -i stdin | \ intersectBed -a stdin -b {input.gapless} -sorted -g {input.genome} | \ bgzip -c \ > {output} """ |
99 100 | script: "../scripts/python/bedtools/gc/intersect_ranges.py" |
53 54 55 56 57 58 59 | shell: """ gunzip -c {input.ref} | \ sed 's/^[A-Za-z]\\+$/\\U&/' | \ {input.bin} {wildcards.unit_len} {wildcards.total_len} - 2> {log} \ > {output} """ |
73 74 | shell: "sed -n '/unit=[{wildcards.bases}]\+/p' {input} > {output}" |
138 139 | shell: "subtractBed -a {input.a} -b {input.b} > {output}" |
169 170 | shell: "subtractBed -a {input.a} -b {input.b} > {output}" |
186 187 188 189 190 191 192 193 194 | shell: """ slopBed -i {input} -b 5 -g {params.genome} | \ cut -f1-3 | \ mergeBed -i stdin | \ intersectBed -a stdin -b {params.gapless} -sorted -g {params.genome} | \ bgzip -c \ > {output} """ |
260 261 262 263 264 265 266 | shell: """ grep 'unit={wildcards.base}' {input} | \ mergeBed -i stdin -d 1 | \ awk '$3-$2>={wildcards.merged_len}' > \ {output} """ |
285 286 287 288 289 290 291 292 | shell: """ multiIntersectBed -i {input} | \ slopBed -i stdin -b 5 -g {params.genome} | \ mergeBed -i stdin | \ intersectBed -a stdin -b {params.gapless} -sorted | \ bgzip -c > {output} """ |
401 402 | script: "../scripts/python/bedtools/low_complexity/filter_sort_trf.py" |
412 413 414 415 416 417 | shell: """ gunzip -c {input} | \ mergeBed -i stdin | \ bgzip -c > {output} """ |
443 444 | script: "../scripts/python/bedtools/low_complexity/filter_sort_rmsk.py" |
456 457 458 459 460 461 462 | shell: """ gunzip -c {input.rmsk} | \ grep {wildcards.rmsk_class} | \ mergeBed -i stdin | \ bgzip -c > {output} """ |
496 497 | script: "../scripts/python/bedtools/low_complexity/filter_sort_censat.py" |
511 512 | shell: "mergeBed -i {input} | bgzip -c > {output}" |
524 525 526 527 528 529 530 | shell: """ slopBed -i {input.bed} -b 5 -g {input.genome} | \ mergeBed -i stdin | \ intersectBed -a stdin -b {input.gapless} -sorted -g {input.genome} | \ bgzip -c > {output} """ |
545 546 547 548 549 550 | shell: """ complementBed -i {input.bed} -g {params.genome} | intersectBed -a stdin -b {params.gapless} -sorted | \ bgzip -c > {output} """ |
567 568 569 570 571 572 573 574 575 576 | shell: """ slopBed -i {input.perfect} -b 5 -g {input.genome} | \ mergeBed -i stdin | \ multiIntersectBed -i stdin {input.imperfect} | \ mergeBed -i stdin | \ intersectBed -a stdin -b {input.gapless} -sorted -g {input.genome} | \ bgzip -c \ > {output} """ |
600 601 602 603 604 605 | shell: """ multiIntersectBed -i {input.beds} | \ slopBed -i stdin -b 5 -g {input.genome} | \ mergeBed -i stdin > {output} """ |
634 635 636 637 638 639 640 | shell: """ awk '({params.lower}+10)<=($3-$2) && ($3-$2)<({params.upper}+10)' {input.tr} | \ subtractBed -a stdin -b {input.hp} | \ intersectBed -a stdin -b {input.gapless} -sorted -g {input.genome} | \ bgzip -c > {output} """ |
667 668 669 670 671 672 | shell: """ multiIntersectBed -i {input} | \ mergeBed -i stdin | \ bgzip -c > {output} """ |
39 40 | shell: "curl -sS -L -o {output} {params.url}" |
59 60 61 62 63 64 65 66 | shell: """ mkdir -p {config.tools_bin_dir} && \ tar xjf {input} \ --directory {config.tools_bin_dir} \ --strip-components=2 \ {params.bins} """ |
81 82 | script: "../scripts/python/bedtools/mappability/filter_ref.py" |
98 99 100 101 102 103 104 105 106 | shell: """ PATH={config.tools_bin_dir}:$PATH {input.bin} \ --complement emulate \ -T {threads} \ -i {input.fa} \ -o {params.base} > {log} 2>&1 """ |
124 125 126 127 128 129 130 131 132 133 | shell: """ {input.bin} \ -m {wildcards.m} \ -e {wildcards.e} \ -l {wildcards.l} \ -T {threads} \ -I {input.fa} \ -o {params.base} > {log} 2>&1 """ |
151 152 153 154 155 156 157 | shell: """ {input.bin} \ -I {input.idx} \ -i {input.map} \ -o {params.base} 2> {log} """ |
171 172 173 174 175 176 177 178 179 | shell: """ sed 's/ AC//' {input} | \ wig2bed -d | \ awk '$5>0.9' | \ cut -f1-3 | \ gzip -c > \ {output} """ |
208 209 | script: "../scripts/python/bedtools/mappability/merge_nonunique.py" |
228 229 230 231 232 233 | shell: """ complementBed -i {input} -g {params.genome} | \ intersectBed -a stdin -b {params.gapless} -sorted -g {params.genome} | \ bgzip -c > {output} """ |
19 20 21 22 23 24 25 26 | shell: """ complementBed -i {input.gapless} -g {input.genome} | \ slopBed -i stdin -b 15000 -g {input.genome} | \ mergeBed -i stdin | \ intersectBed -a stdin -b {input.gapless} -sorted -g {input.genome} | \ bgzip -c > {output} """ |
37 38 | script: "../scripts/python/bedtools/otherdifficult/filter_vdj.py" |
50 51 52 53 54 55 | shell: """ mergeBed -i {input.bed} | \ intersectBed -a stdin -b {input.gapless} -sorted -g {input.genome} | \ bgzip -c > {output} """ |
40 41 | script: "../scripts/python/bedtools/other/filter_sort_other.py" |
53 54 | script: "../scripts/python/bedtools/postprocess/list_strats.py" |
64 65 | script: "../scripts/python/bedtools/postprocess/list_md5.py" |
74 75 | script: "../scripts/python/bedtools/postprocess/generate_tsv.py" |
91 92 | script: "../scripts/python/bedtools/postprocess/run_unit_tests.py" |
111 112 113 114 115 116 | shell: """ curl -fsSq -o {params.tar} {params.url} && \ mkdir {output} && \ tar xzf {params.tar} --directory {output} --strip-components=1 """ |
141 142 | script: "../scripts/python/bedtools/postprocess/diff_previous_strats.py" |
164 165 166 167 168 169 170 171 172 173 174 | run: with open(output[0], "w") as f: for ref_key, build_key in zip(rks, bks): for i in config.buildkey_to_chr_indices(ref_key, build_key): pattern = config.refkey_to_final_chr_pattern(ref_key) line = [ str(i.value), f"{ref_key}@{build_key}", i.chr_name_full(pattern), ] f.write("\t".join(line) + "\n") |
207 208 | script: "../scripts/rmarkdown/rmarkdown/validate.Rmd" |
216 217 | shell: "gunzip -c {input} > {output}" |
227 228 229 230 | shell: """ samtools faidx {input} """ |
257 258 | script: "../scripts/python/bedtools/postprocess/run_happy.py" |
278 279 | script: "../scripts/rmarkdown/rmarkdown/benchmark.Rmd" |
289 290 291 292 293 | shell: """ cp {input.main} {output.main} cp {input.validation} {output.validation} """ |
305 306 307 308 | shell: """ tar czf {output} -C {params.parent} {params.target} """ |
326 327 328 329 330 331 332 333 334 | shell: """ find {params.root} -type f -exec md5sum {{}} + | \ sed 's|{params.root}|\.|' | \ grep -v "\./genome-stratifications-md5s\.txt" | \ sort -k2,2 > {output} && \ cd {params.root} && \ md5sum -c --quiet genome-stratifications-md5s.txt """ |
21 22 | script: "../scripts/python/bedtools/xy/write_par.py" |
35 36 | script: "../scripts/python/bedtools/misc/get_file.py" |
48 49 50 51 52 53 | shell: """ gunzip -c {input} | \ samtools faidx - -o - \ 2> {log} > {output} """ |
65 66 | script: "../scripts/python/bedtools/ref/get_genome.py" |
79 80 81 82 83 | shell: """ samtools faidx {input.fa} $(cut -f1 {input.genome} | tr '\n' ' ') 2> {log} | \ bgzip -c > {output} """ |
123 124 | script: "../scripts/python/bedtools/ref/get_gapless.py" |
154 155 | script: "../scripts/python/bedtools/ref/get_ftbl_mapper.py" |
165 166 167 168 169 170 171 | shell: """ gunzip -c {input} | \ grep -v '^#' | \ awk -F "\t" 'OFS="\t" {{ print $1,$4,$5,$2,$3,$9 }}' | \ bgzip -c > {output} """ |
213 214 | script: "../scripts/python/bedtools/ref/filter_sort_bench_bed.py" |
9 10 | shell: "curl -sS -L -o {output} {params.url}" |
19 20 21 22 23 | shell: """ mkdir {output} && \ tar xzf {input} --directory {output} --strip-components=1 """ |
35 36 | shell: "make -C {input} 2>&1 > {log} && mv {input}/repseq {output}" |
30 31 | script: "../scripts/python/bedtools/segdups/filter_sort_superdups.py" |
43 44 45 46 47 48 | shell: """ mergeBed -i {input.bed} -d 100 | \ intersectBed -a stdin -b {input.gapless} -sorted -g {input.genome} | \ bgzip -c > {output} """ |
58 59 60 61 62 63 | shell: """ gunzip -c {input} | \ awk '($3-$2 > 10000)' | \ bgzip -c > {output} """ |
75 76 77 78 79 80 | shell: """ complementBed -i {input.bed} -g {input.genome} | \ intersectBed -a stdin -b {input.gapless} -sorted -g {input.genome} | \ bgzip -c > {output} """ |
23 24 25 26 27 28 29 30 | shell: """ seqtk telo {input.ref} 2> {log} | \ cut -f1-3 | \ intersectBed -a stdin -b {input.gapless} -sorted -g {input.genome} | \ bgzip -c \ > {output} """ |
16 17 18 19 20 21 22 | shell: """ multiIntersectBed -i {input} | \ mergeBed -i stdin | \ bgzip -c \ > {output} """ |
35 36 37 38 39 40 41 | shell: """ complementBed -i {input.bed} -g {params.genome} | intersectBed -a stdin -b {params.gapless} -sorted -g {params.genome} | \ bgzip -c \ > {output} """ |
36 37 38 39 40 | shell: """ intersectBed -a {input.bed} -b {input.gapless} -sorted -g {input.genome} | \ bgzip -c > {output} """ |
54 55 | script: "../scripts/python/bedtools/xy/filter_sort_features.py" |
87 88 89 90 91 92 93 | shell: """ complementBed -i {input.bed} -g {input.genome} | \ grep {wildcards.sex_chr} | \ intersectBed -a stdin -b {input.gapless} -sorted -g {input.genome} | \ bgzip -c > {output} """ |
105 106 107 108 109 110 111 | shell: """ awk -v OFS='\t' {{'print $1,\"0\",$2'}} {input.bed} | \ grep -v \"X\|Y\" | \ intersectBed -a stdin -b {input.gapless} -sorted -g {input.genome} | \ bgzip -c > {output} """ |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | import json from typing import Any import common.config as cfg from common.bed import filter_sort_bed_inner, read_bed def main(smk: Any, sconf: cfg.GiabStrats) -> None: rk = cfg.RefKey(smk.wildcards["ref_key"]) bk = cfg.BuildKey(smk.wildcards["build_key"]) to_map = sconf.buildkey_to_final_chr_mapping(rk, bk) with open(smk.input["mapper"], "r") as f: from_map = json.load(f) df = read_bed(smk.input["bed"], more=[3, 4]) df = df[df[3].str.contains("RefSeq") & (df[4] == "CDS")][[0, 1, 2]].copy() df = filter_sort_bed_inner(from_map, to_map, df) df.to_csv( smk.output[0], header=False, index=False, sep="\t", ) main(snakemake, snakemake.config) # type: ignore |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 | from pathlib import Path from typing import Any, NamedTuple, Callable import subprocess as sp import common.config as cfg import json # use subprocess here because I don't see an easy way to stream a bedtools # python object to a bgzip file (it only seems to do gzip...oh well) class GCInput(NamedTuple): bed: Path fraction: int is_range_bound: bool def write_simple_range_beds( final_path: Callable[[str], Path], gs: list[GCInput], is_low: bool, ) -> list[str]: def fmt_out(bigger_frac: int, smaller_frac: int) -> Path: lower_frac, upper_frac = ( (smaller_frac, bigger_frac) if is_low else (bigger_frac, smaller_frac) ) return final_path(f"gc{lower_frac}to{upper_frac}_slop50") pairs = zip(gs[1:], gs[:-1]) if is_low else zip(gs[:-1], gs[1:]) torun = [ (fmt_out(bigger.fraction, smaller.fraction), bigger, smaller) for bigger, smaller in pairs ] for out, bigger, smaller in torun: with open(out, "wb") as f: p0 = sp.Popen( ["subtractBed", "-a", bigger.bed, "-b", smaller.bed], stdout=sp.PIPE, ) p1 = sp.run(["bgzip", "-c"], stdin=p0.stdout, stdout=f) p0.wait() if not (p0.returncode == p1.returncode == 0): exit(1) return [str(t[0]) for t in torun] def write_middle_range_bed( final_path: Callable[[str], Path], lower: GCInput, upper: GCInput, genome: Path, gapless: Path, ) -> str: out = final_path(f"gc{lower.fraction}to{upper.fraction}_slop50") with open(out, "wb") as f: p0 = sp.Popen( ["complementBed", "-i", upper.bed, "-g", genome], stdout=sp.PIPE, ) p1 = sp.Popen( ["subtractBed", "-a", "stdin", "-b", lower.bed], stdin=p0.stdout, stdout=sp.PIPE, ) p2 = sp.Popen( ["intersectBed", "-a", "stdin", "-b", gapless, "-sorted", "-g", genome], stdin=p1.stdout, stdout=sp.PIPE, ) p3 = sp.run(["bgzip", "-c"], stdin=p2.stdout, stdout=f) p0.wait() p1.wait() p2.wait() if not (p0.returncode == p1.returncode == p2.returncode == p3.returncode == 0): exit(1) return str(out) def write_intersected_range_beds( final_path: Callable[[str], Path], low: list[GCInput], high: list[GCInput], ) -> list[str]: pairs = zip( [x for x in low if x.is_range_bound], [x for x in reversed(high) if x.is_range_bound], ) torun = [ (i, final_path(f"gclt{b1.fraction}orgt{b2.fraction}_slop50"), b1, b2) for i, (b1, b2) in enumerate(pairs) ] for i, bed_out, b1, b2 in torun: with open(bed_out, "wb") as f: p0 = sp.Popen( ["multiIntersectBed", "-i", b1.bed, b2.bed], stdout=sp.PIPE, ) p1 = sp.Popen( ["mergeBed", "-i", "stdin"], stdin=p0.stdout, stdout=sp.PIPE, ) p2 = sp.run(["bgzip", "-c"], stdin=p1.stdout, stdout=f) p0.wait() p1.wait() if not (p0.returncode == p1.returncode == p2.returncode == 0): exit(1) return [str(t[1]) for t in torun] def main(smk: Any, sconf: cfg.GiabStrats) -> None: rk = cfg.RefKey(smk.wildcards.ref_key) bk = cfg.BuildKey(smk.wildcards.build_key) gps = sconf.buildkey_to_include(rk, bk).gc assert gps is not None, "this should not happen" # ASSUME both of these input lists are sorted by GC fraction low = [GCInput(p, f, r) for p, (f, r) in zip(smk.input.low, gps.low)] high = [GCInput(p, f, r) for p, (f, r) in zip(smk.input.high, gps.high)] genome = Path(smk.input.genome) gapless = Path(smk.input.gapless) def final_path(name: str) -> Path: p = Path(str(smk.params.path_pattern).format(name)) p.parent.mkdir(exist_ok=True, parents=True) return p low_strats = write_simple_range_beds(final_path, low, True) high_strats = write_simple_range_beds(final_path, high, False) range_strat = write_middle_range_bed( final_path, low[-1], high[0], genome, gapless, ) inter_strats = write_intersected_range_beds( final_path, low, high, ) with open(smk.output[0], "w") as f: # put the first low and last high input here since these are already # in the final directory obj = { "gc_ranges": [ low[0][0], *low_strats, range_strat, *high_strats, high[-1][0], ], "widest_extreme": inter_strats[0], "other_extremes": inter_strats[1:], } json.dump(obj, f, indent=4) main(snakemake, snakemake.config) # type: ignore |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | from typing import Any import common.config as cfg from common.bed import read_bed, filter_sort_bed, write_bed def main(smk: Any, sconf: cfg.GiabStrats) -> None: rk = cfg.RefKey(smk.wildcards["ref_key"]) bk = cfg.BuildKey(smk.wildcards["build_key"]) bedfile = sconf.refkey_to_strat(rk).low_complexity.satellites assert bedfile is not None, "this should not happen" ps = bedfile.params conv = sconf.buildkey_to_chr_conversion(rk, bk, ps.chr_pattern) df = read_bed(smk.input[0], ps, [bedfile.sat_col]) df = filter_sort_bed(conv, df) df = df[~df[3].str.startswith("ct_")] write_bed(smk.output[0], df) main(snakemake, snakemake.config) # type: ignore |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | from typing import Any import common.config as cfg from common.bed import read_filter_sort_bed def main(smk: Any, sconf: cfg.GiabStrats) -> None: rk = cfg.RefKey(smk.wildcards["ref_key"]) bk = cfg.BuildKey(smk.wildcards["build_key"]) rm = sconf.refkey_to_strat(rk).low_complexity.rmsk assert rm is not None, "this should not happen" read_filter_sort_bed( sconf, smk.input[0], smk.output[0], rm.params, rk, bk, [rm.class_col] ) main(snakemake, snakemake.config) # type: ignore |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | from typing import Any import common.config as cfg from common.bed import read_filter_sort_bed def main(smk: Any, sconf: cfg.GiabStrats) -> None: rk = cfg.RefKey(smk.wildcards["ref_key"]) bk = cfg.BuildKey(smk.wildcards["build_key"]) ss = sconf.refkey_to_strat(rk).low_complexity.simreps assert ss is not None, "this should not happen" read_filter_sort_bed(sconf, smk.input[0], smk.output[0], ss.params, rk, bk) main(snakemake, snakemake.config) # type: ignore |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 | import re from typing import Any import pandas as pd import common.config as cfg import subprocess as sp from pathlib import Path # This is the one exception to the pattern wherein we filter all input data to # the chromosomes we actually care about. In the case of mappability, we want to # run GEM against all random/unplaced contigs, since these are hard to map to # begin with (otherwise they wouldn't be unplaced). Here, we need to filter # the fasta to include all the primary contigs we care about + all the other # unplaced/random contigs. Later we will filter out all the non-primary contigs. def main(smk: Any, sconf: cfg.GiabStrats) -> None: rk = cfg.RefKey(smk.wildcards["ref_key"]) bk = cfg.BuildKey(smk.wildcards["build_key"]) pattern = sconf.refkey_to_final_chr_pattern(rk) pats = sconf.refkey_to_mappability_patterns(rk) cis = sconf.buildkey_to_chr_indices(rk, bk) idx = pd.read_table( Path(smk.input["idx"]), header=None, dtype={0: str}, usecols=[0], ) chrs = [ *filter( lambda c: any(i.chr_name_full(pattern) == c for i in cis) or any(re.match(p, c) for p in pats), idx[0].tolist(), ) ] with open(Path(smk.output[0]), "w") as f: # ASSUME sort order doesn't matter and ref is bgzip'd or unzip'd # NOTE the output is not bgzip'd because GEM doesn't like it p = sp.Popen(["samtools", "faidx", Path(smk.input["fa"]), *chrs], stdout=f) p.wait() main(snakemake, snakemake.config) # type: ignore |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 | import re from typing import Any from pathlib import Path import common.config as cfg from common.bed import read_bed, filter_sort_bed, write_bed from pybedtools import BedTool as bt # type: ignore import json def main(smk: Any, sconf: cfg.GiabStrats) -> None: rk = cfg.RefKey(smk.wildcards["ref_key"]) pattern = sconf.refkey_to_final_chr_pattern(rk) conv = cfg.fullset_conv(pattern) inputs = smk.input["bed"] genome = Path(smk.input["genome"][0]) gapless = bt(smk.input["gapless"]) def final_path(name: str) -> Path: p = Path(str(smk.params.path_pattern).format(name)) p.parent.mkdir(exist_ok=True, parents=True) return p def to_single_output(basename: str) -> Path: m = re.match(".*(l\\d+_m\\d+_e\\d+).*", basename) assert m is not None, f"malformed mappability file name: {basename}" return final_path(f"nonunique_{m[1]}") def read_sort_bed(p: Path) -> bt: df = read_bed(p) # Sort here because we can't assume wig2bed sorts its output. Also, # filtering is necessary because the output should have unplaced contigs # in it that we don't want. return bt().from_dataframe(filter_sort_bed(conv, df)) def merge_bed(bed: bt, out: Path) -> None: df = bed.merge(d=100).intersect(b=gapless, sorted=True, g=genome).to_dataframe() write_bed(out, df) def merge_single(bed: bt, out: Path) -> None: comp = bed.complement(g=str(genome)) merge_bed(comp, out) all_lowmap = final_path("lowmappabilityall") single_lowmap = [] # If there is only one input, merge this to make one "all_nonunique" bed. # Otherwise, merge each individual input and then combine these with # multi-intersect to make the "all_nonunique" bed. if len(inputs) == 1: bed = read_sort_bed(Path(inputs[0])) merge_single(bed, all_lowmap) else: single = [ ( to_single_output((p := Path(i)).name), read_sort_bed(p), ) for i in inputs ] for sout, bed in single: merge_single(bed, sout) mint = bt().multi_intersect(i=[s[0] for s in single]) merge_bed(mint, all_lowmap) single_lowmap = [str(s[0]) for s in single] with open(smk.output[0], "w") as f: obj = {"all_lowmap": str(all_lowmap), "single_lowmap": single_lowmap} json.dump(obj, f) main(snakemake, snakemake.config) # type: ignore |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 | from pathlib import Path import subprocess as sp from typing import Callable from typing_extensions import assert_never from tempfile import NamedTemporaryFile as Tmp from common.io import is_bgzip, is_gzip import common.config as cfg from common.io import get_md5, setup_logging # hacky curl/gzip wrapper; this exists because I got tired of writing # specialized rules to convert gzip/nozip files to bgzip and back :/ # Solution: force bgzip for references and gzip for bed smk_log = snakemake.log[0] # type: ignore log = setup_logging(smk_log) GUNZIP = ["gunzip", "-c"] BGZIP = ["bgzip", "-c"] GZIP = ["gzip", "-c"] CURL = ["curl", "-f", "-Ss", "-L", "-q"] def main(opath: str, src: cfg.BedSrc | cfg.RefSrc | None) -> None: if isinstance(src, cfg.FileSrc_): # ASSUME these are already tested via the pydantic class for the # proper file format Path(opath).symlink_to(Path(src.filepath).resolve()) elif isinstance(src, cfg.HttpSrc_): curlcmd = [*CURL, src.url] # to test the format of downloaded files, sample the first 65000 bytes # (which should be enough to get one block of a bgzip file, which will # allow us to test for it) curltestcmd = [*CURL, "-r", "0-65000", src.url] with open(opath, "wb") as f, Tmp() as tf, open(smk_log, "w") as lf: def curl() -> None: if sp.run(curlcmd, stdout=f, stderr=lf).returncode != 0: exit(1) def curl_test(testfun: Callable[[Path], bool]) -> bool: if sp.run(curltestcmd, stdout=tf, stderr=lf).returncode != 0: exit(1) return testfun(Path(tf.name)) def curl_gzip(cmd: list[str]) -> None: p1 = sp.Popen(curlcmd, stdout=sp.PIPE, stderr=lf) p2 = sp.run(cmd, stdin=p1.stdout, stdout=f, stderr=lf) p1.wait() if not (p1.returncode == p2.returncode == 0): exit(1) # if we are getting a bed file (or related) ensure it is in gzip # format if isinstance(src, cfg.BedHttpSrc): if curl_test(is_gzip): curl() else: curl_gzip(GZIP) # if we are getting a fasta file ensure it is in bgzip format elif isinstance(src, cfg.RefHttpSrc): if curl_test(is_bgzip): curl() elif curl_test(is_gzip): p1 = sp.Popen(curlcmd, stdout=sp.PIPE, stderr=lf) p2 = sp.Popen(GUNZIP, stdin=p1.stdout, stdout=sp.PIPE, stderr=lf) p3 = sp.run(BGZIP, stdin=p2.stdout, stdout=f, stderr=lf) p1.wait() p2.wait() if not (p1.returncode == p2.returncode == p3.returncode == 0): exit(1) else: curl_gzip(BGZIP) else: assert_never(src) elif src is None: assert False, "file src is null; this should not happen" else: assert_never(src) if src.md5 is not None and src.md5 != (actual := get_md5(opath, True)): log.error("md5s don't match; wanted %s, actual %s", src.md5, actual) exit(1) main(snakemake.output[0], snakemake.params.src) # type: ignore |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | import json from typing import Any import common.config as cfg from common.bed import filter_sort_bed_inner, read_bed, write_bed VDJ_PAT = "^ID=gene-(IGH|IGK|IGL|TRA|TRB|TRG);" def main(smk: Any, sconf: cfg.GiabStrats) -> None: rk = cfg.RefKey(smk.wildcards["ref_key"]) bk = cfg.BuildKey(smk.wildcards["build_key"]) to_map = sconf.buildkey_to_final_chr_mapping(rk, bk) with open(smk.input["mapper"], "r") as f: from_map = json.load(f) df = read_bed(smk.input["bed"], more=[5]) df = df[df[3].str.match(VDJ_PAT)][[0, 1, 2]].copy() df = filter_sort_bed_inner(from_map, to_map, df) write_bed(smk.output[0], df) main(snakemake, snakemake.config) # type: ignore |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | from typing import Any import common.config as cfg from common.bed import read_bed, filter_sort_bed, write_bed from pybedtools import BedTool as bt # type: ignore def main(smk: Any, sconf: cfg.GiabStrats) -> None: rk = cfg.RefKey(smk.wildcards["ref_key"]) bk = cfg.BuildKey(smk.wildcards["build_key"]) lk = cfg.OtherLevelKey(smk.wildcards["other_level_key"]) sk = cfg.OtherStratKey(smk.wildcards["other_strat_key"]) bed = sconf.otherkey_to_bed(rk, bk, lk, sk) ps = bed.params conv = sconf.buildkey_to_chr_conversion(rk, bk, ps.chr_pattern) df = filter_sort_bed(conv, read_bed(smk.input["bed"][0], ps)) if bed.remove_gaps: df = ( bt() .from_dataframe(df) .intersect( smk.input["gapless"], sorted=True, g=smk.input["genome"][0], ) .to_dataframe() ) write_bed(smk.output[0], df) main(snakemake, snakemake.config) # type: ignore |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | from pathlib import Path from typing import Any import common.config as cfg from common.stratdiff.lib.diff import compare_all from common.io import setup_logging log = setup_logging(snakemake.log[0]) # type: ignore def main(smk: Any, sconf: cfg.GiabStrats) -> None: rk = cfg.RefKey(smk.wildcards.ref_key) bk = cfg.BuildKey(smk.wildcards.build_key) comparison = sconf.buildkey_to_build(rk, bk).comparison pattern = sconf.refkey_to_final_chr_pattern(rk) ixs = sconf.buildkey_to_chr_indices(rk, bk) assert comparison is not None, "this should not happen" outdir = Path(smk.output[0]).parent logged = compare_all( Path(smk.input["new_list"]).parent, smk.input["old"], outdir, comparison.path_mapper, comparison.replacements, [i.chr_name_full(pattern) for i in ixs], comparison.ignore_generated, comparison.ignore_other, ) for x in logged: log.info(x) main(snakemake, snakemake.config) # type: ignore |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | from pathlib import Path from typing import Any import re def main(smk: Any) -> None: with open(smk.input[0], "r") as i, open(smk.output[0], "w") as o: for f in i: fp = Path(f.strip()) level = re.sub("^[^_]+_", "", fp.name.replace(".bed.gz", "")) o.write(f"{level}\t{Path(fp.parent.name) / fp.name}\n") main(snakemake) # type: ignore |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | from typing import Any from os.path import dirname from common.io import get_md5 def strat_files(path: str) -> list[str]: with open(path, "r") as f: return [s.strip() for s in f] def main(smk: Any) -> None: out = str(smk.output) root = dirname(out) ss = strat_files(str(smk.input)) with open(out, "w") as op: for s in ss: h = get_md5(s) p = s.replace(root + "/", "") op.write(f"{h} {p}\n") main(snakemake) # type: ignore |
1 2 3 4 5 6 7 8 9 10 | from typing import Any def main(smk: Any) -> None: with open(smk.output[0], "w") as f: for s in sorted(smk.input): f.write(s + "\n") main(snakemake) # type: ignore |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | import gzip import os import subprocess as sp from typing import Any from pathlib import Path def gzip_empty(path: str) -> bool: with gzip.open(path, "rb") as f: return len(f.read(1)) == 0 def main(smk: Any) -> None: # ASSUME if filtered benchmark bed is empty then we have nothing to # benchmark because the desired chromosomes in this build are not included # in the benchmark with open(smk.log[0], "w") as f: if gzip_empty(smk.input["bench_bed"][0]): Path(smk.output[0]).touch() f.write("no overlapping chromosomes, writing empty dummy file\n") else: cmd = [ "hap.py", *["--engine", "vcfeval"], "--verbose", *["--threads", str(smk.threads)], *["--stratification", smk.input["strats"][0]], *["-f", smk.input["bench_bed"][0]], *["-o", smk.params["prefix"]], smk.input["bench_vcf"][0], smk.input["query_vcf"][0], ] res = sp.run( cmd, stderr=sp.STDOUT, stdout=f, env={"HGREF": smk.input["ref"][0], **os.environ}, ) if res.returncode != 0: exit(1) main(snakemake) # type: ignore |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 | import pandas as pd from typing import Any, NamedTuple from pathlib import Path from pybedtools import BedTool as bt # type: ignore from os.path import dirname, basename from os import scandir import common.config as cfg from common.io import setup_logging, is_bgzip import subprocess as sp log = setup_logging(snakemake.log[0]) # type: ignore class GaplessBT(NamedTuple): auto: bt parY: bt def test_bgzip(strat_file: Path) -> list[str]: """Each bed file should be bgzip'ed (not just gzip'ed).""" return [] if is_bgzip(strat_file) else ["is not bgzip file"] def test_bed( strat_file: Path, reverse_map: dict[str, int], gapless: GaplessBT, ) -> list[str]: """Each stratification should be a valid bed file.""" # test bed files in multiple phases, where each phase depends on the # previous being valid # # Phase 1: try to import the dataframe (with almost no assumptions except # that it has no header) try: # no low mem since we don't want to assume what dtypes each column has, # and by default pandas will chunk the input which could lead to mixed # types df = pd.read_table(strat_file, header=0, low_memory=False) except pd.errors.EmptyDataError: # if df is empty, nothing to do return [] except Exception as e: # catch any other weird read errors return [str(e)] # Phase 2: ensure "bed file" has 3 columns if len(df.columns) != 3: return ["bed file has wrong number of columns"] # Phase 3: ensure "bed file" has column of the correct type cols = {"chrom": str, "start": int, "end": int} df.columns = pd.Index([*cols]) try: df = df.astype(cols) except ValueError as e: return [str(e)] # Phase 4: we now know we have a real bed file, ensure that we didn't # invent any new chromosomes (try to convert chrom column to ints, which # will produce NaNs on error) chr_ints = df["chrom"].map(reverse_map).astype("Int64") invalid_chrs = df["chrom"][chr_ints.isnull()].unique().tolist() if len(invalid_chrs) > 0: return [f"invalid chrs: {invalid_chrs}"] # Phase 5: test remaining assumptions: # - chromosome column is sorted # - each region is disjoint and ascending (eg the start of a region is at # least 1bp away from the end of the previous) # - does not intersect with gap regions (if any) or extend beyond chr bounds same_chrom = df["chrom"] == df["chrom"].shift(1) distance = (df["start"] - df["end"].shift(1))[same_chrom] overlapping_lines = distance[distance < 1].index.tolist() # choose gap file depending on if this is an XY strat or not; note that the # built-in gaps file (if applicable) is obviously whitelisted from this # check if "gaps_slop15kb" in basename(strat_file): no_gaps = True else: isXY = basename(dirname(strat_file)) == "XY" gapless_bt = gapless.parY if isXY else gapless.auto gaps = bt.from_dataframe(df).subtract(gapless_bt).to_dataframe() no_gaps = len(gaps) == 0 return [ msg for result, msg in [ ( pd.Index(chr_ints).is_monotonic_increasing, "chromosomes not sorted", ), ( len(overlapping_lines) == 0, f"non-disjoint regions at lines: {overlapping_lines}", ), (no_gaps, "bed contains gap regions"), ] if result is False ] def test_checksums(checksums: Path) -> list[str]: """The md5sum utility should pass when invoked.""" res = sp.run( ["md5sum", "-c", "--strict", "--quiet", checksums.name], cwd=checksums.parent, capture_output=True, text=True, ) errors = [] if (s := res.stdout) == "" else s.strip().split("\n") return [f"checksum error: {e}" for e in errors] def test_tsv_list(tsv_list: Path) -> list[str]: """The final stratifications list should match final beds exactly. We are running our hotel on very tight margins; no extra or missing beds allowed. """ def strat_set(root: Path, sub: Path) -> set[str]: if root.is_dir(): deeper = (Path(p.path) for p in scandir(root)) return {str(q) for d in deeper for q in strat_set(d, sub / d.name)} elif root.is_file() and root.name.endswith(".bed.gz"): return {str(sub)} else: return set() current = strat_set(tsv_list.parent, Path("./")) with open(tsv_list, "r") as f: listed = {line.strip().split("\t")[1] for line in f} missing = [f"not in final directory: {p}" for p in listed - current] extra = [f"not in final list: {p}" for p in current - listed] return missing + extra def run_all_tests( strat_file: Path, reverse_map: dict[str, int], gapless: GaplessBT, ) -> list[tuple[Path, str]]: return [ (strat_file, msg) for msg in test_bgzip(strat_file) + test_bed(strat_file, reverse_map, gapless) ] def strat_files(path: str) -> list[Path]: with open(path, "r") as f: return [Path(s.strip()) for s in f] def main(smk: Any, sconf: cfg.GiabStrats) -> None: rk = cfg.RefKey(smk.wildcards["ref_key"]) bk = cfg.BuildKey(smk.wildcards["build_key"]) reverse_map = {v: k for k, v in sconf.buildkey_to_final_chr_mapping(rk, bk).items()} # check global stuff first (since this is faster) global_failures: list[str] = test_checksums( Path(smk.input["checksums"]) ) + test_tsv_list(Path(smk.input["strat_list"])) for j in global_failures: log.error(j) if len(global_failures) > 0: exit(1) # check individual stratification files gapless = GaplessBT( auto=bt(str(smk.input["gapless_auto"])), parY=bt(str(smk.input["gapless_parY"])), ) strat_failures = [ res for p in strat_files(str(smk.input["strats"])) for res in run_all_tests(p, reverse_map, gapless) ] for i in strat_failures: log.error("%s: %s" % i) if len(strat_failures) > 0: exit(1) main(snakemake, snakemake.config) # type: ignore |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | from typing import Any import common.config as cfg from common.bed import read_filter_sort_bed def main(smk: Any, sconf: cfg.GiabStrats) -> None: rk = cfg.RefKey(smk.wildcards["ref_key"]) bk = cfg.BuildKey(smk.wildcards["build_key"]) bench = sconf.stratifications[rk].builds[bk].bench assert bench is not None, "this should not happen" ps = bench.bench_bed.params read_filter_sort_bed(sconf, smk.input[0], smk.output[0], ps, rk, bk) main(snakemake, snakemake.config) # type: ignore |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | import json import pandas as pd from typing import Any import common.config as cfg def read_ftbl(path: str) -> "pd.Series[str]": filter_cols = ["assembly_unit", "seq_type"] map_cols = ["chromosome", "genomic_accession"] df = pd.read_table(path, header=0, usecols=filter_cols + map_cols, dtype=str) chr_mask = df["seq_type"] == "chromosome" asm_mask = df["assembly_unit"] == "Primary Assembly" return ( df[chr_mask & asm_mask][map_cols] .drop_duplicates() .copy() .set_index("chromosome")["genomic_accession"] ) def main(smk: Any, sconf: cfg.GiabStrats) -> None: rk = cfg.RefKey(smk.wildcards["ref_key"]) bk = cfg.BuildKey(smk.wildcards["build_key"]) cis = sconf.buildkey_to_chr_indices(rk, bk) # if this fails then something is wrong with the ftbl (eg it doesn't have # a complete set of chromosomes) ser = read_ftbl(smk.input[0]) mapper = {ser[i.chr_name]: i.value for i in cis} with open(smk.output[0], "w") as f: json.dump(mapper, f) main(snakemake, snakemake.config) # type: ignore |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 | from typing import Any from pathlib import Path import common.config as cfg from pybedtools import BedTool as bt # type: ignore from common.bed import filter_sort_bed, read_bed, write_bed import pandas as pd def read_genome_bed(p: Path) -> pd.DataFrame: df = pd.read_table( p, header=None, names=["chrom", "end"], dtype={"chrom": str, "end": int}, ) df["start"] = 0 return df[["chrom", "start", "end"]].copy() def main(smk: Any, sconf: cfg.GiabStrats) -> None: inputs = smk.input auto_out = Path(smk.output["auto"]) parY_out = Path(smk.output["parY"]) # convert genome to bed file (where each region is just the length of # one chromosome) genome_bed = read_genome_bed(Path(inputs["genome"])) # If we have gap input, make the gapless file, otherwise just symlink to the # genome bed file (which just means the entire genome is gapless) if hasattr(inputs, "gaps"): rk = cfg.RefKey(smk.wildcards["ref_key"]) bk = cfg.BuildKey(smk.wildcards["build_key"]) bedfile = sconf.stratifications[rk].gap assert bedfile is not None, "this should not happen" ps = bedfile.params conv = sconf.buildkey_to_chr_conversion(rk, bk, ps.chr_pattern) gaps_src = Path(inputs["gaps"]) gaps = read_bed(gaps_src, ps) gaps = filter_sort_bed(conv, gaps) gaps = bt.from_dataframe(gaps).merge(d=100).to_dataframe() gaps_bed = bt().from_dataframe(gaps) gaps_with_parY = bt().from_dataframe(genome_bed).subtract(gaps_bed) # If we have a parY bed, subtract parY from the gaps bed, otherwise # just link them since we have nothing to subtract off if hasattr(inputs, "parY"): parY_src = Path(inputs["parY"]) gaps_no_parY = gaps_with_parY.subtract(bt(parY_src)) write_bed(parY_out, gaps_with_parY.to_dataframe()) write_bed(auto_out, gaps_no_parY.to_dataframe()) else: write_bed(auto_out, gaps) parY_out.symlink_to(auto_out.resolve()) else: write_bed(auto_out, genome_bed) parY_out.symlink_to(auto_out.resolve()) main(snakemake, snakemake.config) # type: ignore |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | from typing import Any import pandas as pd import common.config as cfg from common.bed import filter_sort_bed_inner def main(smk: Any, sconf: cfg.GiabStrats) -> None: rk = cfg.RefKey(smk.wildcards["ref_key"]) bk = cfg.BuildKey(smk.wildcards["build_key"]) to_map = sconf.buildkey_to_final_chr_mapping(rk, bk) # since this is generated from the reference itself, the chr -> int map # is just the reverse of the final int -> chr map from_map = {v: k for k, v in to_map.items()} # ASSUME the input for this is a .fa.fai file (columns = chr, length) df = pd.read_table( smk.input[0], header=None, dtype={0: str, 1: int}, usecols=[0, 1], ) filtered = filter_sort_bed_inner(from_map, to_map, df, 2) filtered.to_csv(smk.output[0], sep="\t", header=False, index=False) main(snakemake, snakemake.config) # type: ignore |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | from typing import Any import common.config as cfg from common.bed import read_filter_sort_bed def main(smk: Any, sconf: cfg.GiabStrats) -> None: rk = cfg.RefKey(smk.wildcards["ref_key"]) bk = cfg.BuildKey(smk.wildcards["build_key"]) bedfile = sconf.refkey_to_strat(rk).segdups.superdups assert bedfile is not None, "this should not happen" read_filter_sort_bed(sconf, smk.input[0], smk.output[0], bedfile.params, rk, bk) main(snakemake, snakemake.config) # type: ignore |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | from typing import Any import common.config as cfg from pybedtools import BedTool as bt # type: ignore from common.bed import read_bed, filter_sort_bed, write_bed def main(smk: Any, sconf: cfg.GiabStrats) -> None: rk = cfg.RefKey(smk.wildcards["ref_key"]) bk = cfg.BuildKey(smk.wildcards["build_key"]) level = smk.params["level"] i = cfg.ChrIndex.from_name(smk.wildcards["sex_chr"]) assert i in [cfg.ChrIndex.CHRX, cfg.ChrIndex.CHRY], "invalid sex chr" fs = sconf.refkey_to_strat(rk).xy.features assert fs is not None, "this should not happen" bedfile = fs.x_bed if i is cfg.ChrIndex.CHRX else fs.y_bed ps = bedfile.params level_col = bedfile.level_col conv = sconf.buildkey_to_chr_conversion(rk, bk, ps.chr_pattern) df = read_bed(smk.input["bed"], ps, [level_col]) filtsort = filter_sort_bed(conv, df) filtsort = filtsort[filtsort[level_col].str.contains(level)].drop( columns=[level_col] ) gapless = bt(smk.input["gapless"]) nogaps = ( bt.from_dataframe(filtsort) .intersect(b=gapless, sorted=True, g=str(smk.input["genome"])) .to_dataframe() ) write_bed(smk.output[0], nogaps) main(snakemake, snakemake.config) # type: ignore |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | from typing import Any from Bio import bgzf # type: ignore import common.config as cfg def main(smk: Any, sconf: cfg.GiabStrats) -> None: i = cfg.ChrIndex.from_name(smk.wildcards["sex_chr"]) k = cfg.RefKey(smk.wildcards["ref_key"]) pattern = sconf.refkey_to_final_chr_pattern(k) cxy = sconf.stratifications[k].xy assert i in [cfg.ChrIndex.CHRX, cfg.ChrIndex.CHRY], "invalid sex chr" par_fun = cxy.fmt_x_par if i == cfg.ChrIndex.CHRX else cxy.fmt_y_par out = par_fun(pattern) assert out is not None, "this should not happen" with bgzf.BgzfWriter(smk.output[0], "w") as f: f.write(out) main(snakemake, snakemake.config) # type: ignore |
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 | library(tidyverse) subsets <- snakemake@params$subsets read_summary <- function(path) { build <- path %>% dirname() %>% dirname() %>% dirname() %>% basename() readr::read_csv( path, col_types = cols( Type = "c", Subtype = "c", Subset = "c", Filter = "c", METRIC.Recall = "d", METRIC.Precision = "d", METRIC.Frac_NA = "d", TRUTH.TP = "d", TRUTH.TP.het = "d", TRUTH.TP.homalt = "d", .default = "-", ) ) %>% filter(Filter == "PASS" & Subtype %in% c("*", "I16_PLUS", "D16_PLUS"), Subset %in% subsets) %>% rename(Recall = METRIC.Recall, Precision = METRIC.Precision, Frac_NA = METRIC.Frac_NA) %>% mutate("-log10(FN_Rate)" = -log10(1 - Recall), "-log10(FP_Rate)" = -log10(1 - Precision), build = build) %>% mutate(across(contains("TRUTH.TP"), log10)) %>% rename_with(~ sprintf("log10(%s)", .x), contains("TRUTH.TP")) %>% select(-Filter, -starts_with("TRUTH.TP")) %>% pivot_longer(cols = c(-Type, -Subtype, -Subset, -build), names_to = "metric", values_to = "value") %>% mutate(metric = fct_relevel( factor(metric), "Precision", "Recall", "Frac_NA", after = 0 )) %>% filter(!is.na(value)) } df <- snakemake@input %>% keep(~ file.size(.x) > 0) %>% map_dfr(read_summary) nsubs <- df %>% pull(Subset) %>% unique() %>% length() nbuilds <- df %>% pull(build) %>% unique() %>% length() knitr::opts_chunk$set( echo = TRUE, fig.width=10, fig.height=(3 + nsubs * nbuilds * 0.5) ) plot_metric <- function(df, key) { p <- ggplot(df, aes(value, fct_rev(Subset), fill = build)) + geom_col(position = "dodge") + facet_wrap(c("metric"), scales = "free_x", ncol = 2) + labs(x = NULL, y = NULL, fill = "Build") + theme(legend.position = "top") print(p) cat("\n\n") } |
99 100 101 | df %>% filter(Type == "SNP") %>% plot_metric() |
107 108 109 | df %>% filter(Type == "INDEL" & Subtype == "*") %>% plot_metric() |
115 116 117 | df %>% filter(Type == "INDEL" & Subtype == "I16_PLUS") %>% plot_metric() |
123 124 125 | df %>% filter(Type == "INDEL" & Subtype == "D16_PLUS") %>% plot_metric() |
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 | library(tidyverse) knitr::opts_chunk$set(fig.width=25, fig.height=6, fig_caption=TRUE) chr_mapper <- readr::read_tsv( snakemake@input$chr_mapper, col_names = c("chrom_index", "build_key", "chrom"), col_types = "icc" ) %>% arrange(chrom_index, build_key) bed_cols <- c("chrom", "start", "end") read_bed <- function(path) { readr::read_tsv(path, comment = "#", col_types = "cii", col_names = bed_cols) %>% mutate(region_size = end - start) } read_nonN <- function(path) { read_bed(path) %>% mutate(build_key = path %>% dirname() %>% dirname() %>% basename()) %>% left_join(chr_mapper, by = c("chrom", "build_key")) %>% select(-chrom) } read_strat <- function(path) { strat_filename <- basename(path) read_bed(path) %>% group_by(chrom) %>% summarise(total_region = sum(region_size), .groups = "drop") %>% mutate(build_key = path %>% dirname() %>% dirname() %>% basename(), ## genome = str_extract(strat_filename, "^[^_]+_(HG[0-9]+)_.bed.gz", 1), ## strat_level = str_extract(strat_filename, ## if_else(is.na(genome), ## "^[^_]+_(.*).bed.gz", ## "^[^_]+_HG[0-9]+_(.*).bed.gz"), ## 1), strat_name = str_extract(strat_filename, "^[^_]+_(.*).bed.gz", 1)) %>% left_join(chr_mapper, by = c("chrom", "build_key")) %>% select(-chrom) } df_chr_sizes <- snakemake@input$nonN %>% map_dfr(read_nonN) %>% group_by(chrom_index, build_key) %>% summarise(nonN_ref_len = sum(region_size), .groups = "drop") chr_build_combinations <- chr_mapper %>% expand(build_key, chrom_index) file_list <- snakemake@input$strats %>% map(readLines) %>% flatten() chr_name <- function(i) { case_when(i == 23 ~ "X", i == 24 ~ "Y", 0 < i & i < 23 ~ as.character(i), TRUE ~ "oops") } chr_levels <- chr_name(1:24) read_strats <- function(stype, gaps) { fs <- keep(file_list, ~ basename(dirname(.x)) == stype) if (length(fs) > 0) { df <- map_dfr(fs, read_strat) all_combos <- chr_build_combinations %>% expand(nesting(chrom_index, build_key), strat_name = df$strat_name) df %>% # join the full cartesian product of all chromosomes and paths, then fill # missing with 0 so that we get an empty space on the graphs indicating # zero coverage full_join(all_combos, by = c("chrom_index", "strat_name", "build_key")) %>% left_join(df_chr_sizes, by = c("chrom_index", "build_key")) %>% mutate(strat_type = stype, coverage = total_region / nonN_ref_len, chrom = factor(chr_name(chrom_index), levels = chr_levels)) %>% replace_na(list(coverage = 0)) } else { NULL } } coverage_plot <- function(df, level) { cat(sprintf("#### %s\n\n", level)) p <- df %>% ggplot(aes(chrom, coverage, fill = build_key)) + geom_col(position = "dodge") + scale_y_continuous(labels = scales::percent) + theme(legend.position = "top", text = element_text(size = 30), axis.text.x = element_text(angle = 90), strip.text.y = element_text(angle = 0), panel.spacing.y = unit(2, "lines")) + labs(x = NULL, fill = "build") print(p) cat("\n\n") } coverage_plots <- function(stype, gaps) { cat(sprintf("## %s\n\n", stype)) df <- read_strats(stype, gaps) if (is.null(df)) { cat("None\n\n") } else { df %>% group_by(strat_name) %>% group_walk(coverage_plot) } } |
145 146 147 | snakemake@params$core_levels %>% union(snakemake@params$other_levels) %>% walk(coverage_plots) |
Support
- Future updates
Related Workflows





