pipeline for generating official genome-in-a-bottle stratifications

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

This repository contains the pipeline and configuration used for generating and evaluating the Genome-in-a-Bottle stratifications. The stratification files were developed with the Global Alliance for Genomic Health (GA4GH) Benchmarking Team, the Genome in a Bottle Consortium (GIAB) and the Telomere-to-Telomere Consortium (T2T) . They are intended as a standard resource of BED files for use in stratifying true positive, false positive and false negative variant calls in challenging and targeted regions of the the genome.

The stratification BED files are only accessible on the GIAB FTP site . These files can be used as a standard resource of BED files for use with GA4GH benchmark tools such as hap.py .

Note that this pipeline applies to stratification versions starting with v3.2. Older pipelines and scripts can be found here .

Within this repository, further details (including all source files) can be found in the configuration at config/all.yml . The pipeline itself is maintained in a separate, reference agnostic repository here . Please refer to this pipeline for specific, methodological information regarding how these files are produced.

Currently, this pipeline is set up to build stratifications for the following references:

GRCh37
GRCh38
CHM13

Versioning

Version tags follow the format X.Y-Z where X will be incremented for removal of old stratifications and/or large deletions, Y will be incremented for additions and minor revisions, and Z will be incremented for documentation updates in this repository. Note that X.Y (without Z ) corresponds to the versions located in the FTP site .

For those looking for precise differences between each stratification version, see the stratdiff tool .

General Information

Author Information

Principal Investigator: Justin Zook, NIST, jzook@nist.gov
Nate Olson, NIST, nathanael.olson@nist.gov
Justin Wagner, NIST, justin.wagner@nist.gov
Jennifer McDaniel, NIST, jennifer.mcdaniel@nist.gov
Nate Dwarshuis, NIST, nathan.dwarshuis@nist.gov

Sharing/Access Information

Licenses/restrictions placed on the data, or limitations of reuse: Publicly released data are freely available for reuse without embargo.

Citations for stratifications are located in the associated READMEs.

If stratifications were used in benchmarking with GA4GH/GIAB best practices or hap.py please reference:

Krusche, P., Trigg, L., Boutros, P.C. et al. 
Best practices for benchmarking germline small-variant calls in human genomes. 
Nat Biotechnol 37, 555-560 (2019). https://doi.org/10.1038/s41587-019-0054-x

Links to publicly accessible locations of the data:

GIAB FTP URL

Individual stratification BED files as well as zipped directories (tar.gz) of files
stratification READMEs
.tsvs for benchmarking with hap.py
MD5 checksums

Summary Of Stratifications

Stratifications can be binned into several types: Low Complexity, Functional Technically Difficult, Genome Specific, Functional Regions, GC content, mappability, Other Difficult, Segmental Duplications, Union, Ancestry, Telomeres, and XY. General information for stratification types are provided below.

These categories are present for all indicated references unless otherwise noted.

Low Complexity

Regions with different types and sizes of low complexity sequence, e.g., homopolymers, STRs, VNTRs and other locally repetitive sequences.

Segmental Duplications

Regions with segmental duplications (generally defined as repeated regions >1kb with >90% similarity).

XY

Chomosome XY specific regions such as PAR, XTR or ampliconic.

Functional Regions

Regions to stratify variants inside and outside coding regions.

GC Content

Regions with different ranges (%) of GC content.

Mappability

Regions where short read mapping can be challenging.

Telomeres

Telomeric regions. These are included for CHM13 only.

Union

Regions with different general types of difficult regions or any type of difficult region or complex variant. For example, performance can be measured in just "easy" or "all difficult" regions of the genome.

Ancestry

Regions with inferred patterns of local ancestry.

These only exist for GRCh38.

Genome Specific (GIAB benchmark v4.2.1)

Difficult regions due to potentially difficult variation in a NIST/GIAB sample, including 1) regions containing putative compound heterozygous variants 2) small regions containing multiple phased variants, 3) regions with potential structural or copy number variation.

These only exist for GRCh37/8.

Functional Technically Difficult

Functional, or potentially functional, regions that are also likely to be technically difficult to sequences.

These only exist for GRCh37/8.

Other Difficult

These regions that are difficult to sequences for miscellaneous reasons. In the case of GRCh37/38, these include regions near gaps or errors in the reference or regions that are known to be expanded/collapsed relative to most other genomes.

This category also includes highly polymorphic regions such as those coding the major histocompatibility complex (MHC), the VDJ regions, and regions coding the killer-cell immunoglobulin-like receptor (KIR). These are all highly polymorphic in nature and thus difficult to sequence.

Finally, in the case of CHM13, this category also includes the rDNA arrays.

Data-Use Policy

This data/work was created by employees of the National Institute of Standards and Technology (NIST), an agency of the Federal Government. Pursuant to title 17 United States Code Section 105, works of NIST employees are not subject to copyright protection in the United States. This data/work may be subject to foreign copyright.

The data/work is provided by NIST as a public service and is expressly provided AS IS. NIST MAKES NO WARRANTY OF ANY KIND, EXPRESS, IMPLIED OR STATUTORY, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NON-INFRINGEMENT AND DATA ACCURACY. NIST does not warrant or make any representations regarding the use of the data or the results thereof, including but not limited to the correctness, accuracy, reliability or usefulness of the data. NIST SHALL NOT BE LIABLE AND YOU HEREBY RELEASE NIST FROM LIABILITY FOR ANY INDIRECT, CONSEQUENTIAL, SPECIAL, OR INCIDENTAL DAMAGES (INCLUDING DAMAGES FOR LOSS OF BUSINESS PROFITS, BUSINESS INTERRUPTION, LOSS OF BUSINESS INFORMATION, AND THE LIKE), WHETHER ARISING IN TORT, CONTRACT, OR OTHERWISE, ARISING FROM OR RELATING TO THE DATA (OR THE USE OF OR INABILITY TO USE THIS DATA), EVEN IF NIST HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

To the extent that NIST may hold copyright in countries other than the United States, you are hereby granted the non-exclusive irrevocable and unconditional right to print, publish, prepare derivative works and distribute the NIST data, in any medium, or authorize others to do so on your behalf, on a royalty-free basis throughout the world.

You may improve, modify, and create derivative works of the data or any portion of the data, and you may copy and distribute such modifications or works. Modified works should carry a notice stating that you changed the data and should note the date and nature of any such change. Please explicitly acknowledge the National Institute of Standards and Technology as the source of the data: Data citation recommendations are provided at https://www.nist.gov/open/license.

Permission to use this data is contingent upon your acceptance of the terms of this agreement and upon your providing appropriate acknowledgments of NISTâ€™s creation of the data/work.

Running this pipeline

This repository's existence on Github is primarily for documentation purposes, as the runtime is defined in terms of NIST-specific resources.

However, for those wishing to run the pipeline themselves, there are several options. Both require cloning this repo with submodules and setting up the conda env:

git clone --recurse-submodules https://github.com/ndwarshuis/giab-stratifications.git
cd giab-stratifications
conda env create -f env.yml
conda activate giab-strats-production

No Cluster Environment

Just call the pipeline directly using snakemake:

snakemake --cores 20 --configfile=config/all.yml all

Note that most rules in the pipeline require 12MB of memory or less with the exception of mappability (which runs GEM) and requires ~32GB for size of the references here.

Cluster Environment

First, create a profile from that located at workflow/profiles/nisaba/config.yml (which is designed for a NIST-specific cluster which runs Slurm).

Then run snakemake with this profile:

snakemake --cores 20 --configfile=config/all.yml --profile=workflow/profiles/yerprofile all

Code Snippets

script:
    "../scripts/python/bedtools/functional/filter_cds.py"

SnakeMake From line 21 of rules/functional.smk

shell:
    """
    mergeBed -i {input.bed} | \
    intersectBed -a stdin -b {input.gapless} -sorted -g {input.genome} | \
    bgzip -c > {output}
    """

SnakeMake BEDTools From line 34 of rules/functional.smk

shell:
    """
    complementBed -i {input.bed} -g {input.genome} | \
    intersectBed -a stdin -b {input.gapless} -sorted | \
    bgzip -c > {output}
    """

SnakeMake BEDTools From line 51 of rules/functional.smk

shell:
    """
    seqtk gc {params.args} -l 100 {input.ref} | \
    cut -f1-3 | \
    slopBed -i stdin -g {input.genome} -b 50 | \
    mergeBed -i stdin | \
    intersectBed -a stdin -b {input.gapless} -sorted -g {input.genome} | \
    bgzip -c \
    > {output}
    """

SnakeMake BEDTools seqtk From line 44 of rules/gc.smk

script:
    "../scripts/python/bedtools/gc/intersect_ranges.py"

SnakeMake From line 99 of rules/gc.smk

shell:
    """
    gunzip -c {input.ref} | \
    sed 's/^[A-Za-z]\\+$/\\U&/' | \
    {input.bin} {wildcards.unit_len} {wildcards.total_len} - 2> {log} \
    > {output} 
    """

SnakeMake From line 53 of rules/low_complexity.smk

shell:
    "sed -n '/unit=[{wildcards.bases}]\+/p' {input} > {output}"

SnakeMake From line 73 of rules/low_complexity.smk

shell:
    "subtractBed -a {input.a} -b {input.b} > {output}"

SnakeMake From line 138 of rules/low_complexity.smk

shell:
    "subtractBed -a {input.a} -b {input.b} > {output}"

SnakeMake From line 169 of rules/low_complexity.smk

shell:
    """
    slopBed -i {input} -b 5 -g {params.genome} | \
    cut -f1-3 | \
    mergeBed -i stdin | \
    intersectBed -a stdin -b {params.gapless} -sorted -g {params.genome} | \
    bgzip -c \
    > {output}
    """

SnakeMake BEDTools From line 186 of rules/low_complexity.smk

shell:
    """
    grep 'unit={wildcards.base}' {input} | \
    mergeBed -i stdin -d 1 | \
    awk '$3-$2>={wildcards.merged_len}' > \
    {output}
    """

SnakeMake From line 260 of rules/low_complexity.smk

shell:
    """
    multiIntersectBed -i {input} | \
    slopBed -i stdin -b 5 -g {params.genome} | \
    mergeBed -i stdin | \
    intersectBed -a stdin -b {params.gapless} -sorted | \
    bgzip -c > {output}
    """

SnakeMake BEDTools From line 285 of rules/low_complexity.smk

script:
    "../scripts/python/bedtools/low_complexity/filter_sort_trf.py"

SnakeMake From line 401 of rules/low_complexity.smk

shell:
    """
    gunzip -c {input} | \
    mergeBed -i stdin | \
    bgzip -c > {output}
    """

SnakeMake From line 412 of rules/low_complexity.smk

script:
    "../scripts/python/bedtools/low_complexity/filter_sort_rmsk.py"

SnakeMake From line 443 of rules/low_complexity.smk

shell:
    """
    gunzip -c {input.rmsk} | \
    grep {wildcards.rmsk_class} | \
    mergeBed -i stdin | \
    bgzip -c > {output}
    """

SnakeMake From line 456 of rules/low_complexity.smk

script:
    "../scripts/python/bedtools/low_complexity/filter_sort_censat.py"

SnakeMake From line 496 of rules/low_complexity.smk

shell:
    "mergeBed -i {input} | bgzip -c > {output}"

SnakeMake From line 511 of rules/low_complexity.smk

shell:
    """
    slopBed -i {input.bed} -b 5 -g {input.genome} | \
    mergeBed -i stdin | \
    intersectBed -a stdin -b {input.gapless} -sorted -g {input.genome} | \
    bgzip -c > {output}
    """

SnakeMake BEDTools From line 524 of rules/low_complexity.smk

shell:
    """
    complementBed -i {input.bed} -g {params.genome} |
    intersectBed -a stdin -b {params.gapless} -sorted | \
    bgzip -c > {output}
    """

SnakeMake BEDTools From line 545 of rules/low_complexity.smk

shell:
    """
    slopBed -i {input.perfect} -b 5 -g {input.genome} | \
    mergeBed -i stdin | \
    multiIntersectBed -i stdin {input.imperfect} | \
    mergeBed -i stdin | \
    intersectBed -a stdin -b {input.gapless} -sorted -g {input.genome} | \
    bgzip -c \
    > {output}
    """

SnakeMake BEDTools From line 567 of rules/low_complexity.smk

shell:
    """
    multiIntersectBed -i {input.beds} | \
    slopBed -i stdin -b 5 -g {input.genome} | \
    mergeBed -i stdin > {output}
    """

SnakeMake From line 600 of rules/low_complexity.smk

shell:
    """
    awk '({params.lower}+10)<=($3-$2) && ($3-$2)<({params.upper}+10)' {input.tr} | \
    subtractBed -a stdin -b {input.hp} | \
    intersectBed -a stdin -b {input.gapless} -sorted -g {input.genome} | \
    bgzip -c > {output}
    """

SnakeMake BEDTools From line 634 of rules/low_complexity.smk

shell:
    """
    multiIntersectBed -i {input} | \
    mergeBed -i stdin | \
    bgzip -c > {output}
    """

SnakeMake From line 667 of rules/low_complexity.smk

shell:
    "curl -sS -L -o {output} {params.url}"

SnakeMake From line 39 of rules/mappability.smk

shell:
    """
    mkdir -p {config.tools_bin_dir} && \
    tar xjf {input} \
    --directory {config.tools_bin_dir} \
    --strip-components=2 \
    {params.bins}
    """

SnakeMake From line 59 of rules/mappability.smk

script:
    "../scripts/python/bedtools/mappability/filter_ref.py"

SnakeMake From line 81 of rules/mappability.smk

shell:
    """
    PATH={config.tools_bin_dir}:$PATH
    {input.bin} \
    --complement emulate \
    -T {threads} \
    -i {input.fa} \
    -o {params.base} > {log} 2>&1
    """

SnakeMake From line 98 of rules/mappability.smk

shell:
    """
    {input.bin} \
    -m {wildcards.m} \
    -e {wildcards.e} \
    -l {wildcards.l} \
    -T {threads} \
    -I {input.fa} \
    -o {params.base} > {log} 2>&1
    """

SnakeMake From line 124 of rules/mappability.smk

shell:
    """
    {input.bin} \
    -I {input.idx} \
    -i {input.map} \
    -o {params.base} 2> {log}
    """

SnakeMake From line 151 of rules/mappability.smk

shell:
    """
    sed 's/ AC//' {input} | \
    wig2bed -d | \
    awk '$5>0.9' | \
    cut -f1-3 | \
    gzip -c > \
    {output}
    """

SnakeMake From line 171 of rules/mappability.smk

script:
    "../scripts/python/bedtools/mappability/merge_nonunique.py"

SnakeMake From line 208 of rules/mappability.smk

shell:
    """
    complementBed -i {input} -g {params.genome} | \
    intersectBed -a stdin -b {params.gapless} -sorted -g {params.genome} | \
    bgzip -c > {output}
    """

SnakeMake BEDTools From line 228 of rules/mappability.smk

shell:
    """
    complementBed -i {input.gapless} -g {input.genome} | \
    slopBed -i stdin -b 15000 -g {input.genome} | \
    mergeBed -i stdin | \
    intersectBed -a stdin -b {input.gapless} -sorted -g {input.genome} | \
    bgzip -c > {output}
    """

SnakeMake BEDTools From line 19 of rules/otherdifficult.smk

script:
    "../scripts/python/bedtools/otherdifficult/filter_vdj.py"

SnakeMake From line 37 of rules/otherdifficult.smk

shell:
    """
    mergeBed -i {input.bed} | \
    intersectBed -a stdin -b {input.gapless} -sorted -g {input.genome} | \
    bgzip -c > {output}
    """

SnakeMake BEDTools From line 50 of rules/otherdifficult.smk

script:
    "../scripts/python/bedtools/other/filter_sort_other.py"

SnakeMake From line 40 of rules/other.smk

script:
    "../scripts/python/bedtools/postprocess/list_strats.py"

SnakeMake From line 53 of rules/postprocess.smk

script:
    "../scripts/python/bedtools/postprocess/list_md5.py"

SnakeMake From line 64 of rules/postprocess.smk

script:
    "../scripts/python/bedtools/postprocess/generate_tsv.py"

SnakeMake From line 74 of rules/postprocess.smk

script:
    "../scripts/python/bedtools/postprocess/run_unit_tests.py"

SnakeMake From line 91 of rules/postprocess.smk

shell:
    """
    curl -fsSq -o {params.tar} {params.url} && \
    mkdir {output} && \
    tar xzf {params.tar} --directory {output} --strip-components=1
    """

SnakeMake From line 111 of rules/postprocess.smk

script:
    "../scripts/python/bedtools/postprocess/diff_previous_strats.py"

SnakeMake From line 141 of rules/postprocess.smk

run:
    with open(output[0], "w") as f:
        for ref_key, build_key in zip(rks, bks):
            for i in config.buildkey_to_chr_indices(ref_key, build_key):
                pattern = config.refkey_to_final_chr_pattern(ref_key)
                line = [
                    str(i.value),
                    f"{ref_key}@{build_key}",
                    i.chr_name_full(pattern),
                ]
                f.write("\t".join(line) + "\n")

SnakeMake From line 164 of rules/postprocess.smk

script:
    "../scripts/rmarkdown/rmarkdown/validate.Rmd"

SnakeMake From line 207 of rules/postprocess.smk

shell:
    "gunzip -c {input} > {output}"

SnakeMake From line 216 of rules/postprocess.smk

shell:
    """
    samtools faidx {input}
    """

SnakeMake SAMtools From line 227 of rules/postprocess.smk

script:
    "../scripts/python/bedtools/postprocess/run_happy.py"

SnakeMake From line 257 of rules/postprocess.smk

script:
    "../scripts/rmarkdown/rmarkdown/benchmark.Rmd"

SnakeMake From line 278 of rules/postprocess.smk

shell:
    """
    cp {input.main} {output.main}
    cp {input.validation} {output.validation}
    """

SnakeMake From line 289 of rules/postprocess.smk

shell:
    """
    tar czf {output} -C {params.parent} {params.target}
    """

SnakeMake From line 305 of rules/postprocess.smk

shell:
    """
    find {params.root} -type f -exec md5sum {{}} + | \
    sed 's|{params.root}|\.|' | \
    grep -v "\./genome-stratifications-md5s\.txt" | \
    sort -k2,2 > {output} && \
    cd {params.root} && \
    md5sum -c --quiet genome-stratifications-md5s.txt
    """

SnakeMake From line 326 of rules/postprocess.smk

script:
    "../scripts/python/bedtools/xy/write_par.py"

SnakeMake From line 21 of rules/ref.smk

script:
    "../scripts/python/bedtools/misc/get_file.py"

SnakeMake From line 35 of rules/ref.smk

shell:
    """
    gunzip -c {input} | \
    samtools faidx - -o - \
    2> {log} > {output}
    """

SnakeMake SAMtools From line 48 of rules/ref.smk

script:
    "../scripts/python/bedtools/ref/get_genome.py"

SnakeMake From line 65 of rules/ref.smk

shell:
    """
    samtools faidx {input.fa} $(cut -f1 {input.genome} | tr '\n' ' ') 2> {log} | \
    bgzip -c > {output}
    """

SnakeMake SAMtools From line 79 of rules/ref.smk

script:
    "../scripts/python/bedtools/ref/get_gapless.py"

SnakeMake From line 123 of rules/ref.smk

script:
    "../scripts/python/bedtools/ref/get_ftbl_mapper.py"

SnakeMake From line 154 of rules/ref.smk

shell:
    """
    gunzip -c {input} | \
    grep -v '^#' | \
    awk -F "\t" 'OFS="\t" {{ print $1,$4,$5,$2,$3,$9 }}' | \
    bgzip -c > {output}
    """

SnakeMake From line 165 of rules/ref.smk

script:
    "../scripts/python/bedtools/ref/filter_sort_bench_bed.py"

SnakeMake From line 213 of rules/ref.smk

shell:
    "curl -sS -L -o {output} {params.url}"

SnakeMake From line 9 of rules/repseq.smk

shell:
    """
    mkdir {output} && \
    tar xzf {input} --directory {output} --strip-components=1
    """

SnakeMake From line 19 of rules/repseq.smk

shell:
    "make -C {input} 2>&1 > {log} && mv {input}/repseq {output}"

SnakeMake From line 35 of rules/repseq.smk

script:
    "../scripts/python/bedtools/segdups/filter_sort_superdups.py"

SnakeMake From line 30 of rules/segdups.smk

shell:
    """
    mergeBed -i {input.bed} -d 100 | \
    intersectBed -a stdin -b {input.gapless} -sorted -g {input.genome} | \
    bgzip -c > {output}
    """

SnakeMake BEDTools From line 43 of rules/segdups.smk

shell:
    """
    gunzip -c {input} | \
    awk '($3-$2 > 10000)' | \
    bgzip -c > {output}
    """

SnakeMake From line 58 of rules/segdups.smk

shell:
    """
    complementBed -i {input.bed} -g {input.genome} | \
    intersectBed -a stdin -b {input.gapless} -sorted -g {input.genome} | \
    bgzip -c > {output}
    """

SnakeMake BEDTools From line 75 of rules/segdups.smk

shell:
    """
    seqtk telo {input.ref} 2> {log} | \
    cut -f1-3 | \
    intersectBed -a stdin -b {input.gapless} -sorted -g {input.genome} | \
    bgzip -c \
    > {output}
    """

SnakeMake BEDTools seqtk From line 23 of rules/telomeres.smk

shell:
    """
    multiIntersectBed -i {input} | \
    mergeBed -i stdin | \
    bgzip -c \
    > {output}
    """

SnakeMake From line 16 of rules/union.smk

shell:
    """
    complementBed -i {input.bed} -g {params.genome} |
    intersectBed -a stdin -b {params.gapless} -sorted -g {params.genome} | \
    bgzip -c \
    > {output}
    """

SnakeMake BEDTools From line 35 of rules/union.smk

shell:
    """
    intersectBed -a {input.bed} -b {input.gapless} -sorted -g {input.genome} | \
    bgzip -c > {output}
    """

SnakeMake BEDTools From line 36 of rules/xy.smk

script:
    "../scripts/python/bedtools/xy/filter_sort_features.py"

SnakeMake From line 54 of rules/xy.smk

shell:
    """
    complementBed -i {input.bed} -g {input.genome} | \
    grep {wildcards.sex_chr} | \
    intersectBed -a stdin -b {input.gapless} -sorted -g {input.genome} | \
    bgzip -c > {output}
    """

SnakeMake BEDTools From line 87 of rules/xy.smk

shell:
    """
    awk -v OFS='\t' {{'print $1,\"0\",$2'}} {input.bed} | \
    grep -v \"X\|Y\" | \
    intersectBed -a stdin -b {input.gapless} -sorted -g {input.genome} | \
    bgzip -c > {output}
    """

SnakeMake BEDTools From line 105 of rules/xy.smk

import json
from typing import Any
import common.config as cfg
from common.bed import filter_sort_bed_inner, read_bed


def main(smk: Any, sconf: cfg.GiabStrats) -> None:
    rk = cfg.RefKey(smk.wildcards["ref_key"])
    bk = cfg.BuildKey(smk.wildcards["build_key"])
    to_map = sconf.buildkey_to_final_chr_mapping(rk, bk)

    with open(smk.input["mapper"], "r") as f:
        from_map = json.load(f)

    df = read_bed(smk.input["bed"], more=[3, 4])
    df = df[df[3].str.contains("RefSeq") & (df[4] == "CDS")][[0, 1, 2]].copy()
    df = filter_sort_bed_inner(from_map, to_map, df)

    df.to_csv(
        smk.output[0],
        header=False,
        index=False,
        sep="\t",
    )


main(snakemake, snakemake.config)  # type: ignore

Python JSON common From line 1 of functional/filter_cds.py

from pathlib import Path
from typing import Any, NamedTuple, Callable
import subprocess as sp
import common.config as cfg
import json

# use subprocess here because I don't see an easy way to stream a bedtools
# python object to a bgzip file (it only seems to do gzip...oh well)


class GCInput(NamedTuple):
    bed: Path
    fraction: int
    is_range_bound: bool


def write_simple_range_beds(
    final_path: Callable[[str], Path],
    gs: list[GCInput],
    is_low: bool,
) -> list[str]:
    def fmt_out(bigger_frac: int, smaller_frac: int) -> Path:
        lower_frac, upper_frac = (
            (smaller_frac, bigger_frac) if is_low else (bigger_frac, smaller_frac)
        )
        return final_path(f"gc{lower_frac}to{upper_frac}_slop50")

    pairs = zip(gs[1:], gs[:-1]) if is_low else zip(gs[:-1], gs[1:])
    torun = [
        (fmt_out(bigger.fraction, smaller.fraction), bigger, smaller)
        for bigger, smaller in pairs
    ]
    for out, bigger, smaller in torun:
        with open(out, "wb") as f:
            p0 = sp.Popen(
                ["subtractBed", "-a", bigger.bed, "-b", smaller.bed],
                stdout=sp.PIPE,
            )
            p1 = sp.run(["bgzip", "-c"], stdin=p0.stdout, stdout=f)
            p0.wait()
            if not (p0.returncode == p1.returncode == 0):
                exit(1)
    return [str(t[0]) for t in torun]


def write_middle_range_bed(
    final_path: Callable[[str], Path],
    lower: GCInput,
    upper: GCInput,
    genome: Path,
    gapless: Path,
) -> str:
    out = final_path(f"gc{lower.fraction}to{upper.fraction}_slop50")
    with open(out, "wb") as f:
        p0 = sp.Popen(
            ["complementBed", "-i", upper.bed, "-g", genome],
            stdout=sp.PIPE,
        )
        p1 = sp.Popen(
            ["subtractBed", "-a", "stdin", "-b", lower.bed],
            stdin=p0.stdout,
            stdout=sp.PIPE,
        )
        p2 = sp.Popen(
            ["intersectBed", "-a", "stdin", "-b", gapless, "-sorted", "-g", genome],
            stdin=p1.stdout,
            stdout=sp.PIPE,
        )
        p3 = sp.run(["bgzip", "-c"], stdin=p2.stdout, stdout=f)
        p0.wait()
        p1.wait()
        p2.wait()
        if not (p0.returncode == p1.returncode == p2.returncode == p3.returncode == 0):
            exit(1)
    return str(out)


def write_intersected_range_beds(
    final_path: Callable[[str], Path],
    low: list[GCInput],
    high: list[GCInput],
) -> list[str]:
    pairs = zip(
        [x for x in low if x.is_range_bound],
        [x for x in reversed(high) if x.is_range_bound],
    )
    torun = [
        (i, final_path(f"gclt{b1.fraction}orgt{b2.fraction}_slop50"), b1, b2)
        for i, (b1, b2) in enumerate(pairs)
    ]
    for i, bed_out, b1, b2 in torun:
        with open(bed_out, "wb") as f:
            p0 = sp.Popen(
                ["multiIntersectBed", "-i", b1.bed, b2.bed],
                stdout=sp.PIPE,
            )
            p1 = sp.Popen(
                ["mergeBed", "-i", "stdin"],
                stdin=p0.stdout,
                stdout=sp.PIPE,
            )
            p2 = sp.run(["bgzip", "-c"], stdin=p1.stdout, stdout=f)
            p0.wait()
            p1.wait()
            if not (p0.returncode == p1.returncode == p2.returncode == 0):
                exit(1)

    return [str(t[1]) for t in torun]


def main(smk: Any, sconf: cfg.GiabStrats) -> None:
    rk = cfg.RefKey(smk.wildcards.ref_key)
    bk = cfg.BuildKey(smk.wildcards.build_key)
    gps = sconf.buildkey_to_include(rk, bk).gc
    assert gps is not None, "this should not happen"
    # ASSUME both of these input lists are sorted by GC fraction
    low = [GCInput(p, f, r) for p, (f, r) in zip(smk.input.low, gps.low)]
    high = [GCInput(p, f, r) for p, (f, r) in zip(smk.input.high, gps.high)]
    genome = Path(smk.input.genome)
    gapless = Path(smk.input.gapless)

    def final_path(name: str) -> Path:
        p = Path(str(smk.params.path_pattern).format(name))
        p.parent.mkdir(exist_ok=True, parents=True)
        return p

    low_strats = write_simple_range_beds(final_path, low, True)
    high_strats = write_simple_range_beds(final_path, high, False)
    range_strat = write_middle_range_bed(
        final_path,
        low[-1],
        high[0],
        genome,
        gapless,
    )
    inter_strats = write_intersected_range_beds(
        final_path,
        low,
        high,
    )

    with open(smk.output[0], "w") as f:
        # put the first low and last high input here since these are already
        # in the final directory
        obj = {
            "gc_ranges": [
                low[0][0],
                *low_strats,
                range_strat,
                *high_strats,
                high[-1][0],
            ],
            "widest_extreme": inter_strats[0],
            "other_extremes": inter_strats[1:],
        }
        json.dump(obj, f, indent=4)


main(snakemake, snakemake.config)  # type: ignore

Python BEDTools JSON common From line 1 of gc/intersect_ranges.py

from typing import Any
import common.config as cfg
from common.bed import read_bed, filter_sort_bed, write_bed


def main(smk: Any, sconf: cfg.GiabStrats) -> None:
    rk = cfg.RefKey(smk.wildcards["ref_key"])
    bk = cfg.BuildKey(smk.wildcards["build_key"])

    bedfile = sconf.refkey_to_strat(rk).low_complexity.satellites
    assert bedfile is not None, "this should not happen"

    ps = bedfile.params
    conv = sconf.buildkey_to_chr_conversion(rk, bk, ps.chr_pattern)

    df = read_bed(smk.input[0], ps, [bedfile.sat_col])
    df = filter_sort_bed(conv, df)
    df = df[~df[3].str.startswith("ct_")]
    write_bed(smk.output[0], df)


main(snakemake, snakemake.config)  # type: ignore

Python common From line 1 of low_complexity/filter_sort_censat.py

from typing import Any
import common.config as cfg
from common.bed import read_filter_sort_bed


def main(smk: Any, sconf: cfg.GiabStrats) -> None:
    rk = cfg.RefKey(smk.wildcards["ref_key"])
    bk = cfg.BuildKey(smk.wildcards["build_key"])
    rm = sconf.refkey_to_strat(rk).low_complexity.rmsk
    assert rm is not None, "this should not happen"
    read_filter_sort_bed(
        sconf, smk.input[0], smk.output[0], rm.params, rk, bk, [rm.class_col]
    )


main(snakemake, snakemake.config)  # type: ignore

Python common From line 1 of low_complexity/filter_sort_rmsk.py

from typing import Any
import common.config as cfg
from common.bed import read_filter_sort_bed


def main(smk: Any, sconf: cfg.GiabStrats) -> None:
    rk = cfg.RefKey(smk.wildcards["ref_key"])
    bk = cfg.BuildKey(smk.wildcards["build_key"])
    ss = sconf.refkey_to_strat(rk).low_complexity.simreps
    assert ss is not None, "this should not happen"
    read_filter_sort_bed(sconf, smk.input[0], smk.output[0], ss.params, rk, bk)


main(snakemake, snakemake.config)  # type: ignore

Python common From line 1 of low_complexity/filter_sort_trf.py

import re
from typing import Any
import pandas as pd
import common.config as cfg
import subprocess as sp
from pathlib import Path

# This is the one exception to the pattern wherein we filter all input data to
# the chromosomes we actually care about. In the case of mappability, we want to
# run GEM against all random/unplaced contigs, since these are hard to map to
# begin with (otherwise they wouldn't be unplaced). Here, we need to filter
# the fasta to include all the primary contigs we care about + all the other
# unplaced/random contigs. Later we will filter out all the non-primary contigs.


def main(smk: Any, sconf: cfg.GiabStrats) -> None:
    rk = cfg.RefKey(smk.wildcards["ref_key"])
    bk = cfg.BuildKey(smk.wildcards["build_key"])

    pattern = sconf.refkey_to_final_chr_pattern(rk)
    pats = sconf.refkey_to_mappability_patterns(rk)
    cis = sconf.buildkey_to_chr_indices(rk, bk)

    idx = pd.read_table(
        Path(smk.input["idx"]),
        header=None,
        dtype={0: str},
        usecols=[0],
    )

    chrs = [
        *filter(
            lambda c: any(i.chr_name_full(pattern) == c for i in cis)
            or any(re.match(p, c) for p in pats),
            idx[0].tolist(),
        )
    ]

    with open(Path(smk.output[0]), "w") as f:
        # ASSUME sort order doesn't matter and ref is bgzip'd or unzip'd
        # NOTE the output is not bgzip'd because GEM doesn't like it
        p = sp.Popen(["samtools", "faidx", Path(smk.input["fa"]), *chrs], stdout=f)
        p.wait()


main(snakemake, snakemake.config)  # type: ignore

Python SAMtools Pandas common From line 1 of mappability/filter_ref.py

import re
from typing import Any
from pathlib import Path
import common.config as cfg
from common.bed import read_bed, filter_sort_bed, write_bed
from pybedtools import BedTool as bt  # type: ignore
import json


def main(smk: Any, sconf: cfg.GiabStrats) -> None:
    rk = cfg.RefKey(smk.wildcards["ref_key"])

    pattern = sconf.refkey_to_final_chr_pattern(rk)
    conv = cfg.fullset_conv(pattern)

    inputs = smk.input["bed"]
    genome = Path(smk.input["genome"][0])
    gapless = bt(smk.input["gapless"])

    def final_path(name: str) -> Path:
        p = Path(str(smk.params.path_pattern).format(name))
        p.parent.mkdir(exist_ok=True, parents=True)
        return p

    def to_single_output(basename: str) -> Path:
        m = re.match(".*(l\\d+_m\\d+_e\\d+).*", basename)
        assert m is not None, f"malformed mappability file name: {basename}"
        return final_path(f"nonunique_{m[1]}")

    def read_sort_bed(p: Path) -> bt:
        df = read_bed(p)
        # Sort here because we can't assume wig2bed sorts its output. Also,
        # filtering is necessary because the output should have unplaced contigs
        # in it that we don't want.
        return bt().from_dataframe(filter_sort_bed(conv, df))

    def merge_bed(bed: bt, out: Path) -> None:
        df = bed.merge(d=100).intersect(b=gapless, sorted=True, g=genome).to_dataframe()
        write_bed(out, df)

    def merge_single(bed: bt, out: Path) -> None:
        comp = bed.complement(g=str(genome))
        merge_bed(comp, out)

    all_lowmap = final_path("lowmappabilityall")
    single_lowmap = []

    # If there is only one input, merge this to make one "all_nonunique" bed.
    # Otherwise, merge each individual input and then combine these with
    # multi-intersect to make the "all_nonunique" bed.
    if len(inputs) == 1:
        bed = read_sort_bed(Path(inputs[0]))
        merge_single(bed, all_lowmap)
    else:
        single = [
            (
                to_single_output((p := Path(i)).name),
                read_sort_bed(p),
            )
            for i in inputs
        ]
        for sout, bed in single:
            merge_single(bed, sout)
        mint = bt().multi_intersect(i=[s[0] for s in single])
        merge_bed(mint, all_lowmap)
        single_lowmap = [str(s[0]) for s in single]

    with open(smk.output[0], "w") as f:
        obj = {"all_lowmap": str(all_lowmap), "single_lowmap": single_lowmap}
        json.dump(obj, f)


main(snakemake, snakemake.config)  # type: ignore

Python JSON common Pybedtools From line 1 of mappability/merge_nonunique.py

from pathlib import Path
import subprocess as sp
from typing import Callable
from typing_extensions import assert_never
from tempfile import NamedTemporaryFile as Tmp
from common.io import is_bgzip, is_gzip
import common.config as cfg
from common.io import get_md5, setup_logging

# hacky curl/gzip wrapper; this exists because I got tired of writing
# specialized rules to convert gzip/nozip files to bgzip and back :/
# Solution: force bgzip for references and gzip for bed

smk_log = snakemake.log[0]  # type: ignore

log = setup_logging(smk_log)

GUNZIP = ["gunzip", "-c"]
BGZIP = ["bgzip", "-c"]
GZIP = ["gzip", "-c"]
CURL = ["curl", "-f", "-Ss", "-L", "-q"]


def main(opath: str, src: cfg.BedSrc | cfg.RefSrc | None) -> None:
    if isinstance(src, cfg.FileSrc_):
        # ASSUME these are already tested via the pydantic class for the
        # proper file format
        Path(opath).symlink_to(Path(src.filepath).resolve())

    elif isinstance(src, cfg.HttpSrc_):
        curlcmd = [*CURL, src.url]

        # to test the format of downloaded files, sample the first 65000 bytes
        # (which should be enough to get one block of a bgzip file, which will
        # allow us to test for it)
        curltestcmd = [*CURL, "-r", "0-65000", src.url]

        with open(opath, "wb") as f, Tmp() as tf, open(smk_log, "w") as lf:

            def curl() -> None:
                if sp.run(curlcmd, stdout=f, stderr=lf).returncode != 0:
                    exit(1)

            def curl_test(testfun: Callable[[Path], bool]) -> bool:
                if sp.run(curltestcmd, stdout=tf, stderr=lf).returncode != 0:
                    exit(1)
                return testfun(Path(tf.name))

            def curl_gzip(cmd: list[str]) -> None:
                p1 = sp.Popen(curlcmd, stdout=sp.PIPE, stderr=lf)
                p2 = sp.run(cmd, stdin=p1.stdout, stdout=f, stderr=lf)
                p1.wait()
                if not (p1.returncode == p2.returncode == 0):
                    exit(1)

            # if we are getting a bed file (or related) ensure it is in gzip
            # format
            if isinstance(src, cfg.BedHttpSrc):
                if curl_test(is_gzip):
                    curl()
                else:
                    curl_gzip(GZIP)

            # if we are getting a fasta file ensure it is in bgzip format
            elif isinstance(src, cfg.RefHttpSrc):
                if curl_test(is_bgzip):
                    curl()
                elif curl_test(is_gzip):
                    p1 = sp.Popen(curlcmd, stdout=sp.PIPE, stderr=lf)
                    p2 = sp.Popen(GUNZIP, stdin=p1.stdout, stdout=sp.PIPE, stderr=lf)
                    p3 = sp.run(BGZIP, stdin=p2.stdout, stdout=f, stderr=lf)
                    p1.wait()
                    p2.wait()
                    if not (p1.returncode == p2.returncode == p3.returncode == 0):
                        exit(1)
                else:
                    curl_gzip(BGZIP)
            else:
                assert_never(src)

    elif src is None:
        assert False, "file src is null; this should not happen"
    else:
        assert_never(src)

    if src.md5 is not None and src.md5 != (actual := get_md5(opath, True)):
        log.error("md5s don't match; wanted %s, actual %s", src.md5, actual)
        exit(1)


main(snakemake.output[0], snakemake.params.src)  # type: ignore

Python common typing-extensions From line 1 of misc/get_file.py

import json
from typing import Any
import common.config as cfg
from common.bed import filter_sort_bed_inner, read_bed, write_bed

VDJ_PAT = "^ID=gene-(IGH|IGK|IGL|TRA|TRB|TRG);"


def main(smk: Any, sconf: cfg.GiabStrats) -> None:
    rk = cfg.RefKey(smk.wildcards["ref_key"])
    bk = cfg.BuildKey(smk.wildcards["build_key"])
    to_map = sconf.buildkey_to_final_chr_mapping(rk, bk)

    with open(smk.input["mapper"], "r") as f:
        from_map = json.load(f)

    df = read_bed(smk.input["bed"], more=[5])
    df = df[df[3].str.match(VDJ_PAT)][[0, 1, 2]].copy()
    df = filter_sort_bed_inner(from_map, to_map, df)

    write_bed(smk.output[0], df)


main(snakemake, snakemake.config)  # type: ignore

Python JSON common From line 1 of otherdifficult/filter_vdj.py

from typing import Any
import common.config as cfg
from common.bed import read_bed, filter_sort_bed, write_bed
from pybedtools import BedTool as bt  # type: ignore


def main(smk: Any, sconf: cfg.GiabStrats) -> None:
    rk = cfg.RefKey(smk.wildcards["ref_key"])
    bk = cfg.BuildKey(smk.wildcards["build_key"])
    lk = cfg.OtherLevelKey(smk.wildcards["other_level_key"])
    sk = cfg.OtherStratKey(smk.wildcards["other_strat_key"])
    bed = sconf.otherkey_to_bed(rk, bk, lk, sk)
    ps = bed.params
    conv = sconf.buildkey_to_chr_conversion(rk, bk, ps.chr_pattern)
    df = filter_sort_bed(conv, read_bed(smk.input["bed"][0], ps))
    if bed.remove_gaps:
        df = (
            bt()
            .from_dataframe(df)
            .intersect(
                smk.input["gapless"],
                sorted=True,
                g=smk.input["genome"][0],
            )
            .to_dataframe()
        )
    write_bed(smk.output[0], df)


main(snakemake, snakemake.config)  # type: ignore

Python common Pybedtools From line 1 of other/filter_sort_other.py

from pathlib import Path
from typing import Any
import common.config as cfg
from common.stratdiff.lib.diff import compare_all
from common.io import setup_logging


log = setup_logging(snakemake.log[0])  # type: ignore


def main(smk: Any, sconf: cfg.GiabStrats) -> None:
    rk = cfg.RefKey(smk.wildcards.ref_key)
    bk = cfg.BuildKey(smk.wildcards.build_key)
    comparison = sconf.buildkey_to_build(rk, bk).comparison
    pattern = sconf.refkey_to_final_chr_pattern(rk)
    ixs = sconf.buildkey_to_chr_indices(rk, bk)

    assert comparison is not None, "this should not happen"

    outdir = Path(smk.output[0]).parent

    logged = compare_all(
        Path(smk.input["new_list"]).parent,
        smk.input["old"],
        outdir,
        comparison.path_mapper,
        comparison.replacements,
        [i.chr_name_full(pattern) for i in ixs],
        comparison.ignore_generated,
        comparison.ignore_other,
    )
    for x in logged:
        log.info(x)


main(snakemake, snakemake.config)  # type: ignore

Python common From line 1 of postprocess/diff_previous_strats.py

from pathlib import Path
from typing import Any
import re


def main(smk: Any) -> None:
    with open(smk.input[0], "r") as i, open(smk.output[0], "w") as o:
        for f in i:
            fp = Path(f.strip())
            level = re.sub("^[^_]+_", "", fp.name.replace(".bed.gz", ""))
            o.write(f"{level}\t{Path(fp.parent.name) / fp.name}\n")


main(snakemake)  # type: ignore

Python From line 1 of postprocess/generate_tsv.py

from typing import Any
from os.path import dirname
from common.io import get_md5


def strat_files(path: str) -> list[str]:
    with open(path, "r") as f:
        return [s.strip() for s in f]


def main(smk: Any) -> None:
    out = str(smk.output)
    root = dirname(out)
    ss = strat_files(str(smk.input))
    with open(out, "w") as op:
        for s in ss:
            h = get_md5(s)
            p = s.replace(root + "/", "")
            op.write(f"{h}  {p}\n")


main(snakemake)  # type: ignore

Python common From line 1 of postprocess/list_md5.py

from typing import Any


def main(smk: Any) -> None:
    with open(smk.output[0], "w") as f:
        for s in sorted(smk.input):
            f.write(s + "\n")


main(snakemake)  # type: ignore

Python From line 1 of postprocess/list_strats.py

import gzip
import os
import subprocess as sp
from typing import Any
from pathlib import Path


def gzip_empty(path: str) -> bool:
    with gzip.open(path, "rb") as f:
        return len(f.read(1)) == 0


def main(smk: Any) -> None:
    # ASSUME if filtered benchmark bed is empty then we have nothing to
    # benchmark because the desired chromosomes in this build are not included
    # in the benchmark
    with open(smk.log[0], "w") as f:
        if gzip_empty(smk.input["bench_bed"][0]):
            Path(smk.output[0]).touch()
            f.write("no overlapping chromosomes, writing empty dummy file\n")
        else:
            cmd = [
                "hap.py",
                *["--engine", "vcfeval"],
                "--verbose",
                *["--threads", str(smk.threads)],
                *["--stratification", smk.input["strats"][0]],
                *["-f", smk.input["bench_bed"][0]],
                *["-o", smk.params["prefix"]],
                smk.input["bench_vcf"][0],
                smk.input["query_vcf"][0],
            ]
            res = sp.run(
                cmd,
                stderr=sp.STDOUT,
                stdout=f,
                env={"HGREF": smk.input["ref"][0], **os.environ},
            )
            if res.returncode != 0:
                exit(1)


main(snakemake)  # type: ignore

Python From line 1 of postprocess/run_happy.py

import pandas as pd
from typing import Any, NamedTuple
from pathlib import Path
from pybedtools import BedTool as bt  # type: ignore
from os.path import dirname, basename
from os import scandir
import common.config as cfg
from common.io import setup_logging, is_bgzip
import subprocess as sp

log = setup_logging(snakemake.log[0])  # type: ignore


class GaplessBT(NamedTuple):
    auto: bt
    parY: bt


def test_bgzip(strat_file: Path) -> list[str]:
    """Each bed file should be bgzip'ed (not just gzip'ed)."""
    return [] if is_bgzip(strat_file) else ["is not bgzip file"]


def test_bed(
    strat_file: Path,
    reverse_map: dict[str, int],
    gapless: GaplessBT,
) -> list[str]:
    """Each stratification should be a valid bed file."""

    # test bed files in multiple phases, where each phase depends on the
    # previous being valid
    #
    # Phase 1: try to import the dataframe (with almost no assumptions except
    # that it has no header)
    try:
        # no low mem since we don't want to assume what dtypes each column has,
        # and by default pandas will chunk the input which could lead to mixed
        # types
        df = pd.read_table(strat_file, header=0, low_memory=False)
    except pd.errors.EmptyDataError:
        # if df is empty, nothing to do
        return []
    except Exception as e:
        # catch any other weird read errors
        return [str(e)]

    # Phase 2: ensure "bed file" has 3 columns
    if len(df.columns) != 3:
        return ["bed file has wrong number of columns"]

    # Phase 3: ensure "bed file" has column of the correct type
    cols = {"chrom": str, "start": int, "end": int}
    df.columns = pd.Index([*cols])

    try:
        df = df.astype(cols)
    except ValueError as e:
        return [str(e)]

    # Phase 4: we now know we have a real bed file, ensure that we didn't
    # invent any new chromosomes (try to convert chrom column to ints, which
    # will produce NaNs on error)
    chr_ints = df["chrom"].map(reverse_map).astype("Int64")
    invalid_chrs = df["chrom"][chr_ints.isnull()].unique().tolist()
    if len(invalid_chrs) > 0:
        return [f"invalid chrs: {invalid_chrs}"]

    # Phase 5: test remaining assumptions:
    # - chromosome column is sorted
    # - each region is disjoint and ascending (eg the start of a region is at
    #   least 1bp away from the end of the previous)
    # - does not intersect with gap regions (if any) or extend beyond chr bounds
    same_chrom = df["chrom"] == df["chrom"].shift(1)
    distance = (df["start"] - df["end"].shift(1))[same_chrom]
    overlapping_lines = distance[distance < 1].index.tolist()

    # choose gap file depending on if this is an XY strat or not; note that the
    # built-in gaps file (if applicable) is obviously whitelisted from this
    # check
    if "gaps_slop15kb" in basename(strat_file):
        no_gaps = True
    else:
        isXY = basename(dirname(strat_file)) == "XY"
        gapless_bt = gapless.parY if isXY else gapless.auto
        gaps = bt.from_dataframe(df).subtract(gapless_bt).to_dataframe()
        no_gaps = len(gaps) == 0

    return [
        msg
        for result, msg in [
            (
                pd.Index(chr_ints).is_monotonic_increasing,
                "chromosomes not sorted",
            ),
            (
                len(overlapping_lines) == 0,
                f"non-disjoint regions at lines: {overlapping_lines}",
            ),
            (no_gaps, "bed contains gap regions"),
        ]
        if result is False
    ]


def test_checksums(checksums: Path) -> list[str]:
    """The md5sum utility should pass when invoked."""
    res = sp.run(
        ["md5sum", "-c", "--strict", "--quiet", checksums.name],
        cwd=checksums.parent,
        capture_output=True,
        text=True,
    )
    errors = [] if (s := res.stdout) == "" else s.strip().split("\n")
    return [f"checksum error: {e}" for e in errors]


def test_tsv_list(tsv_list: Path) -> list[str]:
    """The final stratifications list should match final beds exactly.

    We are running our hotel on very tight margins; no extra or missing beds
    allowed.
    """

    def strat_set(root: Path, sub: Path) -> set[str]:
        if root.is_dir():
            deeper = (Path(p.path) for p in scandir(root))
            return {str(q) for d in deeper for q in strat_set(d, sub / d.name)}
        elif root.is_file() and root.name.endswith(".bed.gz"):
            return {str(sub)}
        else:
            return set()

    current = strat_set(tsv_list.parent, Path("./"))
    with open(tsv_list, "r") as f:
        listed = {line.strip().split("\t")[1] for line in f}
        missing = [f"not in final directory: {p}" for p in listed - current]
        extra = [f"not in final list: {p}" for p in current - listed]
        return missing + extra


def run_all_tests(
    strat_file: Path,
    reverse_map: dict[str, int],
    gapless: GaplessBT,
) -> list[tuple[Path, str]]:
    return [
        (strat_file, msg)
        for msg in test_bgzip(strat_file) + test_bed(strat_file, reverse_map, gapless)
    ]


def strat_files(path: str) -> list[Path]:
    with open(path, "r") as f:
        return [Path(s.strip()) for s in f]


def main(smk: Any, sconf: cfg.GiabStrats) -> None:
    rk = cfg.RefKey(smk.wildcards["ref_key"])
    bk = cfg.BuildKey(smk.wildcards["build_key"])
    reverse_map = {v: k for k, v in sconf.buildkey_to_final_chr_mapping(rk, bk).items()}

    # check global stuff first (since this is faster)

    global_failures: list[str] = test_checksums(
        Path(smk.input["checksums"])
    ) + test_tsv_list(Path(smk.input["strat_list"]))

    for j in global_failures:
        log.error(j)

    if len(global_failures) > 0:
        exit(1)

    # check individual stratification files

    gapless = GaplessBT(
        auto=bt(str(smk.input["gapless_auto"])),
        parY=bt(str(smk.input["gapless_parY"])),
    )

    strat_failures = [
        res
        for p in strat_files(str(smk.input["strats"]))
        for res in run_all_tests(p, reverse_map, gapless)
    ]

    for i in strat_failures:
        log.error("%s: %s" % i)

    if len(strat_failures) > 0:
        exit(1)


main(snakemake, snakemake.config)  # type: ignore

Python Pandas common Pybedtools From line 1 of postprocess/run_unit_tests.py

from typing import Any
import common.config as cfg
from common.bed import read_filter_sort_bed


def main(smk: Any, sconf: cfg.GiabStrats) -> None:
    rk = cfg.RefKey(smk.wildcards["ref_key"])
    bk = cfg.BuildKey(smk.wildcards["build_key"])
    bench = sconf.stratifications[rk].builds[bk].bench
    assert bench is not None, "this should not happen"
    ps = bench.bench_bed.params
    read_filter_sort_bed(sconf, smk.input[0], smk.output[0], ps, rk, bk)


main(snakemake, snakemake.config)  # type: ignore

Python common From line 1 of ref/filter_sort_bench_bed.py

import json
import pandas as pd
from typing import Any
import common.config as cfg


def read_ftbl(path: str) -> "pd.Series[str]":
    filter_cols = ["assembly_unit", "seq_type"]
    map_cols = ["chromosome", "genomic_accession"]
    df = pd.read_table(path, header=0, usecols=filter_cols + map_cols, dtype=str)
    chr_mask = df["seq_type"] == "chromosome"
    asm_mask = df["assembly_unit"] == "Primary Assembly"
    return (
        df[chr_mask & asm_mask][map_cols]
        .drop_duplicates()
        .copy()
        .set_index("chromosome")["genomic_accession"]
    )


def main(smk: Any, sconf: cfg.GiabStrats) -> None:
    rk = cfg.RefKey(smk.wildcards["ref_key"])
    bk = cfg.BuildKey(smk.wildcards["build_key"])
    cis = sconf.buildkey_to_chr_indices(rk, bk)

    # if this fails then something is wrong with the ftbl (eg it doesn't have
    # a complete set of chromosomes)
    ser = read_ftbl(smk.input[0])
    mapper = {ser[i.chr_name]: i.value for i in cis}

    with open(smk.output[0], "w") as f:
        json.dump(mapper, f)


main(snakemake, snakemake.config)  # type: ignore

Python Pandas JSON common From line 1 of ref/get_ftbl_mapper.py

from typing import Any
from pathlib import Path
import common.config as cfg
from pybedtools import BedTool as bt  # type: ignore
from common.bed import filter_sort_bed, read_bed, write_bed
import pandas as pd


def read_genome_bed(p: Path) -> pd.DataFrame:
    df = pd.read_table(
        p,
        header=None,
        names=["chrom", "end"],
        dtype={"chrom": str, "end": int},
    )
    df["start"] = 0
    return df[["chrom", "start", "end"]].copy()


def main(smk: Any, sconf: cfg.GiabStrats) -> None:
    inputs = smk.input
    auto_out = Path(smk.output["auto"])
    parY_out = Path(smk.output["parY"])

    # convert genome to bed file (where each region is just the length of
    # one chromosome)
    genome_bed = read_genome_bed(Path(inputs["genome"]))

    # If we have gap input, make the gapless file, otherwise just symlink to the
    # genome bed file (which just means the entire genome is gapless)
    if hasattr(inputs, "gaps"):
        rk = cfg.RefKey(smk.wildcards["ref_key"])
        bk = cfg.BuildKey(smk.wildcards["build_key"])

        bedfile = sconf.stratifications[rk].gap
        assert bedfile is not None, "this should not happen"
        ps = bedfile.params

        conv = sconf.buildkey_to_chr_conversion(rk, bk, ps.chr_pattern)

        gaps_src = Path(inputs["gaps"])

        gaps = read_bed(gaps_src, ps)
        gaps = filter_sort_bed(conv, gaps)
        gaps = bt.from_dataframe(gaps).merge(d=100).to_dataframe()
        gaps_bed = bt().from_dataframe(gaps)
        gaps_with_parY = bt().from_dataframe(genome_bed).subtract(gaps_bed)

        # If we have a parY bed, subtract parY from the gaps bed, otherwise
        # just link them since we have nothing to subtract off
        if hasattr(inputs, "parY"):
            parY_src = Path(inputs["parY"])
            gaps_no_parY = gaps_with_parY.subtract(bt(parY_src))

            write_bed(parY_out, gaps_with_parY.to_dataframe())
            write_bed(auto_out, gaps_no_parY.to_dataframe())
        else:
            write_bed(auto_out, gaps)
            parY_out.symlink_to(auto_out.resolve())
    else:
        write_bed(auto_out, genome_bed)
        parY_out.symlink_to(auto_out.resolve())


main(snakemake, snakemake.config)  # type: ignore

Python Pandas common Pybedtools From line 1 of ref/get_gapless.py

from typing import Any
import pandas as pd
import common.config as cfg
from common.bed import filter_sort_bed_inner


def main(smk: Any, sconf: cfg.GiabStrats) -> None:
    rk = cfg.RefKey(smk.wildcards["ref_key"])
    bk = cfg.BuildKey(smk.wildcards["build_key"])

    to_map = sconf.buildkey_to_final_chr_mapping(rk, bk)
    # since this is generated from the reference itself, the chr -> int map
    # is just the reverse of the final int -> chr map
    from_map = {v: k for k, v in to_map.items()}

    # ASSUME the input for this is a .fa.fai file (columns = chr, length)
    df = pd.read_table(
        smk.input[0],
        header=None,
        dtype={0: str, 1: int},
        usecols=[0, 1],
    )

    filtered = filter_sort_bed_inner(from_map, to_map, df, 2)
    filtered.to_csv(smk.output[0], sep="\t", header=False, index=False)


main(snakemake, snakemake.config)  # type: ignore

Python Pandas common From line 1 of ref/get_genome.py

from typing import Any
import common.config as cfg
from common.bed import read_filter_sort_bed


def main(smk: Any, sconf: cfg.GiabStrats) -> None:
    rk = cfg.RefKey(smk.wildcards["ref_key"])
    bk = cfg.BuildKey(smk.wildcards["build_key"])
    bedfile = sconf.refkey_to_strat(rk).segdups.superdups
    assert bedfile is not None, "this should not happen"
    read_filter_sort_bed(sconf, smk.input[0], smk.output[0], bedfile.params, rk, bk)


main(snakemake, snakemake.config)  # type: ignore

Python common From line 1 of segdups/filter_sort_superdups.py

from typing import Any
import common.config as cfg
from pybedtools import BedTool as bt  # type: ignore
from common.bed import read_bed, filter_sort_bed, write_bed


def main(smk: Any, sconf: cfg.GiabStrats) -> None:
    rk = cfg.RefKey(smk.wildcards["ref_key"])
    bk = cfg.BuildKey(smk.wildcards["build_key"])
    level = smk.params["level"]
    i = cfg.ChrIndex.from_name(smk.wildcards["sex_chr"])

    assert i in [cfg.ChrIndex.CHRX, cfg.ChrIndex.CHRY], "invalid sex chr"

    fs = sconf.refkey_to_strat(rk).xy.features
    assert fs is not None, "this should not happen"
    bedfile = fs.x_bed if i is cfg.ChrIndex.CHRX else fs.y_bed
    ps = bedfile.params

    level_col = bedfile.level_col
    conv = sconf.buildkey_to_chr_conversion(rk, bk, ps.chr_pattern)

    df = read_bed(smk.input["bed"], ps, [level_col])
    filtsort = filter_sort_bed(conv, df)
    filtsort = filtsort[filtsort[level_col].str.contains(level)].drop(
        columns=[level_col]
    )
    gapless = bt(smk.input["gapless"])
    nogaps = (
        bt.from_dataframe(filtsort)
        .intersect(b=gapless, sorted=True, g=str(smk.input["genome"]))
        .to_dataframe()
    )
    write_bed(smk.output[0], nogaps)


main(snakemake, snakemake.config)  # type: ignore

Python common Pybedtools From line 1 of xy/filter_sort_features.py

from typing import Any
from Bio import bgzf  # type: ignore
import common.config as cfg


def main(smk: Any, sconf: cfg.GiabStrats) -> None:
    i = cfg.ChrIndex.from_name(smk.wildcards["sex_chr"])
    k = cfg.RefKey(smk.wildcards["ref_key"])
    pattern = sconf.refkey_to_final_chr_pattern(k)
    cxy = sconf.stratifications[k].xy

    assert i in [cfg.ChrIndex.CHRX, cfg.ChrIndex.CHRY], "invalid sex chr"

    par_fun = cxy.fmt_x_par if i == cfg.ChrIndex.CHRX else cxy.fmt_y_par
    out = par_fun(pattern)
    assert out is not None, "this should not happen"

    with bgzf.BgzfWriter(smk.output[0], "w") as f:
        f.write(out)


main(snakemake, snakemake.config)  # type: ignore

Python Biopython common From line 1 of xy/write_par.py

library(tidyverse)

subsets <- snakemake@params$subsets

read_summary <- function(path) {
  build <- path %>%
    dirname() %>%
    dirname() %>%
    dirname() %>%
    basename()
  readr::read_csv(
    path,
    col_types = cols(
      Type = "c",
      Subtype = "c",
      Subset = "c",
      Filter = "c",
      METRIC.Recall = "d",
      METRIC.Precision = "d",
      METRIC.Frac_NA = "d",
      TRUTH.TP = "d",
      TRUTH.TP.het = "d",
      TRUTH.TP.homalt = "d",
      .default = "-",
    )
  ) %>%
    filter(Filter == "PASS" &
             Subtype %in% c("*", "I16_PLUS", "D16_PLUS"),
             Subset %in% subsets)  %>%
    rename(Recall = METRIC.Recall,
           Precision = METRIC.Precision,
           Frac_NA = METRIC.Frac_NA) %>%
    mutate("-log10(FN_Rate)" = -log10(1 - Recall),
           "-log10(FP_Rate)" = -log10(1 - Precision),
           build = build) %>%
    mutate(across(contains("TRUTH.TP"), log10)) %>%
    rename_with(~ sprintf("log10(%s)", .x), contains("TRUTH.TP")) %>%
    select(-Filter, -starts_with("TRUTH.TP")) %>%
    pivot_longer(cols = c(-Type, -Subtype, -Subset, -build),
                 names_to = "metric",
                 values_to = "value") %>%
    mutate(metric = fct_relevel(
      factor(metric),
      "Precision",
      "Recall",
      "Frac_NA",
      after = 0
    )) %>%
    filter(!is.na(value))
}

df <- snakemake@input %>%
  keep(~ file.size(.x) > 0) %>%
  map_dfr(read_summary)

nsubs <- df %>%
  pull(Subset) %>%
  unique() %>%
  length()

nbuilds <- df %>%
  pull(build) %>%
  unique() %>%
  length()

knitr::opts_chunk$set(
  echo = TRUE,
  fig.width=10,
  fig.height=(3 + nsubs * nbuilds * 0.5)
)

plot_metric <- function(df, key) {
  p <- ggplot(df, aes(value, fct_rev(Subset), fill = build)) +
    geom_col(position = "dodge") +
    facet_wrap(c("metric"), scales = "free_x", ncol = 2) +
    labs(x = NULL,
         y = NULL,
         fill = "Build") +
    theme(legend.position = "top")
  print(p)
  cat("\n\n")
}

R Markdown ggplot2 tidyverse From line 9 of rmarkdown/benchmark.Rmd

df %>%
  filter(Type == "SNP") %>%
  plot_metric()

R Markdown From line 99 of rmarkdown/benchmark.Rmd

df %>%
  filter(Type == "INDEL" & Subtype == "*") %>%
  plot_metric()

R Markdown From line 107 of rmarkdown/benchmark.Rmd

df %>%
  filter(Type == "INDEL" & Subtype == "I16_PLUS") %>%
  plot_metric()

R Markdown From line 115 of rmarkdown/benchmark.Rmd

df %>%
  filter(Type == "INDEL" & Subtype == "D16_PLUS") %>%
  plot_metric()

R Markdown From line 123 of rmarkdown/benchmark.Rmd

library(tidyverse)

knitr::opts_chunk$set(fig.width=25, fig.height=6, fig_caption=TRUE)

chr_mapper <- readr::read_tsv(
  snakemake@input$chr_mapper,
  col_names = c("chrom_index", "build_key", "chrom"),
  col_types = "icc"
) %>%
  arrange(chrom_index, build_key)

bed_cols <- c("chrom", "start", "end")

read_bed <- function(path) {
  readr::read_tsv(path,
                  comment = "#",
                  col_types = "cii",
                  col_names = bed_cols) %>%
  mutate(region_size = end - start)
}

read_nonN <- function(path) {
  read_bed(path) %>%
    mutate(build_key = path %>% dirname() %>% dirname() %>% basename()) %>%
    left_join(chr_mapper, by = c("chrom", "build_key")) %>%
    select(-chrom)
}

read_strat <- function(path) {
  strat_filename <- basename(path)
  read_bed(path) %>%
    group_by(chrom) %>%
    summarise(total_region = sum(region_size), .groups = "drop") %>%
    mutate(build_key = path %>% dirname() %>% dirname() %>% basename(),
           ## genome = str_extract(strat_filename, "^[^_]+_(HG[0-9]+)_.bed.gz", 1),
           ## strat_level = str_extract(strat_filename, 
           ##                           if_else(is.na(genome),
           ##                                   "^[^_]+_(.*).bed.gz",
           ##                                   "^[^_]+_HG[0-9]+_(.*).bed.gz"),
           ##                           1),
           strat_name = str_extract(strat_filename, "^[^_]+_(.*).bed.gz", 1)) %>%
    left_join(chr_mapper, by = c("chrom", "build_key")) %>%
    select(-chrom)
}

df_chr_sizes <- snakemake@input$nonN %>%
  map_dfr(read_nonN) %>%
  group_by(chrom_index, build_key) %>%
  summarise(nonN_ref_len = sum(region_size), .groups = "drop")

chr_build_combinations <- chr_mapper %>%
  expand(build_key, chrom_index)

file_list <- snakemake@input$strats %>%
  map(readLines) %>%
  flatten()

chr_name <- function(i) {
  case_when(i == 23 ~ "X",
            i == 24 ~ "Y",
            0 < i & i < 23 ~ as.character(i),
            TRUE ~ "oops")
}

chr_levels <- chr_name(1:24)

read_strats <- function(stype, gaps) {
  fs <- keep(file_list, ~ basename(dirname(.x)) == stype)
  if (length(fs) > 0) {
    df <- map_dfr(fs, read_strat)
    all_combos <- chr_build_combinations %>%
      expand(nesting(chrom_index, build_key), strat_name = df$strat_name)
    df %>%
      # join the full cartesian product of all chromosomes and paths, then fill
      # missing with 0 so that we get an empty space on the graphs indicating
      # zero coverage
      full_join(all_combos, by = c("chrom_index", "strat_name", "build_key")) %>%
      left_join(df_chr_sizes, by = c("chrom_index", "build_key")) %>%
      mutate(strat_type = stype,
             coverage = total_region / nonN_ref_len,
             chrom = factor(chr_name(chrom_index), levels = chr_levels)) %>%
      replace_na(list(coverage = 0))
  } else {
    NULL
  }
}

coverage_plot <- function(df, level) {
  cat(sprintf("#### %s\n\n", level))
  p <- df %>%
    ggplot(aes(chrom, coverage, fill = build_key)) +
    geom_col(position = "dodge") +
    scale_y_continuous(labels = scales::percent) +
    theme(legend.position = "top",
          text = element_text(size = 30),
          axis.text.x = element_text(angle = 90),
          strip.text.y = element_text(angle = 0),
          panel.spacing.y = unit(2, "lines")) +
    labs(x = NULL,
         fill = "build")
  print(p)
  cat("\n\n")
}

coverage_plots <- function(stype, gaps) {
  cat(sprintf("## %s\n\n", stype))
  df <- read_strats(stype, gaps)
  if (is.null(df)) {
    cat("None\n\n")
  } else {
    df %>%
      group_by(strat_name) %>%
      group_walk(coverage_plot)
  }
}

R Markdown ggplot2 tidyverse From line 13 of rmarkdown/validate.Rmd

snakemake@params$core_levels %>%
  union(snakemake@params$other_levels) %>%
  walk(coverage_plots)

R Markdown From line 145 of rmarkdown/validate.Rmd

ShowHide 96 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/ndwarshuis/giab-stratifications

Name: giab-stratifications

Version: 1

Badge:

Insert copied code into your website to add a link to this workflow.

License: None

Keywords:

JSON BEDTools Biopython Pandas Pybedtools SAMtools seqtk Snakemake ggplot2 tidyverse common typing-extensions

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free