Snakemake Workflow: Impute HLA and/or KIR alleles from SNPs

public public 1yr ago Version: v1.0 0 bookmarks

N.B: As of 21st of February, 2020 shapeit=v2.r837 has been removed from the bioconda channel. Hence, I have removed it from the environment file. However, shapeit=2.r904 is available from https://anaconda.org/dranew/shapeit and if you want to use it add it to the environment file with the appropriate dranew channel. Runs with shapeit4 will still work as it is still available on bioconda.

HLA and KIR Imputation from SNP

Authors

Table of Contents

  1. Usage

    1. Workflow Overview

    2. Quickstart

    3. Running Specific Parts or Running Workflow in Steps

      1. Running Only Liftover

      2. Encode VCF Without Liftover or Perform Liftover and Encoding Without Phasing With SHAPEIT

      3. Run SHAPEIT After Checking VCF Statistics

    4. Additional and Default Parameters

    5. Workflow Running Options

    6. Investigate Results

  2. Standalone Script

Usage

Workflow Overview

The full workflow consists of creating datasets ready for imputation:

alt text

The steps can be summarized as:

1. Convert input PLINK bed files to reference b37
2. Encode REF/ALT alleles the same as the KIR*IMP reference panel
3. Phase the encoded VCF using shapeit4 and shapeit.

Quickstart

To run the pipeline edit config.yaml to point to your input PLINK bed file by placing the following yaml directives:

project:
 <project name>:
 liftover:
 plink: <path to PLINK bed file>

where <project name> is the name you want to be associated with your run and output and <path to PLINK bed file> is the full or relative path to your input PLINK bed file.

To run the pipeline run the command:

snakemake -j<number of parallel jobs> --use-conda

The default PLINK bed file is assumed to be on hg18 reference but additional reference types are possible as hg16 , hg17 , hg38 . To specify for example reference hg17 edit config.yaml with the following:

project:
 <project name>:
 liftover:
 plink: <path to PLINK bed file>
 reference: hg17

All output will produced inside the output/<project name> directory.

Configure the workflow according to your needs via editing the file config.yaml . To see the list of default configurations settings run the command snakemake print_defaults .

Running Specific Parts or Running Workflow in Steps

The workflow can be ran from a specific part. In addition, the workflow can be run in steps allowing to explore the data before phasing and filtering.

To see the list of specific rules/steps that are possible run snakemake --list-target-rules . Each of the target rules listed can be run with the command snakemake -j<number of parallel jobs> [OPTIONS] <target rule name> .

Running Only Liftover

alt text

It is possible to only use the workflow to perform liftover of PLINK bed files from specific reference to b37 . This will produce a VCF file with the reference being b37 .

To perform only liftover with the same config.yaml from Quickstart run the command:

snakemake -j<number of parallel jobs> --use-conda liftover

Encode VCF Without Liftover or Perform Liftover and Encoding Without Phasing With SHAPEIT

alt text

If you have VCF file based on reference b37 it is possible to only run encoding without liftover. To encode the VCF file using KIR*IMP reference panel edit config.yaml :

project:
 <project name>:
 freq_encode_snps:
 vcf: <path to VCF file>

then execute snakemake using the command:

snakemake -j<number of parallel jobs> --use-conda kirimp_encode

If the above command is executed with the config.yaml settings from Quickstart only liftover and encoding will be performed. This allows inspection of the variants in the VCF by looking at the images produced in output/{project}/kirimp/01_freq_encode_snps/{project}.png . The images might help in deducing what threshold to use during SHAPEIT for filtering troublesome variants with missing genotypes.

Run SHAPEIT After Checking VCF Statistics

alt text

To run SHAPEIT and prepare data for KIR*IMP using the file produced by running snakemake -j<number of parallel jobs> --use-conda kirimp_encode run the command:

snakemake -j<number of parallel jobs> --use-conda kirimp_ready

In addition, SHAPEIT can be run on a specific VCF (not necessarily produced by the workflow) by specifying the vcf file under the shapeit directive in config.yaml :

project:
 <project name>:
 shapeit
 vcf: <Path to VCF file>

Additional and Default Parameters

The default parameters settings and there descriptions are listed below and can be modified by placing it in config.yaml with the desired values different from defaults:

BCFTOOLS_THREADS: 4
KIRIMP_PANEL_URL: http://imp.science.unimelb.edu.au/kir/static/kirimp.uk1.snp.info.csv
PLINK_THREADS: 4
SHAPEIT_GENMAP_URL: https://github.com/odelaneau/shapeit4/blob/master/maps/genetic_maps.b37.tar.gz?raw=true
SHAPEIT_THREADS: 4
project:
 <project name>:
 freq_encode_snps:
 vcf: output/{project}/liftover/04_hg19_vcf/{project}.{reference}ToHg19.vcf.gz
 additional: "--outlier-threshold 0.1" # additional parameters to be passed to the script in 'scripts/frequency_encode_snps.py'.
 liftover:
 reference: hg18 # Input PLINK bed reference can be either hg16, hg17, hg18 or hg38
 shapeit:
 vcf: output/{project}/kirimp/01_freq_encode_snps/{project}.vcf.gz
 gmap: input/meta/shapeit/kirimp.chr19.gmap.txt.gz
 pbwt: 8 # shapeit4: determines the number of conditioning neighbours, higher number should produce better accuracy but slower runtimes
 pbwt-modulo: 8 # shapeit4: determines variants for storing pbwt indexes
 regions: [19] # chromosomes for phasing default is 19 for KIR*IMP
 states: 500 # shapeit2: determines number of haplotypes for conditioning, higher number should produce better accuracy but slower runtimes
 min_missing: 0.25 # variants with missing genotype rate larger than this will be discarded before phasing with SHAPEIT. Inspect the image from kirimp_encode to determine threshold
 v2_additional: "" # Additional parameters for shapeit version 2. Description of possible settings can be found at https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html
 v4_additional: "" # Additional parameters for shapeit version 4. Description of possible settings can be found at https://odelaneau.github.io/shapeit4/

Workflow Running Options

Test your configuration by performing a dry-run via

snakemake --use-conda -npr

Execute the workflow locally via

snakemake --use-conda -j $N

using $N cores or run it in a cluster environment via

snakemake --use-conda --cluster qsub --jobs 100

or if DRMAA is available

snakemake --use-conda --drmaa --jobs 100

If you not only want to fix the software stack but also the underlying OS, use

snakemake --use-conda --use-singularity

in combination with any of the modes above. See the Snakemake documentation for further details.

Investigate results

After successful execution, you can create a self-contained interactive HTML report with all results via:

snakemake --report report.html

Standalone Script

The workflow includes a standalone script

scripts/frequency_encode_snps.py

which can be used to encode a VCF file to a reference panel of variants. To setup the environment for the script create the environment:

conda env create -f envs/freq_encode_snps.yml

then activate the environment with

conda activate freq_encode_snps

The parameters to run the standalone script are:

usage: frequency_encode_snps.py [-h] -v VCF_FILE -r REFERENCE_PANEL
 [-rt {kirimp,custom}] [--separator SEPARATOR]
 -o OUTPUT [-chr [CHROMOSOMES]]
 [--chromosome-annotation {ucsc,ensembl,genbank}]
 [-a] [-c] [-min MIN_AMBIGIOUS_THRESHOLD]
 [-max MAX_AMBIGIOUS_THRESHOLD]
 [--outlier-threshold OUTLIER_THRESHOLD]
 [-t THREADS]

For usage with KIR*IMP the VCF has to be with hg18/b37 reference.

The script in addition to encoding the VCF file will produce a diagnosis plot (same name as output vcf file but with .png extension) as well as a TOML file containing all the metadata of the encoding. The metadata TOML can be read into R using configr

require(configr)
read.config(file = "metadata.toml")

or into Python using

import toml
toml.load("metadata.toml")

the resulting vcf will contain statistics from the encoding inside the INFO fields:

##INFO=<ID=UPD,Number=A,Type=Flag,Description="REF and ALT updated based on reference panel frequency">
##INFO=<ID=PFD,Number=A,Type=Float,Description="Alternate frequency difference to reference panel frequency">
##INFO=<ID=MISS,Number=A,Type=Float,Description="Missing Genotype Frequency">
##INFO=<ID=MAF,Number=A,Type=Float,Description="Minor Allele Frequency">

Code Snippets

23
24
25
26
shell: 
    "scripts/frequency_encode_snps.py -t {threads} -v {input.vcf} "
    "-r {input.panel} -o {output.vcf} {params} 2> {log} && "
    "bcftools index -f {output.vcf}"
9
shell: "mv {input} {output}"
17
shell: "mv {input} {output}"
43
shell: "liftOver {input.bed} {input.chain} {output.lifted} {output.unlifted} &> {log}"
77
78
79
80
81
82
83
    shell: 
        """
	plink --allow-extra-chr --allow-no-sex --real-ref-alleles --keep-allele-order --snps-only just-acgt --bfile {params.basein} --recode vcf --out {params.baseout} &> {log}
        bcftools view -U {output.tmp_vcf} | bcftools norm --threads {threads} --rm-dup snps | bcftools +fill-tags -Oz  -- -d  2> {log} > {output.vcf} 2>> {log}
        bcftools index -f {output.vcf} &>> {log} 
        bcftools stats {output.vcf} > {output.stats} 2>> {log}
        """
12
shell: "cat {input} | tar xzfO - --wildcards 'chr19.b37.gmap.gz' 2> {log}  > {output}"
35
36
37
38
39
40
41
42
43
shell: 
    """
        bcftools filter -r {wildcards.region} -Oz -e 'MISS > {params.missing_threshold}' {input.vcf} > {output.filtered_vcf} 2> {log.bcftools}
        bcftools index -f {output.filtered_vcf} 2>> {log.bcftools}
        shapeit4 --thread {threads} --input {output.filtered_vcf} --map {input.gmap} --output {output.phased_vcf} \
        --region {wildcards.region} --pbwt-depth {params.pbwt} --pbwt-modulo {params.pbwt_modulo} --log {output.log} {params.additional} 2> {log.shapeit4}
        bcftools convert --hapsample {params.out} {output.phased_vcf} 2>> {log.bcftools}
        gunzip {output.hap}.gz 
    """
64
65
66
67
68
69
shell: 
    "bcftools filter -r {wildcards.region} -Oz -e 'MISS > {params.missing_threshold}' {input.vcf} > {output.filtered_vcf} 2> {log.bcftools} && "
    "bcftools index -f {output.filtered_vcf} 2> {log.bcftools}  && "
    "shapeit --thread {threads} --input-vcf {output.filtered_vcf} -M {input.gmap} --states {params.states} "
    "-O {output.haps} {output.sample} --output-graph {output.graph} "
    "--output-log {output.log} {params.additional} 2> {log.shapeit}"
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
import argparse
import re
import sys
import textwrap
import matplotlib
import numpy as np
from os import path
from pathlib import Path
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from matplotlib.patches import Patch

from cyvcf2 import VCF, Writer

plt.rcParams["figure.figsize"] = (18, 14)
matplotlib.use("Agg")

CHROMOSOME_19_ANNOTATION = {"ucsc": "chr19", "ensembl": "19", "genbank": "CM000681.2"}

# Required headers for panel input files, either kirimp (kirimp.uk1.snp.info.csv) or custom with the required fields
KIRIMP_HEADER = ["id", "position", "allele0", "allele1", "allele1_frequency"]
CUSTOM_HEADER = ["chrom", "pos", "a0", "a1", "freq"]

COMPLEMENT = {"A": "T", "T": "A", "G": "C", "C": "G", ".": "."}

""" CUSTOM HEADERS """
UPDATED = {
    "ID": "UPD",
    "Description": "REF and ALT updated based on reference panel frequency",
    "Type": "Flag",
    "Number": "A",
}

PANEL_FREQ_DIFF = {
    "ID": "PFD",
    "Description": "Alternate frequency difference to reference panel frequency",
    "Type": "Float",
    "Number": "A",
}

MISSINGNES = {
    "ID": "MISS",
    "Description": "Missing Genotype Frequency",
    "Type": "Float",
    "Number": "A",
}

MAF = {
    "ID": "MAF",
    "Description": "Minor Allele Frequency",
    "Type": "Float",
    "Number": "A",
}

AF = {
    "ID": "AF",
    "Description": "Alternate Allel Frequency",
    "Type": "Float",
    "Number": "A",
}
toml_header = """
[encoder_settings]
outlier_threshold=%s
minimum_ambigious_threshold=%f
maximum_ambigious_threshold=%f
reference_panel_type=%s
chromosome_annotation=%s
ambigious_dropped=%r
correct_complement=%r


[variant]


"""

variant_toml = """
    [variant.%s]
    status = %s
    reason = %s
    frequency = %f
    panel_frequency=%r
    updated_frequency=%r
    minor_allele_frequency=%r
    missing_genotype_frequency=%r
    panel_frequency_difference=%r


"""

""" Class to represent genotypes in 0/1 format might not be necessary as I can flip from there"""


class Genotype(object):
    __slots__ = ("alleles", "phased")
    FLIP_DICT = {0: 1, 1: 0, -1: -1}

    def __init__(self, li):
        self.alleles = li[:-1]
        self.phased = li[-1]

    def __str__(self):
        sep = "/|"[int(self.phased)]
        return sep.join("0123."[a] for a in self.alleles)

    def flip(self):
        self.alleles = [Genotype.FLIP_DICT[allele] for allele in self.alleles]

    def genotype(self):
        return self.alleles + [self.phased]

    __repr__ = __str__


""" Class to keep track of VCF file summaries for variants """


class VCFSummary(object):
    __slots__ = (
        "ambigious",
        "unknown_alt",
        "updated",
        "flipped",
        "n_in_panel",
        "kept",
        "__freqs",
        "VARIANTS",
    )

    def __init__(self):
        self.VARIANTS = {}
        self.ambigious = 0
        self.unknown_alt = 0
        self.updated = 0
        self.flipped = 0
        self.n_in_panel = 0
        self.kept = 0
        self.__freqs = None

    def __str__(self):
        summary_toml = """
        [summary]
        n_ambigious_variants=%d
        n_unknown_alt_or_monomorphic=%d
        n_updated=%d
        n_flipped_strand=%d
        vcf_variants_in_panel=%d
        vcf_variants_in_panel_after_encoding_snps=%d
        """.rstrip()
        return textwrap.dedent(summary_toml) % (
            self.ambigious,
            self.unknown_alt,
            self.updated,
            self.flipped,
            self.n_in_panel,
            self.kept,
        )

    def add_variant(self, v_id):
        self.VARIANTS[v_id] = {"freq": None, "updated_freq": None, "panel_freq": None}

    def add_variant_dict(self, vdict):
        self.VARIANTS.update(vdict)

    def freqs(self):
        if not self.__freqs:
            self.__freqs = np.array(
                [
                    [
                        v["freq"],
                        v["updated_freq"],
                        v.get("MAF", None),
                        v.get("MISS", None),
                        v.get("PFD", None),
                        v["panel_freq"],
                    ]
                    for k, v in sorted(self.VARIANTS.items())
                ]
            )
        return self.__freqs

    def v_ids(self, original=True):
        if original:
            vids = np.array([v["v_id"] for k, v in sorted(self.VARIANTS.items())])
        else:
            vids = np.array(sorted(self.VARIANTS.keys()))
        return vids

    def updates(self):
        return np.array([v["updated"] == 1 for k, v in sorted(self.VARIANTS.items())])

    __repr__ = __str__


def main(arguments=None):
    args = parse_arguments()
    vcf = VCF(args["vcf_file"], threads=args["threads"])
    vcf.add_info_to_header(UPDATED)
    vcf.add_info_to_header(PANEL_FREQ_DIFF)
    vcf.add_info_to_header(MISSINGNES)
    vcf.add_info_to_header(MAF)
    if not vcf.contains("AF"):
        vcf.add_info_to_header(AF)
    w = Writer(args["output"], vcf)

    panel = generate_panel_data(
        panel_file=args["reference_panel"],
        chr=args["chromosomes"],
        annotation=args["chromosome_annotation"],
        panel_type=args["reference_panel_type"],
        separator=args["separator"],
    )

    vcf_summary = VCFSummary()

    print(
        toml_header
        % (
            args["outlier_threshold"],
            args["min_ambigious_threshold"],
            args["max_ambigious_threshold"],
            args["reference_panel_type"],
            args["chromosome_annotation"],
            args["ambigious"],
            args["fix_complement_ref_alt"],
        ),
        file=sys.stderr,
    )
    for variant in vcf:
        status = "unchanged"
        reason = "None"
        panel_variant_freq = None
        variant_id_end = str(variant.CHROM) + "_" + str(variant.end)
        if not variant.INFO.get("AF"):
            variant.INFO["AF"] = variant.aaf
        if variant_id_end in panel:
            variant.INFO["UPD"] = 0
            panel_variant = panel[variant_id_end]
            panel_variant_freq = panel_variant["freq"]
            vcf_summary.n_in_panel += 1
            if not variant.ALT:
                print_variant_toml(
                    variant, panel_variant["freq"], "removed", "unknown_alt/monomorphic"
                )
                vcf_summary.unknown_alt += 1
                continue
            if (
                args["ambigious"]
                and variant.aaf > args["min_ambigious_threshold"]
                and variant.aaf < args["max_ambigious_threshold"]
            ):
                vcf_summary.ambigious += 1
                print_variant_toml(
                    variant, panel_variant["freq"], "removed", "ambigious_frequency"
                )
                continue
            if should_recode(variant, panel_variant):
                swap_ref_alt(variant)
                variant.INFO["UPD"] = 1
                vcf_summary.updated += 1
                status = "updated"
                reason = "ref_alt_swapped"
            if (
                should_flipstrand(variant, panel_variant)
                and args["fix_complement_ref_alt"]
            ):
                flipstrand(variant, panel_variant["freq"])
                variant.INFO["UPD"] = 1
                vcf_summary.flipped += 1
                status = "strand_flipped"
                reason = "ref/alt_not_in_panel_nucleotides"

            vcf_summary.add_variant(variant_id_end)
            v_freq = variant.INFO.get("AF")

            variant.INFO["PFD"] = abs(variant.INFO.get("AF") - panel_variant["freq"])
            variant.INFO["MISS"] = np.sum(variant.gt_types == 2) / len(variant.gt_types)
            variant.INFO["MAF"] = v_freq if v_freq < 0.5 else 1 - v_freq

            vcf_summary.VARIANTS[variant_id_end].update(
                {
                    "freq": variant.aaf,
                    "panel_freq": panel_variant["freq"],
                    "updated_freq": v_freq,
                    "MAF": variant.INFO.get("MAF"),
                    "MISS": variant.INFO.get("MISS"),
                    "PFD": variant.INFO.get("PFD"),
                    "v_id": variant.ID,
                    "updated": variant.INFO.get("UPD"),
                }
            )
            print_variant_toml(variant, panel_variant_freq, status, reason)
            vcf_summary.kept += 1
        w.write_record(variant)
    w.close()
    vcf.close()
    plot_file = re.sub(r"(vcf|bcf)(\.gz)*$", "png", args["output"])
    if not vcf_summary.n_in_panel == 0:
        create_summary_plot(
            vcf_summary, outfile=plot_file, threshold=args["outlier_threshold"]
        )
    print(vcf_summary, file=sys.stderr)
    print("n_reference_panel_size=%d" % len(panel.keys()), file=sys.stderr)


def print_variant_toml(
    variant, panel_variant_freq, status, reason, variant_toml=variant_toml
):

    variant_tup = (
        variant.INFO.get("AF", None),
        variant.INFO.get("MAF", None),
        variant.INFO.get("MISS", None),
        variant.INFO.get("PFD", None),
    )

    print(
        variant_toml
        % ((variant.ID, status, reason, variant.aaf, panel_variant_freq) + variant_tup),
        file=sys.stderr,
    )


def swap_ref_alt(variant):
    gts = variant.genotypes
    gts = [Genotype(li) for li in gts]
    for gt in gts:
        gt.flip()
    variant.genotypes = [gt.genotype() for gt in gts]
    updated_frequency = sum([gt.alleles.count(1) for gt in gts]) / (
        2 * len([gt for gt in gts if -1 not in gt.alleles])
    )
    temp_nuc = variant.REF
    variant.REF = variant.ALT[0]
    variant.ALT = [temp_nuc]
    variant.INFO["AF"] = updated_frequency


def flipstrand(variant, panel_freq, COMPLEMENT=COMPLEMENT):
    variant.REF = COMPLEMENT[variant.REF]
    variant.ALT = COMPLEMENT[variant.ALT[0]]
    frequency_synced = (panel_freq > 0.5 and variant.INFO.get("AF") > 0.5) or (
        panel_freq < 0.5 and variant.INFO.get("AF") < 0.5
    )
    if not frequency_synced:
        swap_ref_alt(variant)


def should_recode(variant, panel_variant):
    frequency_synced = (
        panel_variant["freq"] > 0.5 and variant.INFO.get("AF") > 0.5
    ) or (panel_variant["freq"] < 0.5 and variant.INFO.get("AF") < 0.5)
    nucleotides_synced = (
        variant.REF == panel_variant["A0"] and variant.ALT[0] == panel_variant["A1"]
    )
    return not nucleotides_synced and not frequency_synced


def should_flipstrand(variant, panel_variant, COMPLEMENT=COMPLEMENT):
    nucleotides_synced = (
        variant.REF == panel_variant["A0"] and variant.ALT[0] == panel_variant["A1"]
    )
    frequency_synced = (
        panel_variant["freq"] > 0.5 and variant.INFO.get("AF") > 0.5
    ) or (panel_variant["freq"] < 0.5 and variant.INFO.get("AF") < 0.5)
    alt_is_complement = COMPLEMENT[variant.REF] == variant.ALT[0]
    return (not nucleotides_synced or not frequency_synced) and alt_is_complement


def create_summary_plot(v_summary, outfile, threshold=None):
    freqs = v_summary.freqs()
    default_color = plt.rcParams["axes.prop_cycle"].by_key()["color"][0]

    fig = plt.figure()

    ax1 = fig.add_subplot(321)
    ax2 = fig.add_subplot(323, sharex=ax1)
    ax3 = fig.add_subplot(325)
    ax4 = fig.add_subplot(222)
    ax5 = fig.add_subplot(224, sharex=ax4)

    titles = (
        "Original VCF Frequencies Compared to Panel Frequencies",
        "Updated VCF Frequencies Compared to Panel Frequencies",
        "Minor Allele Frequency Compared to Difference in Frequency Between Panel and VCF",
        "Genotype Missingness Compared to Difference in Frequency Between Panel and VCF",
    )

    x_labs = (
        "VCF Alternate Frequency",
        "VCF Alternate Frequency",
        "Minor Allele Frequency",
        "Missing Genotype Frequency",
    )

    y_labs = (
        "Panel Allele Frequency",
        "Panel Allele Frequency",
        "Panel vs VCF Frequency Difference",
        "Panel vs VCF Frequency Difference",
    )

    coefs = np.corrcoef(freqs.T)[:, [4, 5]]

    for i, ax in enumerate([ax1, ax2, ax4, ax5]):
        coef, comparison_freq = (
            (coefs[i, 0], freqs[:, 4]) if i > 1 else (coefs[i, 1], freqs[:, 5])
        )
        ax.set_title(titles[i], fontsize=9)
        ax.scatter(freqs[:, i], comparison_freq, s=10, alpha=0.7)
        ax.annotate(
            "corr = %.2f" % coef,
            (
                max(freqs[:, i]) - max(freqs[:, i]) / 20,
                max(comparison_freq) - max(comparison_freq) / 20,
            ),
            ha="center",
            fontsize=10,
        )
        ax.set_ylabel(y_labs[i], fontsize=9)
        ax.set_xlabel(x_labs[i], fontsize=9)
        ax.plot(ax.get_xlim(), ax.get_ylim(), ls="--", c=".3")

        if threshold:
            idxs = freqs[:, 4] > threshold
            for f, cf, vid in zip(
                freqs[idxs, i], comparison_freq[idxs], v_summary.v_ids()[idxs]
            ):
                ax.annotate(vid, (f, cf), ha="center", fontsize=8)

    v_types = ["REF --> ALT", "Strand Flipped", "Ambigious Variants", "ALT Missing"]
    counts = [
        v_summary.updated,
        v_summary.flipped,
        v_summary.ambigious,
        v_summary.unknown_alt,
    ]
    bar_width = 0.75
    idx = np.arange(len(counts))
    barlist = ax3.bar(idx, counts, width=bar_width, align="center")
    ax3.set_xticks(idx)
    ax3.set_xticklabels(v_types, rotation=45, minor=False, fontsize=8)
    [bar.set_color("r") for bar in barlist[2:]]
    for i, count in enumerate(counts):
        col = "r" if i > 1 else "black"
        ax3.text(i, count, " " + str(count), ha="center", color=col)
    ax3.set_ylabel("Counts", fontsize=9)
    ax3.set_title("Variant Modification Type and Excluded Counts", fontsize=9)

    leg_elements = [
        Patch(facecolor=default_color, label="Updated"),
        Patch(facecolor="red", label="Removed"),
    ]
    ax3.legend(handles=leg_elements, loc="upper right")

    plt.savefig(outfile)


def generate_panel_data(
    panel_file, chr=None, annotation="ensembl", panel_type="kirimp", separator=None
):
    f = open(panel_file, "r")
    header_line = next(f).strip()
    sep = get_separator(header_line, separator)
    header_line = [cell.replace('"', "") for cell in header_line.split(sep)]
    if panel_type == "kirimp":
        chromosome = CHROMOSOME_19_ANNOTATION[annotation]
        if header_line != KIRIMP_HEADER:
            raise TypeError(
                "If input panel type is kirimp, the panel needs to contain a comma-separated header:\n%s"
                % ",".join(KIRIMP_HEADER)
            )
    else:
        if header_line != CUSTOM_HEADER:
            raise TypeError(
                "If input panel type is custom, the panel needs to contain a comma-separated header:\n%s"
                % ",".join(CUSTOM_HEADER)
            )
    snp_dict = {
        chromosome + "_" + cells[1]
        if panel_type == "kirimp"
        else cells[0]
        + "_"
        + cells[1]: {"A0": cells[2], "A1": cells[3], "freq": float(cells[4].strip())}
        if panel_type == "kirimp"
        else {"A0": cells[2], "A1": cells[3], "freq": float(cells[4])}
        for cells in [line.strip().replace('"', "").split(sep) for line in f]
    }
    f.close()
    return snp_dict


def get_separator(line, passed_separator=None):
    tabs = line.count(r"\t")
    commas = line.count(r",")
    if passed_separator:
        sep = passed_separator
    elif tabs == 4 and commas != 4:
        sep = r"\t"
    elif tabs != 4 and commas == 4:
        sep = ","
    else:
        raise TypeError(
            "Cannot determine separator from file please specify separator directly as an argument [--reference-panel-col-separator]"
        )
    return sep


def parse_arguments(arguments=None):
    parser = argparse.ArgumentParser(
        description="This script encodes SNPs in a VCF to a reference panel based on allele frequencies",
        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
    )
    parser._action_groups.pop()
    required = parser.add_argument_group("required arguments")
    optional = parser.add_argument_group("optional arguments")
    required.add_argument(
        "-v",
        "--vcf-file",
        help="VCF/BCF file to re-encode (can be compressed with bgzip)",
        required=True,
        type=str,
    )
    required.add_argument(
        "-r",
        "--reference-panel",
        help="Reference panel file  either in format of KIR*IMP reference panel  or a custom data format [chrom pos ref alt freq]",
        required=True,
        type=str,
    )
    optional.add_argument(
        "-rt",
        "--reference-panel-type",
        help="Reference panel file type",
        choices=["kirimp", "custom"],
        default="kirimp",
        type=str,
    )
    optional.add_argument(
        "--separator", help="Custom reference panel column separator", type=str
    )
    optional.add_argument(
        "-o", "--output", help="Output vcf file", required=True, type=str
    )
    optional.add_argument(
        "-chr",
        "--chromosomes",
        help="Chromosome over which to encode SNPs ",
        required=False,
        nargs="?",
        type=str,
    )
    optional.add_argument(
        "--chromosome-annotation",
        help="Chromosome annotation type in the VCF",
        choices=["ucsc", "ensembl", "genbank"],
        default="ensembl",
        type=str,
    )
    optional.add_argument(
        "-a",
        "--ambigious",
        help="Determines whether ambigious alternate alleles should be dropped",
        action="store_false",
    )
    optional.add_argument(
        "-c",
        "--fix-complement-ref-alt",
        help="Should ref/alt that are complements be fixed with respect to frequency",
        action="store_false",
    )
    optional.add_argument(
        "-min",
        "--min-ambigious-threshold",
        help="Alternate alleles above this frequency and below the max ambigious frequency will be flagged as ambigious",
        default=0.495,
        type=float,
    )
    optional.add_argument(
        "-max",
        "--max-ambigious-threshold",
        help="Alternate alleles above this frequency and below the max ambigious frequency will be flagged as ambigious",
        default=0.505,
        type=float,
    )
    optional.add_argument(
        "--outlier-threshold",
        help="Threshold to use to label variant frequency differences between alternate and panel frequencis that are significant",
        default=None,
        type=float,
    )
    optional.add_argument(
        "-t", "--threads", help="Number of threads to use for compression", type=int
    )
    args = vars(parser.parse_args())
    if (
        args["reference_panel_type"] == "custom"
        and args["reference_panel_format"] is None
    ):
        parser.error(
            "custom --reference-panel-type requires --reference-panel-format to be set"
        )
    return args


if __name__ == "__main__":
    main()
43
shell: "snakemake --dag | dot -Tsvg > {output}"
ShowHide 3 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/bjohnnyd/hla-kir-imputation
Name: hla-kir-imputation
Version: v1.0
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...