Local Ancestry Tract Calling Pipeline with RFMix

public public 1yr ago 0 bookmarks

Roshni Patel ([email protected])

Description

Calling local ancestry tracts with RFMix v1.5.4 and (optionally) validating global ancestry fractions with ADMIXTURE.

NCBI build

This pipeline was designed to call local ancestry tracts on hg38 VCFs. In practice, it can be used to call local ancestry tracts on hg19/GRCh37 VCFs, but you will need to obtain hg19/GRCh37 versions of the files in maps/ . hg19/GRCh37 genetic maps in the same format can be found here . chrom_centromere.map will probably have to be recreated manually starting from the file centromeres_hg19.bed here .

Cluster

This pipeline is designed to be run on a cluster that uses the Slurm workload manager. If this is not the case, sm_script.sh and sm_slurm_config.json will need to be reworked.

Required files

See scripts/snakemake_variables.py for more details on naming conventions and directory structure.

Programs

Data

  • Phased VCF containing admixed data

  • Phased VCF containing reference data

  • Metadata file for admixed individuals (mapping ancestry to VCF sample ID)

  • Metadata file for reference population (mapping ancestry to VCF sample ID)

Workflow

  1. Obtain required files

  2. Determine whether your admixed and reference VCFs have the same chromosome naming system; otherwise you might want to uncomment rule annotate_admix in the Snakefile and modify accordingly to suit your purposes.

  3. Edit variables in scripts/snakemake_variables.py as needed. (In all likelihood, only the variables under Data and Programs will need to be edited.)

  4. Edit sm_slurm_config.json with the partitions you'll be running jobs on.

  5. Edit the top of sm_script.sh and the top of the Snakefile to reflect the location/type of shell you're using.

  6. Determine desired output and edit rule all in Snakefile as needed.

  • To generate local ancestry tracts with RFMix:

    rule all: input: expand(DATA_DIR + "bed/em{em}.chr{chr}.{ind}.{hapl}.bed", hapl=HAPL, ind=INDIV, em=EM_ITER, chr=CHROMS)
    
  • To plot karyograms with RFMix local ancestry tracts:

    rule all: input: expand("plots/{ind}.em{em}.png", ind=INDIV, em=EM_ITER)
    
  • To generate per-chromosome global ancestry fractions from RFMix:

    rule all: input: expand(DATA_DIR + "chr{chr}/chr{chr}.em{em}.lai_global.txt", chr=CHROMS, ind=INDIV)
    
  • To generate genome-wide global ancestry fractions from both RFMix and ADMIXTURE:

    rule all: input: DATA_DIR + "combined_global_anc_frac.txt"
    
  • (Note: there is currently no way to generate genome-wide global ancestry fractions for RFMix alone, but you can edit scripts/combine_chrs.py to allow for that.

  1. Make the conda environments in envs/

  2. Unzip maps/plink.GRCh38.genetic_map.zip

  3. Run the pipeline with ./sm_script.sh

Code Snippets

  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
import pandas as pd

### Directories ###

# Directory in which to find admixed VCFs and store output files
DATA_DIR = "data/"

# Directory in which to find reference VCFs
REF_DIR = "data/1KGenomes_hg38/"

# Directory in which to find genetic maps
MAP_DIR = "data/maps/"



### Data ###

# Basename of phased VCF for admixed and reference data
ADMIX_DATA = "genotype_freeze.6a.pass_only.phased.mesa_1331samples.maf01.biallelic.vcf.gz"
REF_DATA = "GRCh38.filt.1kg.phase3.v5a.biSNPs.vcf.gz"

# Files containing metadata for admixed and reference data. Must be tab-delimited, and
# first column should be sample ID in VCF.
ADMIX_METADATA = "metadata-TOPMed_MESA_RNAseq_2973samples_metadata.combined_Black_samples.tsv"
REF_METADATA = "20130606_sample_info.tsv"

# Column specifying sample ID in metadata file
REF_ID_COL = "Sample"
ADMIX_ID_COL = "nwd_id"

# Column specifying race/ancestry in metadata file
REF_POP_COL = "Population"
ADMIX_POP_COL = "race"

# Race/ancestry values in metadata file to filter for
REF_POP_VAL = "CEU,YRI"
ADMIX_POP_VAL = "Black"

# Colors for plotting scripts
HEX_COLORS = "455A8E,E48671,22223A"


### Scripts ###

# Script for filtering sample IDs based on a specific ancestry
FILTER_SCRIPT = "scripts/ancestry_filter.py"

# Script for creating RFMix input files (classes, alleles, snp_locations)
RFMIX_INPUT_SCRIPT = "scripts/make_rfmix_input.py"

# Script for collapsing RFMix output into local ancestry tracts; adapted from
# armartin on GitHub
RFMIX_OUTPUT_SCRIPT = "scripts/collapse_ancestry.py"

# Script for plotting karyograms from local ancestry tracts; adapted from Alicia
# Martin's GitHub
KARYOGRAM_SCRIPT = "scripts/plot_karyogram.py"

# Script for inferring global ancestry proportion from local ancestry tracts;
# adapted from Alicia Martin's GitHub
GLOBAL_INF_SCRIPT = "scripts/lai_global.py"

# Script for combining ancestry calls between RFMix and ADMIXTURE for each
# individual
COMB_SCRIPT = "scripts/combine_ancestry_calls.py"

# Script for combining global ancestry estimates across chromosomes
COMB_CHR_SCRIPT = "scripts/combine_chrs.py"

# Script for checking RFMix phasing against original VCF
PHASING_SCRIPT = "scripts/check_phasing.py"

# Script for computing distribution of European tract lengths
TRACT_SCRIPT = "scripts/compute_eur_dist.py"


### Programs ###

# Path to RFMix script
RFMIX = "bin/rfmix/RFMix_v1.5.4/RunRFMix.py"

# Path to ADMIXTURE binary
ADMIXTURE = "bin/admixture"



### Variables ###

CHROMS = [str(i) for i in range(1, 23)]

# Partition chromosomes into acrocentric (no centromere) and non-acrocentric.
# For the latter, we infer local ancestry separately on each arm of the
# chromosome to decrease computational time - this is possible because there is
# no recombination at centromeres.
ACROCENTRIC = ['13', '14', '15', '21', '22']
NON_ACRO = [c for c in CHROMS if c not in ACROCENTRIC]

# Number of EM iterations to run with RFMix; can optionally be a list (will take
# much longer, though!)
EM_ITER = 0

# MAF threshold used for filtering input
ADMIX_MAF = 0.1
RFMIX_MAF = 0.05

# r2 value used for pruning input
ADMIX_R2 = 0.1
RFMIX_R2 = 0.5

# Number of reference populations to use
NPOP = len(REF_POP_VAL.split(','))

# Arms of chromosome
ARMS = ['p', 'q']

# Sets of haplotype
HAPL = ['A', 'B']

# Sample IDs of all individuals for which local ancestry analysis should be
# performed (generally all admixed individuals)
INDIV = ['NWD163203', 'NWD838454', 'NWD447217', 'NWD506902', 'NWD516805', \
        'NWD598769', 'NWD276731', 'NWD857305', 'NWD573145', 'NWD378807', \
        'NWD773408', 'NWD862131', 'NWD696182', 'NWD601219', 'NWD445484', \
        'NWD245122', 'NWD574025', 'NWD962792', 'NWD140824', 'NWD347140', \
        'NWD388542', 'NWD409874', 'NWD641240', 'NWD620567', 'NWD754070', \
        'NWD338378', 'NWD660071', 'NWD452926', 'NWD233967', 'NWD878543', \
        'NWD484275', 'NWD913991', 'NWD643775', 'NWD453328', 'NWD865962', \
        'NWD524336', 'NWD728777', 'NWD652755', 'NWD346975', 'NWD795057', \
        'NWD630602', 'NWD467083', 'NWD610801', 'NWD158237', 'NWD600992', \
        'NWD381518', 'NWD710489', 'NWD220560', 'NWD263184', 'NWD343349', \
        'NWD998816', 'NWD897019', 'NWD612175', 'NWD606337', 'NWD404517', \
        'NWD486871', 'NWD170091', 'NWD902298', 'NWD402569', 'NWD251680', \
        'NWD256042', 'NWD242199', 'NWD141713', 'NWD816583', 'NWD487726', \
        'NWD462712', 'NWD916599', 'NWD856383', 'NWD576327', 'NWD571531', \
        'NWD890143', 'NWD901710', 'NWD171365', 'NWD782813', 'NWD993431', \
        'NWD847600', 'NWD716639', 'NWD293024', 'NWD574916', 'NWD862869', \
        'NWD557722', 'NWD299581', 'NWD135441', 'NWD345273', 'NWD546564', \
        'NWD774561', 'NWD969537', 'NWD449948', 'NWD716470', 'NWD945886', \
        'NWD417668', 'NWD362144', 'NWD671182', 'NWD281909', 'NWD210347', \
        'NWD998609', 'NWD353250', 'NWD781750', 'NWD426008', 'NWD587844', \
        'NWD133198', 'NWD149502', 'NWD975921', 'NWD181781', 'NWD661828', \
        'NWD712674', 'NWD478709', 'NWD337903', 'NWD288321', 'NWD158184', \
        'NWD559886', 'NWD737180', 'NWD419692', 'NWD479130', 'NWD797979', \
        'NWD352494', 'NWD752087', 'NWD390054', 'NWD479314', 'NWD626329', \
        'NWD521616', 'NWD878668', 'NWD268401', 'NWD138795', 'NWD159679', \
        'NWD609750', 'NWD392992', 'NWD730299', 'NWD993659', 'NWD108116', \
        'NWD787458', 'NWD657624', 'NWD105648', 'NWD311932', 'NWD640412', \
        'NWD866219', 'NWD687607', 'NWD788130', 'NWD519039', 'NWD678520', \
        'NWD212434', 'NWD687323', 'NWD523161', 'NWD730873', 'NWD828733', \
        'NWD168541', 'NWD164446', 'NWD930303', 'NWD457909', 'NWD363273', \
        'NWD547021', 'NWD314678', 'NWD950794', 'NWD677387', 'NWD542871', \
        'NWD879683', 'NWD795363', 'NWD346624', 'NWD262638', 'NWD419335', \
        'NWD177635', 'NWD462567', 'NWD553021', 'NWD563349', 'NWD822723', \
        'NWD944827', 'NWD959607', 'NWD681170', 'NWD490768', 'NWD685703', \
        'NWD907433', 'NWD712032', 'NWD677776', 'NWD599306', 'NWD684484', \
        'NWD826684', 'NWD428381', 'NWD174966', 'NWD938526', 'NWD328083', \
        'NWD318497', 'NWD731252', 'NWD412116', 'NWD640560', 'NWD639242', \
        'NWD599371', 'NWD720391', 'NWD431959', 'NWD584394', 'NWD344310', \
        'NWD517673', 'NWD527323', 'NWD418773', 'NWD857221', 'NWD836765', \
        'NWD396724', 'NWD125656', 'NWD912070', 'NWD597815', 'NWD512818']
21
22
23
24
25
26
27
28
29
30
31
shell:
    """
    conda activate py36
    python {FILTER_SCRIPT} \
        --sample_data {input} \
        --id_col {ADMIX_ID_COL} \
        --pop_col {ADMIX_POP_COL} \
        --pop_val {ADMIX_POP_VAL} \
        --out {output}
    conda deactivate
    """
40
41
42
43
44
45
46
47
48
shell:
    """
    conda activate bcftools-env
    bcftools view -S {input.filter} --force-samples -Ob {input.dataset} | \
        bcftools view -i 'MAF[0] > 0.05' -Ob | \
        bcftools view --genotype ^miss --phased -Ob -o {output.bcf}
    bcftools index -c {output.bcf}
    conda deactivate
    """
59
60
61
62
63
64
65
66
shell:
    """
    conda activate bcftools-env
    bcftools annotate --rename-chrs {input.map} -Ob -o {output.bcf} \
        {input.bcf}
    bcftools index -c {output.bcf}
    conda deactivate
    """
77
78
79
80
81
82
83
84
shell:
    """
    conda activate bcftools-env
    mkdir -p {params.out_dir}
    bcftools view -r {wildcards.chr} -Ob -o {output.bcf} {input.bcf}
    bcftools index -c {output.bcf}
    conda deactivate
    """
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
shell:
    """
    conda activate py36
    python {FILTER_SCRIPT} \
        --sample_data {input} \
        --id_col {REF_ID_COL} \
        --pop_col {REF_POP_COL} \
        --pop_val {REF_POP_VAL} \
        --out {output}
    conda deactivate
    """
112
113
114
115
116
117
118
119
120
121
shell:
    """
    conda activate bcftools-env
    mkdir -p {params.out_dir}
    bcftools view -S {input.filter} --force-samples -Ou {input.ref} | \
        bcftools view --genotype ^miss --phased -Ou | \
        bcftools view -i 'MAF[0] > .05' -Ob -o {output.bcf}
    bcftools index -c {output.bcf}
    conda deactivate
    """
135
136
137
138
139
140
shell:
    """
    conda activate bcftools-env
    bcftools isec -p {params.out_dir} -n=2 --collapse none -Ob {input.admix} {input.ref}
    conda deactivate
    """
149
150
151
152
153
154
155
shell:
    """
    conda activate bcftools-env
    bcftools merge -m none -Ou {input.bcf} | \
        bcftools view -i 'MAF[0] > {wildcards.maf}' -m2 -M2 -v snps -Oz -o {output}
    conda deactivate
    """
169
170
171
172
173
174
175
shell:
    """
    conda activate plink-env
    plink --vcf {input} --vcf-half-call m --allow-no-sex \
        --keep-allele-order --indep-pairwise 50 5 {wildcards.r2} --out {params.out_file}
    conda deactivate
    """
183
184
185
186
187
188
shell:
    """
    conda activate bcftools-env
    bcftools view -i 'ID=@{input.snps}' -Oz -o {output} {input.vcf}
    conda deactivate
    """
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
shell:
    """
    conda activate py36
    CENT=0
    zcat {input.vcf} | python {RFMIX_INPUT_SCRIPT} \
        --admix {input.admix_info} \
        --admix_id_col {ADMIX_ID_COL} \
        --admix_pop_col {ADMIX_POP_COL} \
        --admix_pop_val {ADMIX_POP_VAL} \
        --ref {input.ref_info} \
        --ref_id_col {REF_ID_COL} \
        --ref_pop_col {REF_POP_COL} \
        --ref_pop_val {REF_POP_VAL} \
        --genetic_map {input.genetic_map} \
        --out {params.out_file} \
        --cent $CENT
    conda deactivate
    """
SnakeMake From line 207 of master/Snakefile
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
shell:
    """
    conda activate py36
    CENT=$(grep '^{wildcards.chr}[^0-9]' {input.centromeres} | cut -f3 -d$'\t')
    zcat {input.vcf} | python {RFMIX_INPUT_SCRIPT} \
        --admix {input.admix_info} \
        --admix_id_col {ADMIX_ID_COL} \
        --admix_pop_col {ADMIX_POP_COL} \
        --admix_pop_val {ADMIX_POP_VAL} \
        --ref {input.ref_info} \
        --ref_id_col {REF_ID_COL} \
        --ref_pop_col {REF_POP_COL} \
        --ref_pop_val {REF_POP_VAL} \
        --genetic_map {input.genetic_map} \
        --out {params.out_file} \
        --cent $CENT
    conda deactivate
    """
SnakeMake From line 244 of master/Snakefile
273
274
275
276
277
278
279
280
281
282
shell:
    """
    conda activate py27
    python {RFMIX} PopPhased {input.alleles} \
        {input.classes} {input.snp_locations} \
        -e {EM_ITER} \
        -o {params.out_file} \
        --generations 8
    conda deactivate
    """
SnakeMake From line 273 of master/Snakefile
302
303
304
305
306
307
shell:
    """
    cat {input.viterbi} > {output.viterbi}
    cat {input.snp_locations} > {output.snp_locations}
    paste {input.pos_map} > {output.pos_map}
    """
SnakeMake From line 302 of master/Snakefile
318
319
320
321
322
323
324
325
326
327
328
329
330
shell:
    """
    mkdir -p {params.out_dir}
    conda activate py27
    python {RFMIX_OUTPUT_SCRIPT} \
        --rfmix {input.viterbi} \
        --snp_map {input.pos_map} \
        --pop_map {input.pop_map} \
        --pop_labels {REF_POP_VAL} \
        --out {params.out_dir} \
        --chrom {wildcards.chr}
    conda deactivate
    """
SnakeMake From line 318 of master/Snakefile
346
347
348
349
350
351
352
353
354
shell:
    """
    conda activate py36
    python {GLOBAL_INF_SCRIPT} \
        --bed {params.bed} \
        --pops {REF_POP_VAL} \
        --out {output}
    conda deactivate
    """
SnakeMake From line 346 of master/Snakefile
362
363
364
365
366
367
368
369
370
371
372
373
shell:
    """
    conda activate py36
    python {KARYOGRAM_SCRIPT} \
        --bed_path {input.bed} \
        --ind {wildcards.ind} \
        --centromeres {input.cent} \
        --pop_order {REF_POP_VAL} \
        --colors {HEX_COLORS} \
        --out {output}
    conda deactivate
    """
SnakeMake From line 362 of master/Snakefile
380
381
382
383
shell:
    """
    tr '\t' '\n' < <(head -n 2 {input.map} | tail -n 1) | sed 's/ADMIX/-/' > {output.pop}
    """
SnakeMake From line 380 of master/Snakefile
395
396
397
398
399
400
shell:
    """
    conda activate plink-env
    plink --vcf {input.vcf} --make-bed --keep-allele-order --out {params.out_file}
    conda deactivate
    """
409
410
411
412
413
shell:
    """
    {ADMIXTURE} {input.bed} 2 --supervised
    mv chr{wildcards.chr}* {DATA_DIR}chr{wildcards.chr}/
    """
SnakeMake From line 409 of master/Snakefile
423
424
425
426
427
428
429
430
431
432
433
shell:
    """
    conda activate py36
    python {COMB_SCRIPT} \
        --rfmix {input.rfmix} \
        --admix {input.admix} \
        --map {input.map} \
        --pops {REF_POP_VAL} \
        --out {output}
    conda deactivate
    """
SnakeMake From line 423 of master/Snakefile
442
443
444
445
446
447
448
449
450
451
shell:
    """
    conda activate py36
    python {COMB_CHR_SCRIPT} \
        --anc {input.anc} \
        --map {input.map} \
        --frac_out {output.frac} \
        --idv_out {output.idv} 
    conda deactivate
    """
SnakeMake From line 442 of master/Snakefile
461
462
463
464
shell:
    """
    cat {input} > {output}
    """
SnakeMake From line 461 of master/Snakefile
473
474
475
476
477
478
479
480
481
shell:
    """
    conda activate py36
    python {PHASING_SCRIPT} \
        --vcf {input.vcf} \
        --rfmix_phased {input.phasing} \
        --out {output}
    conda deactivate
    """
SnakeMake From line 473 of master/Snakefile
490
491
492
493
494
495
496
497
498
shell:
    """
    conda activate py36
    python {TRACT_SCRIPT} \
        --bed {params.bed} \
        --tract_out {output.tract} \
        --idv_out {output.idv}
    conda deactivate
    """
SnakeMake From line 490 of master/Snakefile
ShowHide 23 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/Monia234/LocalAncestry_RFmix_admixture
Name: localancestry_rfmix_admixture
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...