Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Roshni Patel ([email protected])
Description
Calling local ancestry tracts with RFMix v1.5.4 and (optionally) validating global ancestry fractions with ADMIXTURE.
NCBI build
This pipeline was designed to call local ancestry tracts on hg38 VCFs. In practice, it can be used to call local ancestry tracts on hg19/GRCh37 VCFs, but you will need to obtain hg19/GRCh37 versions of the files in
maps/
. hg19/GRCh37 genetic maps in the same format can be found
here
.
chrom_centromere.map
will probably have to be recreated manually starting from the file
centromeres_hg19.bed
here
.
Cluster
This pipeline is designed to be run on a cluster that uses the Slurm workload manager. If this is not the case,
sm_script.sh
and
sm_slurm_config.json
will need to be reworked.
Required files
See
scripts/snakemake_variables.py
for more details on naming conventions and directory structure.
Programs
-
ADMIXTURE (optional)
Data
-
Phased VCF containing admixed data
-
Phased VCF containing reference data
-
Metadata file for admixed individuals (mapping ancestry to VCF sample ID)
-
Metadata file for reference population (mapping ancestry to VCF sample ID)
Workflow
-
Obtain required files
-
Determine whether your admixed and reference VCFs have the same chromosome naming system; otherwise you might want to uncomment
rule annotate_admix
in the Snakefile and modify accordingly to suit your purposes. -
Edit variables in
scripts/snakemake_variables.py
as needed. (In all likelihood, only the variables under Data and Programs will need to be edited.) -
Edit
sm_slurm_config.json
with the partitions you'll be running jobs on. -
Edit the top of
sm_script.sh
and the top of the Snakefile to reflect the location/type of shell you're using. -
Determine desired output and edit
rule all
in Snakefile as needed.
-
To generate local ancestry tracts with RFMix:
rule all: input: expand(DATA_DIR + "bed/em{em}.chr{chr}.{ind}.{hapl}.bed", hapl=HAPL, ind=INDIV, em=EM_ITER, chr=CHROMS)
-
To plot karyograms with RFMix local ancestry tracts:
rule all: input: expand("plots/{ind}.em{em}.png", ind=INDIV, em=EM_ITER)
-
To generate per-chromosome global ancestry fractions from RFMix:
rule all: input: expand(DATA_DIR + "chr{chr}/chr{chr}.em{em}.lai_global.txt", chr=CHROMS, ind=INDIV)
-
To generate genome-wide global ancestry fractions from both RFMix and ADMIXTURE:
rule all: input: DATA_DIR + "combined_global_anc_frac.txt"
-
(Note: there is currently no way to generate genome-wide global ancestry fractions for RFMix alone, but you can edit
scripts/combine_chrs.py
to allow for that.
-
Make the conda environments in
envs/
-
Unzip
maps/plink.GRCh38.genetic_map.zip
-
Run the pipeline with
./sm_script.sh
Code Snippets
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 | import pandas as pd ### Directories ### # Directory in which to find admixed VCFs and store output files DATA_DIR = "data/" # Directory in which to find reference VCFs REF_DIR = "data/1KGenomes_hg38/" # Directory in which to find genetic maps MAP_DIR = "data/maps/" ### Data ### # Basename of phased VCF for admixed and reference data ADMIX_DATA = "genotype_freeze.6a.pass_only.phased.mesa_1331samples.maf01.biallelic.vcf.gz" REF_DATA = "GRCh38.filt.1kg.phase3.v5a.biSNPs.vcf.gz" # Files containing metadata for admixed and reference data. Must be tab-delimited, and # first column should be sample ID in VCF. ADMIX_METADATA = "metadata-TOPMed_MESA_RNAseq_2973samples_metadata.combined_Black_samples.tsv" REF_METADATA = "20130606_sample_info.tsv" # Column specifying sample ID in metadata file REF_ID_COL = "Sample" ADMIX_ID_COL = "nwd_id" # Column specifying race/ancestry in metadata file REF_POP_COL = "Population" ADMIX_POP_COL = "race" # Race/ancestry values in metadata file to filter for REF_POP_VAL = "CEU,YRI" ADMIX_POP_VAL = "Black" # Colors for plotting scripts HEX_COLORS = "455A8E,E48671,22223A" ### Scripts ### # Script for filtering sample IDs based on a specific ancestry FILTER_SCRIPT = "scripts/ancestry_filter.py" # Script for creating RFMix input files (classes, alleles, snp_locations) RFMIX_INPUT_SCRIPT = "scripts/make_rfmix_input.py" # Script for collapsing RFMix output into local ancestry tracts; adapted from # armartin on GitHub RFMIX_OUTPUT_SCRIPT = "scripts/collapse_ancestry.py" # Script for plotting karyograms from local ancestry tracts; adapted from Alicia # Martin's GitHub KARYOGRAM_SCRIPT = "scripts/plot_karyogram.py" # Script for inferring global ancestry proportion from local ancestry tracts; # adapted from Alicia Martin's GitHub GLOBAL_INF_SCRIPT = "scripts/lai_global.py" # Script for combining ancestry calls between RFMix and ADMIXTURE for each # individual COMB_SCRIPT = "scripts/combine_ancestry_calls.py" # Script for combining global ancestry estimates across chromosomes COMB_CHR_SCRIPT = "scripts/combine_chrs.py" # Script for checking RFMix phasing against original VCF PHASING_SCRIPT = "scripts/check_phasing.py" # Script for computing distribution of European tract lengths TRACT_SCRIPT = "scripts/compute_eur_dist.py" ### Programs ### # Path to RFMix script RFMIX = "bin/rfmix/RFMix_v1.5.4/RunRFMix.py" # Path to ADMIXTURE binary ADMIXTURE = "bin/admixture" ### Variables ### CHROMS = [str(i) for i in range(1, 23)] # Partition chromosomes into acrocentric (no centromere) and non-acrocentric. # For the latter, we infer local ancestry separately on each arm of the # chromosome to decrease computational time - this is possible because there is # no recombination at centromeres. ACROCENTRIC = ['13', '14', '15', '21', '22'] NON_ACRO = [c for c in CHROMS if c not in ACROCENTRIC] # Number of EM iterations to run with RFMix; can optionally be a list (will take # much longer, though!) EM_ITER = 0 # MAF threshold used for filtering input ADMIX_MAF = 0.1 RFMIX_MAF = 0.05 # r2 value used for pruning input ADMIX_R2 = 0.1 RFMIX_R2 = 0.5 # Number of reference populations to use NPOP = len(REF_POP_VAL.split(',')) # Arms of chromosome ARMS = ['p', 'q'] # Sets of haplotype HAPL = ['A', 'B'] # Sample IDs of all individuals for which local ancestry analysis should be # performed (generally all admixed individuals) INDIV = ['NWD163203', 'NWD838454', 'NWD447217', 'NWD506902', 'NWD516805', \ 'NWD598769', 'NWD276731', 'NWD857305', 'NWD573145', 'NWD378807', \ 'NWD773408', 'NWD862131', 'NWD696182', 'NWD601219', 'NWD445484', \ 'NWD245122', 'NWD574025', 'NWD962792', 'NWD140824', 'NWD347140', \ 'NWD388542', 'NWD409874', 'NWD641240', 'NWD620567', 'NWD754070', \ 'NWD338378', 'NWD660071', 'NWD452926', 'NWD233967', 'NWD878543', \ 'NWD484275', 'NWD913991', 'NWD643775', 'NWD453328', 'NWD865962', \ 'NWD524336', 'NWD728777', 'NWD652755', 'NWD346975', 'NWD795057', \ 'NWD630602', 'NWD467083', 'NWD610801', 'NWD158237', 'NWD600992', \ 'NWD381518', 'NWD710489', 'NWD220560', 'NWD263184', 'NWD343349', \ 'NWD998816', 'NWD897019', 'NWD612175', 'NWD606337', 'NWD404517', \ 'NWD486871', 'NWD170091', 'NWD902298', 'NWD402569', 'NWD251680', \ 'NWD256042', 'NWD242199', 'NWD141713', 'NWD816583', 'NWD487726', \ 'NWD462712', 'NWD916599', 'NWD856383', 'NWD576327', 'NWD571531', \ 'NWD890143', 'NWD901710', 'NWD171365', 'NWD782813', 'NWD993431', \ 'NWD847600', 'NWD716639', 'NWD293024', 'NWD574916', 'NWD862869', \ 'NWD557722', 'NWD299581', 'NWD135441', 'NWD345273', 'NWD546564', \ 'NWD774561', 'NWD969537', 'NWD449948', 'NWD716470', 'NWD945886', \ 'NWD417668', 'NWD362144', 'NWD671182', 'NWD281909', 'NWD210347', \ 'NWD998609', 'NWD353250', 'NWD781750', 'NWD426008', 'NWD587844', \ 'NWD133198', 'NWD149502', 'NWD975921', 'NWD181781', 'NWD661828', \ 'NWD712674', 'NWD478709', 'NWD337903', 'NWD288321', 'NWD158184', \ 'NWD559886', 'NWD737180', 'NWD419692', 'NWD479130', 'NWD797979', \ 'NWD352494', 'NWD752087', 'NWD390054', 'NWD479314', 'NWD626329', \ 'NWD521616', 'NWD878668', 'NWD268401', 'NWD138795', 'NWD159679', \ 'NWD609750', 'NWD392992', 'NWD730299', 'NWD993659', 'NWD108116', \ 'NWD787458', 'NWD657624', 'NWD105648', 'NWD311932', 'NWD640412', \ 'NWD866219', 'NWD687607', 'NWD788130', 'NWD519039', 'NWD678520', \ 'NWD212434', 'NWD687323', 'NWD523161', 'NWD730873', 'NWD828733', \ 'NWD168541', 'NWD164446', 'NWD930303', 'NWD457909', 'NWD363273', \ 'NWD547021', 'NWD314678', 'NWD950794', 'NWD677387', 'NWD542871', \ 'NWD879683', 'NWD795363', 'NWD346624', 'NWD262638', 'NWD419335', \ 'NWD177635', 'NWD462567', 'NWD553021', 'NWD563349', 'NWD822723', \ 'NWD944827', 'NWD959607', 'NWD681170', 'NWD490768', 'NWD685703', \ 'NWD907433', 'NWD712032', 'NWD677776', 'NWD599306', 'NWD684484', \ 'NWD826684', 'NWD428381', 'NWD174966', 'NWD938526', 'NWD328083', \ 'NWD318497', 'NWD731252', 'NWD412116', 'NWD640560', 'NWD639242', \ 'NWD599371', 'NWD720391', 'NWD431959', 'NWD584394', 'NWD344310', \ 'NWD517673', 'NWD527323', 'NWD418773', 'NWD857221', 'NWD836765', \ 'NWD396724', 'NWD125656', 'NWD912070', 'NWD597815', 'NWD512818'] |
21 22 23 24 25 26 27 28 29 30 31 | shell: """ conda activate py36 python {FILTER_SCRIPT} \ --sample_data {input} \ --id_col {ADMIX_ID_COL} \ --pop_col {ADMIX_POP_COL} \ --pop_val {ADMIX_POP_VAL} \ --out {output} conda deactivate """ |
40 41 42 43 44 45 46 47 48 | shell: """ conda activate bcftools-env bcftools view -S {input.filter} --force-samples -Ob {input.dataset} | \ bcftools view -i 'MAF[0] > 0.05' -Ob | \ bcftools view --genotype ^miss --phased -Ob -o {output.bcf} bcftools index -c {output.bcf} conda deactivate """ |
59 60 61 62 63 64 65 66 | shell: """ conda activate bcftools-env bcftools annotate --rename-chrs {input.map} -Ob -o {output.bcf} \ {input.bcf} bcftools index -c {output.bcf} conda deactivate """ |
77 78 79 80 81 82 83 84 | shell: """ conda activate bcftools-env mkdir -p {params.out_dir} bcftools view -r {wildcards.chr} -Ob -o {output.bcf} {input.bcf} bcftools index -c {output.bcf} conda deactivate """ |
91 92 93 94 95 96 97 98 99 100 101 | shell: """ conda activate py36 python {FILTER_SCRIPT} \ --sample_data {input} \ --id_col {REF_ID_COL} \ --pop_col {REF_POP_COL} \ --pop_val {REF_POP_VAL} \ --out {output} conda deactivate """ |
112 113 114 115 116 117 118 119 120 121 | shell: """ conda activate bcftools-env mkdir -p {params.out_dir} bcftools view -S {input.filter} --force-samples -Ou {input.ref} | \ bcftools view --genotype ^miss --phased -Ou | \ bcftools view -i 'MAF[0] > .05' -Ob -o {output.bcf} bcftools index -c {output.bcf} conda deactivate """ |
135 136 137 138 139 140 | shell: """ conda activate bcftools-env bcftools isec -p {params.out_dir} -n=2 --collapse none -Ob {input.admix} {input.ref} conda deactivate """ |
149 150 151 152 153 154 155 | shell: """ conda activate bcftools-env bcftools merge -m none -Ou {input.bcf} | \ bcftools view -i 'MAF[0] > {wildcards.maf}' -m2 -M2 -v snps -Oz -o {output} conda deactivate """ |
169 170 171 172 173 174 175 | shell: """ conda activate plink-env plink --vcf {input} --vcf-half-call m --allow-no-sex \ --keep-allele-order --indep-pairwise 50 5 {wildcards.r2} --out {params.out_file} conda deactivate """ |
183 184 185 186 187 188 | shell: """ conda activate bcftools-env bcftools view -i 'ID=@{input.snps}' -Oz -o {output} {input.vcf} conda deactivate """ |
207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 | shell: """ conda activate py36 CENT=0 zcat {input.vcf} | python {RFMIX_INPUT_SCRIPT} \ --admix {input.admix_info} \ --admix_id_col {ADMIX_ID_COL} \ --admix_pop_col {ADMIX_POP_COL} \ --admix_pop_val {ADMIX_POP_VAL} \ --ref {input.ref_info} \ --ref_id_col {REF_ID_COL} \ --ref_pop_col {REF_POP_COL} \ --ref_pop_val {REF_POP_VAL} \ --genetic_map {input.genetic_map} \ --out {params.out_file} \ --cent $CENT conda deactivate """ |
244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 | shell: """ conda activate py36 CENT=$(grep '^{wildcards.chr}[^0-9]' {input.centromeres} | cut -f3 -d$'\t') zcat {input.vcf} | python {RFMIX_INPUT_SCRIPT} \ --admix {input.admix_info} \ --admix_id_col {ADMIX_ID_COL} \ --admix_pop_col {ADMIX_POP_COL} \ --admix_pop_val {ADMIX_POP_VAL} \ --ref {input.ref_info} \ --ref_id_col {REF_ID_COL} \ --ref_pop_col {REF_POP_COL} \ --ref_pop_val {REF_POP_VAL} \ --genetic_map {input.genetic_map} \ --out {params.out_file} \ --cent $CENT conda deactivate """ |
273 274 275 276 277 278 279 280 281 282 | shell: """ conda activate py27 python {RFMIX} PopPhased {input.alleles} \ {input.classes} {input.snp_locations} \ -e {EM_ITER} \ -o {params.out_file} \ --generations 8 conda deactivate """ |
302 303 304 305 306 307 | shell: """ cat {input.viterbi} > {output.viterbi} cat {input.snp_locations} > {output.snp_locations} paste {input.pos_map} > {output.pos_map} """ |
318 319 320 321 322 323 324 325 326 327 328 329 330 | shell: """ mkdir -p {params.out_dir} conda activate py27 python {RFMIX_OUTPUT_SCRIPT} \ --rfmix {input.viterbi} \ --snp_map {input.pos_map} \ --pop_map {input.pop_map} \ --pop_labels {REF_POP_VAL} \ --out {params.out_dir} \ --chrom {wildcards.chr} conda deactivate """ |
346 347 348 349 350 351 352 353 354 | shell: """ conda activate py36 python {GLOBAL_INF_SCRIPT} \ --bed {params.bed} \ --pops {REF_POP_VAL} \ --out {output} conda deactivate """ |
362 363 364 365 366 367 368 369 370 371 372 373 | shell: """ conda activate py36 python {KARYOGRAM_SCRIPT} \ --bed_path {input.bed} \ --ind {wildcards.ind} \ --centromeres {input.cent} \ --pop_order {REF_POP_VAL} \ --colors {HEX_COLORS} \ --out {output} conda deactivate """ |
380 381 382 383 | shell: """ tr '\t' '\n' < <(head -n 2 {input.map} | tail -n 1) | sed 's/ADMIX/-/' > {output.pop} """ |
395 396 397 398 399 400 | shell: """ conda activate plink-env plink --vcf {input.vcf} --make-bed --keep-allele-order --out {params.out_file} conda deactivate """ |
409 410 411 412 413 | shell: """ {ADMIXTURE} {input.bed} 2 --supervised mv chr{wildcards.chr}* {DATA_DIR}chr{wildcards.chr}/ """ |
423 424 425 426 427 428 429 430 431 432 433 | shell: """ conda activate py36 python {COMB_SCRIPT} \ --rfmix {input.rfmix} \ --admix {input.admix} \ --map {input.map} \ --pops {REF_POP_VAL} \ --out {output} conda deactivate """ |
442 443 444 445 446 447 448 449 450 451 | shell: """ conda activate py36 python {COMB_CHR_SCRIPT} \ --anc {input.anc} \ --map {input.map} \ --frac_out {output.frac} \ --idv_out {output.idv} conda deactivate """ |
461 462 463 464 | shell: """ cat {input} > {output} """ |
473 474 475 476 477 478 479 480 481 | shell: """ conda activate py36 python {PHASING_SCRIPT} \ --vcf {input.vcf} \ --rfmix_phased {input.phasing} \ --out {output} conda deactivate """ |
490 491 492 493 494 495 496 497 498 | shell: """ conda activate py36 python {TRACT_SCRIPT} \ --bed {params.bed} \ --tract_out {output.tract} \ --idv_out {output.idv} conda deactivate """ |
Support
- Future updates