Admixture and association mapping pipeline

public public 1yr ago 0 bookmarks

This pipeline implements admixture mapping on the output of the AncInf pipeline (i.e. RFMix output) and/or association mapping on the output of the pre-imp

Code Snippets

 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
suppressPackageStartupMessages(library(optparse))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(magrittr))
suppressPackageStartupMessages(library(tidyverse))

######################### Read in arguments #########################

option_list <- list( 
  make_option(c("-r", "--relateds"),
              help="Output of PLINK IBS calculation; i.e. XXX.genome. Filtered for related individuals"),
  make_option(c("-o", "--output"), help="output_file"))

opt <- parse_args(OptionParser(option_list=option_list))

######################### Main Body #########################

rel = read.table(opt$relateds, header = T, comment.char = "") #Read in IBD results from PLINK

rel %<>% mutate(ID1=paste(FID1,IID1,sep="]"), ID2=paste(FID2,IID2,sep="]")) #Concatenate IDs for ease of comparison
multi_rel = rel %>% select(ID1,ID2) %>% pivot_longer(c(ID1,ID2)) %>% count(value) %>% filter(n>1) #Identify individuals with more than one related individual
sing_rel = rel %>% filter(!(ID1 %in% multi_rel$value) & !(ID2 %in% multi_rel$value)) #Identify individuals with just a single related individual
sing_exclude = sing_rel %>% mutate(choice = rbinom(n(),1,0.5)) %>% mutate(Choice=case_when(choice==1~as.character(ID1), TRUE~as.character(ID2))) %>% select(Choice) #Randomly select one of the two individuals from the single relative pairs.  
multi_exclude = multi_rel %>% mutate(Choice=value) %>% select(Choice) #Remove all individuals that are related to multiple others

exclude = rbind(sing_exclude,multi_exclude)
exclude %<>% separate(Choice, c("FID","IID"), "]") %>% select(FID,IID)

write.table(exclude, opt$output, quote = F, row.names = F, col.names = F)
147
148
shell:
    "rm -r input; rm -r plink/*; rm -r data; rm *txt; rm accessory/*"
158
159
160
161
162
163
164
run:
    cmd = f"{CMD_PREFIX} plink --bfile {QUERY} --keep-allele-order --make-bed --maf {config['gwas']['maf']} --allow-no-sex --out plink/{BASE}_sub {plnk_rsrc(rule, plink_run_specs)}"
    if os.path.exists(config['samples']): cmd += f" --keep {config['samples']}"
    if os.path.exists(config['sites']): cmd += f" --exclude {config['sites']}"
    cmd += f"; sed -i 's/#//g' plink/{BASE}_sub.fam" #This removes the # character from any sample names which will trip up R in downstream steps
    shell(cmd)
    shell(f"{CMD_PREFIX} awk {{params.awk_string}}; {CMD_PREFIX} plink --bfile plink/{BASE}_sub --keep accessory/control_samples.txt --make-bed --out plink/{BASE}_sub_ctrls {plnk_rsrc(rule, plink_run_specs)}")
170
171
172
173
174
175
176
177
178
179
run:
    import pandas as pd
    shell(f"{CMD_PREFIX} plink --bfile plink/{BASE}_sub --test-missing --allow-no-sex --out plink/{BASE}_sub") #Use plink to calculate stats
    shell(f"{CMD_PREFIX} plink --bfile plink/{BASE}_sub --missing --allow-no-sex --out plink/{BASE}_sub")
    mbc = pd.read_table(f"plink/{BASE}_sub.missing", sep = r"\s+") #Read output of PLINK
    mis = pd.read_table(f"plink/{BASE}_sub.lmiss", sep = r"\s+")
    dat = pd.concat([mbc[mbc['P'] < float(config['gwas']['mbc_pval'])]['SNP'], mis[mis['F_MISS'] > float(config['gwas']['missingness'])]['SNP']]).drop_duplicates() #Extract SNPs to exclude from PLINK results
    dat.to_csv('accessory/missing_filter.txt', header=False,index=False) #Write SNPs to be filtered
    shell(f"{CMD_PREFIX} plink --bfile plink/{BASE}_sub --exclude accessory/missing_filter.txt --keep-allele-order --make-bed --out plink/{BASE}_sub_mflt") #Filter sites from full dataset
    shell(f"{CMD_PREFIX} plink --bfile plink/{BASE}_sub_ctrls --exclude accessory/missing_filter.txt --keep-allele-order --make-bed --out plink/{BASE}_sub_ctrls_mflt") #Filter sites from control
208
209
run:
    shell(f"{CMD_PREFIX} cat plink/{BASE}.genome.* | gzip > {BASE}.genome.gz; mv {BASE}.genome.gz plink")
216
217
218
219
run:
    shell(f"zcat {{input[1]}} | awk {{params.awk_string}}; " #Extract related individuals
    f"{CMD_PREFIX} Rscript scripts/filter_relateds.R -r accessory/related_IBS_ctrls.txt -o accessory/excluded_related_ctrls.txt; " #This script cycles through related pairs and keeps one 
    f"{CMD_PREFIX} plink --bfile plink/{BASE}_LDprune_ctrls --remove accessory/excluded_related_ctrls.txt --keep-allele-order --make-bed --out plink/{BASE}_LDp_ctrls_IBDflt {plnk_rsrc(rule, plink_run_specs)}")
239
240
241
run:
    shell(f"{CMD_PREFIX} plink --bfile plink/{BASE}_LDp_ctrls_IBDflt --keep plink/DS/{BASE}_LDprune_ctrls_DS{{wildcards.round}}.txt --maf {config['sHWE']['maf']} "
          f"--geno 0.0 --keep-allele-order --make-bed --out plink/DS/{BASE}_LDprune_ctrls_DS{{wildcards.round}} {plnk_rsrc(rule, plink_run_specs)}")
251
252
253
254
255
256
257
258
259
run:
    with open('accessory/sHWE_downsampled_snps.txt', 'r') as tested: #count up the SNPs that were combined in previous rule.
        for num_tested, line in enumerate(tested): pass
    if num_tested < int(config['sHWE']['test_threshold']): #If not enough SNPs will be tested, print error message and DO NOT create output file.  The documentatino of the pipeline encourages the user to adjust the parameters of the downsampling (more rounds, fewer individuals, lower threshold, etc.)
        print(f"Failed Checkpoint\nReason: Fewer than {config['sHWE']['test_threshold']} unique SNPs downsampled for sHWE (number downsampled = {num_tested})\n"
              f"Increase the value set for \'downsample_rounds\' entry in the \'sHWE\' section of the config file or reduce the sHWE 'test_threshold'")
        shell("cp accessory/sHWE_downsampled_snps.txt accessory/.sHWE_downsampled_snps.txt; rm accessory/sHWE_downsampled_snps.txt") #This will remove output of prior rule, which will prompt the re-creation of accessory/sHWE_downsampled_snps.txt.  This is necessary to prompt the downsampling process when the user changes parameters of downsampling in config file.
    else:
        shell('touch accessory/.sHWE_downsample_pass.txt') #If pass, then create hidden file required by next rule
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
run:
    max_entropy = -9
    best_popnum = -9
    with open("sHWE/entropy.txt", 'w') as ent_file:
        for file in input: #Loop over input files and calculate entropy of each
            if not os.stat(file).st_size == 0: #Make sure file is not empty
                popnum = file.split("_")[-1] #Extract population number from filename
                dat = pd.read_table(file, header=None) # read in vector of p-values from sHWE
                binned = pd.cut(dat.iloc[:,0], params.bins).value_counts(sort=False) # bin and count vector values
                entropy = scipy.stats.entropy(binned.iloc[1:,]) # Calculate entropy of bin counts, excluding first bin that corrresponds to bin containing true positives (i.e. sites truly out of HWE)
                ent_file.write(f"{popnum}\t{entropy}\n")
                if entropy > max_entropy: #Store current value if better than prior best result.
                    best_popnum = popnum
                    max_entropy = entropy
            else:
                print(f"sHWE calculation failed for {popnum}; debug or remove this value from sHWE -> pop_nums in the config.yml file")
    assert(max_entropy != -9 and best_popnum != -9) #Make sure that we have successfully identified an optimum population number
    with open("sHWE/.optimum_popnum.txt", 'w') as outfile: outfile.write(str(best_popnum)) #Write the optimum population number to hidden file.
295
296
297
298
299
run:
    popfile = open(f"sHWE/.optimum_popnum.txt", 'r') #Grab the estimated population number from the hidden file
    popnum = popfile.readline()
    popfile.close()
    shell(f"{CMD_PREFIX} Rscript {CODE}/run_sHWE.R -p plink/DS/{BASE}_LDprune_ctrls_DS{{wildcards.f}} -d {popnum} -t {{params.threshold}} -o sHWE/{BASE}_sHWE-DS{{wildcards.f}} || touch sHWE/{BASE}_sHWE-DS{{wildcards.f}}")
306
307
308
309
310
311
run:
    file_list = [x for x in glob.glob(f"plink/DS/{BASE}_LDprune_ctrls_DS*.bim")
                 if os.path.exists(f"sHWE/{BASE}_sHWE-DS{int(x.strip('.bim').replace('DS', '').split('_')[-1])}.rmv.bim")] #Create file list of all successful runs of sHWE (the .rmv.bim file will be present if sHWE completed successfully)
    with open('accessory/.tested_file_list.txt', 'w') as tested: #Write file list to hidden file
        for file in file_list: tested.write(f"{file}\n")
    if file_list: shell(f"find plink/DS/ -name \'{BASE}_LDprune_ctrls_DS*.bim\' | grep -v -w accessory/.tested_file_list.txt | xargs cat | cut -d \" \" -f 2 | sort -T {os.getcwd()} | uniq > accessory/sHWE_tested_snps.txt") #This command prints markers from the input .bim files IFF they were successfully tested (in our file list)
316
317
318
319
320
321
322
323
324
run:
    with open('accessory/sHWE_tested_snps.txt', 'r') as tested: #Count up tested SNPs in file
        for num_tested, line in enumerate(tested): pass
    if num_tested < int(config['sHWE']['test_threshold']):
        print(f"Failed Checkpoint\nReason: Only {num_tested} markers were tested for sHWE, which is fewer than requested threshold ({config['sHWE']['test_threshold']})\n"
              f"Increase the value set for \'downsample_rounds\' entry in the \'sHWE\' section of the config file")
        shell("cp accessory/sHWE_tested_snps.txt accessory/.sHWE_tested_snps.txt; rm accessory/sHWE_tested_snps.txt; rm accessory/sHWE_downsampled_snps.txt")
    else:
        shell('touch accessory/.sHWE_pass.txt')
344
345
346
347
348
349
350
run:
    if config['IBD']['filter_relateds'] == 'true':
        shell(f"zcat {{input[1]}} | awk {{params.awk_string}}; {CMD_PREFIX} Rscript {CODE}/filter_relateds.R -r accessory/related_IBS_results.txt -o accessory/excluded_related_samples.txt; "
              f"{CMD_PREFIX} plink --bfile plink/{BASE}_LDp_sHWE --remove accessory/excluded_related_samples.txt --keep-allele-order --make-bed --out plink/{BASE}_LDp_sHWE_IBDflt {plnk_rsrc(rule, plink_run_specs)}")
    else:
        shell(f"zcat {{input[1]}} | awk {{params.awk_string}}; {CMD_PREFIX} Rscript {CODE}/filter_relateds.R -r accessory/related_IBS_results.txt -o accessory/excluded_related_samples.txt; "
              f"{CMD_PREFIX} plink --bfile plink/{BASE}_LDp_sHWE --keep-allele-order --make-bed --out plink/{BASE}_LDp_sHWE_IBDflt {plnk_rsrc(rule, plink_run_specs)}")
355
356
357
358
359
360
361
362
363
364
365
run:
    if config['match_controls'] == 'true': #Run control-matching and parse results into a format that PLINK can use to extract matched samples.
        shell(f"{CMD_PREFIX} plink --bfile plink/{BASE}_LDp_sHWE_IBDflt --allow-no-sex --read-genome {{input[1]}} --cluster cc --mcc 1 {config['control_number']} --out plink/{BASE} {plnk_rsrc(rule, plink_run_specs)}; "
              f"{CMD_PREFIX} " + "awk '{{if(NF!=2) for (i = 2; i <= NF; i++) print $i}}'" + f" plink/{BASE}.cluster1 " + f"| cut -d \"(\" -f 1 | sed 's/_/\\t/' > accessory/samples2.txt; "
              "awk '{{split($2,a,\"_\"); printf(\"%s\\n%s\\n\",$2,a[1])}}' accessory/samples2.txt | sed 's/#//g' > accessory/samples.txt") #The sed command is a holdover from dealing with pound signs in sample names in StJude data
        shell(f"{CMD_PREFIX} plink --bfile plink/{BASE}_sub_mflt --keep accessory/samples2.txt --make-bed --out input/{BASE}_CtlMat {plnk_rsrc(rule, plink_run_specs)}; "
              f"{CMD_PREFIX} plink --bfile plink/{BASE}_LDp_sHWE_IBDflt --keep accessory/samples2.txt --make-bed --out input/{BASE}_PCA {plnk_rsrc(rule, plink_run_specs)}")
    else:
        shell("{CMD_PREFIX} awk '{{split($2,a,\"_\"); printf(\"%s\\n%s\\n\",$2,a[1])}}' plink/" + BASE + "_LDp_sHWE_IBDflt.fam | sed 's/#//g' > accessory/samples.txt")
        shell(f"{CMD_PREFIX} plink --bfile plink/{BASE}_sub_mflt --make-bed --out input/{BASE}_CtlMat {plnk_rsrc(rule, plink_run_specs)}; "
              f"{CMD_PREFIX} plink --bfile plink/{BASE}_LDp_sHWE_IBDflt --make-bed --out input/{BASE}_PCA {plnk_rsrc(rule, plink_run_specs)}")
393
394
run:
    shell(f"head -n 1 {BASE}_chr1.admixmap.txt > header.txt; tail -q -n +2  {BASE}_chr*.admixmap.txt > body.txt; cat header.txt body.txt > {BASE}.admixmap.txt; rm header.txt; rm body.txt")
399
400
401
402
403
run:
    cmd = f"{CMD_PREFIX} Rscript {CODE}/genesis_prep.R -p plink/{BASE}_sub_mflt -k {{input[1]}} -n {config['gwas']['pc_num']} -o input/{BASE} -t {config['kinship_threshold']}"
    if os.path.exists(config['covariate_file']): #Not tested
            cmd += f" -C {config['covariate_file']}"
    shell(cmd)
409
410
411
run:
    cmd = f"{CMD_PREFIX} Rscript {CODE}/genesis_gwas.R -g {{input[0]}} -p {{input[1]}} -k {{input[2]}} -n {config['gwas']['pc_num']} -C {config['gwas']['other_predictors']} -O {os.getcwd()} -o {BASE}.genesis.txt -c {{threads}}"
    shell(cmd)
416
417
418
419
420
421
422
423
424
425
426
run:
    if os.path.exists(VCF):
        cmd = f"{CMD_PREFIX} plink2 --vcf {VCF} dosage=HDS --extract plink/{BASE}_sub_mflt.bim --const-fid 0 "
        if os.path.exists(config['samples']):
            cmd += f" --keep {config['samples']} "
        cmd += f" --make-pgen --out input/{BASE}_sub {plnk_rsrc(rule, plink_run_specs)}; " \
               f"{CMD_PREFIX} plink2 --pfile input/{BASE}_sub --export vcf vcf-dosage=HDS-force --out input/{BASE}_sub;" \
               f"{CMD_PREFIX} bgzip input/{BASE}_sub.vcf; {CMD_PREFIX} tabix input/{BASE}_sub.vcf.gz"
        print(cmd)
        shell(cmd)
    else: print(f"Did not find VCF file at provided path: {VCF}")
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
run:
    with open("scripts/gather_report_data.sh", 'w') as report_cmds:
        report_line = f"echo \'rmarkdown::render(\"scripts/admixMap-report.Rmd\", output_file=\"{BASE}-Mapping-report.pdf\", " \
                      f"params=list(genesis_rslt=\"{BASE}.genesis.txt\", " \
                      f"fam_file=\"plink/{BASE}_sub_mflt.fam\", " \
                      f"input_directory=\"input/\", " \
                      f"blink_rslt=\"-9\", " \
                      f"gen_since_admix=\"{config['admixMapping']['gen_since_admixture']}\", " \
                      f"gwas_threshold=\"{config['gwas']['sig_threshold']}\", " \
                      f"ancestry_comps=\"{config['admixMapping']['ancestry_predictors']}\", " \ 
                      f"pruned_bim=\"plink/{BASE}_LDp_sHWE.bim\", " \
                      f"sHWE_entropy=\"sHWE/entropy.txt\", " \
                      f"sHWE_markers=\"accessory/sHWE_tested_snps.txt\", " \
                      f"ibd_ctrls=\"plink/{BASE}_LDp_ctrls_IBDflt.fam\", " \
                      f"eigenval=\"input/{BASE}.pcair.varprop\", " \
                      f"phenodat=\"input/{BASE}.genesis.pdat\", " \
                      f"kinplot=\"input/{BASE}.kinplot.png\", " \
                      f"mflt=\"accessory/missing_filter.txt\", " \
                      f"rulegraph_file=\"{BASE}-rulegraph.png\", "
        if f"{BASE}.admixmap.txt" in input:
            report_line += f"admixMap_rslt=\"{BASE}.admixmap.txt\", global_ancestry=\"{BASE}.globalancestry.txt\","
        else:
            report_line += f"admixMap_rslt=\"-9\", global_ancestry=\"-9\","
            shell(f"touch {BASE}.admixmap.sig.txt")
        if f"{BASE}.dos.genesis.txt" in input: report_line += f"dosage_rslt=\"{BASE}.dos.genesis.txt\","
        else:
            report_line += f"dosage_rslt=\"-9\","
            shell(f"touch {BASE}.dos.genesis.sig.txt")
        report_line += f"config_file=\"workflow/config.yml\"))\' | R --vanilla"
        report_cmds.write(report_line)
    shell(f"{CMD_PREFIX} sh scripts/gather_report_data.sh; mv scripts/{BASE}-Mapping-report.pdf {BASE}-Mapping-report.pdf; mv input/*kinplot.png figures")
478
479
480
481
482
483
484
485
486
487
run:
    shell(f"cut -f 1 {BASE}.genesis.sig.txt > {BASE}.genesis.sigID.txt")
    shell(f"{CMD_PREFIX} plink --bfile plink/{BASE}_sub_mflt --extract {BASE}.genesis.sigID.txt --keep-allele-order --freq case-control --allow-no-sex --recode vcf --out {BASE}.genesis.sig {plnk_rsrc(rule, plink_run_specs)}")
    cmd = f"{CMD_PREFIX} /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Xmx8G -jar /snpEff/snpEff.jar ann -t GRCh37.75 {BASE}.genesis.sig.vcf | " \
          "awk '{{split($8,a,\"|\"); split(a[1],b,\"=\"); print $1,$2,$4,$5,a[4],a[3],a[2],b[2]}}' | grep -v \"#\" > {BASE}.genesis.sig.preannt.txt"
    shell(cmd)
    for soft in ['genesis']:
        cmd = f"{CMD_PREFIX} Rscript {CODE}/merge_annt.R -i {BASE}.{soft}.txt -s {BASE}.{soft}.sig.preannt.txt -r {config['annotation']['rsIDs']} -f {BASE}.{soft}.sig.frq.cc -o {BASE}.{soft}.sig.annt.txt"
        if os.path.exists(config['annotation']['typed_key']): cmd += f" -t {config['annotation']['typed_key']}"
        shell(cmd)
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
run:
    dat = pd.read_table(input[0], sep = r"\s+")
    dat = dat[dat['ANC.p.value'] < float(config['admixMapping']['sig_threshold'])]
    dat[['chm', 'spos', 'epos', 'ANC.estimate', 'ANC.std.error', 'ANC.statistic', 'ANC.p.value', 'anc']].to_csv(f"{BASE}.admixmap.tmp.bed", header=False, sep = "\t", index=False)
    gff = f"{os.getcwd()}/accessory/{os.path.basename(config['annotation']['gff'])}"
    genome = f"{os.getcwd()}/accessory/{os.path.basename(config['annotation']['genome'])}" #File containing chromosome sizes
    if config['annotation']['download_refs'] == 'true':
        shell(f" wget {config['annotation']['gff']} -O {gff}")
        shell(f" wget {config['annotation']['genome']} -O {genome}")
    elif os.path.exists(config['annotation']['gff']) and os.path.exists(config['annotation']['genome']):
        shell(f"cp {config['annotation']['gff']} {gff}")
        shell(f"cp {config['annotation']['genome']} {genome}")
    else:
        print("Could not find reference .gff and/or .genome file.  Check config file that paths are correctly specified and/or download requested.")
    shell(f"{CMD_PREFIX} bedtools slop -i {BASE}.admixmap.tmp.bed -g {genome} -b {config['admixMapping']['annotation_buffer']} > {BASE}.admixmap.bed")
    cmd = f"{CMD_PREFIX} bedtools intersect -a {BASE}.admixmap.bed -b {gff} -loj | "  \
    "awk '{{if($11==\"gene\") print $0}}' | awk '{{split($NF,a,\";\"); $NF=\"\"; for(x in a) if(a[x] ~ /Name=/) print $0,a[x]}}' | sed 's/Name=//' > {BASE}.admixmap.preannt.txt"
    shell(cmd)
    shell(f"{CMD_PREFIX} Rscript {CODE}/admixAnnt.R -i input/ -p {BASE}.admixmap.preannt.txt -b {config['admixMapping']['annotation_buffer']} -o {{output}}")
516
517
518
519
520
521
522
523
524
525
526
527
528
run:
    if os.path.exists(config['conditional_analysis']['snp_list']):
        if not os.path.exists('conditional-analysis'): os.mkdir('conditional-analysis')
        chrom, pos = f"{wildcards.marker}".split("-")
        shell(f"{CMD_PREFIX} plink --bfile plink/{BASE}_sub_mflt --snp {wildcards.marker.replace('-',':')} --recodeAD --out conditional-analysis/{wildcards.marker} {plnk_rsrc(rule, plink_run_specs)} || true")
        if os.path.exists(f"conditional-analysis/{wildcards.marker}.raw"):
            shell(f"{CMD_PREFIX} Rscript {CODE}/genesis_conditional_analysis.R -a {chrom} -b {pos} -d {config['conditional_analysis']['distance']} "
                  f"-m conditional-analysis/{wildcards.marker}.raw "
                  f"-g {{input[0]}} -p {{input[1]}} -k {{input[2]}} -n {config['gwas']['pc_num']} -C {config['gwas']['other_predictors']} "
                  f"-o conditional-analysis/{BASE}-{wildcards.marker}.genesis.txt")
        else:
            print(f"Did not find {wildcards.marker} in plink/{BASE}_sub_mflt")
            shell(f"touch conditional-analysis/{BASE}-{wildcards.marker}.genesis.txt")
547
548
549
550
551
552
run:
    if not os.path.exists('fine-mapping'): os.mkdir('fine-mapping')
    chrom, pos = f"{wildcards.marker}".split("-")
    shell(f"{CMD_PREFIX} plink2 --pfile input/{BASE}_sub --snp {wildcards.marker.replace('-',':')} --window {int(config['fine_mapping']['distance'])} "
          f"--export A --out fine-mapping/{wildcards.marker} {plnk_rsrc(rule, plink_run_specs)} || true")
    #NOTE: phenotype data is not specified in dosage input.  Need to grab and merge from data in plink folder.
ShowHide 18 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/Monia234/admixMap
Name: admixmap
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...