Evaluating the robustness of polygenic adaptation to the choice of the GWAS cohort

public public 1yr ago 0 bookmarks

A - Neutrality test for polygenic scores

Workflow

INSERT YOUR GRAPHIC HERE

Step 0: download softwares and packages

This workflow is built in Snakemake. Please, download Snakemake following the instructions here: https://snakemake.readthedocs.io/en/stable/getting_started/installation.html

Install glactools to calculate allele frequencies for a given population panel. Follow the instructions here: https://github.com/grenaud/glactools

Step 1: download all files

You should modify the value for all keys config.yaml with your own paths and dirs (do not modify the key name unless you change it accordingly in all snake make files). If you are working with GRCh37, download files from:

# EPO file
wget ftp://ftp.healthtech.dtu.dk/public/EPO/all_hg19.epo.gz; wget ftp://ftp.healthtech.dtu.dk/public/EPO/all_hg19.epo.gz.tbi
# FAI file
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.fai
# LBD_blocks
# Download accordingly based on the ancestry of the populations you are working with, e.g. EUR
wget https://bitbucket.org/nygcresearch/ldetect-data/src/master/EUR/fourier_ls-all.bed
# Summary stats
# You can find a list of traits available for the UKBB by Neale lab at: https://docs.google.com/spreadsheets/d/1kvPoupSzsSFBNSztMzl04xMoSC3Kcx3CrjVf4yBmESU/edit#gid=178908679. For instance, for height: 
wget https://broad-ukb-sumstats-us-east-1.s3.amazonaws.com/round2/additive-tsvs/50_irnt.gwas.imputed_v3.both_sexes.tsv.bgz -O 50_irnt.gwas.imputed_v3.both_sexes.tsv.bgz
# VCF file - 1000 Genomes can be downloaded from here: 
# http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ 

Most files for GRCh38 are also available. However, you will need to liftover the LBD_blocks files.

Step 2: Genomic data and allele frequencies

Get the allele frequency of the populations of interest (e.g.: populations from the 1000 Genomes Project) by running vcf2acf.smk using Snakemake. The first rule is to prepare your vcf file (filters at the variant- and/or individual-level).

In the config file, add the path/filename under 'pops_panel'. This is a two-columns tab-separate file with no header, the first-column corresponds to the sample ID and the second to the population they belong two. Example:

head -n1 pops_panel_example_1000GP.txt
HG00096 GBR

Once, the config file is ready. Run snakemake as follows:

snakemake --snakefile path/to/vcf2acf.smk --cores XX

The final output of step 1 corresponds to the file under the key 'acf_file' in the config.yaml. This file will contain the allele frequencies for each site in each population. Intermediate files will be deleted (remove the temp() within the rules to keep them). Output example (use glactools view to check the output):

#chr coord REF,ALT root anc GBR
1 19276348 G,A 0,0:0 0,0:0 192,14:0

More info on how to use glactools is here: https://github.com/grenaud/glactools

Compute Qx statistic

Before computing polygenic scores, if you are working with more than one GWAS, make sure their effect size is polarized by the same allele (e.g.: derived allele).

A. rule polyAdapt_freqs

Step 1: Run scripts/acf2ukbfreq_byP.py to get a tabulated file with all the info needed. The output is a file with the population-allele frequencies and some extra info about the variant (chr, position, alleles, beta etc.). It currently only works for the UK Biobank. However, you can very easily modify it if you are using different summary stats. When you read 'gwasfile', you will need to make sure the column numbers and names of the variables we are interested in corresponds to those in the summary stats (eg. beta value). ONLY LINES 3-43 WILL NEED MODIFICATIONS! The output contains a new column DEREFFECT as we polarised the betas by the derived allele for comparison reasons (by default the Uk Biobank was polarised by allele 'a1').

#CHROM POS SNPID REF ALT ANC DER DEREFFECT SE PVAL GBR XXX
1 889638 1:889638 G C G C -2.18042e-04 3.36098e-03 9.48274e-01 14,192 XXX,XXX

Step 2: Run scripts/partitionUKB_byP.py to get the "candidate variants". It uses the output from step 1 and the LD blocks from Berisa et al, 2019. It extracts the variant with the lowest p-value for the association test in each block (if there is at least a variant that passes the genome-wide p-value threshold 5e-8). The p-value threshold can be changed -p 5e-08 . Output example:

#CHROM POS SNPID DEREFFECT GBR XXX
1 930751 1:930751 5.22061e-03 34,172

Step 3: Run scripts/extractneutral_byP.py to get "neutral variants". You can change again the p-value for what you would like to consider "non-associated" variants similarly as in step 2 - -p 1e-5 . The output file has the same format as in Step 2.

#CHROM POS SNPID DEREFFECT GBR XXX
1 958953 1:958953 1.01310e-02 54,152

B. rule polyAdapt_qx

Step 4: Run scripts/CalcQX_edit4parallel_Alba.R

It generates two outputs:

  • QX_report: it contains the PRS for each pop, the QX value, the Chi-squared p-value, the number of trait-associated SNPs and the sign-randomised P-value.

  • PRS: two-column text file

POP SCORE
GBR 0.55

snakemake --snakefile path/to/polyadapt.smk --cores XX
- NB! CURRENTLY UNDER DEVELOPMENT 

B - Assessing different association methods

Quality control filters:

  • Genotyped SNPs (~ 800,000)

  • MAF > 0.001

  • HWE pvalue > 1e-10

  • Autosomes

Association models Examples

  • Linear model implemented in PLINK 2.0
plink2
  • Linear mixed model implemented in BOLT_LMM
WHITE="/home/white"
BRI="/home/bri"
HEIGHT="/home/id_height.tsv" # tabulated file with #FID	IID	HEIGHT
cd /home/white
plink2 --bgen allind.bgen --sample allind.sample --hwe 1e-10 --maf 0.001 --keep-fam filtered_ind.list.txt 
--linear hide-covar --pheno $HEIGHT --covar allcov.tsv --covar-col-nums 3-4,8-27 --variance-standardize --out $WHITE
  • Meta-analysis implemented in METAL
//inverse variance based
metal
// sample size based
metal

Code Snippets

35
36
37
38
39
40
41
42
shell:
    """
    python scripts/acf2ukbfreq_byP_Alba.py  -a {input.popfile} -g {input.infile} -o {output.outmp}
    cat <(head -1 {output.outmp}) <(tail -n+2 {output.outmp} | sort -k1,1 -k2,2g) | bgzip -c > {output.outfile}
    tabix -s 1 -b 2 -e 2 {output.outfile}
    python scripts/partitionUKB_byP.py -i {output.outfile} -b {input.lbd} -o {output.candi} -p {params.pvalCan}
    python scripts/extractneutral_byP.py -i {output.outfile} -o {output.neut} -s 20 -p {params.pvalNeut}
    """
51
52
53
54
shell:
    """
    Rscript scripts/CalcQX_edit4parallel_Alba.R -w {input.candi} -e {input.neut} -o {output.qx} -s {output.scores} -n 1000 -j 1000
    """
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
library("optparse")
# terminal line
# Rscript CalcQX.R -w gwasfreqs_candidates_pheno.tsv -e gwasfreqs_neutral_pheno.tsv -o qxreport.txt -n 1000 -j 1000 -s genscores.txt 

option_list = list(
    make_option(c("-w", "--gwasfile"), type="character", default=NULL, help="GWAS input file name"),
    make_option(c("-e", "--neutfile"), type="character", default=NULL, help="Neutral input file name"),
    make_option(c("-o", "--outfile"), type="character", default="output.txt", help="Output file name"),
    make_option(c("-p", "--pops"), type="character", default="ALL", help="Populations to be tested"),
    make_option(c("-n", "--numrep"), type="numeric", default=NULL, help="Number of sign-randomized replicates"),
    make_option(c("-j", "--emprep"), type="numeric", default=NULL, help="Number of frequency-matched replicates"),
    make_option(c("-s", "--scorefile"), type="character", default=NULL, help="Score file name"),
    make_option(c("-l", "--leeway"), type="numeric", default=0.01, help="Leeway for empirical frequency matching"),
    make_option(c("-f", "--minorfreq"), type="numeric", default=0.05, help="Minor allele frequency cutoff for covariance matrix")
);

opt_parser = OptionParser(option_list=option_list);
opt = parse_args(opt_parser);

if (is.null(opt$gwasfile)){
    print <- help(opt_parser)
    stop("GWAS file name must be supplied.n", call.=FALSE)
}
if (is.null(opt$neutfile)){
    print <- help(opt_parser)
    stop("Neutral file name must be supplied.n", call.=FALSE)
}
if (is.null(opt$outfile)){
    print <- help(opt_parser)
    stop("Output file name must be supplied.n", call.=FALSE)
}

if (is.null(opt$scorefile)){
    print <- help(opt_parser)
    stop("Score file name must be supplied.n", call.=FALSE)
}


gwasfile <- opt$gwasfile
neutfile <- opt$neutfile
outfile <- opt$outfile
scorefile <- opt$scorefile
pops <- opt$pops
#pops <- strsplit(opt$pops,",")[[1]]
pseudorep <- opt$numrep
emprep <- opt$emprep

# Multiple-testing
popmatch <- opt$popmatch
leeway <- opt$leeway
minorfreq <- opt$minorfreq


#pops <- strsplit(readLines(pops), ",")[[1]]
#pops <- c("ESN","MSL", "CEU", "TSI", "CDX", "JPT", "PEL")

ObtainFreqs <- function(countdat){
    dercounts <- apply(countdat,c(1,2),function(x){splitted <- strsplit(x,",")[[1]]; return( as.numeric(splitted[2]) )})
    totalcounts <- apply(countdat,c(1,2),function(x){splitted <- strsplit(x,",")[[1]]; return( as.numeric(splitted[2])+as.numeric(splitted[1]) )})
    dersum <- apply(dercounts,1,function(x){sum(x)})
    totalsum <- apply(totalcounts,1,function(x){sum(x)})
    totalfreq <- dersum/totalsum
    freqs <- apply(countdat,c(1,2),function(x){splitted <- strsplit(x,",")[[1]];
        return( as.numeric(splitted[2]) / (as.numeric(splitted[2])+as.numeric(splitted[1])) )})
    return(list(as.matrix(freqs), totalfreq, as.matrix(dercounts),as.matrix(totalcounts)))
}

# Trim out white space
trim <- function (x) gsub("^\\s+|\\s+$", "", x)

LoadCounts <- function(filename,pops){
    table <- as.matrix(read.table(filename,header=TRUE,sep="\t",strip.white=TRUE,comment.char="!"))
    if(paste(pops,collapse=",") == "ALL"){ tokeep <- seq(5,length(colnames(table)))
    } else{ tokeep <- which(colnames(table) %in% pops)}
    tokeep <- c(1,2,3,4,tokeep)
    table <- table[,tokeep]
    table[,1] <- trim(table[,1])
    table[,2] <- trim(table[,2])
    return(table)
}

ComputeFmat <- function(neut_leaves_freqs, neut_total_freqs){
    leaves <- colnames(neut_leaves_freqs)

    #checksegneut <- which( apply(neut_leaves_freqs,1,sum)/length(leaves) < 0.95  & apply(neut_leaves_freqs,1,sum)/length(leaves) > 0.05 )
    #neut_leaves_freqs <- neut_leaves_freqs[checksegneut,]
    checksegneut <- which(neut_total_freqs < 0.95  & neut_total_freqs > 0.05 )
    neut_leaves_freqs <- neut_leaves_freqs[checksegneut,]

    neut_leaves_freqs_means <- apply(neut_leaves_freqs, 1, mean)
    mean_hetero <- neut_leaves_freqs_means*(1-neut_leaves_freqs_means)
    numSNPs <- length(neut_leaves_freqs_means)

    Fmat <- sapply(seq(1,dim(neut_leaves_freqs)[2]),function(x){
        sapply(seq(1,dim(neut_leaves_freqs)[2]),function(y){
            cov(neut_leaves_freqs[,x]/sqrt(mean_hetero), neut_leaves_freqs[,y]/sqrt(mean_hetero))
        })
    })

    colnames(Fmat) <- colnames(neut_leaves_freqs)
    rownames(Fmat) <- colnames(neut_leaves_freqs)

    return(Fmat)
}


ChiSquared <- function(leaves_freqs,total_freqs,effects,F_mat, dercounts, totalcounts, randomize=FALSE){
    leaves <- colnames(leaves_freqs)
    checkseg <- which(total_freqs < 0.95  & total_freqs > 0.05 )
    leaves_freqs <- leaves_freqs[c(checkseg),]
    effects <- effects[c(checkseg)]
    leaves_freqs <- leaves_freqs[! is.infinite(effects),]
    effects <- effects[! is.infinite(effects)]

    # Randomize effects if necessary
    if(randomize == TRUE){effects <- effects * sample(c(-1,1),length(effects),replace=TRUE)}

    # Compute mean genetic values
    meangen <- apply(leaves_freqs * effects, 2, function(x){sum(x)})

    # Scale by average genetic value
    meangen <- (meangen - mean(meangen))

    # Compute the estimated ancestral genetic variance over all populations
    meanfreqs <- apply(leaves_freqs,1,mean)

    varmean <- sum(sapply(seq(1,length(meanfreqs)),function(i){
        score = meanfreqs[i]*(1-meanfreqs[i])*effects[i]^2
        return(score)
    }))

    meangenvec <- 2*meangen / sqrt( 4*varmean)

    # Compute error PRS
    alpha <- dercounts + 1
	beta <- 1 + totalcounts - dercounts
	num = alpha * beta
	denom = (alpha+beta)^2 * (alpha+beta+1)
	varpl = num/denom

	varpl <- varpl[c(checkseg),]
	varprs = effects^2 * varpl

	varprs <- apply(varprs, 2, function(x) sum(x))
	se = sqrt(varprs)


	# meangen already subracted mean 
	normcilow <- 2*(meangen-1.96*se)/sqrt(4*varmean)
	normcihi <- 2*(meangen + 1.96*se)/sqrt(4*varmean)

    # Compute Q_X statistic
    numerator <- t(meangen) %*% solve(Fmat) %*% meangen
    denominator <- varmean
    Qteststat <- numerator / denominator
    Pval <- 1 - pchisq(Qteststat,qr(Fmat)$rank)
    allstats <- c(Qteststat,Pval)

    return(list(allstats, meangenvec, normcihi, normcilow))
}


# Load GWAS data  
print("Loading data...")
data <- LoadCounts(gwasfile, pops)
leaves_counts <- as.data.frame(data[,seq(5,dim(data)[2])])
raw_leaves_freqs <- ObtainFreqs(leaves_counts)
leaves_freqs <- raw_leaves_freqs[[1]]
total_freqs <- raw_leaves_freqs[[2]]
dercounts <- raw_leaves_freqs[[3]]
totalcounts <- raw_leaves_freqs[[4]]
effects <- as.numeric(data[,4])


if (sum(apply(leaves_freqs, 1, function(x) sum(is.na(x))))!=0){
leaves_freqs <- na.omit(leaves_freqs)
total_freqs <- na.omit(total_freqs)
print(paste('Ommiting', length(leaves_counts[,1])-length(leaves_freqs[,1]),
            'out of', length(leaves_counts[,1]), 'character SNPs', sep = ' '))
}

# Load Neutral data
neutdata <- LoadCounts(neutfile, pops)
neut_leaves_counts <- as.data.frame(neutdata[,seq(5,dim(neutdata)[2])])
raw_neut_leaves_freqs <- ObtainFreqs(neut_leaves_counts)
neut_leaves_freqs <- raw_neut_leaves_freqs[[1]]
neut_total_freqs <- raw_neut_leaves_freqs[[2]]

# Calculate covariance matrix
print("Computing covariance matrix...")
Fmat <- ComputeFmat(neut_leaves_freqs, neut_total_freqs)


# Calculate chi-squared statistics
print("Computing Q_X statistic...")
totaltest <- ChiSquared(leaves_freqs,total_freqs,effects,Fmat, dercounts, totalcounts, randomize=FALSE)
totalstat <- totaltest[[1]]
meangenvec <- totaltest[[2]]
meangenvechi <- totaltest[[3]]
meangenveclow <- totaltest[[4]]


qtab <- cbind(round(totalstat[1],3),totalstat[2])
colnames(qtab) <- c("Q_X","Pval")

#### Only calculate pval and genetic scores

write("Genetic scores",file=outfile,append=FALSE)
write(paste(names(meangenvec),collapse="\t"),file=outfile,append=TRUE)
write(paste(meangenvec,collapse="\t"),file=outfile,append=TRUE)
write("",file=outfile,append=TRUE)

printgenvec <- cbind(names(meangenvec),meangenvec, meangenvechi, meangenveclow)
rownames(printgenvec) <- c()
colnames(printgenvec) <- c("#POP","SCORE", "upperlim", "lowerlim")
write.table(printgenvec,file=scorefile,sep="\t",row.names=FALSE,col.names=TRUE, quote=FALSE,append=FALSE)

write(paste("Q_X:",qtab[1],sep="\t"),file=outfile,append=TRUE)
write("",file=outfile,append=TRUE)

write(paste("Chi-squared distribution P-value:",qtab[2],sep="\t"),file=outfile,append=TRUE)
write("",file=outfile,append=TRUE)

# Add number of traits-ass SNPs 
write(paste("Number of trait-associated SNPs:", length(effects), sep="\t"), file=outfile, append=TRUE)
write("",file=outfile,append=TRUE)

# Calculate sign-randomized chi-squared statistics
if( !is.null(pseudorep) ){
    print(paste("Computing sign-randomized P-values, using ",pseudorep," pseudo-replicates...",sep=""))
    totaltest <- ChiSquared(leaves_freqs,total_freqs,effects,Fmat, dercounts, totalcounts, randomize=TRUE)
    totalstat <- totaltest[[1]]
    randqtab <- round(totalstat[1],3)
    for(i in seq(2,pseudorep)){
        totaltest <- ChiSquared(leaves_freqs,total_freqs,effects,Fmat, dercounts, totalcounts, randomize=TRUE)
        totalstat <- totaltest[[1]]
        randqtab <- c( randqtab, round(totalstat[1],3) )
    }
    # Edit adding 1
    randpval <- (1+sum( as.numeric(randqtab)  > as.numeric(qtab[1]) )) / (1+length(randqtab))

    write(paste("Sign-randomized P-value:",randpval,sep="\t"),file=outfile,append=TRUE)
    write("",file=outfile,append=TRUE)

}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import pysam
from optparse import OptionParser

parser = OptionParser("$prog [options]")
parser.add_option("-i", "--infile", dest="infile", help="Input GWAS+freq file", default=None, type="string")
parser.add_option("-p", "--minp", dest="minp", help="Minimum p-value allowed", default=1, type="float")
parser.add_option("-o", "--outfile", dest="outfile", help="Output file", default=None, type="string")
parser.add_option("-s", "--sep", dest="sep", help="Number of valid SNPs separating each printed SNP", default=None, type="int")
(options,args) = parser.parse_args()


infile = pysam.Tabixfile(options.infile, mode='r')
outfile = open(options.outfile,"w")

readheader = infile.header
for x in readheader:
    header = x

header = header.split("\t")
header = list(filter(lambda a: a != "REF" and a != "ALT" and a != "ANC" and a != "DER" and a!= "SE" and a != "PVAL", header))
header = "\t".join(header)

outfile.write(''.join(header) + '\n')

i=0
for line in infile.fetch():
    fields = line.strip().split("\t")
    Pval = float(fields[9])
    if Pval > options.minp:
        if i == options.sep:
            i = 0
            finalvec = fields[0:3]+[fields[7]]+fields[10:]
            outfile.write("\t".join(finalvec)+ '\n')
        i += 1
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import pysam
import numpy as np
from optparse import OptionParser
import subprocess

parser = OptionParser("$prog [options]")
parser.add_option("-i", "--infile", dest="infile", help="Input GWAS+freq file", default=None, type="string")
parser.add_option("-b", "--bedfile", dest="bedfile", help="Bed file with LD partitions (def None)", default=None, type="string")
parser.add_option("-p", "--maxp", dest="maxp", help="Maximum p-value allowed", default=1, type="float")
parser.add_option("-o", "--outfile", dest="outfile", help="Output file", default=None, type="string")
(options,args) = parser.parse_args()


bedfile = open(options.bedfile,"r")
infile = pysam.Tabixfile(options.infile, mode='r')
outfile = open(options.outfile,"w")
logmaxp = np.log10(options.maxp)

readheader = infile.header
for x in readheader:
    header = x

header = header.split("\t")
header = list(filter(lambda a: a != "REF" and a != "ALT" and a != "ANC" and a != "DER" and a!= "SE" and a != "PVAL", header))
header = "\t".join(header)
outfile.write(''.join(header) + '\n')

for line in bedfile:
    regfields = line.strip("\n").split("\t")
    regchr = regfields[0].strip()
    regstart = int(regfields[1])
    regend = int(regfields[2])

    CurrLogPval = 0
    CurrPval = None
    CurrBest = None
    try:
        for elem in infile.fetch(regchr,regstart-1,regend):
            fields = elem.strip().split("\t")
            Pval = float(fields[9])
            if Pval == 0.0:
                Pval = 1e-20
            LogPval = np.log10(Pval)
            if LogPval < CurrLogPval:
                CurrBest = fields[0:3]+[fields[7]]+fields[10:]
                CurrLogPval = LogPval
                CurrPval = Pval
    except:
        continue

    if CurrLogPval < logmaxp and CurrBest != None:
        outfile.write("\t".join(CurrBest) + '\n')
ShowHide 1 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/albarema/GWAS_choice
Name: gwas_choice
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...