Causal effect of modifiable risk factors on COVID

public public 1yr ago Version: v6 0 bookmarks

Esimating the shared genetic eitology and the causal association of potentially modifiable risk factors on SARS-CoV-2 infection and COVID-19 severity using genetic correlation (rg) and Mendelian randomization (MR).

Citation: The COVID-19 Host Genetics Initative. 2021. Mapping the human genetic architecture of COVID-19 by worldwide meta-analysis. MedRxiv . doi:10.1101/2021.03.10.21252820.

Data Avaliability

Exposures

A set of 38 disease, health and neuropsychiatric phenotypes were selected as potential COVID-19 risk factors based on their putative relevance to the disease susceptibility, severity, or mortality. Initial risk factors were selected based on guidelines from the CDC and OpenSafely .

  • docs/mrcovid_exposures.xlsx : excel file listing the traits and respective GWAS used as exposures.

Outcomes

Genome-wide assocations studies for COIVD-19 susceptibility and severity were obtained from Release 5 of the COVID-19 host genetics initiative .

Outcomes included:

  • SARS-CoV-2 reported infection

  • COVID-19 Hospitalization

  • COVID-19 Critical Illness

Data Analysis

The Snakemake workflow management system was used to implement pipelines for the rg and MR anlaysis. Rules defineding the snakemake workflows for the rg and MR analyses are avliable in workflow/rules/rg.smk and workflow/rules/mr.smk . Scripts for implementing polygenic risk score analyses were not utlized in this analysis. This pipeline is a adapted from Andrews et al (2020) Annals of Neurology .

Final output files used in the current publication are avaliable in docs/20210224 .

Genetic Correlation

Genetic correlations between the exposures and outcomes was estimated using LD Score (LDSC) regression.

Snakefile:

  • workflow/rules/rg.smk : workflow for implementing MR anlaysis

Configuration files:

  • config_rg_covid.yaml : parameters for implementing rg analysis

Script files:

  • workflow/scripts/rg_* : script files for implementing spefific rules in the rg workflow

MR

Mendelian randomization analysis was conducted using the TwoSampleMR package. Primary analysis was conducted using IVW to estimated causal relationships, while senstivity analysis were conducted using WME, WMBE, MR-Egger and MR-PRESSO.

Snakefile

  • workflow/rules/mr.smk : workflow for implementing MR anlaysis

Configuration files

  • config_covid_eur.yaml : analysis parameters for MR analysis COIVD-19 ouctomes using EUR only popultions

  • config_covid_eur_wo_ukbb.yaml : analysis parameters for MR analysis COIVD-19 ouctomes using EUR only popultions, without samples from UKBB

Script files

  • workflow/scripts/mr_* : script files for implementing spefific rules in MR workflow

Results

Figure 4

Genetic correlations and Mendelian randomization causal estimates between 38 traits and COVID-19 Critical Illness, COVID-19 Hospitalization and SARS-CoV-2 reported infection. Blue, negative genetic correlation and protective Mendelian randomization (MR) causal estimates; red, positive genetic correlation and risk MR causal estimates. Larger squares correspond to more significant P values, with genetic correlations or MR causal estimates significantly different from zero at a P < 0.05 shown as a full-sized square. Genetic correlations or causal estimates that are significantly different from zero at a false discovery rate (FDR) of 5% are marked with an asterisk. Forest plots display the causal estimates for each of the sensitivity analyses used in the MR analysis for trait pairs that were significant at an FDR of 5%. Individual scatter and funnel plots for each pair of traits are available in docs/20210224/MRcovideur_MRScatterFunnelPlots.pdf .

IVW: Inverse variance weighted analysis; WME: Weighted median estimator; WMBE: weighted mode based estimator; MR-PRESSO: Mendelian Randomization Pleiotropy RESidual Sum and Outlier. RBC: Red blood cell count

Genetic correlation results:

  • data/RGcovid contains intermediatry files generated during rg workflow

  • results/RGcovid contains final results for rg analysis

MR results

  • data/MRcovideur , and data/MRcovideurwoukbb contain intermediatry files generated during MR workflow

  • results/MRcovideur , and results/MRcovideurwoukbb contain final results for MR analysis

Code Snippets

83
84
script:
    '../scripts/mr_FormatGwas.R'
SnakeMake From line 83 of rules/mr.smk
107
108
script:
    '../scripts/mr_FormatGwas.R'
SnakeMake From line 107 of rules/mr.smk
124
125
126
127
128
129
130
shell:
    """
    plink --bfile {params.ref} --keep-allele-order --allow-no-sex --clump {input.ss}  --clump-r2 {params.r2} --clump-kb {params.kb} --clump-p1 1 --clump-p2 1 --out {params.out};
    #gzip -k {output.exp_clumped} --keep not working now??
    gzip {output.exp_clumped};
    zcat {output.exp_clumped_zipped} > {output.exp_clumped}
    """
142
143
script:
    '../scripts/mr_ExposureData.R'
SnakeMake From line 142 of rules/mr.smk
155
156
script:
    "../scripts/manhattan_plot.R"
SnakeMake From line 155 of rules/mr.smk
169
170
script:
    '../scripts/mr_OutcomeData.R'
SnakeMake From line 169 of rules/mr.smk
182
183
184
185
186
187
188
189
190
191
192
193
    shell:
        """
        if [ $(wc -l < {input.MissingSNPs}) -eq 0 ]; then
            touch {output.ProxyList}
          else
           plink --bfile {params.ref} \
           --keep-allele-order \
           --r2 dprime in-phase with-freqs \
           --ld-snp-list {input.MissingSNPs} \
           --ld-window-r2 0.8 --ld-window-kb 500 --ld-window 1000 --out {params.Outcome}
          fi
"""
207
208
script:
    '../scripts/mr_ExtractProxySNPs.R'
SnakeMake From line 207 of rules/mr.smk
226
script: "../scripts/mr_DataHarmonization.R"
SnakeMake From line 226 of rules/mr.smk
241
242
script:
    '../scripts/mr_MRPRESSO.R'
SnakeMake From line 241 of rules/mr.smk
254
255
script:
    '../scripts/mr_MRPRESSO_wo_outliers.R'
SnakeMake From line 254 of rules/mr.smk
268
script: '../scripts/mr_analysis.R'
SnakeMake From line 268 of rules/mr.smk
278
script: '../scripts/mr_SteigerTest.R'
SnakeMake From line 278 of rules/mr.smk
292
shell: "awk 'FNR==1 && NR!=1{{next;}}{{print}}' {input.mr_steiger} > {output}"
SnakeMake From line 292 of rules/mr.smk
298
script: '../scripts/mr_PowerEstimates.R'
SnakeMake From line 298 of rules/mr.smk
312
shell: "awk 'FNR==1 && NR!=1{{next;}}{{print}}' {input.mr_power} > {output}"
SnakeMake From line 312 of rules/mr.smk
329
330
script:
    '../scripts/mr_ConcatMRdat.R'
SnakeMake From line 329 of rules/mr.smk
353
354
shell:
    "awk 'FNR==1 && NR!=1{{next;}}{{print}}' {input.mrpresso_global} > {output}"
SnakeMake From line 353 of rules/mr.smk
361
362
shell:
    "awk 'FNR==1 && NR!=1{{next;}}{{print}}' {input.mrpresso_global_wo_outliers} > {output}"
SnakeMake From line 361 of rules/mr.smk
377
378
shell:
    "awk 'FNR==1 && NR!=1{{next;}}{{print}}' {input.mrpresso_res} > {output}"
SnakeMake From line 377 of rules/mr.smk
394
395
shell:
    "awk 'FNR==1 && NR!=1{{next;}}{{print}}' {input.mr_heterogenity} > {output}"
SnakeMake From line 394 of rules/mr.smk
411
412
shell:
    "awk 'FNR==1 && NR!=1{{next;}}{{print}}' {input.mr_pleiotropy} > {output}"
SnakeMake From line 411 of rules/mr.smk
428
429
shell:
    "awk 'FNR==1 && NR!=1{{next;}}{{print}}' {input.mr_results} > {output}"
SnakeMake From line 428 of rules/mr.smk
458
script: "../scripts/mr_RenderReport.R"
SnakeMake From line 458 of rules/mr.smk
488
script: "../scripts/mr_RenderFinalReport.R"
SnakeMake From line 488 of rules/mr.smk
35
shell: "zcat {input} | sed '/^[[:blank:]]*#/d;s/#.*//' > {output}"
SnakeMake From line 35 of rules/prs.smk
42
43
script:
    '../scripts/prs_sample_exclude.R'
SnakeMake From line 42 of rules/prs.smk
54
55
shell:
    "plink --bfile {params.input} --keep-allele-order --keep {input.samp_keep} --geno 0.05 --maf 0.01 --hwe 1e-10 midp include-nonctrl --make-bed --out {params.out}"
66
67
68
69
70
shell:
    """
    plink --bfile {params.input} --keep-allele-order --mind 0.05 --make-bed --out {params.out};
    awk '{{ print $2 }}\' {params.out}.bim > {params.out}.snplist.txt
    """
82
83
84
85
86
87
88
89
shell:
    """
    plink --bfile {params.ref} --keep-allele-order --allow-no-sex \
    --extract {input.snplist} \
    --clump {input.ss} --clump-snp-field ID --clump-field P \
    --clump-r2 {params.r2} --clump-kb {params.kb} --clump-p1 1 --clump-p2 1 \
    --out {params.out}
    """
98
script: "../scripts/prs_region_exclude.R"
SnakeMake From line 98 of rules/prs.smk
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
shell:
    """
    ./src/PRSice \
    --base {input.base} \
    --target {params.target} \
    --pheno-file {input.pheno} \
    --pheno-col {outcome} \
    --cov-file {input.covar} \
    --cov-col {covar} \
    --cov-factor {covar_factors} \
    --no-clump \
    --extract {input.snps} \
    --beta --A1 ALT --A2 REF --bp POS --chr CHROM --pvalue P --snp ID --stat BETA \
    --all-score \
    --binary-target T \
    --print-snp \
    --bar-levels {bar_levels} \
    --fastscore \
    --perm 1000 \
    --out {params.out}
    """
SnakeMake From line 112 of rules/prs.smk
137
script: "../scripts/prs_wrangle.R"
SnakeMake From line 137 of rules/prs.smk
150
151
script:
    '../scripts/prs_glm.R'
SnakeMake From line 150 of rules/prs.smk
160
script: '../scripts/prs_concat.R'
SnakeMake From line 160 of rules/prs.smk
166
script: '../scripts/prs_concat.R'
SnakeMake From line 166 of rules/prs.smk
172
script: '../scripts/prs_concat.R'
SnakeMake From line 172 of rules/prs.smk
185
script: '../scripts/prs_output.Rmd'
SnakeMake From line 185 of rules/prs.smk
61
62
63
64
65
66
67
68
69
shell:
    """
    Rscript workflow/src/HDL/HDL.data.wrangling.R \
    gwas.file={input} \
    LD.path=data/raw/UKB_imputed_SVD_eigen99_extraction \
    SNP={params.SNP} A1={params.A1} A2={params.A2} N={params.N} Z={params.Z} \
    output.file={params.outfile} \
    log.file={params.log}
    """
SnakeMake From line 61 of rules/rg.smk
123
shell: "awk 'FNR==1 && NR!=1{{next;}}{{print}}' {input} > {output}"
SnakeMake From line 123 of rules/rg.smk
136
script: '../scripts/rg_plot_ldsc.R'
SnakeMake From line 136 of rules/rg.smk
147
148
149
150
151
152
153
154
155
shell:
    """
    Rscript workflow/src/HDL/HDL.run.R \
        gwas1.df={input.g1} \
        gwas2.df={input.g2} \
        LD.path={input.ld} \
        output.file={output};
    Rscript workflow/scripts/rg_extract_hdl.R {output.log}
    """
SnakeMake From line 147 of rules/rg.smk
160
shell: "awk 'FNR==1 && NR!=1{{next;}}{{print}}' {input} > {output}"
SnakeMake From line 160 of rules/rg.smk
173
174
175
176
177
178
179
shell:
    """
    python2.7 workflow/src/GNOVA/gnova.py {input.g1} {input.g2} \
        --bfile data/raw/bfiles/eur_chr@_SNPmaf5 \
        --out {output.res};
    Rscript workflow/scripts/rg_extract_gnova.R {output.res} {params.g1} {params.g2}
    """
SnakeMake From line 173 of rules/rg.smk
184
shell: "awk 'FNR==1 && NR!=1{{next;}}{{print}}' {input} > {output}"
SnakeMake From line 184 of rules/rg.smk
195
196
197
198
199
200
201
202
203
shell:
    """
    python3 workflow/src/SUPERGNOVA/supergnova.py {input.g1} {input.g2} \
        --bfile data/raw/bfiles/eur_chr@_SNPmaf5 \
        --partition workflow/src/SUPERGNOVA/partition/[email protected] \
        --thread 8 \
        --out {output};
    Rscript workflow/scripts/rg_extract_supergnova.R {output} {params.g1} {params.g2}
    """
SnakeMake From line 195 of rules/rg.smk
208
shell: "awk 'FNR==1 && NR!=1{{next;}}{{print}}' {input} > {output}"
SnakeMake From line 208 of rules/rg.smk
45
script: '../scripts/rg_region_exclude.R'
SnakeMake From line 45 of rules/test.smk
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
if(any(grepl("conda", .libPaths(), fixed = TRUE))){
  message("Setting libPaths")
  df = .libPaths()
  conda_i = which(grepl("conda", df, fixed = TRUE))
  .libPaths(c(df[conda_i], df[-conda_i]))
}
.libPaths(c(.libPaths(), "/hpc/users/harern01/miniconda3/envs/py38/lib/R/library"))
## Load Packages
library(tidyverse)
library(ggplot2)
library(ggman)
library(gridExtra)

snp.r2 <- function(eaf, beta){
  2*eaf*(1 - eaf)*beta^2
}

## Read in arguments
infile_gwas = snakemake@input[["ingwas"]]
infile_clump = snakemake@input[["inclump"]]
outfile_plot = snakemake@output[["out"]]
PlotTitle = snakemake@params[["PlotTitle"]]

message(paste(PlotTitle, '\n'))

## Read in GWAS and Plink Clumped File
trait.gwas <- vroom::vroom(infile_gwas, comment = '##',
                           col_select = c(SNP, CHROM, POS, BETA, AF, P),
                       col_types = list(SNP = col_character(),
                                        CHROM = col_double(),
                                        POS = col_double(),
                                        AF = col_number(),
                                        BETA = col_double(),
                                        P = col_double())) %>%
  mutate(r2 = snp.r2(AF, BETA))

trait.clump <- suppressMessages(read_table2(infile_clump)) %>%
  filter(!is.na(CHR)) %>%
  select(CHR, F, SNP, BP, P, TOTAL, NSIG)

## Count number of SNPs in clumped dataset
clump.n <- trait.clump %>%
  mutate(p.cut = cut(P, breaks = c(Inf, 0.05, 5e-5, 5e-6, 5e-7, 5e-8, -Inf), lablels = F)) %>%
  left_join(select(trait.gwas, SNP, r2)) %>%
  mutate(r2 = as.numeric(r2)) %>%
  group_by(p.cut) %>% summarise(n = n(), r2 = round(sum(r2, na.rm = T), 4))

## Count number of SNPs in total dataset
all.n <- trait.gwas %>%
  mutate(p.cut = cut(P, breaks = c(Inf, 0.05, 5e-5, 5e-6, 5e-7, 5e-8, -Inf), lablels = F)) %>%
  group_by(p.cut) %>% summarise(n = n())

## Merge Summary files
tab.TRAIT <- full_join(all.n, clump.n, by = 'p.cut', suffix = c('.all', '.clump'))

## Calculate Maximum Pvalue
max.p <- max(-log10(filter(trait.clump, P > 0)$P)) + 5

## Index SNPs
IndexSnps1 <- filter(trait.clump, P <= 5e-8)$SNP
IndexSnps2 <- filter(trait.clump, P <= 5e-6 & P > 5e-8)$SNP

## Plot GWAS
p.TRAIT <- ggman(filter(trait.gwas, P < 0.05), snp = 'SNP', chrom = 'CHROM', bp = 'POS', pvalue = 'P', ymin = 0, ymax = max.p,
                 title = PlotTitle, sigLine = -log10(5e-8), relative.positions = TRUE) +
  theme_classic() + theme(text = element_text(size=10)) +
  geom_hline(yintercept = -log10(5e-5), colour = 'blue', size = 0.25) +
  geom_hline(yintercept = -log10(5e-6), colour = 'red', size = 0.25, linetype = 2)

if(length(IndexSnps1) >= 1){
  p.TRAIT <- ggmanHighlight(p.TRAIT, highlight = IndexSnps1, shape = 18, colour = 'red')
}
if(length(IndexSnps2) >= 1){
  p.TRAIT <- ggmanHighlight(p.TRAIT, highlight = IndexSnps2, shape = 18, colour = 'blue')
}

out.TRAIT <- arrangeGrob(p.TRAIT,  tableGrob(tab.TRAIT, theme = ttheme_minimal(base_size = 8)),
                          ncol = 2, widths = 2:1, layout_matrix = rbind(c(1,2), c(1, NA)))

ggsave(outfile_plot, device = 'png', units = 'in', width = 10, height = 5, plot = out.TRAIT)
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
if(any(grepl("conda", .libPaths(), fixed = TRUE))){
  message("Setting libPaths")
  df = .libPaths()
  conda_i = which(grepl("conda", df, fixed = TRUE))
  .libPaths(c(df[conda_i], df[-conda_i]))
}

## LOAD packages
library(tidyr)
library(readr)
library(dplyr)
library(TwoSampleMR) ## For conducting MR https://mrcieu.github.io/TwoSampleMR/
source("workflow/scripts/miscfunctions.R", chdir = TRUE)

## Path to input/output
input = snakemake@input[["mrdat"]] # Harmonized MR data
output = snakemake@params[["out"]] # Output

##
mrdat <- read_csv(input) %>%
  filter(mr_keep == TRUE) %>%
  filter(pleitropy_keep == TRUE)

pt = mrdat %>% slice(1) %>% pull(pt)
n_outliers <- nrow(mrdat %>% filter(mr_keep == TRUE)) - nrow(mrdat %>% filter(mr_keep == TRUE) %>% filter(mrpresso_keep == TRUE))

## ================= w/ outliers ================= ##
## MR analysis
mr_res <- mr(mrdat, method_list = c("mr_ivw_fe", "mr_weighted_median", "mr_weighted_mode", "mr_egger_regression")) %>%
  as_tibble() %>%
  mutate(outliers_removed = FALSE) %>%
  mutate(n_outliers = n_outliers) %>%
  mutate(pt = pt) %>%
  select(id.exposure, id.outcome, outcome, exposure, pt, outliers_removed, method, nsnp, n_outliers, b, se, pval)

## Cochrans Q heterogeneity test
mr_heterogenity <- mr_heterogeneity(mrdat, method_list=c("mr_egger_regression", "mr_ivw")) %>%
  as_tibble() %>%
  mutate(outliers_removed = FALSE) %>%
  mutate(pt = pt) %>%
  select(id.exposure, id.outcome, outcome, exposure, pt, outliers_removed, method, Q, Q_df, Q_pval)

## MR Egger I2_Gx
isq <- Isq_gx(mrdat)

## MR Egger Test of Pliotropy
mr_plei <- mr_pleiotropy_test(mrdat) %>%
  as_tibble() %>%
  mutate(outliers_removed = FALSE) %>%
  mutate(pt = pt,
         Isq = isq) %>%
  select(id.exposure, id.outcome, outcome, exposure, pt, outliers_removed, Isq, egger_intercept, se, pval)

## ================= w/o outliers ================= ##
if(n_outliers >= 1){
  ## Remove outliers
  mrdat_mrpresso <- filter(mrdat, mrpresso_keep == T)

  ## MR Egger I2_Gx
  mrpresso_isq <- Isq_gx(mrdat_mrpresso)

  ## MR, Heterogenity, and Pleitorpy Tests
  mrpresso_res <- mr(mrdat_mrpresso, method_list = c("mr_ivw_fe", "mr_weighted_median", "mr_weighted_mode", "mr_egger_regression")) %>%
    as_tibble() %>%
    mutate(outliers_removed = TRUE) %>%
    mutate(n_outliers = n_outliers) %>%
    mutate(pt = pt) %>%
    select(id.exposure, id.outcome, outcome, exposure, pt, outliers_removed, method, nsnp, n_outliers, b, se, pval)
  mrpresso_heterogenity <- mr_heterogeneity(mrdat_mrpresso, method_list=c("mr_egger_regression", "mr_ivw")) %>%
    as_tibble() %>%
    mutate(outliers_removed = TRUE) %>%
    mutate(pt = pt) %>%
    select(id.exposure, id.outcome, outcome, exposure, pt, outliers_removed, method, Q, Q_df, Q_pval)
  mrpresso_plei <- mr_pleiotropy_test(mrdat_mrpresso) %>%
    as_tibble() %>%
    mutate(outliers_removed = TRUE) %>%
    mutate(pt = pt,
           Isq = mrpresso_isq) %>%
    select(id.exposure, id.outcome, outcome, exposure, pt, outliers_removed, Isq, egger_intercept, se, pval)

}else{
  ## If no outliers are removed, make empty dataframes
  mrpresso_res <- data.frame(id.exposure = as.character(mrdat[1,'id.exposure']),
                             id.outcome = as.character(mrdat[1,'id.outcome']),
                             outcome = as.character(mrdat[1,'outcome']),
                             exposure = as.character(mrdat[1,'exposure']),
                             pt = pt,
                             outliers_removed = NA,
                             method = 'mrpresso',
                             nsnp = NA,
                             n_outliers = NA,
                             b = NA,
                             se = NA,
                             pval = NA)
  mrpresso_heterogenity <- data.frame(
    id.exposure = as.character(mrdat[1,'id.exposure']),
    id.outcome = as.character(mrdat[1,'id.outcome']),
    outcome = as.character(mrdat[1,'outcome']),
    exposure = as.character(mrdat[1,'exposure']),
    pt = pt,
    outliers_removed = NA,
    method = 'mrpresso',
    Q = NA,
    Q_df = NA,
    Q_pval = NA)
  mrpresso_plei <- data.frame(
    id.exposure = as.character(mrdat[1,'id.exposure']),
    id.outcome = as.character(mrdat[1,'id.outcome']),
    outcome = as.character(mrdat[1,'outcome']),
    exposure = as.character(mrdat[1,'exposure']),
    pt = pt,
    outliers_removed = NA,
    Isq = NA,
    egger_intercept = NA,
    se = NA,
    pval = NA)
  }

## ================= bind results ================= ##
res_mr <- bind_rows(mr_res, mrpresso_res) %>%
  filter(!is.na(outliers_removed))
res_heterogenity <- bind_rows(mr_heterogenity, mrpresso_heterogenity) %>%
  filter(!is.na(outliers_removed))
res_plei <- bind_rows(mr_plei, mrpresso_plei) %>%
  filter(!is.na(outliers_removed))

## ================= Write Out ================= ##
res_heterogenity %>% write_tsv(paste0(output, '_MR_heterogenity.txt'))
res_plei %>% write_tsv(paste0(output, '_MR_egger_plei.txt'))
res_mr %>% write_tsv(paste0(output, '_MR_Results.txt'))
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
if(any(grepl("conda", .libPaths(), fixed = TRUE))){
  message("Setting libPaths")
  df = .libPaths()
  conda_i = which(grepl("conda", df, fixed = TRUE))
  .libPaths(c(df[conda_i], df[-conda_i]))
}

library(tidyverse)
library(plyr)

input = snakemake@input[["dat"]] # Outcome Summary statistics
output = snakemake@output[["out"]]

mrpresso_MRdat <- input %>%
  map(., function(x){
    datin <- read_csv(x, col_types = list(target_snp.outcome = col_character(),
                                          proxy.outcome = col_character(),
                                          proxy_snp.outcome = col_character(),
                                          target_a1.outcome = col_character(),
                                          target_a2.outcome = col_character(),
                                          proxy_a1.outcome = col_character(),
                                          proxy_a2.outcome = col_character(),
                                          effect_allele.outcome = col_character(),
                                          mrpresso_RSSobs= col_character(),
                                          mrpresso_pval= col_character()))
  }) %>%
  bind_rows()
write_csv(mrpresso_MRdat, output)
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
if(any(grepl("conda", .libPaths(), fixed = TRUE))){
  message("Setting libPaths")
  df = .libPaths()
  conda_i = which(grepl("conda", df, fixed = TRUE))
  .libPaths(c(df[conda_i], df[-conda_i]))
}

message("Begining Harmonization \n")
### ===== Command Line Arguments ===== ##
## Infiles
exposure.summary= snakemake@input[["ExposureSummary"]]
outcome.summary = snakemake@input[["OutcomeSummary"]]
proxy.snps = snakemake@input[["ProxySNPs"]]
## Outfile
out.harmonized = snakemake@output[["Harmonized"]]
## Params for output
pt = as.numeric(snakemake@params[["Pthreshold"]])
ExposureCode = snakemake@params[["excposurecode"]]
OutcomeCode = snakemake@params[["outcomecode"]]
## Regions for exclusion
regions_chrm = snakemake@params[["regions_chrm"]]
regions_start = snakemake@params[["regions_start"]]
regions_stop = snakemake@params[["regions_stop"]]


### ===== Load packages ===== ###
message("Loading packages  \n")
library(tidyr)
library(glue)
library(readr)
library(dplyr)
library(TwoSampleMR) ## For conducting MR https://mrcieu.github.io/TwoSampleMR/


### ===== Read In Data ===== ###
message("READING IN EXPOSURE \n")
exposure.dat <- read_tsv(exposure.summary)

message("READING IN OUTCOME \n")
outcome.dat <- read_tsv(outcome.summary)

message("READING IN PROXY SNPs \n")
proxy.dat <- read_csv(proxy.snps) %>%
  filter(proxy.outcome == TRUE) %>%
  select(proxy.outcome, target_snp, proxy_snp, ALT, REF, ALT.proxy, REF.proxy) %>%
  mutate(SNP = target_snp) %>%
  rename(target_snp.outcome = target_snp, proxy_snp.outcome = proxy_snp, target_a1.outcome = ALT, target_a2.outcome = REF, proxy_a1.outcome = ALT.proxy, proxy_a2.outcome = REF.proxy)


### ===== Harmonization ===== ###
message("Formating Exposure and Outcome \n")
# Format Exposure
mr_exposure.dat <- format_data(exposure.dat, type = 'exposure',
                            snp_col = 'SNP',
                            chr_col = 'CHROM',
                            pos_col = 'POS',
                            beta_col = "BETA",
                            se_col = "SE",
                            eaf_col = "AF",
                            effect_allele_col = "ALT",
                            other_allele_col = "REF",
                            pval_col = "P",
                            z_col = "Z",
                            samplesize_col = "N",
                            phenotype_col = 'TRAIT')
mr_exposure.dat$exposure =  ExposureCode

# Format Outcome
mr_outcome.dat <- format_data(outcome.dat, type = 'outcome',
                                 snp_col = 'SNP',
                                 chr_col = 'CHROM',
                                 pos_col = 'POS',
                                 beta_col = "BETA",
                                 se_col = "SE",
                                 eaf_col = "AF",
                                 effect_allele_col = "ALT",
                                 other_allele_col = "REF",
                                 pval_col = "P",
                                z_col = "Z",
                                samplesize_col = "N",
                                phenotype_col = 'TRAIT')
mr_outcome.dat$outcome =  OutcomeCode

# Harmonize Datasets
# Remove Plietropic variatns
## Exclude variants that are significant for outcome at a given Pt
## Exclude variants within genomic regions

# gws_pt = 5e-8
gws_pt = 5e-300

if(is.null(regions_chrm)){
  message("Harmonizing data \n")
  harmonized.MRdat <- harmonise_data(mr_exposure.dat, mr_outcome.dat) %>%
    as_tibble() %>%
    mutate(pt = pt,
      pleitropy_keep = case_when(pval.outcome <= gws_pt ~ FALSE,
                                  TRUE ~ TRUE))
} else {
  harmonized.MRdat <- harmonise_data(mr_exposure.dat, mr_outcome.dat) %>%
    as_tibble() %>%
    mutate(pt = pt,
      pleitropy_keep = case_when(pval.outcome <= gws_pt ~ FALSE, TRUE ~ TRUE))

  for(i in 1:length(regions_chrm)){
  message(glue("Harmonizing data and excluding regions: ", regions_chrm[i], ":", regions_start[i], "-", regions_stop[i]))
   harmonized.MRdat <- harmonized.MRdat %>%
    mutate(pleitropy_keep = case_when(
      pleitropy_keep == FALSE ~ FALSE,
      chr.exposure == regions_chrm[i] & between(pos.exposure, regions_start[i], regions_stop[i]) ~ FALSE,
      TRUE ~ TRUE))
    }
  }

print(filter(harmonized.MRdat, pleitropy_keep == FALSE) %>%
select(SNP, chr.exposure, pos.exposure, pval.exposure, pval.outcome))
print(count(harmonized.MRdat, pleitropy_keep) )
## Write out Harmonized data
message("Exporting Data to ", out.harmonized, "\n")
write_csv(harmonized.MRdat, out.harmonized)
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
if(any(grepl("conda", .libPaths(), fixed = TRUE))){
  message("Setting libPaths")
  df = .libPaths()
  conda_i = which(grepl("conda", df, fixed = TRUE))
  .libPaths(c(df[conda_i], df[-conda_i]))
}

### ===== Command Line Arguments ===== ##
exposure.summary = snakemake@input[["summary"]] # Exposure summary statistics
exposure.clump = snakemake@input[["ExposureClump"]]
p.threshold = as.numeric(snakemake@params[["Pthreshold"]])
out.file = snakemake@output[["out"]] # SPECIFY THE OUTPUT FILE

### ===== Load packages ===== ###
suppressMessages(library(tidyverse))   ## For data wrangling
library(here)
source(here("workflow", "scripts", "miscfunctions.R"), chdir = TRUE)

### ===== Read in Data ===== ###
message("\n READING IN EXPOSURE \n")
exposure.dat <- read_tsv(exposure.summary, comment = '#', guess_max = 15000000) %>%
  filter(P < p.threshold) %>%
  distinct(SNP, .keep_all = TRUE)

### ===== Clump Exposure ===== ###
message("\n CLUMPING EXPOSURE SNPS \n")

## Plink Pre-clumped
mr_exposure.dat_ld <- read_table2(exposure.clump) %>%
  filter(!is.na(CHR)) %>%
  select(CHR, F, SNP, BP, P, TOTAL, NSIG)

# Filter exposure data for clumped SNPs
exposure.dat <- exposure.dat %>% filter(SNP %in% mr_exposure.dat_ld$SNP)

# message("Distanced based clumping on lead SNP \n")
# # Distanced based clumping on lead SNP
# ls.p <- exposure.dat
# ls.mat <- ls.p %>%
#   split(., .$CHROM) %>%
#   map(., select, SNP,POS) %>%
#   map(., column_to_rownames, var = "SNP") %>%
#   map(., dist, method = "manhattan", upper = TRUE) %>%
#   map_dfr(., tidy) %>%
#   mutate(clump = case_when(distance <= 250000 ~ 1, distance > 250000 ~ 0))
#
# ls.clump <- list()
# i = 1
#
# while(!plyr::empty(ls.p)){
#   snp1 <- ls.p %>% slice(1) %>% pull(SNP)
#   message("Distance Clumping: ", snp1)
#   ls.clump[[i]] <- snp1
#
#   snprm <- ls.mat %>%
#     filter(item1 == snp1 & clump == 1) %>%
#     pull(item2)
#
#   ls.p <- ls.p %>% filter(SNP %nin% snprm) %>% filter(SNP %nin% snp1)
#   i <- i + 1
# }
#
# ls.clump <- unlist(ls.clump)
#
# exposure.dat <- exposure.dat %>%
#   mutate(LSclump = case_when(SNP %nin% ls.clump ~ FALSE,
#                            TRUE ~ TRUE)

### ===== Write Out Exposure ===== ###
message("\n Writing Out Exposure \n")

write_tsv(exposure.dat, out.file)
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
if(any(grepl("conda", .libPaths(), fixed = TRUE))){
  message("Setting libPaths")
  df = .libPaths()
  conda_i = which(grepl("conda", df, fixed = TRUE))
  .libPaths(c(df[conda_i], df[-conda_i]))
}

## ==== Load required packages ==== ##
suppressMessages(library(plyr))
suppressMessages(library(tidyverse))
`%nin%` = Negate(`%in%`)

summary.path = snakemake@input[["OutcomeSummary"]] # Outcome Summary statistics
proxy.path = snakemake@input[["OutcomeProxys"]] # prox snps
snps.path = snakemake@input[["OutcomeSNPs"]]
out = snakemake@params[["Outcome"]]

message("READING IN OUTCOME AND PROXY's \n")
summary.dat <- read_tsv(summary.path, comment = '#', guess_max = 15000000)
proxy.dat <- read_table2(proxy.path)
outcome.raw <- read_tsv(snps.path)

if(empty(proxy.dat)){
  message("NO PROXY SNPS AVALIABLE \n")
  outcome.dat <- outcome.raw %>% filter(!is.na(CHROM))
} else if (empty(filter(proxy.dat, SNP_A != SNP_B)) |
           empty(filter(summary.dat, SNP %in% (filter(proxy.dat, SNP_A != SNP_B) %>% pull(SNP_B)))) ){
  message("NO PROXY SNPS AVALIABLE \n")
  outcome.dat <- outcome.raw %>% filter(!is.na(CHROM))
  query_snps <- proxy.dat %>%
    filter(SNP_A == SNP_B)
} else {
  message("WRANGLING TARGET AND PROXY SNPs \n")
  ## Filter Query SNPs
  query_snps <- proxy.dat %>%
    filter(SNP_A == SNP_B) %>%
    select(CHR_A, BP_A, SNP_A, PHASE)

  ## Filter Proxy SNPs
  proxy.snps <- proxy.dat %>%
    filter(SNP_A != SNP_B) %>%                        ## remove query snps
    left_join(summary.dat, by = c('SNP_B' = 'SNP')) %>%
    filter(!is.na(ALT)) %>%                 ## remove snps with missing information
    group_by(SNP_A) %>%                      ## by query snp
    arrange(-R2) %>%                                  ## arrange by ld
    slice(1) %>%                                      ## select top proxy snp
    ungroup() %>%
    rename(ALT.proxy = ALT,
           REF.proxy = REF) %>%
    select(-CHR_A, -CHR_B, -BP_A, -BP_B, -MAF_A, -MAF_B, -R2, -DP)            ## remove uneeded columns

  ## Select correlated alleles
  alleles <- proxy.snps %>% select(PHASE) %>%
    mutate(PHASE = str_replace(PHASE, "/", ""))
  alleles <- str_split(alleles$PHASE, "", n = 4, simplify = T)
  colnames(alleles) <- c('ref', 'ref.proxy', 'alt', 'alt.proxy')
  alleles <- as_tibble(alleles)

  ## Bind Proxy SNPs and correlated alleles
  proxy.out <- proxy.snps %>%
    bind_cols(alleles) %>%
    rename(SNP = SNP_A) %>%
    mutate(ALT = ifelse(ALT.proxy == ref.proxy, ref, alt)) %>%
    mutate(REF = ifelse(REF.proxy == ref.proxy, ref, alt)) %>%
    mutate(CHROM = as.numeric(CHROM)) %>%
    mutate(AF = as.numeric(AF))

  ## Outcome data
  outcome.dat <- outcome.raw %>%
    filter(SNP %nin% proxy.out$SNP) %>%
    bind_rows(select(proxy.out, SNP, CHROM, POS, REF, ALT, AF, BETA, SE, Z, P, N, TRAIT)) %>%
    arrange(CHROM, POS) %>%
    filter(!is.na(CHROM))

}

message("\n EXPORTING \n")
## Write out outcomes SNPs
write_tsv(outcome.dat, paste0(out, '_ProxySNPs.txt'))

## Write out Proxy SNPs
if(empty(proxy.dat)){
  tibble(proxy.outcome = NA, target_snp = NA, proxy_snp = NA, ld.r2 = NA, Dprime = NA, ref.proxy = NA, alt.proxy = NA,
         CHROM = NA, POS = NA, ALT.proxy = NA, REF.proxy = NA,
         AF = NA, BETA = NA, SE = NA, P = NA, N = NA, ref = NA, alt = NA,
         ALT = NA, REF = NA, PHASE = NA) %>%
    write_csv(paste0(out, '_MatchedProxys.csv'))
} else if (empty(filter(proxy.dat, SNP_A != SNP_B)) |
           empty(filter(summary.dat, SNP %in% (filter(proxy.dat, SNP_A != SNP_B) %>% pull(SNP_B)))) ){
  tibble(proxy.outcome = NA, target_snp = query_snps$SNP_A, proxy_snp = NA, ld.r2 = NA, Dprime = NA, ref.proxy = NA, alt.proxy = NA,
         CHROM = NA, POS = NA, ALT.proxy = NA, REF.proxy = NA,
         AF = NA, BETA = NA, SE = NA, P = NA, N = NA, ref = NA, alt = NA,
         ALT = NA, REF = NA, PHASE = NA) %>%
    write_csv(paste0(out, '_MatchedProxys.csv'))
}else{
  proxy.dat %>%
    select(SNP_A, SNP_B, R2, DP) %>%
    inner_join(proxy.out, by = c('SNP_A' = 'SNP', 'SNP_B')) %>%
    rename(target_snp = SNP_A, proxy_snp = SNP_B, ld.r2 = R2, Dprime = DP) %>%
    mutate(proxy.outcome = TRUE) %>%
    full_join(select(query_snps, SNP_A), by = c('target_snp' = 'SNP_A')) %>%
    write_csv(paste0(out, '_MatchedProxys.csv'))
}
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
infile_gwas = snakemake@input[["ss"]]
outfile = snakemake@output[["formated_ss"]]
snp_col = snakemake@params[["snp_col"]]
chrom_col = snakemake@params[["chrom_col"]]
pos_col = snakemake@params[["pos_col"]]
ref_col = snakemake@params[["ref_col"]]
alt_col = snakemake@params[["alt_col"]]
af_col = snakemake@params[["af_col"]]
beta_col = snakemake@params[["beta_col"]]
se_col = snakemake@params[["se_col"]]
p_col = snakemake@params[["p_col"]]
z_col = snakemake@params[["z_col"]]
n_col = snakemake@params[["n_col"]]
trait_col = snakemake@params[["trait_col"]]

if(any(grepl("conda", .libPaths(), fixed = TRUE))){
  message("Setting libPaths")
  df = .libPaths()
  conda_i = which(grepl("conda", df, fixed = TRUE))
  .libPaths(c(df[conda_i], df[-conda_i]))
}

# library(gwasvcf)
# library(VariantAnnotation)
library(tidyverse)
library(magrittr)

message('\n', 'Columns names are: ', 'SNP: ', snp_col, ', CHROM: ', chrom_col,
        ', POS: ', pos_col, ', REF: ', ref_col, ', ALT: ', alt_col, ', AF: ', af_col,
        ', BETA: ', beta_col, ', SE: ', se_col, ', P: ', p_col, ', Z: ', z_col,
        ', N: ', n_col, ', TRAIT: ', trait_col, '\n')

if(str_detect(infile_gwas, ".vcf.gz")){
  ## Formating for IEU OPEN GWAS .vcf files
  message("IEU OPEN GWAS FORMAT")

  vcf <- VariantAnnotation::readVcf(infile_gwas)
  trait.gwas <- gwasvcf::vcf_to_tibble(vcf)
  traitID = tolower(samples(header(vcf)))

  ieugwas <- read_csv("data/raw/ieugwas_201020.csv")
  trait_meta <- select(ieugwas, id, trait, sample_size, ncase, ncontrol) %>% filter(id == traitID)

  out <- trait.gwas %>%
    mutate(EZ = ES/SE,
           P = 10^-LP,
           TRAIT = pull(trait_meta, trait),
           N = pull(trait_meta, sample_size),
           NCASE = pull(trait_meta, ncase),
           NCTRL = pull(trait_meta, ncontrol)) %>%
    filter(nchar(REF) == 1) %>%
    filter(nchar(ALT) == 1) %>%
    mutate(Z = ifelse(BETA == 0 & SE == 0, 0, Z)) %>%
    select(SNP = ID, CHROM = seqnames, POS = start, REF = REF, ALT = ALT, AF = AF,
           BETA = ES, SE, Z = EZ, P, N, TRAIT, NCASE, NCTRL) %>%
    distinct(SNP, .keep_all = TRUE)

  out %>%
    write_tsv(gzfile(outfile))

} else if(str_detect(infile_gwas, ".chrall.CPRA_b37.tsv.gz")){
  ## Formating for MSSM files
  message("MSSM GWAS FORMAT")

  trait.gwas <- suppressMessages(vroom::vroom(infile_gwas, comment = '#', guess_max = 11000000))

  out <- trait.gwas %>%
    rename(SNP = snp_col, CHROM = chrom_col, POS = pos_col, REF = ref_col,
           ALT = alt_col, AF = af_col, BETA = beta_col, SE = se_col, Z = z_col,
           P = p_col, N = n_col, TRAIT = trait_col) %>%
    filter(nchar(REF) == 1) %>%
    filter(nchar(ALT) == 1) %>%
    mutate(Z = ifelse(BETA == 0 & SE == 0, 0, Z)) %>%
    select(SNP, CHROM, POS, REF, ALT, AF, BETA, SE, Z, P, N, TRAIT) %>%
    drop_na %>%
    distinct(SNP, .keep_all = TRUE)

  out %>%
    write_tsv(gzfile(outfile))

}

message('\n', 'SNPs in orginal file: ', nrow(trait.gwas), '; SNPs in formated file: ', pos_col, ', REF: ', nrow(out), '\n')
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
if(any(grepl("conda", .libPaths(), fixed = TRUE))){
  message("Setting libPaths")
  df = .libPaths()
  conda_i = which(grepl("conda", df, fixed = TRUE))
  .libPaths(c(df[conda_i], df[-conda_i]))
}
.libPaths(c(.libPaths(), "/hpc/users/harern01/miniconda3/envs/py38/lib/R/library"))
### ===== Command Line Arguments ===== ##
infile = snakemake@input[["mrdat"]] # Exposure summary statistics
out = snakemake@params[["out"]]

### ===== Load packages ===== ###
library(tidyr)   ## For data wrangling
library(dplyr)  ## For data transformation
library(magrittr)   ## For data wrangling
library(readr)
library(stringi)
library(stringr)
library(MRPRESSO) ## For detecting pleitropy

### ===== READ IN DATA ===== ###
message("\n READING IN HARMONIZED MR DATA \n")
mrdat.raw <- read_csv(infile)
mrdat <- mrdat.raw %>%
  filter(mr_keep == TRUE) %>%
  filter(pleitropy_keep == TRUE)

## Data Frame of nsnps and number of iterations
df.NbD <- data.frame(n = c(10, 50, 100, 500, 1000, 1500, 2000),
                     NbDistribution = c(10000, 10000, 10000, 25000, 50000, 75000, 100000))

nsnps <- nrow(mrdat)
SignifThreshold <- 0.05
NbDistribution <- df.NbD[which.min(abs(df.NbD$n - nsnps)), 2]
## Bonfernoni correction, needs a crazy amount of nbdistributions for larger nsnps.
# SignifThreshold <- 0.05/nsnps
# NbDistribution <- (nsnps+100)/SignifThreshold

### ===== MR-PRESSO ===== ###
message("\n CALCULATING PLEITROPY \n")

set.seed(333)
mrpresso.out <- mr_presso(BetaOutcome = "beta.outcome",
                               BetaExposure = "beta.exposure",
                               SdOutcome = "se.outcome",
                               SdExposure = "se.exposure",
                               OUTLIERtest = TRUE,
                               DISTORTIONtest = TRUE,
                               data = as.data.frame(mrdat),
                               NbDistribution = NbDistribution,
                               SignifThreshold = SignifThreshold)

### ===== FORMAT DATA ===== ###
## extract RSSobs and Pvalue
if("Global Test" %in% names(mrpresso.out$`MR-PRESSO results`)){
  mrpresso.p <- mrpresso.out$`MR-PRESSO results`$`Global Test`$Pvalue
  RSSobs <- mrpresso.out$`MR-PRESSO results`$`Global Test`$RSSobs
} else {
  mrpresso.p <- mrpresso.out$`MR-PRESSO results`$Pvalue
  RSSobs <- mrpresso.out$`MR-PRESSO results`$RSSobs
}

## If Global test is significant, append outlier tests to mrdat
if("Outlier Test" %in% names(mrpresso.out$`MR-PRESSO results`)){
  outliers <- mrdat %>%
    bind_cols(mrpresso.out$`MR-PRESSO results`$`Outlier Test`) %>%
    rename(mrpresso_RSSobs = RSSobs, mrpresso_pval = Pvalue) %>%
    mutate(mrpresso_keep = as.numeric(str_replace_all(mrpresso_pval, pattern="<", repl="")) > SignifThreshold) %>%
    select(SNP, mrpresso_RSSobs, mrpresso_pval, mrpresso_keep)
  mrdat.out <- mrdat.raw %>%
    left_join(outliers, by = 'SNP')
} else {
  mrdat.out <- mrdat.raw %>%
    mutate(mrpresso_RSSobs = NA, mrpresso_pval = NA) %>%
    mutate(mrpresso_keep = ifelse(mr_keep == TRUE, TRUE, NA))
}

# Write n outliers, RSSobs and Pvalue to tibble
mrpresso.dat <- tibble(id.exposure = as.character(mrdat[1,'id.exposure']),
                       id.outcome = as.character(mrdat[1,'id.outcome']),
                       outcome = as.character(mrdat[1,'outcome']),
                       exposure = as.character(mrdat[1,'exposure']),
                       pt = mrdat.raw %>% slice(1) %>% pull(pt),
                       outliers_removed = FALSE,
                       n_outliers = sum(mrdat.out$mrpresso_keep == F, na.rm = T),
                       RSSobs = RSSobs,
                       pval = mrpresso.p)

# Write MR-PRESSO results out
mrpresso.res <- extract2(mrpresso.out, 1) %>%
  as_tibble() %>%
  mutate(id.exposure = as.character(mrdat[1,'id.exposure']),
         id.outcome = as.character(mrdat[1,'id.outcome']),
         outcome = as.character(mrdat[1,'outcome']),
         exposure = as.character(mrdat[1,'exposure']),
         pt = mrdat.raw %>% slice(1) %>% pull(pt),
         method = "MR-PRESSO",
         nsnp = c(nrow(mrdat), nrow(filter(mrdat.out, mr_keep == T, pleitropy_keep == T, mrpresso_keep == T)))
  ) %>%
  select(id.exposure, id.outcome, outcome, exposure,	pt, method, nsnp,
         outliers_removed = `MR Analysis`, b = `Causal Estimate`, sd = Sd, t.stat = `T-stat`, pval = `P-value`)

### ===== EXPORTING ===== ###
message("\n EXPORTING REPORTS \n")
sink(paste0(out, '.txt'), append=FALSE, split=FALSE)
mrpresso.out
sink()

write_tsv(mrpresso.dat, paste0(out, '_global.txt'))
write_tsv(mrpresso.res, paste0(out, '_res.txt'))
write_csv(mrdat.out, paste0(out, '_MRdat.csv'))
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
if(any(grepl("conda", .libPaths(), fixed = TRUE))){
  message("Setting libPaths")
  df = .libPaths()
  conda_i = which(grepl("conda", df, fixed = TRUE))
  .libPaths(c(df[conda_i], df[-conda_i]))
}
.libPaths(c(.libPaths(), "/hpc/users/harern01/miniconda3/envs/py38/lib/R/library"))
### ===== Command Line Arguments ===== ##
infile = snakemake@input[["mrdat"]] # Exposure summary statistics
out = snakemake@params[["out"]]

# infile = "data/MR_ADbidir/Grasby2020thickness/Willer2013hdl/Grasby2020thickness_5e-6_Willer2013hdl_mrpresso_MRdat.csv"
# out = "/Users/sheaandrews/Dropbox/Research/PostDoc-MSSM/2_MR/2_DerivedData/mvpa/load/mvpa_5e-6_load"

### ===== Load packages ===== ###
library(tidyr)   ## For data wrangling
library(dplyr)  ## For data transformation
library(magrittr)   ## For data wrangling
library(readr)
library(stringi)
library(stringr)
library(MRPRESSO) ## For detecting pleitropy


### ===== READ IN DATA ===== ###
message("\n READING IN HARMONIZED MR DATA \n")
mrdat.raw <- read_csv(infile, col_types = list(mrpresso_pval = col_character())) %>%
  filter(mr_keep == TRUE) %>%
  filter(pleitropy_keep == TRUE)

mrdat <- filter(mrdat.raw, mrpresso_keep == TRUE)

if(nrow(mrdat) < nrow(mrdat.raw)){

  if(nrow(mrdat) > 3){
    ## Data Frame of nsnps and number of iterations
    df.NbD <- data.frame(n = c(10, 50, 100, 500, 1000, 1500, 2000),
                         NbDistribution = c(10000, 10000, 10000, 25000, 50000, 75000, 100000))

    nsnps <- nrow(mrdat)
    SignifThreshold <- 0.05
    NbDistribution <- df.NbD[which.min(abs(df.NbD$n - nsnps)), 2]

    ### ===== MR-PRESSO ===== ###
    message("\n CALCULATING PLEITROPY \n")
    set.seed(333)
    mrpresso.out <- mr_presso(BetaOutcome = "beta.outcome",
                              BetaExposure = "beta.exposure",
                              SdOutcome = "se.outcome",
                              SdExposure = "se.exposure",
                              OUTLIERtest = FALSE,
                              DISTORTIONtest = FALSE,
                              data = as.data.frame(mrdat),
                              NbDistribution = NbDistribution,
                              SignifThreshold = SignifThreshold)

    ### ===== FORMAT DATA ===== ###
    ## extract RSSobs and Pvalue
    message("\n Extracting RSSobs and Pvalue \n")

    mrpresso.p <- mrpresso.out$`MR-PRESSO results`$`Global Test`$Pvalue
    RSSobs <- mrpresso.out$`MR-PRESSO results`$`Global Test`$RSSobs

    if("Outlier Test" %in% names(mrpresso.out$`MR-PRESSO results`)){
      n_outliers = length(mrpresso.out$`MR-PRESSO results`$`Distortion Test`$`Outliers Indices`)
  } else {
      n_outliers = 0
  }

    # Write n outliers, RSSobs and Pvalue to tibble
    message("\n Writing Results \n")

    mrpresso.dat <- tibble(id.exposure = as.character(mrdat[1,'id.exposure']),
                           id.outcome = as.character(mrdat[1,'id.outcome']),
                           outcome = as.character(mrdat[1,'outcome']),
                           exposure = as.character(mrdat[1,'exposure']),
                           pt = mrdat.raw %>% slice(1) %>% pull(pt),
                           outliers_removed = TRUE,
                           n_outliers = (nrow(mrdat.raw) - nrow(mrdat)),
                           RSSobs = RSSobs,
                           pval = mrpresso.p)

    mrpresso.res <- extract2(mrpresso.out, 1) %>%
      as_tibble() %>%
      mutate(id.exposure = as.character(mrdat[1,'id.exposure']),
             id.outcome = as.character(mrdat[1,'id.outcome']),
             outcome = as.character(mrdat[1,'outcome']),
             exposure = as.character(mrdat[1,'exposure']),
             pt = mrdat.raw %>% slice(1) %>% pull(pt),
             method = "MR-PRESSO_wo_outliers",
             nsnp	= c(nrow(mrdat), nrow(mrdat) - n_outliers)
      ) %>%
      select(id.exposure, id.outcome, outcome, exposure,	pt, method, nsnp,
             outliers_removed = `MR Analysis`, b = `Causal Estimate`, sd = Sd, t.stat = `T-stat`, pval = `P-value`)


  } else {
    message("\n Not enough intrumental variables \n")

    mrpresso.dat <- tibble(id.exposure = as.character(mrdat[1,'id.exposure']),
                           id.outcome = as.character(mrdat[1,'id.outcome']),
                           outcome = as.character(mrdat[1,'outcome']),
                           exposure = as.character(mrdat[1,'exposure']),
                           pt = mrdat.raw %>% slice(1) %>% pull(pt),
                           outliers_removed = TRUE,
                           n_outliers = (nrow(mrdat.raw) - nrow(mrdat)),
                           RSSobs = NA,
                           pval = NA)

    mrpresso.res <- tibble(
        id.exposure = as.character(mrdat[1,'id.exposure']),
        id.outcome = as.character(mrdat[1,'id.outcome']),
        outcome = as.character(mrdat[1,'outcome']),
        exposure = as.character(mrdat[1,'exposure']),
        pt = mrdat.raw %>% slice(1) %>% pull(pt),
        method = "MR-PRESSO_wo_outliers",
        nsnp	= NA,
        outliers_removed = NA,
        b = NA,
        sd = NA,
        t.stat = NA,
        pval = NA)
  }

} else {
  message("\n NO OUTLIER VARIANTS \n")

  mrpresso.dat <- tibble(id.exposure = as.character(mrdat[1,'id.exposure']),
                         id.outcome = as.character(mrdat[1,'id.outcome']),
                         outcome = as.character(mrdat[1,'outcome']),
                         exposure = as.character(mrdat[1,'exposure']),
                         pt = mrdat.raw %>% slice(1) %>% pull(pt),
                         outliers_removed = TRUE,
                         n_outliers = (nrow(mrdat.raw) - nrow(mrdat)),
                         RSSobs = NA,
                         pval = NA)

  mrpresso.res <- tibble(
      id.exposure = as.character(mrdat[1,'id.exposure']),
      id.outcome = as.character(mrdat[1,'id.outcome']),
      outcome = as.character(mrdat[1,'outcome']),
      exposure = as.character(mrdat[1,'exposure']),
      pt = mrdat.raw %>% slice(1) %>% pull(pt),
      method = "MR-PRESSO_wo_outliers",
      nsnp	= NA,
      outliers_removed = NA,
      b = NA,
      sd = NA,
      t.stat = NA,
      pval = NA)
}

write_tsv(mrpresso.dat, paste0(out, '_global_wo_outliers.txt'))
write_tsv(mrpresso.res, paste0(out, '_res_wo_outliers.txt'))
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
if(any(grepl("conda", .libPaths(), fixed = TRUE))){
  message("Setting libPaths")
  df = .libPaths()
  conda_i = which(grepl("conda", df, fixed = TRUE))
  .libPaths(c(df[conda_i], df[-conda_i]))
}

### ===== Command Line Arguments ===== ##
exposure.summary = snakemake@input[["ExposureSummary"]] # Exposure summary statistics
outcome.summary = snakemake@input[["OutcomeSummary"]] # Outcome Summary statistics
out = snakemake@params[["Outcome"]]

### ===== Load packages ===== ###
library(tidyverse)## For data wrangling
library(dplyr)
library(readr)
#suppressMessages(library(Hmisc))       ## Contains miscillaneous funtions

### ===== READ IN SNPs ===== ###
message("READING IN EXPOSURE \n")
exposure.dat <- read_tsv(exposure.summary)

message("\n READING IN OUTCOME \n")
outcome.dat.raw <- read_tsv(outcome.summary, comment = '##', guess_max = 15000000)

### ===== EXTACT SNPS ===== ###
message("\n EXTRACTING SNP EFFECTS FROM OUTCOME GWAS  \n")
outcome.dat <- outcome.dat.raw %>%
  right_join(select(exposure.dat, SNP), by = 'SNP')


### ===== MISSING SNPS SNPS ===== ###

outcome.dat %>%
  filter(is.na(CHROM)) %>%
  select(SNP) %>%
  write_tsv(paste0(out, '_MissingSNPs.txt'), col_names = F)

message("\n EXPORTING \n")
## Write out outcomes SNPs
write_tsv(outcome.dat, paste0(out, '_SNPs.txt'))
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
if(any(grepl("conda", .libPaths(), fixed = TRUE))){
  message("Setting libPaths")
  df = .libPaths()
  conda_i = which(grepl("conda", df, fixed = TRUE))
  .libPaths(c(df[conda_i], df[-conda_i]))
}

## Load librarys and functions
.libPaths(c(.libPaths(), "/hpc/users/harern01/miniconda3/envs/py38/lib/R/library"))
library(tidyr)
library(dplyr)
library(readr)
library(purrr)
library(TwoSampleMR)
source("workflow/scripts/miscfunctions.R", chdir = TRUE)
source("workflow/scripts/traits.R", chdir = TRUE)
source("workflow/scripts/mr_PowerFunctions.R", chdir = TRUE)
infile = snakemake@input[["infile"]]
outfile = snakemake@output[["outfile"]]

## -------------------------------------------------------------------------- ##
## Read in Harmonized datasets

MRdat.raw <- infile %>%
  read_csv(., guess_max = 1000)

n_outliers <- MRdat.raw %>%
  summarise(., n_outliers = sum(!mrpresso_keep, na.rm = TRUE)) %>%
  pull(n_outliers)

std.MRdat <- MRdat.raw %>%
  filter(pleitropy_keep == TRUE) %>%
  select(-samplesize.outcome, -samplesize.exposure) %>%
  left_join(samplesize, by = c('exposure' = 'code')) %>%
  rename(samplesize.exposure = samplesize,
         ncase.exposure = ncase,
         ncontrol.exposure = ncontrol,
         logistic.exposure = logistic,
         exposure.name = trait) %>%
  left_join(samplesize, by = c('outcome' = 'code')) %>%
  rename(samplesize.outcome = samplesize,
         ncase.outcome = ncase,
         ncontrol.outcome = ncontrol,
         logistic.outcome = logistic,
         outcome.name = trait) %>%
  ## Standardize continous outcomes
  mutate(st_beta.outcome = ifelse(logistic.outcome == FALSE,
                                  std_beta(z.outcome, eaf.outcome, samplesize.outcome),
                                  beta.outcome),
         st_se.outcome = ifelse(logistic.outcome == FALSE,
                                std_se(z.outcome, eaf.outcome, samplesize.outcome),
                                se.outcome)) %>%
  ## Standardize all exposures
  mutate(st_beta.exposure = std_beta(z.exposure, eaf.exposure, samplesize.exposure),
         st_se.exposure = std_se(z.exposure, eaf.exposure, samplesize.exposure)) %>%
  mutate(beta.outcome = st_beta.outcome,
         se.outcome = st_se.outcome,
         beta.exposure = st_beta.exposure,
         se.exposure = st_se.exposure)

## -------------------------------------------------------------------------- ##
##            Rerun IVW MR analysis using standardized beta estimates
##            Outliers retained
std.res <- mr(std.MRdat, method_list = c("mr_ivw_fe")) %>%
  as_tibble(.) %>%
  mutate(outliers_removed = FALSE) %>%
  left_join(select(samplesize, code, logistic), by = c('outcome' = 'code')) %>%
  mutate(or = ifelse(logistic == TRUE, exp(b), NA)) %>%
  mutate(beta = ifelse(logistic == FALSE, b, NA))

## Join std MR results with summary data for power analysis
mrdat.power <- std.MRdat %>%
  filter(mr_keep == TRUE) %>%
  mutate(pve.exposure = snp.pve(eaf.exposure, beta.exposure, se.exposure, samplesize.exposure)) %>%
  summarise(exposure = first(exposure),
            outcome = first(outcome),
            pt = first(pt),
            pve.exposure = sum(pve.exposure),
            samplesize.exposure = max(samplesize.exposure),
            samplesize.outcome = max(samplesize.outcome),
            nsnps = n(),
            k.outcome = max(ncase.outcome)/max(samplesize.outcome)) %>%
  left_join(std.res, by = c('exposure', 'outcome', "nsnps" = 'nsnp')) %>%
  select(-id.exposure, -id.outcome)


## -------------------------------------------------------------------------- ##
##            Rerun IVW MR analysis using standardized beta estimates
##            Outliers retained

if(n_outliers > 0){
  std.res_wo_outliers <- std.MRdat %>%
    filter(mrpresso_keep == TRUE) %>%
    mr(., method_list = c("mr_ivw_fe")) %>%
    as_tibble(.) %>%
    mutate(outliers_removed = TRUE) %>%
    left_join(select(samplesize, code, logistic), by = c('outcome' = 'code')) %>%
    mutate(or = ifelse(logistic == TRUE, exp(b), NA)) %>%
    mutate(beta = ifelse(logistic == FALSE, b, NA))

  mrdat.power_wo_outliers <- std.MRdat %>%
    filter(mr_keep == TRUE) %>%
    filter(mrpresso_keep == TRUE) %>%
    mutate(pve.exposure = snp.pve(eaf.exposure, beta.exposure, se.exposure, samplesize.exposure)) %>%
    summarise(exposure = first(exposure),
              outcome = first(outcome),
              pt = first(pt),
              pve.exposure = sum(pve.exposure),
              samplesize.exposure = max(samplesize.exposure),
              samplesize.outcome = max(samplesize.outcome),
              nsnps = n(),
              k.outcome = max(ncase.outcome)/max(samplesize.outcome)) %>%
    left_join(std.res_wo_outliers, by = c('exposure', 'outcome', "nsnps" = 'nsnp')) %>%
    select(-id.exposure, -id.outcome)

}

## -------------------------------------------------------------------------- ##
##                        Power to detect observed effect                     ##

observed_power.df <- mrdat.power %>%
  {if(n_outliers > 0) bind_rows(., mrdat.power_wo_outliers) else .} %>%
  split(., as.factor(1:nrow(.))) %>%
  map_dfr(., function(dat){
    message('Processing: ', dat$exposure, ' - ', dat$outcome, ' - ', dat$pt)
    if(dat$logistic == TRUE){
      results_binary(dat)
    } else {
      results_beta_based(dat)
    }
  })

write_tsv(observed_power.df, outfile)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
if(any(grepl("conda", .libPaths(), fixed = TRUE))){
  message("Setting libPaths")
  df = .libPaths()
  conda_i = which(grepl("conda", df, fixed = TRUE))
  .libPaths(c(df[conda_i], df[-conda_i]))
}
.libPaths(c(.libPaths(), "/hpc/users/harern01/.Rlib"))
# log_path = snakemake@log[[1]]
#
# ## Logging messages
# con <- file(log_path, open = "wt")
# sink(con, type = "message")

message("Render Final Report",
        "\n markdown: ", snakemake@input[["markdown"]],
        "\n intermediates_dir: ", snakemake@params[["output_dir"]],
        "\n output_file: ", snakemake@output[["outfile"]],
        "\n output_dir: ", snakemake@params[["output_dir"]],
        "\n model: ", snakemake@params[["project"]],
        "\n exposures: ", snakemake@params[["exposures"]],
        "\n outcomes: ", snakemake@params[["outcomes"]],
        "\n rwd: ", snakemake@params[["rwd"]],
        "\n rlib: ", snakemake@params[["rlib"]],
        "\n DATE: ", snakemake@params[["today_date"]],
        "\n MR_results.path: ", snakemake@input[["mrresults"]],
        "\n MRdat.path: ", snakemake@input[["mrdat"]],
        "\n mrpresso_res.path: ", snakemake@input[["mrpresso_res"]],
        "\n mrpresso_global.path: ", snakemake@input[["mrpresso_global"]],
        "\n mrpresso_global_wo_outliers.path: ", snakemake@input[["mrpresso_global_wo_outliers"]],
        "\n egger_comb.path: ", snakemake@input[["egger"]],
        "\n steiger.path: ", snakemake@input[["steiger"]],
        "\n power.path: ", snakemake@input[["power"]])


message("\n", getwd())

rmarkdown::render(
  input = snakemake@input[["markdown"]],
  clean = TRUE,
  intermediates_dir = snakemake@params[["output_dir"]],
  output_file = snakemake@output[["outfile"]],
  output_dir = snakemake@params[["output_dir"]],
  output_format = "all",
  params = list(
    model = snakemake@params[["project"]],
    rwd = snakemake@params[["rwd"]],
    exposures = snakemake@params[["exposures"]],
    outcomes = snakemake@params[["outcomes"]],
    rlib = snakemake@params[["rlib"]],
    DATE = snakemake@params[["today_date"]],
    MR_results.path = snakemake@input[["mrresults"]],
    MRdat.path = snakemake@input[["mrdat"]],
    mrpresso_res.path = snakemake@input[["mrpresso_res"]],
    mrpresso_global.path = snakemake@input[["mrpresso_global"]],
    mrpresso_global_wo_outliers.path = snakemake@input[["mrpresso_global_wo_outliers"]],
    egger_comb.path = snakemake@input[["egger"]],
    steiger.path = snakemake@input[["steiger"]],
    power.path = snakemake@input[["power"]]
  )
)
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
if(any(grepl("conda", .libPaths(), fixed = TRUE))){
  message("Setting libPaths")
  df = .libPaths()
  conda_i = which(grepl("conda", df, fixed = TRUE))
  .libPaths(c(df[conda_i], df[-conda_i]))
}

# log_path = snakemake@log[[1]]
#
# ## Logging messages
# con <- file(log_path, open = "wt")
# sink(con, type = "message")

rmarkdown::render(
  input = snakemake@input[["markdown"]],
  clean = TRUE,
  intermediates_dir = snakemake@params[["output_dir"]],
  output_file = snakemake@output[["outfile"]],
  output_dir = snakemake@params[["output_dir"]],
  output_format = "all",
  params = list(
    rwd = snakemake@params[["rwd"]],
    Rlib = snakemake@params[["Rlib"]],
    exposure.snps = snakemake@input[["ExposureSnps"]],
    outcome.snps = snakemake@input[["OutcomeSnps"]],
    proxy.snps = snakemake@input[["ProxySnps"]],
    harmonized.dat = snakemake@input[["HarmonizedDat"]],
    mrpresso_global = snakemake@input[["mrpresso_global"]],
    power = snakemake@input[["power"]],
    outcome.code = snakemake@params[["OutcomeCode"]],
    exposure.code = snakemake@params[["ExposureCode"]],
    Outcome = snakemake@params[["Outcome"]],
    Exposure = snakemake@params[["Exposure"]],
    p.threshold = snakemake@params[["Pthreshold"]],
    r2.threshold = snakemake@params[["r2threshold"]],
    kb.threshold = snakemake@params[["kbthreshold"]],
    out = snakemake@params[["output_name"]])
)
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
if(any(grepl("conda", .libPaths(), fixed = TRUE))){
  message("Setting libPaths")
  df = .libPaths()
  conda_i = which(grepl("conda", df, fixed = TRUE))
  .libPaths(c(df[conda_i], df[-conda_i]))
}
.libPaths(c(.libPaths(), "/hpc/users/harern01/miniconda3/envs/py38/lib/R/library"))
## Load Packages
library(tidyr)
library(dplyr)
library(readr)
library(TwoSampleMR)
source('workflow/scripts/miscfunctions.R', chdir = TRUE)
source("workflow/scripts/traits.R", chdir = TRUE)

## estimate r for binary or continouse exposures/outcomes
r_func <- function(x){
  x$r.exposure <- if(logistic.exposure == TRUE){
    x %>%
      mutate(r.exposure = get_r_from_lor(.$beta.exposure, .$eaf.exposure, .$ncase.exposure, .$ncontrol.exposure, .$prevelance.exposure)) %>% pull(r.exposure)
  } else if(logistic.exposure == FALSE){
    x %>% mutate(r.exposure = get_r_from_pn(.$pval.exposure, .$samplesize.exposure)) %>% pull(r.exposure)
  }

  x$r.outcome <- if(logistic.outcome == TRUE){
    x %>% mutate(r.outcome = get_r_from_lor(.$beta.outcome, .$eaf.outcome, .$ncase.outcome, .$ncontrol.outcome, .$prevelance.outcome)) %>% pull(r.outcome)
  } else if(logistic.outcome == FALSE){
    x %>% mutate(r.outcome = get_r_from_pn(.$pval.outcome, .$samplesize.outcome)) %>% pull(r.outcome)
  }
  x
}

## -------------------- TEST -------------------------##
# exposure.code = "Kunkle2019load"
# outcome.code = "Yengo2018bmi"
# p.threshold = "5e-8"
# dataout = "data/MR_ADbidir/"
#
# harmonized.dat = glue(dataout, "{exposure.code}/{outcome.code}/{exposure.code}_{p.threshold}_{outcome.code}_mrpresso_MRdat.csv")

## -------------------- Load/Wrangle Data -------------------------##
harmonized.dat = snakemake@input[["mrdat"]] # Harmonized MR data
pt = snakemake@params[["pt"]]
outfile = snakemake@output[["outfile"]] # Output

mrdat.raw <- read_csv(harmonized.dat)
mrdat <- mrdat.raw %>%
  filter(pleitropy_keep == TRUE, mrkeep = TRUE) %>%
  select(-samplesize.outcome, -samplesize.exposure) %>%
  left_join(samplesize, by = c('exposure' = 'code')) %>%
  rename(samplesize.exposure = samplesize,
         ncase.exposure = ncase,
         ncontrol.exposure = ncontrol,
         logistic.exposure = logistic,
         exposure.name = trait,
         prevelance.exposure = prevelance) %>%
  left_join(samplesize, by = c('outcome' = 'code')) %>%
  rename(samplesize.outcome = samplesize,
         ncase.outcome = ncase,
         ncontrol.outcome = ncontrol,
         logistic.outcome = logistic,
         outcome.name = trait,
         prevelance.outcome = prevelance) %>%
  select(-domain.y, -domain.x, -pmid.x, -pmid.y)

logistic.exposure <- mrdat %>% slice(1) %>% pull(logistic.exposure)
logistic.outcome <- mrdat %>% slice(1) %>% pull(logistic.outcome)
mrdat <- r_func(mrdat)

## -------------------- Directionality Test -------------------------##
steiger_outliers <- directionality_test(mrdat) %>%
  mutate(outliers_removed = FALSE,
         pt = pt)

steiger_out <- steiger_outliers

if(sum(!mrdat$mrpresso_keep, na.rm = TRUE) > 0){

  steiger_wo_outliers <- mrdat %>%
    filter(mrpresso_keep == TRUE) %>%
    directionality_test(.) %>%
    mutate(outliers_removed = TRUE,
           pt = pt) %>%
    bind_rows(steiger_outliers)
  steiger_out <- steiger_wo_outliers
}

## -------------------- Write -------------------------##

write_tsv(steiger_out, outfile)
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
library(tidyverse)
library(qvalue)

## File Paths
res.files = snakemake@input[["res"]]
out <- snakemake@output[["summ"]]
file_type <- snakemake@params[["file_type"]]


if(file_type == "association"){
  message("\n Concat GLM assocation statistics: ", res.files, "\n")
  # GLM assocations stats
  ## res.files <- list.files(pattern = 'association.tsv', recursive = T)
  res <- map_dfr(res.files, read_tsv) %>%
    mutate(OR = exp(estimate),
           LCI = exp(estimate-std.error*1.96),
           UCI = exp(estimate+std.error*1.96))
  write_csv(res, out)

} else if(file_type == "prsice"){
  message("\n Concat PRSice files: ", res.files, "\n")
  ##Summary files of output from PRSice
  ## prsice.files <- list.files(pattern = '.prsice', recursive = T)
  prsice.files <- map_dfr(res.files, function(x){
    read_tsv(x) %>% mutate(Phenotype = str_extract(x, pattern = '(?<=prs_)[A-Za-z,0-9]+(?=_)'))
  }) %>%
    filter(Threshold == 5e-8)
  write_csv(prsice.files, out)

} else if(file_type == "summary"){
  message("\n Concat Summary files of output from PRSice: ", res.files, "\n")
  # Summary files of output from PRSice
  ## sum.files <- list.files(pattern = '.summary', recursive = T)
  sum.files <-  map_dfr(res.files, function(x){
    read_tsv(x) %>%
      mutate(Phenotype = str_extract(x, pattern = '(?<=prs_)[A-Za-z,0-9]+(?=_)'))
  })
  write_csv(sum.files, out)

} else {

  stop("In compatiable file type used as input")

}
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
library(tidyverse)
library(broom)
zscore = function(x){(x - mean(x)) / sd(x)}

# setwd('/Users/sheaandrews/LOAD_minerva/dummy/shea/Projects/AD_PRS')

## File Paths
prs.file.path <- snakemake@input[["prs"]]
sum.file.path <- snakemake@input[["summary"]]
cohort.pheno.path <- snakemake@input[["pheno_file"]]

## params
exposure = snakemake@params[["pheno"]]
r2 <- snakemake@params[["r2"]]
window <- snakemake@params[["window"]]

# Read in summary file and extract best Pt
message('READ IN SUMMARY FILE \n')
best.prs <- read_tsv(sum.file.path) %>%
  pull(Threshold) %>%
  as.character()

# Read in Pheno file
message('READ IN PHENOTYPES \n')
pheno <- read_tsv(cohort.pheno.path, col_names = T)

# Read in all PRS file
message('READ IN PRS \n')
prs <- read_csv(prs.file.path) %>%
#  rename_at(vars(-FID, -IID), function(x) paste0('prs.pt_', x))  %>%
  mutate_at(vars(matches("prs.pt_")), zscore) %>%
  select(-prs.pc2)

# extract Pt used
pt <- select(prs, starts_with("prs")) %>% colnames(.)

## Join PRS and Phenotype data
message('Join PRS and Phenotype data \n')
dat <- left_join(prs, pheno) %>%
  select(FID:cohort, sex, status, aaoaae2, apoe4dose, jointPC1:jointPC10) %>%
  mutate(status = as.factor(status),
         sex = as.factor(sex),
         cohort = fct_infreq(cohort))

## Peform logistic regression for each Pvalue threshold
message('RUNNING REGRESSIONS \n')
res <- map(pt, function(x){
  select(dat, status, x, aaoaae2, sex, apoe4dose, cohort, jointPC1:jointPC10) %>%
    glm(status ~ ., family = 'binomial', data = .) %>%
    tidy(.) %>%
    mutate(pt = x,
           pt = case_when(
            str_detect(pt, 'prs.pt_') ~ str_replace(pt, 'prs.pt_', "") %>% as.numeric() %>% as.character(),
            str_detect(pt, 'prs.pc') ~ str_replace(pt, 'prs.', ""),
            TRUE ~ "NA"
            ),
           exposure = exposure)
}) %>%
  bind_rows() %>%
#  mutate(pt = as.numeric(pt)) %>%
  mutate(model = case_when(
          pt == '5e-08' & pt == best.prs ~ 'GWS_best',
          pt == '5e-08' ~ "GWS",
          pt == best.prs ~ 'best',
          pt == "pc1" ~ 'PCA',
          TRUE ~ "other"
        ),
         r2 = r2,
         window = window) %>%
  select(exposure, pt, model, everything())

message('EXPORT \n')
write_tsv(res, snakemake@output[["out"]])
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
res.path = snakemake@input[["association"]]
prsice.path = snakemake@input[["prsice"]]
sum.path = snakemake@input[["summary"]]
model = snakemake@params[["model"]]
rwd = snakemake@params[["rwd"]]
r2 = snakemake@params[["r2"]]
window = snakemake@params[["window"]]
todaydate = Sys.Date()


knitr::opts_chunk$set(echo = TRUE)
knitr::opts_knit$set(root.dir = rwd)
# knitr::opts_knit$set(root.dir = "/sc/arion/projects/LOAD/shea/Projects/AD_PRS")

library(tidyverse)
library(broom)
library(glue)
library(scales)

## Function for spreading multiple columns
# https://community.rstudio.com/t/spread-with-multiple-value-columns/5378
myspread <- function(df, key, value) {
  # quote key
  keyq <- rlang::enquo(key)
  # break value vector into quotes
  valueq <- rlang::enquo(value)
  s <- rlang::quos(!!valueq)
  df %>% gather(variable, value, !!!s) %>%
    unite(temp, !!keyq, variable) %>%
    spread(temp, value)
}
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
traits <- tibble(
  code = c('Liu2019drnkwk23andMe', 'Liu2019smkint23andMe', 'Liu2019smkcpd23andMe', 'SanchezRoige2019auditt23andMe',
           "Niarchou2020meat", "Niarchou2020fish", 'Wells2019hdiff', 'Xue2018diab',
           'Yengo2018bmi', 'Willer2013tc', 'Willer2013ldl', 'Willer2013hdl',
           'Willer2013tg', 'Evangelou2018dbp', 'Evangelou2018sbp', 'Evangelou2018pp',
           'Howard2019dep23andMe', 'Jansen2018insomnia23andMe', 'Dashti2019slepdur',
           'Day2018sociso', 'Lee2018education23andMe', 'Huang2017aaos', 'Deming2017ab42',
           'Hilbar2017hipv', 'Hilbar2015hipv', 'Lambert2013load', 'Kunkle2019load',
           'Beecham2014braak4', 'Beecham2014npany', 'Deming2017ptau', 'Deming2017tau',
           'Beecham2014vbiany', 'Klimentidis2018mvpa'),
  trait = c("Alcohol Consumption", "Smoking Initiation", "Cigarettes per Day",
            "AUDIT", "Meat diet", "Fish and Plant diet", "Hearing Difficulties",
            "Type 2 Diabetes", 'BMI', "Total Cholesterol", "Low-density lipoproteins",
            "High-density lipoproteins", "Triglycerides", "Diastolic Blood Pressure",
            "Systolic Blood Pressure", "Pulse Pressure", "Depressive Symptoms",
            "Insomnia Symptoms", "Sleep Duration",
            "Social Isolation", "Educational Attainment", "AAOS", "AB42",
            "Hippocampal Volume", "Hippocampal Volume", "LOAD", "LOAD",
            "Neurofibrillary Tangles", "Neuritic Plaques", "Ptau181", "Tau",
            "Vascular Brain Injury", "Moderate-to-vigorous PA"),
  abb = c("Alcohol Consumption", "Smoking", "Cigarettes per Day",
            "AUDIT", "Meat diet", "Fish & Plant diet", "Hearing Difficulties",
            "Diabetes", 'BMI', "Total Cholesterol", "LDL-cholesterol",
            "HDL-cholesterol", "Triglycerides", "DBP",
            "SBP", "PP", "Depression",
            "Insomnia", "Sleep",
            "Social Isolation", "Education", "AAOS", "AB42",
            "Hippocampal Volume", "Hippocampal Volume", "LOAD", "LOAD",
            "Neurofibrillary Tangles", "Neuritic Plaques", "Ptau181", "Tau",
            "Vascular Brain Injury", "Moderate-to-vigorous PA"))

traits %>% knitr::kable()
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
## GLM association statistics
res.files <- list.files(pattern = 'association.tsv', recursive = T)
res <- map_dfr(res.files, read_tsv) %>%
  mutate(OR = exp(estimate),
         LCI = exp(estimate-std.error*1.96),
         UCI = exp(estimate+std.error*1.96))

## Summary files of output from PRSice
prsice.files <- list.files(pattern = '.prsice', recursive = T) %>%
  map_dfr(., function(x){read_tsv(x) %>% mutate(Phenotype = str_extract(x, pattern = '(?<=prs_)[A-Za-z,0-9]+(?=_)'))}) %>%
  filter(Threshold == 5e-8)

## Summary files of output from PRSice
sum.files <- list.files(pattern = '.summary', recursive = T) %>%
  map_dfr(., function(x){read_tsv(x) %>% mutate(Phenotype = str_extract(x, pattern = '(?<=prs_)[A-Za-z,0-9]+(?=_)'))})
117
118
119
res <- read_csv(res.path)
prsice.files <- read_csv(prsice.path)
sum.files <- read_csv(sum.path)
126
127
128
129
130
131
132
133
message("GWS")
# Genome-wide Significant
res.gws <- res %>%
  filter(str_detect(model, 'GWS'), str_detect(term, 'prs')) %>%
  mutate(fdr = p.adjust(p.value, method = 'BH')) %>%
  mutate(model = 'GWS') %>%
  left_join(select(prsice.files, Phenotype, Num_SNP), by = c('exposure' = 'Phenotype')) %>%
  arrange(fdr)
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
res.gws %>%
  left_join(traits, by = c('exposure' = 'code')) %>%
  mutate(estimate = round(estimate, 3),
         std.error = round(std.error, 3),
         b_se = paste0(estimate, ' (', std.error, ')'),
         OR = round(OR, 2),
         LCI = round(LCI, 2),
         UCI = round(UCI, 2),
         out = paste0(OR, ' [', LCI, ', ', UCI, ']'),
         pt = ifelse(pt < 0.001, scientific(pt , digits = 2), pt),
         fdr = ifelse(fdr < 0.001, scientific(fdr , digits = 2), round(fdr, 3)),
         p.value = ifelse(p.value < 0.001, scientific(p.value , digits = 2), round(p.value, 3))) %>%
  select(trait, model, pt, Num_SNP, b_se, p.value, fdr, out) %>%
  arrange(as.numeric(p.value)) %>%
  select(trait, "n SNP" = Num_SNP, "b (se)" = b_se, "OR (95% CI)" = out, p.value, fdr) %>%
  knitr::kable(caption = "Assoction of GWS PRS with AD", format = "html") %>%
  kableExtra::kable_styling()
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
## Genome Wide Significant Results
p.gws <- res.gws %>%
  arrange(OR) %>%
  left_join(traits, by = c('exposure' = 'code')) %>%
  mutate(exposure = fct_inorder(exposure),
         trait = fct_inorder(trait)) %>%
  ggplot(., aes(x = OR, y = trait, colour = fdr < 0.05)) +
  geom_vline(xintercept = 1, linetype = 2, colour = 'grey75') +
  geom_point() +
  geom_errorbarh(aes(xmax = UCI, xmin = LCI), height = 0) +
  scale_color_manual(values = c('black', 'red')) +
  theme_bw() + theme(legend.position = 'none', axis.title.y = element_blank())

p.gws

ggsave(glue("results/{model}_coeffplot_gws_{todaydate}_rsq{r2}_{window}kb.jpg"),
       plot = p.gws, height = 3, width = 3, units = 'in')
178
179
180
181
182
183
184
# Best Pt
res.best <- res %>%
  filter(str_detect(model, 'best'), str_detect(term, 'prs')) %>%
  mutate(fdr = p.adjust(p.value, method = 'BH'))  %>%
  mutate(model = 'best') %>%
  arrange(fdr) %>%
  left_join(select(sum.files, Phenotype, Num_SNP), by = c('exposure' = 'Phenotype'))
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
res.best %>%
  left_join(traits, by = c('exposure' = 'code')) %>%
  mutate(estimate = round(estimate, 3),
         std.error = round(std.error, 3),
         b_se = paste0(estimate, ' (', std.error, ')'),
         OR = round(OR, 2),
         LCI = round(LCI, 2),
         UCI = round(UCI, 2),
         out = paste0(OR, ' [', LCI, ', ', UCI, ']'),
         pt = ifelse(pt < 0.001, scientific(pt , digits = 2), pt),
         fdr = ifelse(fdr < 0.001, scientific(fdr , digits = 2), round(fdr, 3)),
         p.value = ifelse(p.value < 0.001, scientific(p.value , digits = 2), round(p.value, 3))) %>%
  select(trait, model, pt, Num_SNP, b_se, p.value, fdr, out) %>%
  arrange(as.numeric(p.value)) %>%
  select(trait, "n SNP" = Num_SNP, "b (se)" = b_se, "OR (95% CI)" = out, p.value, fdr) %>%
  knitr::kable(caption = "Assoction of best PRS threshold with AD", format = "html") %>%
  kableExtra::kable_styling()
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
## Plot
p.best <- res.best %>%
  arrange(OR) %>%
  left_join(traits, by = c('exposure' = 'code')) %>%
  mutate(exposure = fct_inorder(exposure),
         trait = fct_inorder(trait)) %>%
  ggplot(., aes(x = OR, y = trait, colour = fdr < 0.05)) +
  geom_vline(xintercept = 1, linetype = 2, colour = 'grey75') +
  geom_point() +
  geom_errorbarh(aes(xmax = UCI, xmin = LCI), height = 0) +
  scale_color_manual(values = c('black', 'red')) +
  theme_bw() + theme(legend.position = 'none',
                     axis.title.y = element_blank(),
                     text = element_text(size=10))

p.best

ggsave(glue("results/{model}_coeffplot_best_{todaydate}_rsq{r2}_{window}kb.jpg"),
       plot = p.best, height = 3, width = 3, units = 'in')
234
235
236
237
238
239
240
# Best Pt
res.pca <- res %>%
  filter(str_detect(model, 'PCA'), str_detect(term, 'prs')) %>%
  mutate(fdr = p.adjust(p.value, method = 'BH'))  %>%
  mutate(model = 'PCA') %>%
  arrange(fdr) %>%
  left_join(select(sum.files, Phenotype, Num_SNP), by = c('exposure' = 'Phenotype'))
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
res.pca %>%
  left_join(traits, by = c('exposure' = 'code')) %>%
  mutate(estimate = round(estimate, 3),
         std.error = round(std.error, 3),
         b_se = paste0(estimate, ' (', std.error, ')'),
         OR = round(OR, 2),
         LCI = round(LCI, 2),
         UCI = round(UCI, 2),
         out = paste0(OR, ' [', LCI, ', ', UCI, ']'),
         pt = ifelse(pt < 0.001, scientific(pt , digits = 2), pt),
         fdr = ifelse(fdr < 0.001, scientific(fdr , digits = 2), round(fdr, 3)),
         p.value = ifelse(p.value < 0.001, scientific(p.value , digits = 2), round(p.value, 3))) %>%
  select(trait, model, pt, Num_SNP, b_se, p.value, fdr, out) %>%
  arrange(as.numeric(p.value)) %>%
  select(trait, "n SNP" = Num_SNP, "b (se)" = b_se, "OR (95% CI)" = out, p.value, fdr) %>%
  knitr::kable(caption = "Assoction of best PRS threshold with AD", format = "html") %>%
  kableExtra::kable_styling()
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
## Plot
p.pca <- res.pca %>%
  arrange(OR) %>%
  left_join(traits, by = c('exposure' = 'code')) %>%
  mutate(exposure = fct_inorder(exposure),
         trait = fct_inorder(trait)) %>%
  ggplot(., aes(x = OR, y = trait, colour = fdr < 0.05)) +
  geom_vline(xintercept = 1, linetype = 2, colour = 'grey75') +
  geom_point() +
  geom_errorbarh(aes(xmax = UCI, xmin = LCI), height = 0) +
  scale_color_manual(values = c('black', 'red')) +
  theme_bw() + theme(legend.position = 'none',
                     axis.title.y = element_blank(),
                     text = element_text(size=10))

p.pca

ggsave(glue("results/{model}_coeffplot_best_{todaydate}_rsq{r2}_{window}kb.jpg"),
       plot = p.pca, height = 3, width = 3, units = 'in')
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
## Output results
out <- bind_rows(res.gws, res.best, res.pca) %>%
  left_join(traits, by = c('exposure' = 'code')) %>%
  mutate(estimate = round(estimate, 3),
         std.error = round(std.error, 3),
         b_se = paste0(estimate, ' (', std.error, ')'),
         OR = round(OR, 2),
         LCI = round(LCI, 2),
         UCI = round(UCI, 2),
         out = paste0(OR, ' [', LCI, ', ', UCI, ']'),
         pt = ifelse(pt < 0.001, scientific(pt , digits = 2), pt),
         fdr = ifelse(fdr < 0.001, scientific(fdr , digits = 2), round(fdr, 3)),
         p.value = ifelse(p.value < 0.001, scientific(p.value , digits = 2), round(p.value, 3))) %>%
  select(trait, model, pt, Num_SNP, b_se, p.value, fdr, out) %>%
  pivot_wider(names_from = model, values_from = c(pt, Num_SNP, b_se, p.value, fdr, out)) %>%
  arrange(as.numeric(p.value_GWS)) %>%
  select(trait, Num_SNP_GWS, b_se_GWS, p.value_GWS, fdr_GWS, out_GWS,
         pt_best, Num_SNP_best, b_se_best, p.value_best, fdr_best, out_best,
         b_se_PCA, p.value_PCA, fdr_PCA, out_PCA)

write_csv(out, glue("results/{model}_rsq{r2}_{window}kb_table1_{todaydate}.csv"))
315
316
317
318
319
320
321
322
# pt = best_pt, "n SNP" = best_Num_SNP, "b (se)" = best_b_se, "OR (95% CI)" = best_out, p = best_p.value, fdr = GWS_fdr

out %>%
  rename( "n SNP" = Num_SNP_GWS, "b (se)" = b_se_GWS, "OR (95% CI)" = out_GWS, p = p.value_GWS, fdr = fdr_GWS) %>%
  knitr::kable(caption = "Comparison of GWS and Best pt Results", format = "html") %>%
  kableExtra::kable_styling() %>%
  kableExtra::add_header_above(c(" " = 1, "GWS" = 5, "Best Pt" = 6, "PRS-PCA" = 4)) %>%
  kableExtra::scroll_box(width = "100%")
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
## Joint Plot
p.joint <- bind_rows(res.gws, res.best, res.pca) %>%
  left_join(traits, by = c('exposure' = 'code')) %>%
  mutate(model = fct_relevel(model, 'GWS', 'best', 'PCA')) %>%
  arrange(model, OR) %>%
  mutate(exposure = fct_inorder(exposure),
         trait = fct_inorder(trait),
         signif = case_when(fdr < 0.05 ~ "*",
                            fdr < 0.1 ~ ".",
                            fdr >= 0.1 ~ " ")) %>%
  ggplot(., aes(x = OR, y = trait, colour = signif)) +
  geom_vline(xintercept = 1, linetype = 2, colour = 'grey75') +
  geom_point() +
  geom_errorbarh(aes(xmax = UCI, xmin = LCI), height = 0) +
  scale_color_manual(values = c(" " = 'black', "." = "orange", "*" = 'red')) +
  facet_grid(cols = vars(model)) +
  theme_bw() +
  theme(legend.position = 'none', axis.title.y = element_blank())

p.joint

ggsave(glue("results/{model}_coeffplot_joint_{todaydate}_rsq{r2}_{window}kb.jpg"),
            plot = p.joint, height = 5, width = 8, units = 'in')
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
suppressMessages(library(tidyverse))
suppressMessages(library(glue))

# Read in Data
clumped = snakemake@input[[1]]
out.snps = snakemake@output[[1]]

# Regions for exclusion
regions_chrm = snakemake@params[["regions_chrm"]]
regions_start = snakemake@params[["regions_start"]]
regions_stop = snakemake@params[["regions_stop"]]

message("Reading in Data \n")
dat_ld <- read_table2(clumped) %>%
  filter(!is.na(CHR)) %>%
  select(CHR, F, SNP, BP, P, TOTAL, NSIG)

if(is.null(regions_chrm)){
  message("No exclusions \n")
  out <- dat_ld %>%
    select(SNP)
} else {
  out <- dat_ld
  for(i in 1:length(regions_chrm)){
    message(glue("Excluding regions: ", regions_chrm[i], ":", regions_start[i], "-", regions_stop[i], " (n = {n})", 
                 n = filter(out, (CHR == regions_chrm[i] & between(BP, regions_start[i], regions_stop[i]) )) %>% nrow()))
    out <- filter(out, !(CHR == regions_chrm[i] & between(BP, regions_start[i], regions_stop[i]) ))
  }
  out <- select(out, SNP)
}

write_tsv(out, out.snps)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
library(tidyverse)
library(fastDummies)
`%nin%` = Negate(`%in%`)

## Read in ADGC pheno file
cohort.pheno.path <- snakemake@input[["pheno_file"]]
keep_out <- snakemake@output[["keep"]]
pheno_out <- snakemake@output[["pheno"]]

## Exclude samples
message('\n READ AND FILTER DATA \n')
dat <- read_tsv(cohort.pheno.path)
df <- dat %>%
  filter(!QC_omit) %>%
  filter(!ADGC_omit) %>%
  filter(!rel_omit) %>%
  filter(status %in% c(1,2)) %>%
  filter(!is.na(jointPC1)) %>%
  filter(!is.na(aaoaae2)) %>%
  filter(!is.na(apoe4dose)) %>%
  filter(cohort %nin% c('GSK')) %>%
  mutate(cohort = fct_infreq(cohort)) %>%
  dummy_cols(., select_columns = "cohort", remove_first_dummy = TRUE)

#  filter(cohort %nin% c('ACT2', 'BIOCARD', 'EAS', 'UKS', 'GSK'))

## Write out list of sample to keep
message('\n EXPORT \n')
df %>%
  select(FID, IID) %>%
  write_tsv(keep_out)

## Males Only
df %>%
  filter(sex == 1) %>%
  select(FID, IID) %>%
  write_tsv(paste0(keep_out, '_male'))

## Females Only
df %>%
  filter(sex == 2) %>%
  select(FID, IID) %>%
  write_tsv(paste0(keep_out, '_female'))

# ## Write out dataframe of phenotypes
write_tsv(df, pheno_out)

message('\n TABLE OF EXCLUDED \n')
exclusions <- summarise(dat,
          total = nrow(dat) - nrow(df),
          QC_omit = sum(QC_omit),
          ADGC_omit = sum(ADGC_omit, na.rm = T),
          rel_omit = sum(rel_omit, na.rm = T),
          status = sum(status %nin% c(1,2)),
          miss.pc = sum(is.na(jointPC1)),
          miss.aaoaae2 = sum(is.na(aaoaae2)),
          miss.apoe4dose = sum(is.na(apoe4dose)))

exclusions
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
library(tidyverse)
library(here)
source(here("workflow", "scripts", "prs_pca.R"), chdir = TRUE)

# Define Input
prs.file.path <- snakemake@input[[1]]
outfile = snakemake@output[[1]]

# Read in all PRS file
message('READ IN PRS \n')
prs <- read_table2(prs.file.path)  %>%
  rename_at(vars(-FID, -IID), function(x) paste0('prs.pt_', x))

prspca <- prs.pc(prs)

prs.out <- left_join(prs, prspca$data)

write_csv(prs.out, outfile)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
library(dplyr)
library(readr)
library(stringr)

args = commandArgs(trailingOnly = TRUE) # Set arguments from the command line
input = args[1]
g1 = args[2]
g2 = args[3]


dat <- read_table2(input) %>%
  mutate(trait1 = g1,
         trait2 = g2) %>%
  relocate(annot_name, trait1, trait2)

write_tsv(dat, str_replace(input, ".txt", ".tsv"))
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
library(tidyr)
library(readr)
library(dplyr)
library(plyr)
library(magrittr)

args = commandArgs(trailingOnly = TRUE) # Set arguments from the command line
input = args[1]

## Extract rg results from HDL log output
extract_hdl <- function(x){
  hdl_res <- read_lines(x)
  tribble(
    ~term, ~var,
    "h2_trait1", extract(hdl_res, str_detect(hdl_res, "Heritability of phenotype 1:")),
    "h2_trait2", extract(hdl_res, str_detect(hdl_res, "Heritability of phenotype 2:")),
    "rho", extract(hdl_res, str_detect(hdl_res, "Genetic Covariance:")),
    "rg", extract(hdl_res, str_detect(hdl_res, "Genetic Correlation:")),
    "p", extract(hdl_res, str_detect(hdl_res, "P:")),
  ) %>%
    separate(var, into = c("var", "est"), sep = ":") %>%
    separate(est, into = c("b", "se"), sep = "\\(") %>%
    mutate(se = str_remove(se, "\\)"),
           b = str_trim(b) %>% as.numeric(),
           se = str_trim(se) %>% as.numeric()) %>%
    select(-var) %>%
    pivot_wider(names_from = term, values_from = c(b, se)) %>%
    mutate(trait1 = extract(hdl_res, str_detect(hdl_res, "gwas1.df")) %>%
             str_extract(., pattern = '(?<=input/hdl/)[A-Za-z,0-9]+(?=_formated)'),
           trait2 = extract(hdl_res, str_detect(hdl_res, "gwas2.df")) %>%
             str_extract(., pattern = '(?<=input/hdl/)[A-Za-z,0-9]+(?=_formated)')) %>%
    select(trait1, trait2, b_h2_trait1, se_h2_trait1, b_h2_trait2, se_h2_trait2,
           b_rho, se_rho, b_rg, se_rg, p = b_p, -se_p)
}

dat = read_lines(input)

if(str_detect(dat, "Genetic Covariance:") %>% any()){
  # If HDL finished suscesucully
  message("HDL output present: Extracting")
  extract_hdl(input) %>%
    mutate(success = ifelse(b_rg == "Inf", FALSE, TRUE)) %>%
    write_tsv(., str_replace(input, ".Rout", ".tsv"))

} else {
  # If HDL threw an error
  message("HDL output not present")
  tibble(
    trait1 = str_extract(input, pattern = '(?<=res/hdl/)[A-Za-z,0-9]+(?=_)'),
    trait2 = str_extract(input, pattern = '(?<=_)[A-Za-z,0-9]+(?=_HDL)'),
    b_h2_trait1 = NA,
    se_h2_trait1 = NA,
    b_h2_trait2 = NA,
    se_h2_trait2 = NA,
    b_rho = NA,
    se_rho = NA,
    b_rg = NA,
    se_rg = NA,
    p  = NA,
    success = FALSE
  ) %>%
    write_tsv(., str_replace(input, ".Rout", ".tsv"))
}
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
library(dplyr)
library(readr)
library(stringr)

args = commandArgs(trailingOnly = TRUE) # Set arguments from the command line
input = args[1]
g1 = args[2]
g2 = args[3]


dat <- read_table2(input) %>%
  mutate(trait1 = g1,
         trait2 = g2) %>%
  relocate(trait1, trait2)

write_tsv(dat, input)
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
library(tidyverse)
library(here)
library(gtools)
source(here("workflow", "scripts", "miscfunctions.R"), chdir = TRUE)
# setwd("/sc/arion/projects/LOAD/shea/Projects/MRcovid")
# rg.path = 'results/RGcovid/ldsc/all_results_ldsc_2021-01-12.tsv'

rg.path = snakemake@input[["rg"]]
DATE = snakemake@params[["date"]]
out.path = snakemake@params[["out"]]

## Munge datasets
exposures_outcomes <- expand_grid(exposures, outcomes) %>%
  magrittr::set_colnames(c("exposures.name", "outcomes.name")) %>%
  left_join(select(samplesize, code, trait, domain), by = c('exposures.name' = 'code')) %>%
  rename(exposure.name = trait, domain.exposure = domain) %>%
  left_join(select(samplesize, code, trait, domain), by = c('outcomes.name' = 'code')) %>%
  rename(outcome.name = trait, domain.outcome = domain) %>% 
  mutate(UKB = !str_detect(outcome.name, "UKB"))

rg <- read_tsv(rg.path) %>% 
  mutate(p1 = str_replace(p1, "\\.sumstats.gz", ""), 
         p1 = str_replace(p1, "data/formated/ldsc/", ""), 
         p2 = str_replace(p2, "data/formated/ldsc/", ""), 
         p2 = str_replace(p2, ".sumstats.gz", "")) %>%
  filter(p1 %in% exposures) %>% 
  filter(p2 %in% outcomes) %>% 
  left_join(select(exposures_outcomes, exposures.name, outcomes.name, exposure.name, outcome.name, UKB), 
            by = c("p1" = "exposures.name", "p2" = "outcomes.name")) %>%
  group_by(UKB) %>%
  mutate(fdr = p.adjust(p, "fdr")) %>% 
  ungroup() %>% 
  mutate(lci = rg - (1.96 * se), 
         uci = rg + (1.96 * se), 
         signif = case_when(p > 0.05 ~ "NS", 
                               p < 0.05 & fdr > 0.05 ~ "Nominal", 
                               fdr < 0.05 ~ "Significant"), 
         signif = replace_na(signif, "NS"), 
         signif = fct_relevel(signif, "NS", "Nominal", "Significant"), 
         stars = stars.pval(fdr)) %>%
  arrange(UKB, p2, fdr) 

## Functions of heatmap and forest plots  
forrest_rg <- function(dat){
  ggplot(dat, aes(y = exposure.name, x = rg, colour = signif)) + 
    geom_vline(xintercept = 0, linetype = 2) + 
    facet_grid(. ~ outcome.name, scales = "free_x") + 
    geom_point() + 
    geom_linerange(aes(xmin = lci, xmax = uci)) + 
    scale_color_manual(values = c("grey75", "black", "red")) + 
    theme_bw() + 
    labs(title = "Genetic correlation", x = "rg (95%CI)") + 
    theme(legend.position = "bottom",
          legend.title = element_blank(),
          axis.title.y=element_blank(), 
          strip.background = element_blank(),
          strip.text.y = element_blank(), 
          panel.grid.minor.x  = element_blank())

} 

heatmap_rg <- function(dat){
  ggplot(dat) +
    geom_raster(aes(x = exposure.name, y = outcome.name, fill = rg)) +
    geom_text(data = dat, size = 4, aes(label = stars, x = exposure.name, y = outcome.name)) +
    scale_fill_gradient2(low="steelblue", high="firebrick", mid = "white", na.value = "grey75", name = "rg", limits = c(-1,1)) +
    geom_vline(xintercept=seq(0.5, 23.5, 1),color="white") +
    geom_hline(yintercept=seq(0.5, 11.5, 1),color="white") +
    coord_equal() +
    theme_classic() +
    theme(legend.position = 'right', legend.key.height = unit(1, "line"),
          axis.text.x = element_text(angle = 35, hjust = 0),
          legend.text = element_text(hjust = 1.5), text = element_text(size=10),
          axis.title.x = element_blank(), axis.title.y = element_blank()) +
    scale_x_discrete(position = "top")
}

## Print Forrest Plots
rg_forrest_eur.p <- rg %>% 
  filter(UKB == TRUE) %>%
  arrange(p2, rg) %>% 
  mutate(exposure.name = fct_inorder(exposure.name)) %>% 
  forrest_rg(.)
ggsave(glue(out.path, "forrest_eur_{DATE}.png"), rg_forrest_eur.p ,width = 9, height = 6, units = "in")

rg_forrest_eurLeaveUKB.p <- rg %>% 
  filter(UKB == FALSE) %>%
  arrange(p2, rg) %>% 
  mutate(exposure.name = fct_inorder(exposure.name)) %>% 
  forrest_rg(.)
ggsave(glue(out.path, "forrest_eurLeaveUKB_{DATE}.png"), rg_forrest_eurLeaveUKB.p ,width = 9, height = 6, units = "in")

## Print Heatmaps 
rg_heatmap_eur.p <- rg %>% 
  filter(UKB == TRUE) %>%
  arrange(p2, exposure.name) %>% 
  mutate(exposure.name = fct_inorder(exposure.name), 
         outcome.name = fct_rev(outcome.name)) %>% 
  heatmap_rg()
ggsave(glue(out.path, "heatmap_eur_{DATE}.png"), rg_heatmap_eur.p, width = 11, height = 4, units = "in")

rg_heatmap_eurLeaveUKB.p <- rg %>% 
  filter(UKB == FALSE) %>%
  arrange(p2, exposure.name) %>% 
  mutate(exposure.name = fct_inorder(exposure.name), 
         outcome.name = fct_rev(outcome.name)) %>% 
  heatmap_rg()
ggsave(glue(out.path, "heatmap_eurLeaveUKB_{DATE}.png"), rg_heatmap_eurLeaveUKB.p, width = 11, height = 4, units = "in")
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
if(any(grepl("conda", .libPaths(), fixed = TRUE))){
  message("Setting libPaths")
  df = .libPaths()
  conda_i = which(grepl("conda", df, fixed = TRUE))
  .libPaths(c(df[conda_i], df[-conda_i]))
}
.libPaths(c(.libPaths(), "/hpc/users/harern01/miniconda3/envs/py38/lib/R/library"))
library(tidyr)
library(readr)
library(dplyr)
library(plyr)
library(glue)
#suppressMessages(library(tidyverse))
#suppressMessages(library(glue))

infile_gwas = snakemake@input[["ss"]]
outfile = snakemake@output[["out"]]
## Regions for exclusion
regions_chrm = snakemake@params[["regions_chrm"]]
regions_start = snakemake@params[["regions_start"]]
regions_stop = snakemake@params[["regions_stop"]]

trait.gwas <- suppressMessages(read_tsv(infile_gwas, comment = '#', guess_max = 11000000))


if(is.null(regions_chrm)){

  message("No genomic regions to exclude")

} else {

  for(i in 1:length(regions_chrm)){
    message(glue("Excluding regions: ", regions_chrm[i], ":", regions_start[i], "-", regions_stop[i]))
    trait.gwas <- trait.gwas %>%
      filter(!(CHROM == regions_chrm[i] & between(POS, regions_start[i], regions_stop[i])))
  }
}

write_tsv(trait.gwas , gzfile(outfile))
ShowHide 74 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/marcoralab/MRcovid
Name: mrcovid
Version: v6
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...