Improving pathogenicity prediction of missense variants by using AlphaFold-derived features

public public 1yr ago Version: 0.9 0 bookmarks

This code belongs to the project "Predicting the pathogenicity of missense variants using features derived from AlphaFold2". This is the source code of the project; precalculated scores are deposited under the DOI 10.5281/zenodo.6288139 .

If you would like to retrieve individual scores for your variants you can visit the website: http://holzwerk-schmidt.de/AlphScore/

Workflow

In the first part of the workflow features of each amino acid from AlphaFold structures ( https://alphafold.ebi.ac.uk/ ) are extracted. For feature extraction the tools DSSP, FEATURE, protinter and the python modules biopython, biopandas and networkx are used. The extracted features are then added to the data of dbNSFP 4.2a and dbNSFP is subsequently used to form variant sets (gnomAD, ClinVar). Then these variants are used to train tree-based machine learning classifiers.

Workflow: alt text

Install the requiered dependencies

First clone or download the repository and the modified version of the tool protinter to a directory with plenty (~500Gb) of free storage by. Protinter was modified from its original version to increase its speed and to be able to obtain residues that are spatially close.

git clone https://github.com/Ax-Sch/AlphScore
cd AlphScore/tools
git clone https://github.com/Ax-Sch/protinter.git
chmod +x protinter/protinter
cd ../

If you do not have conda installed, please now obtain it e.g. from https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html ) and set up the main conda environment and the conda environment in which the FEATURE framework will run:

conda env update --file workflow/envs/AlphScore.yaml
conda env update --file workflow/envs/dssp.yaml

Additionally the FEATURE framework version 3.1.0 needs to be obtained from the following website: https://simtk.org . To install the feature framework place the file feature-3.1.0-src.tar.gz (obtained from simtk.org) to the folder tools/feature. Then execute the following commands:

cd tools/feature
conda activate DSSP
tar -xf feature-3.1.0-src.tar.gz 
cd feature-3.1.0/
make
cd ../../..

After this, the executable of FEATURE should be located in the following path: tools/feature/feature-3.1.0/bin/featurize

Run the snakemake pipeline

The snakemake pipeline can then be run by executing the following command.

conda activate AlphScore
snakemake --cores [number of cores you would like to use] --use-conda --conda-frontend conda

At the top of the file workflow/Snakefile the variable testing is set to True (testing=True). This reduces the computation to a set of about 210 proteins, to test if everything works as expected without the need of processing all proteins. You can set the variable testing to False (change testing=True to testing=False) to run all proteins, which will result in >100,000 jobs. This code was tested on a HPC cluster with CentOS Linux 7, miniconda3 and the job scheduler Slurm.

If you have any questions do not hesitate to contact me.

Code Snippets

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
library(tidyverse)
library(readxl)
library(gridExtra)
library(optparse)

option_list = list(
  make_option(c("-r", "--input_recalibrated"), type="character", default="C:/Users/Karo.PC-Karo/Master/Semester_3/Lab_rotation_Ludwig/for_Karola/SSH/gnomad_extracted_prepro_rec.csv.gz", 
              help="location of csv.gz file"),
  make_option(c("-e", "--excel_location"), type="character", default="C:/Users/Karo.PC-Karo/Master/Semester_3/Lab_rotation_Ludwig/for_Karola/local_files/available_colnames_regular.xlsx", 
              help="location of excel file"),
  make_option(c("-o", "--out_folder"), type="character", default="C:/Users/Karo.PC-Karo/Master/Semester_3/Lab_rotation_Ludwig/for_Karola/local_files", 
              help="name of folder to store output")

)

opt = parse_args(OptionParser(option_list=option_list))

variants<-read_csv(opt$input_recalibrated, na=c(".","NA"))
colnames_usage <- read_excel(opt$excel_location)

dir.create(opt$out_folder, recursive=TRUE)
setwd(opt$out_folder)

write_tsv(as_tibble(colnames(variants)), file="available_colnames_W_surr.tsv")

for_prediction<-(colnames_usage %>% filter(!is.na(for_prediction)))$value

mean_no_na_no_round<-function(x) {(mean(x, na.rm=TRUE))}


mean_properties<-variants %>% filter(b_factor>80) %>% 
  group_by(from_AS, outcome)%>% 
  dplyr::select(for_prediction) %>%
  summarise_all(mean_no_na_no_round)%>%
  ungroup()%>%
  dplyr::select(-starts_with("RESIDUE_NAME"))

mean_properties<-mean_properties %>% 
  gather(variable, value, -outcome, -from_AS) %>%
  group_by(variable,from_AS) %>%
  summarise(value = (value[outcome == 1] - value[outcome == 0]))

####Both Outcomes, total count####
mean_properties_freq<-variants %>%  
  group_by(from_AS, to_AS) %>%
  select(from_AS, to_AS) %>%
  filter(!to_AS=="%3D") %>%
  filter(!to_AS=="???") %>%
  count(from_AS,to_AS,sort=TRUE)

plot_tot_AA<-ggplot(mean_properties_freq, aes(x = from_AS, y = to_AS, fill = n)) +
  geom_tile()+
  theme_minimal()+  
  scale_fill_viridis_c()+
  theme(plot.title= element_text(size = 6),
        axis.text.x = element_text(size=5,angle = 90, vjust = 0.5, hjust=1),
        axis.text.y = element_text(size=5, vjust = 0.5, hjust=1),
        legend.key.height = unit(3, 'mm'),
        legend.key.width = unit(2.5, 'mm'),
        legend.title = element_text(size=5), 
        legend.text = element_text(size=5),
        legend.box.spacing = unit(0.1,"mm"),
        axis.title.y=element_text(size=6),
        axis.title.x=element_text(size=6),
        plot.margin=unit(c(1,1,1,1), "mm"))+
  ggtitle("AA exchanges, raw count")+
  geom_text(aes(from_AS, to_AS, label=n), colour = "white", check_overlap = TRUE, size=1.5)

#ggsave("AA_ex_tot.pdf", width=28.575, height=19.05, units="cm", dpi=300)

####Relative fraction outcome1/outcome0####
mean_properties_freq_outcome<-variants %>% 
  filter(!to_AS=="%3D") %>%
  filter(!to_AS=="???") %>%
  group_by(from_AS, to_AS) %>%
  select(from_AS, to_AS, outcome) %>%
  summarise(frac=mean(outcome))

frac_round<- round(mean_properties_freq_outcome$frac, digits = 2)

plot_AA_freq<-ggplot(mean_properties_freq_outcome, aes(x = from_AS, y = to_AS, fill = frac)) +
  geom_tile()+
  theme_minimal()+  
  scale_fill_viridis_c()+
  theme(plot.title= element_text(size = 6),
        axis.text.x = element_text(size=5,angle = 90, vjust = 0.5, hjust=1),
        axis.text.y = element_text(size=5, vjust = 0.5, hjust=1),
        legend.key.height = unit(3, 'mm'),
        legend.key.width = unit(2.5, 'mm'),
        legend.title = element_text(size=5), 
        legend.text = element_text(size=5),
        legend.box.spacing = unit(0.1,"mm"),
        axis.title.y=element_text(size=6),
        axis.title.x=element_text(size=6),
        plot.margin=unit(c(1,1,1,1), "mm"))+
  ggtitle("Fraction of AA exchanges with outcome = 1")+
  geom_text(aes(from_AS, to_AS, label=frac_round), colour = "white", check_overlap = TRUE, size=1.5)

#ggsave("AA_ex_frac.pdf",width=28.575, height=19.05, units="cm", dpi=300)

g<-arrangeGrob(plot_tot_AA, plot_AA_freq, ncol=1) #combine both plots in one
ggsave("AA_ex_compared.pdf", width=95, height=120,units="mm", dpi=300,g)
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
library(tidyverse)
library(pROC)
library(gridExtra)
library(optparse)
library(boot)
source("workflow/scripts/existing_scores_glm_functions.R")
BOOT_REPETITIONS=1000
N_CPU=3
COL_ORDER=c("Alph_null","AlphScore","CADD_raw","DEOGEN2_score_med","REVEL_score","glm_AlphCadd","glm_AlphDeogen",
  "glm_AlphRevel","glm_CaddDeogen","glm_DeogenRevel","glm_RevelCadd","glm_AlphDeogenRevel",
  "glm_AlphCaddDeogen","glm_AlphCaddRevel","glm_CaddDeogenRevel")

set.seed(1)


option_list = list(
  make_option(c("-t", "--variants"), type="character", default="results/prediction_final/pre_final_model_regular_variants.csv.gz", 
              help="csv.gz file, test dataset"),
  make_option(c("-v", "--validation_set"), type="character", default="results/validation_set/validation_set_w_AlphScore.csv.gz", 
              help="Validation set"),
  make_option(c("-o", "--out_folder"), type="character", default="results/analyse_score", 
              help="Output folder"))

opt = parse_args(OptionParser(option_list=option_list))

variants<-read_csv(opt$variants)
validation_dataset<-read_csv(opt$validation_set)

dir.create(opt$out_folder, recursive=TRUE)
setwd(opt$out_folder)

# get DEOGEN2 Scores:
variants$pos_in_VEP_and_Uniprot<-get_index_col(variants)
variants$DEOGEN2_score_med<-unlist_score(variants$DEOGEN2_score, variants$pos_in_VEP_and_Uniprot)

validation_dataset$pos_in_VEP_and_Uniprot<-get_index_col(validation_dataset)
validation_dataset$DEOGEN2_score_med<-unlist_score(validation_dataset$DEOGEN2_score, validation_dataset$pos_in_VEP_and_Uniprot)

variants$AlphScore<-variants$predicted_Alph

variants<-variants%>% 
  mutate(ID=paste(`#chr`, `pos(1-based)`,ref, alt, sep=":"))

### combine scores on interim dataset:
interim_dataset<-variants %>% 
  filter(clinvar_interim_test==TRUE)
gnomad_train_dataset<-variants %>% 
  filter(gnomad_train==TRUE)

validation_dataset<-validation_dataset %>%
  mutate(ID=paste(`#chr`, `pos(1-based)`,ref, alt, sep=":"))%>%
  filter(!ID %in% interim_dataset$ID)%>%
  filter(!ID %in% gnomad_train_dataset$ID)


set_of_models<-fit_set_of_models(interim_dataset)
validation_dataset<-predict_set_of_models(set_of_models, validation_dataset)


selected_scores<-c("HRAS_DMS_g12v",
                   "P53_DMS_WT_Nutlin",
                   "ADRB2_DMS_0.625",          
                   "BRCA1_DMS_E3_high_(a)",    
                   "BRCA1_DMS_(b)",            
                   "MSH2_DMS",      
                   "TPMT_DMS",                 
                   "PTEN_DMS_(a)",             
                   "PTEN_DMS_highqual_(b)",    
                   "SUMO1_DMS_flipped",        
                   "UBE2I_DMS_flipped",       
                   "VKOR1_DMS_abundance_score",
                   "TPK1_DMS_flipped")


validation_dataset<-validation_dataset%>%
  mutate(scoreID=paste(gene_dms, DMS, sep="_"))%>%
  filter(scoreID %in% selected_scores)



###### validation Dataset

prot_info_for_join<-validation_dataset %>% 
  distinct(Uniprot_acc_split, protein_length, protein_mean_b_factor)

SCORES<-c("Alph_null","AlphScore", "CADD_raw", "glm_AlphCadd", 
          "REVEL_score", "glm_AlphRevel", "glm_RevelCadd", 
          "DEOGEN2_score_med", "glm_AlphDeogen", "glm_CaddDeogen", 
          "glm_AlphCaddDeogen", "glm_DeogenRevel","glm_AlphDeogenRevel",
          "glm_AlphCaddRevel",    "glm_CaddDeogenRevel")

get_correlations<-function(val_dat){


  spearman_joined_private<-tibble()

for (un_ID in unique(val_dat$Uniprot_acc_split)){
  temp_values_joined<-val_dat %>% 
    filter(Uniprot_acc_split==un_ID)
  temp_values_joined <- temp_values_joined[complete.cases(temp_values_joined[, SCORES]),]
  for (dms in unique(temp_values_joined$DMS)){
    temp_inner_loop_values_joined<-temp_values_joined %>% 
      filter(DMS==dms)
    for (score in SCORES){
     score_DMS_tibble<-tibble(score_val=unlist(temp_inner_loop_values_joined %>% select(one_of(score))),
                              DMS_val=temp_inner_loop_values_joined$DMS_val)
     score_DMS_tibble<-score_DMS_tibble %>% filter(complete.cases(.))
      spearmans<-tibble(
      gene=unique(temp_inner_loop_values_joined$gene_dms), 
      UP_ID=un_ID, 
      DMS=dms, 
      spearm=ifelse(nrow(score_DMS_tibble)>2,
      cor.test(score_DMS_tibble$DMS_val, score_DMS_tibble$score_val,
               method = "spearman", na.action="na.ommit")$estimate,NA),
      method=score,
      number_of_vals=nrow(score_DMS_tibble))
      spearman_joined_private=rbind(spearman_joined_private,spearmans)
      }
  }
}

bad_UP_ids<-(spearman_joined_private %>% 
               filter(number_of_vals<21))$UP_ID

spearman_joined_private<-spearman_joined_private %>% 
  mutate(scoreID=paste(gene, DMS, sep="_"))%>%
  mutate(method=factor(method, levels= SCORES))%>%
  filter(! UP_ID %in% bad_UP_ids)%>%
  left_join(prot_info_for_join, by=c("UP_ID"="Uniprot_acc_split"))%>%
  mutate(abs_spear=abs(spearm)) %>%
  mutate(alphInclude=grepl("Alph", method))

return(spearman_joined_private)
}

spearmans_joined<-get_correlations(validation_dataset)

scatter_dms<-ggplot(data=validation_dataset)+
  geom_point(aes(x=DMS_val, y=AlphScore), color="darkblue", alpha=0.1)+
  geom_text(data=spearmans_joined %>% filter(method=="AlphScore"), aes(label=round(abs_spear, 3)),x = Inf, y = Inf, hjust = 1, vjust = 1)+
  labs(x="DMS score")+
  facet_wrap(~ scoreID, scales = "free", ncol = 3)+
  theme_minimal()

ggsave(filename = "scatter_dms_alphscore.png",
       plot = scatter_dms,
       height = 10,
       width=8, 
       dpi=300)

scatter_dms<-ggplot(data=validation_dataset)+
  geom_point(aes(x=DMS_val, y=glm_AlphDeogenRevel), color="darkblue", alpha=0.1)+
  geom_text(data=spearmans_joined %>% filter(method=="glm_AlphDeogenRevel"), aes(label=round(abs_spear, 3)),x = Inf, y = Inf, hjust = 1, vjust = 1)+
  labs(y="AlphScore + DEOGEN2 + REVEL", x="DMS score")+
  facet_wrap(~ scoreID, scales = "free", ncol = 3)+
  theme_minimal()

ggsave(filename = "scatter_dms_alph_doegen_revel.png",
       plot = scatter_dms,
       height = 10,
       width=8, 
       dpi=300)



pairwise.t.test(x = spearmans_joined$abs_spear, g = spearmans_joined$method, paired = TRUE, 
                p.adjust.method = "none", alternative = "greater")
# has been performed first; 
spearmans_joined %>% 
  group_by(gene, UP_ID, DMS) %>% 
  summarise(mean_abs_spear=mean(abs(spearm)))
## selected the scores with the highest overall correlation in mean_corr_per_score

spearmans_joined<-spearmans_joined%>%
  mutate(method=factor(method, levels=COL_ORDER))

plot_spearmans<-ggplot(spearmans_joined, aes(x=method, y=abs(spearm)))+
  stat_summary(fun.y = mean, geom = "bar", width=0.8, color="black", fill="grey") + 
  stat_summary(fun.data = mean_se, geom = "errorbar", width=0.4)+
  geom_jitter(width=0.15, color="blue4", alpha=1, size=1)+
  theme_minimal()+ 
  theme(axis.text.x = element_text(angle = 45, vjust = 1.0, hjust=1, color="black", size=10),
                          axis.text.y = element_text(color="black", size=10))+
  labs(x = "")+
  labs(y = "absolute Spearman correlation", size=12)

plot_spearmans
ggsave(filename= "spearman_plot.pdf", plot=plot_spearmans, height=5, width=4)

getmeansSD_spearman<-function(spearmans_joined_private){
  table_spearm_private<-spearmans_joined_private %>% 
    mutate(abs_spearman=abs(spearm))%>%
    group_by(method)%>%
    summarise(mean_spearm=mean(abs_spearman), sd=sd(abs_spearman), num=n(), sem=sd(abs_spearman)/sqrt(n())) %>%
    arrange(-mean_spearm)
  return(table_spearm_private)
}

table_spearm<-getmeansSD_spearman(spearmans_joined)
print(table_spearm)

write_tsv(x = table_spearm, file = "table_spearm.tsv")

getMeanAbsCorr<-function(data_priv, i){
  spearman_joined_priv<-get_correlations(data_priv[i,])
  table_spearm_private<-getmeansSD_spearman(spearman_joined_priv)
  return(unlist(table_spearm_private[,"mean_spearm"]))
}

getMeanAbsCorr_scoreids<-function(data_priv, i){
  filtered_data<-tibble()

  for (score_id in unlist(data_priv[i,]) ){
  filtered_data<-rbind(filtered_data,
    validation_dataset %>%
      filter(scoreID %in% score_id) )
  }

  spearman_joined_priv<-get_correlations(filtered_data)
  table_spearm_private<-getmeansSD_spearman(spearman_joined_priv)
  table_spearm_private$used_scores<-as.numeric(paste(i, collapse = ""))
  # for check print
  #print(data_priv[i,])
  #print(nrow(filtered_data))

  return(table_spearm_private)
}

getMeanAbsCorr_scoreids_return_vector<-function(data_priv, i){
  table_spearm_private<-getMeanAbsCorr_scoreids(data_priv, i)
  correlations_used_scores<-c(unlist(table_spearm_private[,"mean_spearm"]),
                              unlist(table_spearm_private[1,"used_scores"]))

  return(correlations_used_scores)
}

compare_scores<-function(booted_values_private){
  booted_values_private_summarised<-booted_values_private %>%
    rowwise%>%
    select(-starts_with("used_cols_vector"))%>%
    mutate(rowmax=max(across()))%>%
    mutate(glm_AlphDeogenRevel_biggest=glm_AlphDeogenRevel==rowmax)%>%
    mutate(glm_AlphCadd_biggerThanCadd=glm_AlphCadd>CADD_raw)%>%
    mutate(glm_AlphDeogen_biggerThanDeogen=glm_AlphDeogen>DEOGEN2_score_med)%>%
    mutate(glm_AlphRevel_biggerThanRevel=glm_AlphRevel>REVEL_score)
  return(booted_values_private_summarised)
}

get_p_value_table<-function(booted_values_summarised_private){
  relevant_scores<-c("glm_AlphDeogenRevel_biggest","glm_AlphDeogen_biggerThanDeogen","glm_AlphRevel_biggerThanRevel","glm_AlphCadd_biggerThanCadd")
  p_val_tabl<-tibble()
  for (columname in relevant_scores){
    p_val_tabl<- rbind(p_val_tabl, 
                       tibble(name=columname, num_true=sum(unlist(booted_values_summarised_private[,columname])), total_num=nrow(booted_values_summarised_private)))
  }
  return(p_val_tabl)
}



# do bootstrapping for means
mean_original_mat<-getmeansSD_spearman(get_correlations(validation_dataset))
set.seed(1)

boot_meanAbsCor<-boot(data=validation_dataset, statistic=getMeanAbsCorr, R=BOOT_REPETITIONS, parallel="multicore", ncpus=N_CPU)
booted_meanAbsCor_values<-as_tibble(boot_meanAbsCor$t)
colnames(booted_meanAbsCor_values)<-mean_original_mat$method
booted_meanAbsCor_values<-compare_scores(booted_meanAbsCor_values)
pValTable_AbsCor<-get_p_value_table(booted_meanAbsCor_values)
pValTable_AbsCor<-pValTable_AbsCor%>% 
  mutate(pval=1-num_true/total_num)
write_tsv(x=pValTable_AbsCor, "pValTable_AbsCor.tsv")


# bootstrap the choice of scores
score_ids<-as_tibble(unique(validation_dataset$scoreID))
mean_original_mat<-getMeanAbsCorr_scoreids(score_ids,1:13)
set.seed(1)
boot_meanAbsCor<-boot(data=score_ids, statistic=getMeanAbsCorr_scoreids_return_vector, R=BOOT_REPETITIONS, parallel="multicore", ncpus=N_CPU)
booted_meanAbsCor_values<-as_tibble(boot_meanAbsCor$t)
colnames(booted_meanAbsCor_values)<-c(as.character(mean_original_mat$method), "used_cols_vector")
booted_meanAbsCor_values<-compare_scores(booted_meanAbsCor_values)
pValTable_AbsCor<-get_p_value_table(booted_meanAbsCor_values)
pValTable_AbsCor<-pValTable_AbsCor%>% 
  mutate(pval=1-num_true/total_num)
write_tsv(x=pValTable_AbsCor, "pValTable_AbsCor_choice_of_experiments.tsv")
pValTable_AbsCor

rank_score_spearm<-function(spearm_joined_private){
  rank_scored_priv<-spearm_joined_private%>%group_by(scoreID)%>%
    mutate(minCor=min(abs_spear), maxCor=max(abs_spear))%>%
    ungroup()%>%
    mutate(norm_cor=(abs_spear-minCor)/(maxCor-minCor))%>%
    group_by(method,gene)%>%
    summarise(gene_level_cor=mean(norm_cor))%>%
    group_by(method)%>%
    summarise(rank_score=sum(gene_level_cor)/n() , sd_score=sd(gene_level_cor)) %>%
    mutate(sd_score_norm=sd_score/rank_score)%>%
    arrange(-method)
  return(rank_scored_priv)
}

getSpearmanRank<-function(data_priv, i){
  spearman_joined_priv<-get_correlations(data_priv[i,])# %>%
    #filter(method!="Alph_null")

  rank_scored<-rank_score_spearm(spearman_joined_priv)
  return(unlist(rank_scored[,2]))
}


# do bootstrapping for rankscores
rank_scored_original_mat<-rank_score_spearm(get_correlations(validation_dataset))
set.seed(1)
boot_rank_rankscore<-boot(data=validation_dataset, statistic=getSpearmanRank, R=BOOT_REPETITIONS, parallel="multicore", ncpus=N_CPU) # parallel="multicore", ncpus=3
booted_rankscores_values<-as_tibble(boot_rank_rankscore$t)
colnames(booted_rankscores_values)<-rank_scored_original_mat$method
booted_rankscores_values<-compare_scores(booted_rankscores_values)
vals<-booted_rankscores_values %>% pivot_longer(cols=1:length(unique(rank_scored_original_mat$method)))

vals<-vals%>%
  mutate(name=factor(name, levels=COL_ORDER))



rankscore_plot<-ggplot(vals)+
  geom_boxplot(aes(x=name,y=value))+
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1, color="black", size=10))

ggsave(filename = "rankscore_plot.pdf", plot = rankscore_plot, width=5, height=5.5)

pValTable_RankScore<-get_p_value_table(booted_rankscores_values)
pValTable_RankScore<-pValTable_RankScore%>% 
  mutate(pval=1-num_true/total_num)
pValTable_RankScore
write_tsv(x=pValTable_RankScore, "pValTable_RankScore.tsv")


ggplot(spearmans_joined, aes(x=scoreID, y=abs_spear, color=method, label=method))+
  geom_jitter(position = position_jitter(seed = 1))+
  #geom_text(position = position_jitter(seed = 1))+
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1, color="black", size=10),
        axis.text.y = element_text(color="black", size=10))
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
library(tidyverse)
library(readxl)
library(optparse)

option_list = list(
  make_option(c("-c", "--csv_location"), type="character", default="C:/Users/Karo.PC-Karo/Master/Semester_3/Lab_rotation_Ludwig/for_Karola/SSH/variants_preprocessed_samp.csv", 
              help="location of csv.gz file"),
  make_option(c("-o", "--out_folder"), type="character", default="C:/Users/Karo.PC-Karo/Master/Semester_3/Lab_rotation_Ludwig/for_Karola/local_files", 
              help="name of folder to store output")

)

opt = parse_args(OptionParser(option_list=option_list))

variants<-read_csv(opt$csv_location, na=c(".","NA"))

dir.create(opt$out_folder, recursive=TRUE)
setwd(opt$out_folder)

variants_gnomad<- variants %>% filter(gnomadSet==1)
variants_cv<- variants %>% filter(gnomadSet==0)
variants_unique_genes<-unique(variants$var_id_genomic)
variants_unique_proteins<-unique(variants$var_id_prot)
gnomad_outc1<-variants%>%filter(gnomadSet==1 & outcome==1)
gnomad_outc0<-variants%>%filter(gnomadSet==1 & outcome==0)
cv_outc1<-variants%>%filter(gnomadSet==0 & outcome==1)
cv_outc0<-variants%>%filter(gnomadSet==0 & outcome==0)
cv18_to_21_CV<- variants%>%filter(clinvar_holdout_test==1)
gene_no<-unique(variants$gene)
hold_out_genes<- variants%>%filter(hold_out_genes==1)
train1<-variants%>%filter(!(hold_out_genes==TRUE) & gnomadSet==TRUE)
test1<-variants%>%filter(clinvar_interim_test==TRUE)
test2<-variants %>% filter(clinvar_holdout_test==TRUE)
rest_train1<-variants%>%filter((hold_out_genes==TRUE) & gnomadSet==TRUE)
ensemblID_gene<-unique(variants$Ensembl_geneid)
ensemblID_protein<-unique(variants$Ensembl_proteinid)
uniprot<-unique(variants$Uniprot_entry)
uniprot_acc<-unique(variants$Uniprot_acc)
ensemblID_in_genes<-ensemblID_gene %in% variants_unique_genes == TRUE

var_tibble<-tibble(total_var=nrow(variants),
                        total_gnomad=nrow(variants_gnomad),
                        total_cv=nrow(variants_cv),
                        total_genes=length(variants_unique_genes),
                        total_proteins=length(variants_unique_proteins),                        
                        gnomad_outcome1=nrow(gnomad_outc1),
                        gmoad_outcome0=nrow(gnomad_outc0),
                        cv_outcome1=nrow(cv_outc1),
                        cv_outcome0=nrow(cv_outc0),
                        cv18_to_21_CV=nrow(cv18_to_21_CV),
                        gene_no=length(gene_no),
                        train1=nrow(train1),
                        test1=nrow(test1),
                        test2=nrow(test2),
                        rest_train1= nrow(rest_train1),
                        ensemblIDgene=length(ensemblID_gene),
                        uniprot=length(uniprot),
                        uniprot_acc=length(uniprot_acc),
                        ensemblID_in_genes=length(ensemblID_in_genes))
write_tsv(var_tibble,"characteristics_variants.tsv")
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
director_combined_files=$1
out_dir=$2
first_file="$director_combined_files/"$(ls $director_combined_files | head -n1)

echo "using header of: $first_file"

mkdir -p $out_dir
mkdir -p $out_dir"/tmp"
zcat $first_file | head -n1 > $out_dir"/header.csv"
zcat_sting=""

# loop over directory content and extract collect non empty files with TRUE in filename 
for file in $(ls $director_combined_files)
do
if [[ $file == *"TRUE"* ]]; then
f_size=$(stat --printf="%s" $director_combined_files"/"$file)
if [ $f_size -gt 100 ]
then
zcat_string+=" ${director_combined_files}/${file}"
fi
fi
done

#relevant_cols=$(zcat results/validation_set/validation_set_w_AlphScore.csv.gz | awk -v RS=',' '/^b_factor$/{print NR} /^SOLVENT_ACCESSIBILITY_core$/{print NR} /^Uniprot_acc_split$/{print NR} /^gnomad_train$/{print NR} /AlphScore/{print NR; exit}' | tr "\n" ",")"2,3,4,5,6,7,8,9,10"


echo $zcat_string

cat $out_dir"/header.csv" > $out_dir"/header_short.csv"

zcat $zcat_string  | grep -v "chr,pos(1-based)," | sort -V -t, -T $out_dir"/tmp/" -k1 -k2 - | \
cat $out_dir"/header_short.csv" - | tr "," "\t" | bgzip -c > $out_dir"/all_possible_values_concat.csv.gz"

tabix -s 1 -b 2 -e 2 $out_dir"/all_possible_values_concat.csv.gz"

rm -rf $out_dir"/tmp"
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
import sys
import pandas as pd
import numpy as np
from operator import itemgetter


### for debuging
class testclass:
	def __init__(self):
		self.wildcards=["AF-A0A087WUL8-F11-model_v1"]
		self.output=["results/AF-A0A087WUL8-F11-model_v1__combined_features.csv.gz"]
		self.params=["results/pdb_features/"]

#snakemake=testclass()
offset=0
offset=( int(snakemake.wildcards[0].split("-")[2][1:] ) - 1) * 200

# generate a list with the number of interactions
possible_interactions=["aroaro", "arosul", "cationpi", "disulphide", "hbond_main_side", "hbond_side_side", "hydrophobic", "ionic", "within_radius"]
interaction_matrix=np.zeros((40000,len(possible_interactions)),np.int) # 40000 - longer than Titin
i=0
edges=[]
for interaction in possible_interactions:
	interactions_pd=pd.read_csv(snakemake.params[0] + "/result_" + interaction + ".csv", low_memory=False)
	interactions_np=interactions_pd.to_numpy()
	interactions_np[:,1] = interactions_np[:,1] + offset
	interactions_np[:,3] = interactions_np[:,3] + offset
	interactions_joined=np.append(interactions_np[:,1], interactions_np[:,3])
	interactions_np[:,0]=np.array([s.strip() for s in interactions_np[:,0]])
	interactions_np[:,2]=np.array([s.strip() for s in interactions_np[:,2]])
	for ele in interactions_joined:
		interaction_matrix[ele,i]=interaction_matrix[ele,i]+1
	i=i+1


# Read features from FEATURIZE Framework
header=pd.read_csv(snakemake.params[1],skipinitialspace=True).columns.to_list()

to_remove=["RAUTE","coord1", "coord2", "coord3", "RAUTE2", "SECONDARY_STRUCTURE2_IS_UNKNOWN", "SECONDARY_STRUCTURE1_IS_UNKNOWN", "RESIDUE_CLASS2_IS_UNKNOWN","RESIDUE_CLASS1_IS_UNKNOWN", "RESIDUE_NAME_IS_OTHER", "ATOM_TYPE_IS_OTHER", "ELEMENT_IS_OTHER", "Atom_name"]

for suff in ["_core", "_3A", "_6A", "_9A", "_12A"]:
	header_surround=[ el + suff  if el not in ["RESN_RESI_AT"] + to_remove else el for el in header] 
	features=pd.read_csv(snakemake.params[0] + "/" + snakemake.wildcards[0] + suff + ".txt", delimiter="\t", names=header_surround,skipinitialspace=True , low_memory=False)
	features["RESN_RESI"]=features.RESN_RESI_AT.str.split(":",expand=True).iloc[:,0]
	features["RESN"]=features.RESN_RESI.str[:3]
	features["RESI"]=features.RESN_RESI.str[3:]
	features=features.astype({"RESI": int})
	features.RESI=features.RESI + offset
	features["RESN_RESI"]=features["RESN"]+features["RESI"].astype(str)
	features.index=features.RESI
	features=features.drop(to_remove, axis=1)
	print(suff)
	if suff!="_core":
		features_joined=features_joined.join(features.drop(['RESN_RESI_AT', 'RESN_RESI', 'RESI','RESN'], axis=1))
	else:
		features_joined=features

# join interactions and FEATURE based values
interactions_for_join=pd.DataFrame(interaction_matrix, columns=possible_interactions) # offset already inserted
interactions_for_join=interactions_for_join.astype(int)
interactions_for_join.index.name="RESI" 
features_joined2=features_joined.join(interactions_for_join)

# join the bfactors
b_factors=pd.read_csv(snakemake.params[0] + "/b_factors.csv", delimiter=" ", names=["residue_number","atom_name","b_factor"], skipinitialspace=True, low_memory=False)
b_factors["residue_number"] = pd.to_numeric(b_factors["residue_number"]) + offset
b_factors.index=b_factors["residue_number"] ########### +
features_joined3=features_joined2.join(b_factors)

# join the HSEs
HSE=pd.read_csv(snakemake.params[0] + "/HSEs.csv", delimiter=",", skipinitialspace=True, low_memory=False)
print(HSE.columns)
HSE["RESI"] = pd.to_numeric(HSE["RESI"]) + offset
HSE.index=HSE["RESI"]
HSE=HSE.drop("RESI", axis=1) ########### +
features_joined4=features_joined3.join(HSE)


features_joined4[["pdb_file"]]=snakemake.wildcards[0]
features_joined4.to_csv(snakemake.output[0])
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
library(tidyverse)
library(readxl)
library(data.table)
library(optparse)

option_list = list(
  make_option(c("-m", "--mapping"), type="character", default="resources/mapping_genename_uniprot.tsv", 
              help="location of file that maps gene name in assay to UniProt ID"),
  make_option(c("-p", "--prot_folder"), type="character", default="results/predicted_prots_eval", 
              help="Folder with protein files with AlphScore predictions"),
  make_option(c("-a", "--suffix"), type="character", default="_w_AlphScore_red_FALSE.csv.gz", 
              help="suffix that protein files with AlphScore predictions have")
)
opt = parse_args(OptionParser(option_list=option_list))

list_uniprot_ids<-read_tsv(opt$mapping)
PREDICTED_PROT_SUFFIX=opt$suffix
folder_of_prots=opt$prot_folder

# MAPK1	/ Uniprot P28482 not in mapping file, as this was not mapped between Uniprot and dbNSFP

sheets<-c("ADRB2","BRCA1","CALM1", "HRAS", "MAPK1", "P53",  "PTEN", "SUMO1", "TPK1", "TPMT", "UBE2I")
scores_conc<-tibble()

for (gene in sheets){
  scores <- read_excel("resources/Source.xlsx", sheet=gene) %>% 
    select(variant, starts_with("DMS"))%>%
    mutate(gene=gene)%>%
    gather(key="DMS", value="DMS_val", -variant, -gene)
  scores_conc=rbind(scores_conc, scores)
}

#MSH2
scores <- read_excel("resources/MSH2_Jia_2020/1-s2.0-S0002929720304390-mmc2.xlsx", sheet="TableS5_AllVariants")%>%
  rename(variant=Variant, DMS=`LOF score`)%>%
  select(variant, starts_with("DMS"))%>%
  mutate(gene="MSH2")%>%
  gather(key="DMS", value="DMS_val", -variant, -gene)
scores_conc=rbind(scores_conc, scores)

#Abeta
scores <- read_excel("resources/Abeta_Seuma_2020/elife-63364-supp4-v2.xlsx", sheet="1 aa change")%>%
  mutate(variant=paste0(WT_AA, Pos, Mut))%>%
    rename( DMS_nscore=`nscore`)%>%
    select(variant, starts_with("DMS"))%>%
    mutate(gene="Abeta")%>%
    gather(key="DMS", value="DMS_val", -variant, -gene)
scores_conc=rbind(scores_conc, scores)


#VKOR1
scores <- read_excel("resources/VKOR1_Chiasson_2020_activity/elife-58026-fig1-data1-v1.xlsx")%>%
  rename( DMS_abundance_score=`abundance_score`)%>%
  rename( DMS_activity_score=`activity_score`)%>%
  select(variant, starts_with("DMS"))%>%
  mutate(gene="VKOR1")%>%
  gather(key="DMS", value="DMS_val", -variant, -gene)
scores_conc=rbind(scores_conc, scores)

scores_conc<-scores_conc %>% 
  filter(!is.na(DMS_val))%>%
  left_join(list_uniprot_ids, by=c("gene"="gene_name"))%>%
  rename(gene_dms=gene)%>% 
  filter(!is.na(uniprot_id))


files<-paste0(folder_of_prots, list_uniprot_ids$uniprot_id, PREDICTED_PROT_SUFFIX) 

values_joined<-lapply(files, function(x) { #  mc.cores = 1,
  return(fread(x, na.strings = c("NA","."))  ) } )

values_joined<-rbindlist(values_joined)

values_joined<-values_joined%>%
  mutate(variant=paste0(
str_replace_all(RESN_RESI, c( 
 "ALA"="A", "ARG"="R", "ASN"="N", "ASP"="D",
 "CYS"="C", "GLU"="E", "GLN"="Q", "GLY"="G",
 "HIS"="H", "ILE"="I", "LEU"="L", "LYS"="K",
 "MET"="M", "PHE"="F", "PRO"="P", "SER"="S",
 "THR"="T", "TRP"="W", "TYR"="Y", "VAL"="V")),
str_replace_all(to_AS, c( 
  "ALA"="A", "ARG"="R", "ASN"="N", "ASP"="D",
  "CYS"="C", "GLU"="E", "GLN"="Q", "GLY"="G",
  "HIS"="H", "ILE"="I", "LEU"="L", "LYS"="K",
  "MET"="M", "PHE"="F", "PRO"="P", "SER"="S",
  "THR"="T", "TRP"="W", "TYR"="Y", "VAL"="V"))
))

values_joined<-values_joined %>% 
  left_join(scores_conc, by=c("variant"="variant", "Uniprot_acc_split"="uniprot_id"))%>%
  filter(!is.na(DMS_val))%>%
  mutate(DMS_val=as.double(DMS_val))

dir.create("results/validation_set")
setwd("results/validation_set")

fwrite(x=values_joined, file="validation_set_w_AlphScore.csv.gz")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
library(readr)
library(dplyr)
library(optparse)
library(stringr)
source("workflow/scripts/existing_scores_glm_functions.R")
set.seed(1)

option_list = list(
  make_option(c("-t", "--training_dataset"), type="character", default="results/prediction_final/pre_final_model_regular_variants.csv.gz", 
              help="File containing the training dataset"),
  make_option(c("-c", "--combined_model"), type="character", default="results/prediction_final/combined_model.RData", 
              help="File where the final combined logistic regression model should be saved to"),
  make_option(c("-i", "--training_var_ids"), type="character", default="results/prediction_final/training_var_ids.tsv", 
              help="IDs of variants that were in the training / test set")
)

opt = parse_args(OptionParser(option_list=option_list))

variants<-read_csv(opt$training_dataset, na=c(".","NA", NA))
variants$AlphScore<-variants$predicted_Alph

index_col<-get_index_col(variants)

variants$DEOGEN2_score_med<- unlist_score(variants$DEOGEN2_score, index_col)

gnomad_vars<-variants %>% 
  filter(gnomadSet==1)

clinvar_vars_not_gnomad <-variants%>% 
  filter(gnomadSet==0) %>%
  filter(! var_id_genomic %in% gnomad_vars$var_id_genomic )

set_of_models<-fit_set_of_models(clinvar_vars_not_gnomad)

gnomad_set_position<-gnomad_vars$var_id_genomic
clinvar_set_position<-clinvar_vars_not_gnomad$var_id_genomic

training_vars<-rbind(
tibble(var_id_genomic=gnomad_set_position,
       gnomadSet=1),
tibble(var_id_genomic=clinvar_set_position,
       gnomadSet=0)
)
# write to disk
write_tsv(x=training_vars,
          file=opt$training_var_ids)

write_rds(x=set_of_models,
          file=opt$combined_model)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
library(data.table)

files <- list.files(path=snakemake@params[["loaction_csvgzs"]], pattern=".csv.gz", full.names=TRUE, recursive=FALSE)
for (x_files in 1:length(files)){
  if (file.info(files[x_files])$size < 1000){
    files[x_files]<-NA
  }
}
print(files)
files<-files[!is.na(files)]

values_joined<-lapply(files, function(x) { #  mc.cores = 1,
  training_data<-fread(x, na.strings = c("NA","."))[!is.na(CADD_raw_sur), ] # load data

  singletons <- training_data[
    ((is.na(gnomAD_exomes_AC) | (gnomAD_exomes_AC<2 & (gnomAD_exomes_NFE_AC == gnomAD_exomes_AC) ) )  & 
    gnomAD_genomes_AN>90000 &
    (!gnomAD_genomes_flag %in% c("lcr","segdup"))&
    gnomAD_genomes_AC==1 & 
    gnomAD_genomes_NFE_AC==1) & 
    (is.na(`1000Gp3_AC`) | `1000Gp3_AC`==0)  & 
    (is.na(ESP6500_AA_AC) | ESP6500_AA_AC==0) & 
    (is.na(ESP6500_EA_AC) | ESP6500_AA_AC==0) 
    ]

  frequents<- training_data[
    gnomAD_genomes_AF> 0.001 & 
    gnomAD_genomes_AN>90000 & 
    (!gnomAD_genomes_flag %in% c("lcr","segdup"))
    ]

  clinvars<- training_data[
    clinvar_clnsig %in% c("Likely_pathogenic","Pathogenic","Benign", "Likely_benign")
    ]

  return(rbindlist(list(singletons[,`:=`(outcome=1, gnomadSet=1)],
                        frequents[,`:=` (outcome=0, gnomadSet=1)], 
                        clinvars[,`:=` (outcome=ifelse(clinvar_clnsig %in% c("Likely_pathogenic","Pathogenic"),1,0), gnomadSet=0)]), 
                   fill=TRUE, use.names=TRUE))
})

values_joined<-rbindlist(values_joined)

fwrite(values_joined, file=snakemake@output[["csv_gz_out"]])
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
library(tidyverse)
library(readxl)
library(dplyr)
library(optparse)

option_list = list(
  make_option(c("-t", "--tsv_location"), type="character", default="../results/prediction/pre_final_model_k_fold_results.tsv", 
              help="location of .tsv file"),
  make_option(c("-o", "--out_folder"), type="character", default="../results/prediction/plot_k/", 
              help="name of folder to store output")
)

opt = parse_args(OptionParser(option_list=option_list))

output_loop<-read.table(opt$tsv_location, sep="\t",header=TRUE)

dir.create(opt$out_folder, recursive=TRUE)
setwd(opt$out_folder)

# rename?
output_loop_num<-select_if(output_loop, is.numeric)


#output_loop_mean<-data.frame()
output_loop_mean<-output_loop %>%
  summarise_if(is.numeric,mean)%>%
  t()

output_loop_sd<-data.frame()
output_loop_sd<-output_loop%>%
  summarise_if(is.numeric,sd)%>%
  t()

df<-data.frame(colnames_cv=colnames(output_loop_num),mean=output_loop_mean,sd=output_loop_sd) 

df_sel<-df%>%
  filter(str_detect(colnames_cv, "^auc"))

df_sel$colnames_cv<-sub("auc_", "", df_sel$colnames_cv)

barplot_cv<-ggplot(df_sel, aes(x=reorder(colnames_cv, -mean), y=mean))+
  geom_bar(stat = "identity",width = 0.75, position="dodge")+
  geom_errorbar(aes(ymin=mean - sd,ymax=mean + sd), width=0.4)+
  geom_text(aes(x=reorder(colnames_cv, -mean), y=mean+0.014, label=round(mean, digits=3)), 
            position = position_dodge(0.9),
            vjust = -0.5, 
            size = 5, colour = "black", check_overlap = FALSE)+
  coord_cartesian(ylim=c(0,1))+
  theme_minimal()+
  labs(x="prediction method",y="mean AUC")

barplot_cv

ggsave(filename= "barplot_mean_auc_crossval.pdf",plot=barplot_cv, width=190, height=100, units="mm", dpi=300)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
library(tidyverse)
library(pROC)
library(readxl)
library(ranger)
library(optparse)

option_list = list(
  make_option(c("-i", "--input_impurity"), type="character", default="results/prediction_final/final_regular_impurity_importance.tsv", 
              help="location of .tsv file when importance set to impurity"),
  make_option(c("-m", "--input_permutation"), type="character", default="results/prediction_final/final_regular_permutation_importance.tsv", 
              help="location of .tsv file when importance set to impurity"),
  make_option(c("-o", "--out_folder"), type="character", default="results/plot_k/", 
              help="name of folder to store output"),
  make_option(c("-p", "--prefix"), type="character", default="base_tree_2000_10", 
              help="Prefix for output")
  )

opt = parse_args(OptionParser(option_list=option_list))

importance_table_imp<-read.table(opt$input_impurity, sep="\t",header=TRUE)
importance_table_per<-read.table(opt$input_permutation, sep="\t",header=TRUE)

dir.create(opt$out_folder, recursive=TRUE)
setwd(opt$out_folder)


plot_variable_importance<-function(var_imp_ds){
var_importance_plot<-ggplot(var_imp_ds, aes(x=reorder(variable,-importance), y=importance))+ 
  geom_bar(stat="identity", position="dodge")+ 
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1, color="black", size=10),
        axis.text.y = element_text(color="black", size=10))+
  labs(y = "Variable Importance", size=12)+
  labs(x = "")
return(var_importance_plot)
}

var_imp_imp<-importance_table_imp[order(-importance_table_imp$importance),]%>%
  slice(1:25)

var_importance_imp_plot<-plot_variable_importance(var_imp_imp)

ggsave(filename=paste0(opt$prefix,"_importance_impurity.pdf"), 
       plot=var_importance_imp_plot, width = 6, height = 6)


#Variable importance mode = permutation
var_imp_per<-importance_table_per[order(-importance_table_per$importance),]%>%
  slice(1:25)

var_importance_per_plot<-plot_variable_importance(var_imp_per)

ggsave(filename=paste0(opt$prefix,"_importance_permutation.pdf"), 
       plot=var_importance_per_plot, width = 6, height = 6)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
library(tidyverse)
library(readxl)
library(optparse)

option_list = list(
  make_option(c("-p", "--input_preprocessed"), type="character", default="C:/Users/Karo.PC-Karo/Master/Semester_3/Lab_rotation_Ludwig/for_Karola/SSH/gnomad_extracted_prepro.csv.gz", 
              help="location of csv.gz file with preprocessed, uncalibrated variants"),
  make_option(c("-r", "--input_recalibrated"), type="character", default="C:/Users/Karo.PC-Karo/Master/Semester_3/Lab_rotation_Ludwig/for_Karola/SSH/gnomad_extracted_prepro_recal.csv.gz", 
              help="location of csv.gz file with preprocessed, recalibrated variants"),
  make_option(c("-o", "--out_folder"), type="character", default="C:/Users/Karo.PC-Karo/Master/Semester_3/Lab_rotation_Ludwig/for_Karola/local_files", 
              help="name of folder to store output")
)

opt = parse_args(OptionParser(option_list=option_list))

variants_pre<-read_csv(opt$input_preprocessed, na=c(".","NA"))
variants_recal<-read_csv(opt$input_recalibrated, na=c(".","NA"))

dir.create(opt$out_folder, recursive=TRUE)
setwd(opt$out_folder)

from_AS_pre<-variants_pre %>%
  filter(!(hold_out_genes==TRUE) & gnomadSet==TRUE)%>%
  group_by(from_AS) %>%
  summarise(frac=mean(outcome), num=n())

plot_pre<- ggplot(from_AS_pre,aes(x=reorder(from_AS, -frac), y=frac)) + 
  geom_bar(stat = "identity")+
  geom_text(aes(x=reorder(from_AS, -frac), y=frac, label=num),
            position = position_dodge(width = 1),
            vjust = -0.5, 
            size = 2.5, colour = "black", check_overlap = TRUE)+
  theme_minimal()+
  labs(x="reference AA",y="proportion of proxy-pathogenic AAs")

ggsave("barplot_preprocessed.pdf", width=200, height=100, units="mm", dpi=300)

order_AA_pre<- from_AS_pre[order(-from_AS_pre$frac),]
order_AA_pre<-order_AA_pre$from_AS

####recalibrated

from_AS_recal<-variants_recal %>% 
  filter(!(hold_out_genes==TRUE) & gnomadSet==TRUE)%>%
  group_by(from_AS) %>%
  summarise(frac=mean(outcome), num=n())

from_AS_recal<-from_AS_recal[match(order_AA_pre, from_AS_recal$from_AS),]
from_AS_recal$from_AS<-factor(from_AS_recal$from_AS, levels=from_AS_recal$from_AS)

plot_recal<-ggplot(from_AS_recal,aes(x=from_AS, y=frac)) + 
  geom_bar(stat = "identity")+
  geom_text(aes(x=from_AS, y=frac, label=num),
            position = position_dodge(width = 1),
            vjust = -0.5, 
            size = 2.5, colour = "black", check_overlap = TRUE)+
  theme_minimal()+
  labs(x="reference AA",y="proportion of proxy-pathogenic AAs")

# get stats of training / testing data set:


ggsave("barplot_recalibrated.pdf", width=200, height=100, units="mm", dpi=300)
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
library(tidyverse)
library(optparse)
library(pROC)
library(PRROC)
library(VennDiagram)
library(boot)

source("workflow/scripts/existing_scores_glm_functions.R")
source("workflow/scripts/precision_recall_resource.R")

BOOT_REPETITIONS=1000
set.seed(1)

option_list = list(
  make_option(c("-b", "--clinvar_benign"), type="character", default="results/clinvar2022/clinvar_2022_benign.vcf.gz", 
              help="location of ClinVar benign missense variants in vcf like format"),
  make_option(c("-p", "--clinvar_pathogenic"), type="character", default="results/clinvar2022/clinvar_2022_pathogenic.vcf.gz", 
              help="location of ClinVar benign missense variants in vcf like format"),
  make_option(c("-a", "--AlphaFold_scores"), type="character", default="results/clinvar2022/values_of_clinvar_variants.tsv.gz", 
              help="pre calculated Alphafold scores"),
  make_option(c("-v", "--variants"), type="character", default="results/prediction_final/pre_final_model_regular_variants.csv.gz", 
              help="variants with calculated AlphScore"),
  make_option(c("-c", "--CV_training"), type="logical", default=FALSE, 
              help="Were ClinVar variants used as training set?"),
  make_option(c("-s", "--protein_subset"), type="character", default="resources/membrane_proteins.txt", 
              help="Subset of proteins to evaluate separately"),
  make_option(c("-o", "--out_folder"), type="character", default="results/clinvar2022_alphafold/", 
              help="name of folder to store output")
              )
opt = parse_args(OptionParser(option_list=option_list))

# load data, add variant ID (chrom:pos:ref:alt)
clinvar_benign<-read_tsv(opt$clinvar_benign, col_names=TRUE, col_types = cols(.default = "?",  CHROM = col_character()) ) %>%
  mutate(ID=paste(CHROM,POS,REF,ALT, sep=":"))
clinvar_pathogenic<-read_tsv(opt$clinvar_pathogenic, col_names=TRUE, col_types = cols(.default = "?",  CHROM = col_character()) ) %>%
  mutate(ID=paste(CHROM,POS,REF,ALT, sep=":"))
alphafold_pre_calculated<-read_tsv(opt$AlphaFold_scores, col_names=TRUE, 
                                   col_types = cols(.default = "?", REVEL_score = "d", "#chr" = col_character()))%>%
  mutate(ID=paste(`#chr`, `pos(1-based)`, ref, alt, sep=":"))
variants<-read_csv(opt$variants, 
                   col_types = cols(.default = "?", REVEL_score = "d", "#chr"= col_character() ))

protein_subset<-scan(file = opt$protein_subset, what=character())


dir.create(opt$out_folder, recursive=TRUE)
setwd(opt$out_folder)

head(alphafold_pre_calculated)

alphafold_pre_calculated$pos_in_VEP_and_Uniprot<-get_index_col(alphafold_pre_calculated)
alphafold_pre_calculated$DEOGEN2_score_med<-unlist_score(alphafold_pre_calculated$DEOGEN2_score, alphafold_pre_calculated$pos_in_VEP_and_Uniprot)


all_clinvar_ids<-c(clinvar_benign$ID, clinvar_pathogenic$ID)

alphafold_pre_calculated_w_CV2022<-alphafold_pre_calculated %>%
  filter(ID %in% all_clinvar_ids)%>%
  mutate(outcome=(ID %in% clinvar_pathogenic$ID))%>%
  filter(!is.na(AlphScore))


variants$AlphScore<-variants$predicted_Alph

variants$pos_in_VEP_and_Uniprot<-get_index_col(variants)
variants$DEOGEN2_score_med<-unlist_score(variants$DEOGEN2_score, variants$pos_in_VEP_and_Uniprot)

variants<-variants%>% 
  mutate(ID=paste(`#chr`, `pos(1-based)`,ref, sep=":"))




non_test_dataset<-variants %>% 
  filter(clinvar_holdout_test==FALSE)
test_dataset<-variants %>% 
  filter(clinvar_holdout_test==TRUE)

alphafold_pre_calculated_w_CV2022<-alphafold_pre_calculated_w_CV2022 %>%
  mutate(ID=paste(`#chr`, `pos(1-based)`,ref,  sep=":"))%>%
  mutate(prot_subset=Uniprot_acc_split %in% protein_subset)

### combine scores on interim dataset:
if (opt$CV_training){
  train_ds<-variants %>% 
    filter(clinvar_train==TRUE)
  interim_dataset<-variants %>% 
    filter(gnomad_interim_test==TRUE)

}else{
  interim_dataset<-variants %>% 
    filter(clinvar_interim_test==TRUE)
  train_ds<-variants %>% 
    filter(gnomad_train==TRUE)
}

alphafold_pre_calculated_w_CV2022<-alphafold_pre_calculated_w_CV2022 %>%
  filter(!ID %in% interim_dataset$ID)%>%
  filter(!ID %in% non_test_dataset$ID)%>%
  filter(!ID %in% train_ds$ID)

set_of_models<-fit_set_of_models(interim_dataset)
testSet<-predict_set_of_models(set_of_models, alphafold_pre_calculated_w_CV2022)

print("Number of protein-Variants in testSet:")
nrow(testSet)


# ensure that no variant is multiple times in the data set; if so, take the mean of the predictors, ensure, that the variant is rated consistently benign / pathogenic
testSet<-testSet %>% 
  group_by(ID) %>%
  summarise(AlphScore=mean(AlphScore), 
            REVEL_score=mean(REVEL_score), 
            glm_AlphRevel=mean(glm_AlphRevel), 
            Alph_null=mean(Alph_null),
            CADD_raw=mean(CADD_raw), 
            glm_AlphCadd=mean(glm_AlphCadd), 
            DEOGEN2_score_med=mean(DEOGEN2_score_med), 
            glm_AlphDeogen=mean(glm_AlphDeogen), 
            outcome=mean(outcome),
            prot_subset=max(prot_subset))%>%
  filter(outcome %in% c(0,1))%>%
  mutate(outcome=as.logical(outcome))

nrow(testSet)
sum(complete.cases(testSet))

sum(testSet$ID %in% test_dataset$ID)
sum(testSet$ID %in% variants$ID)

testSet<-testSet[complete.cases(testSet),]

table(testSet$outcome)



score_performance_tbl<-tibble()

pdf(file="ClinVar_val_REVEL.pdf")
plot(roc(as.integer(testSet$outcome), testSet$AlphScore), print.auc = TRUE, 
       col = "black", print.auc.y = .35)
  plot(roc(testSet$outcome, testSet$REVEL_score), print.auc = TRUE, 
       col = "blue", print.auc.y = .25, add = TRUE)
  plot(roc(testSet$outcome, testSet$glm_AlphRevel), print.auc = TRUE, 
       col = "red", print.auc.y = .15, add = TRUE)
  plot(roc(testSet$outcome, testSet$Alph_null), print.auc = TRUE, 
       col = "grey", print.auc.y = .05, add = TRUE)
  legend(0.2, 0.3, legend=c("AlphScore", "REVEL", "AlphScore + REVEL", "NullModel"),
         col=c("black", "blue","red", "grey"), lty=1, cex=0.8)
dev.off()

pdf(file="ClinVar_val_CADD.pdf")
  plot(roc(as.integer(testSet$outcome), testSet$AlphScore), print.auc = TRUE, 
       col = "black", print.auc.y = .35)
  plot(roc(testSet$outcome, testSet$CADD_raw), print.auc = TRUE, 
       col = "blue", print.auc.y = .25, add = TRUE)
  plot(roc(testSet$outcome, testSet$glm_AlphCadd), print.auc = TRUE, 
       col = "red", print.auc.y = .15, add = TRUE)
  plot(roc(testSet$outcome, testSet$Alph_null), print.auc = TRUE, 
       col = "grey", print.auc.y = .05, add = TRUE)
  legend(0.2, 0.3, legend=c("AlphScore", "CADD", "AlphScore + CADD", "NullModel"),
         col=c("black", "blue","red", "grey"), lty=1, cex=0.8)
dev.off()

pdf(file="ClinVar_val_DEOGEN2.pdf")
  plot(roc(as.integer(testSet$outcome), testSet$AlphScore), print.auc = TRUE, 
       col = "black", print.auc.y = .35)
  plot(roc(testSet$outcome, testSet$DEOGEN2_score_med), print.auc = TRUE, 
       col = "blue", print.auc.y = .25, add = TRUE)
  plot(roc(testSet$outcome, testSet$glm_AlphDeogen), print.auc = TRUE, 
       col = "red", print.auc.y = .15, add = TRUE)
  plot(roc(testSet$outcome, testSet$Alph_null), print.auc = TRUE, 
       col = "grey", print.auc.y = .05, add = TRUE)
  legend(0.2, 0.3, legend=c("AlphScore", "DEOGEN2", "AlphScore + DEOGEN2", "NullModel"),
         col=c("black", "blue","red", "grey"), lty=1, cex=0.8)
dev.off()





AlphNullPlot<-ggplot(testSet)+
  geom_density(aes(x=Alph_null, fill=outcome), alpha=0.5)+
  theme_bw()+
  labs(fill = "(likely)\npathogenic:")+
  scale_fill_manual(values=c("darkblue", "red"))
ggsave(filename="AlphNullPlot.pdf", 
       plot=AlphNullPlot,
       width=5, height=3)
alph_null_table<-generate_table("Alph_null")
write_tsv(x=alph_null_table,
          file="alph_null_table.tsv")

AlphScorePlot<-ggplot(testSet)+
  geom_density(aes(x=AlphScore, fill=outcome), alpha=0.5)+
  theme_bw()+
  labs(fill = "(likely)\npathogenic:")+
  scale_fill_manual(values=c("darkblue", "red"))
ggsave(filename="AlphScorePlot.pdf", 
       plot=AlphScorePlot,
       width=5, height=3)
AlphScore_table<-generate_table("AlphScore")
write_tsv(x=alph_null_table,
          file="alph_null_table.tsv")

glm_AlphRevel_plot<-ggplot(testSet)+
  geom_density(aes(x=glm_AlphRevel, fill=outcome), alpha=0.5)+
  theme_bw()+
  labs(fill = "(likely)\npathogenic:")+
  scale_fill_manual(values=c("darkblue", "red"))
ggsave(filename="glm_AlphRevel_plot.pdf", 
       plot=glm_AlphRevel_plot,
       width=5, height=3)
glm_AlphRevel_table<-generate_table("glm_AlphRevel")
write_tsv(x=glm_AlphRevel_table,
          file="glm_AlphRevel_table.tsv")

REVEL_score_plot<-ggplot(testSet)+
  geom_density(aes(x=REVEL_score, fill=outcome), alpha=0.5)+
  theme_bw()+
  labs(fill = "(likely)\npathogenic:")+
  scale_fill_manual(values=c("darkblue", "red"))
ggsave(filename="REVEL_score_plot.pdf", 
       plot=REVEL_score_plot,
       width=5, height=3)
REVEL_score_table<-generate_table("REVEL_score")
write_tsv(x=REVEL_score_table,
          file="REVEL_score_table.tsv")

CADD_raw_plot<-ggplot(testSet)+
  geom_density(aes(x=CADD_raw, fill=outcome), alpha=0.5)+
  theme_bw()+
  labs(fill = "(likely)\npathogenic:")+
  scale_fill_manual(values=c("darkblue", "red"))
ggsave(filename="CADD_raw_plot.pdf", 
       plot=CADD_raw_plot,
       width=6, height=4)
CADD_score_table<-generate_table("CADD_raw")
write_tsv(x=CADD_score_table,
          file="CADD_score_table.tsv")

glm_AlphCadd_plot<-ggplot(testSet)+
  geom_density(aes(x=glm_AlphCadd, fill=outcome), alpha=0.5)+
  theme_bw()+
  labs(fill = "(likely)\npathogenic:")+
  scale_fill_manual(values=c("darkblue", "red"))
ggsave(filename="glm_AlphCadd_plot.pdf", 
       plot=glm_AlphCadd_plot,
       width=5, height=3)
glm_AlphCadd_table<-generate_table("glm_AlphCadd")
write_tsv(x=glm_AlphCadd_table,
          file="glm_AlphCadd_table.tsv")


pr_cadd<-pr.curve(scores.class0=testSet$CADD_raw, weights.class0=testSet$outcome, curve=T)
pr_alphCadd<-pr.curve(scores.class0=testSet$glm_AlphCadd, weights.class0=testSet$outcome, curve=T)
pr_alph<-pr.curve(scores.class0=testSet$AlphScore, weights.class0=testSet$outcome, curve=T)
pr_alph_null<-pr.curve(scores.class0=testSet$Alph_null, weights.class0=testSet$outcome, curve=T)

pdf(file="proc_auc_CADD.pdf")
  plot(pr_alphCadd, color="red")
  plot(pr_cadd, add=TRUE, color="blue")
  plot(pr_alph, add=TRUE, color="black")
  plot(pr_alph_null, add=TRUE, color="grey")
  legend(0.2, 0.3, legend=c("AlphScore", "CADD", "AlphScore + CADD", "NullModel"),
         col=c("black", "blue","red", "grey"), lty=1, cex=0.8)
dev.off()


pr_revel<-pr.curve(scores.class0=testSet$REVEL_score, weights.class0=testSet$outcome, curve=T)
pr_alphRevel<-pr.curve(scores.class0=testSet$glm_AlphRevel, weights.class0=testSet$outcome, curve=T)

pdf(file="proc_auc_REVEL.pdf")
  plot(pr_alphRevel, color="red")
  plot(pr_revel, add=TRUE, color="blue")
  plot(pr_alph, add=TRUE, color="black")
  plot(pr_alph_null, add=TRUE, color="grey")
  legend(0.2, 0.3, legend=c("AlphScore", "REVEL", "AlphScore + REVEL", "NullModel"),
         col=c("black", "blue","red", "grey"), lty=1, cex=0.8)
dev.off()

pr_deogen<-pr.curve(scores.class0=testSet$DEOGEN2_score_med, weights.class0=testSet$outcome, curve=T)
pr_alphDeogen<-pr.curve(scores.class0=testSet$glm_AlphDeogen, weights.class0=testSet$outcome, curve=T)

pdf(file="proc_auc_DEOGEN2.pdf")
  plot(pr_alphDeogen, color="red")
  plot(pr_deogen, add=TRUE, color="blue")
  plot(pr_alph, add=TRUE, color="black")
  plot(pr_alph_null, add=TRUE, color="grey")
  legend(0.2, 0.3, legend=c("AlphScore", "DEOGEN2", "AlphScore + DEOGEN2", "NullModel"),
         col=c("black", "blue","red", "grey"), lty=1, cex=0.8)
dev.off()



# get aucs and do bootstrapping
get_score_performance_table<-function(testSet_private, i){
  testSet_private<-testSet_private[i,]
  score_performance_tbl_private <-
    tibble(
      Alph_ROC=roc(testSet_private$outcome, testSet_private$AlphScore)$auc, 
      Alph_null_ROC=roc(testSet_private$outcome, testSet_private$Alph_null)$auc, 
      CADD_ROC=roc(testSet_private$outcome, testSet_private$CADD_raw)$auc, 
      REVEL_ROC=roc(testSet_private$outcome, testSet_private$REVEL_score)$auc,
      DEOGEN2_ROC=roc(testSet_private$outcome, testSet_private$DEOGEN2_score_med)$auc, 
      AlphCadd_ROC=roc(testSet_private$outcome, testSet_private$glm_AlphCadd)$auc, 
      AlphDeogen_ROC=roc(testSet_private$outcome, testSet_private$glm_AlphDeogen)$auc, 
      AlphRevel_ROC=roc(testSet_private$outcome, testSet_private$glm_AlphRevel)$auc, 

      Alph_PROC=pr.curve(weights.class0=testSet_private$outcome, scores.class0=testSet_private$AlphScore)$auc.integral, 
      Alph_null_PROC=pr.curve(weights.class0=testSet_private$outcome, scores.class0=testSet_private$Alph_null)$auc.integral, 
      CADD_PROC=pr.curve(weights.class0=testSet_private$outcome, scores.class0=testSet_private$CADD_raw)$auc.integral, 
      REVEL_PROC=pr.curve(weights.class0=testSet_private$outcome, scores.class0=testSet_private$REVEL_score)$auc.integral,
      DEOGEN2_PROC=pr.curve(weights.class0=testSet_private$outcome, scores.class0=testSet_private$DEOGEN2_score_med)$auc.integral, 
      AlphCadd_PROC=pr.curve(weights.class0=testSet_private$outcome, scores.class0=testSet_private$glm_AlphCadd)$auc.integral, 
      AlphDeogen_PROC=pr.curve(weights.class0=testSet_private$outcome, scores.class0=testSet_private$glm_AlphDeogen)$auc.integral, 
      AlphRevel_PROC=pr.curve(weights.class0=testSet_private$outcome, scores.class0=testSet_private$glm_AlphRevel)$auc.integral,             

      num_test=nrow(testSet_private) )

  return(score_performance_tbl_private)
}


get_score_performance_table_boot<-function(testSet_private,i){
  return(unlist(get_score_performance_table(testSet_private,i)))
}

score_performance_tbl<-get_score_performance_table(testSet, TRUE)

write_tsv(x=score_performance_tbl, 
          file="score_performance_tbl.tsv")


score_performance_tbl_subset<-get_score_performance_table(testSet %>% filter(prot_subset==1), TRUE)
score_performance_tbl_NONsubset<-get_score_performance_table(testSet %>% filter(prot_subset==0), TRUE)

counts_test_set<-testSet %>% 
  group_by(outcome, prot_subset)%>%
  summarise(n=n())

write_tsv(x=counts_test_set,
          file="counts_test_set.tsv")

write_tsv(x=score_performance_tbl_subset, 
          file="score_performance_tbl_subset.tsv")

write_tsv(x=score_performance_tbl_NONsubset, 
          file="score_performance_tbl_NONsubset.tsv")



compare_scores<-function(booted_values_private){
  booted_values_private_summarised<-booted_values_private %>%
    rowwise%>%
    mutate(rowmax=max(across()))%>%
    mutate(AlphCadd_biggerThanCadd_ROC=AlphCadd_ROC>CADD_ROC)%>%
    mutate(AlphDeogen_biggerThanDeogen_ROC=AlphDeogen_ROC>DEOGEN2_ROC)%>%
    mutate(AlphRevel_biggerThanRevel_ROC=AlphRevel_ROC>REVEL_ROC)%>%
    mutate(AlphCadd_biggerThanCadd_PROC=AlphCadd_PROC>CADD_PROC)%>%
    mutate(AlphDeogen_biggerThanDeogen_PROC=AlphDeogen_PROC>DEOGEN2_PROC)%>%
    mutate(AlphRevel_biggerThanRevel_PROC=AlphRevel_PROC>REVEL_PROC)
  return(booted_values_private_summarised)
}

get_p_value_table<-function(booted_values_summarised_private){
  relevant_scores<-c("AlphDeogen_biggerThanDeogen_ROC","AlphRevel_biggerThanRevel_ROC","AlphCadd_biggerThanCadd_ROC",
                     "AlphDeogen_biggerThanDeogen_PROC","AlphRevel_biggerThanRevel_PROC","AlphCadd_biggerThanCadd_PROC")
  p_val_tabl<-tibble()
  for (columname in relevant_scores){
    p_val_tabl<- rbind(p_val_tabl, 
                       tibble(name=columname, num_true=sum(unlist(booted_values_summarised_private[,columname])), total_num=nrow(booted_values_summarised_private)))
  }
  return(p_val_tabl)
}

boot_aucs<-boot(data=testSet, statistic=get_score_performance_table_boot, R=BOOT_REPETITIONS, parallel="multicore", ncpus=3)
booted_aucs_table<-as_tibble(boot_aucs$t)
colnames(booted_aucs_table)<-colnames(score_performance_tbl)
booted_aucs_table<-compare_scores(booted_aucs_table)
pValTable_AbsCor<-get_p_value_table(booted_aucs_table)
pValTable_AbsCor<-pValTable_AbsCor%>% 
  mutate(pval=1-num_true/total_num)
pValTable_AbsCor

write_tsv(x = pValTable_AbsCor,
          file = "pValTable_AbsCor.tsv")



# Diagram with ROCs
plot_aucs_clinvar_bar<-function(perf_table, out_file){
  score_performance_tbl_spread<-perf_table%>%
    select(-num_test)%>%
    gather(key="method")%>%
    arrange(value)

  score_performance_tbl_spread<-score_performance_tbl_spread%>%
    mutate(method=factor(method, levels=c("Alph_null_ROC","Alph_ROC","CADD_ROC","AlphCadd_ROC","DEOGEN2_ROC","AlphDeogen_ROC","REVEL_ROC","AlphRevel_ROC")))%>%
    filter(!is.na(method))

  plot_aucs_ClinVar<-ggplot(score_performance_tbl_spread, aes(x=method, y=value))+
    stat_summary(fun.y = mean, geom = "bar") + 
    theme_minimal()+
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1, color="black", size=10),
          axis.text.y = element_text(color="black", size=10))+
    coord_cartesian(ylim=c(0.5,1)) + 
    labs(x = "")+
    labs(y = "AUC (ClinVar test set)", size=12)

  plot_aucs_ClinVar
  ggsave(filename= out_file, plot=plot_aucs_ClinVar, height=5, width=4)
}

plot_aucs_clinvar_bar(score_performance_tbl,"plot_aucs_ClinVar.pdf")
plot_aucs_clinvar_bar(score_performance_tbl_subset,"plot_aucs_ClinVar_subset.pdf")
plot_aucs_clinvar_bar(score_performance_tbl_NONsubset,"plot_aucs_ClinVar_NOsubset.pdf")



datasets_venn<-venn.diagram(x=list(train_ds$ID, interim_dataset$ID, testSet$ID), 
             category.names = c("gnomAD_train" , "ClinVar_validation", "ClinVar_test/hold-out"), 
             filename = NULL)

ggsave(datasets_venn, file="datasets_venn.pdf", device = "pdf", width=6, height=6)
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
import sys
import pandas as pd
import numpy as np
from operator import itemgetter
import networkx as nx
import biographs as bg
import os

def exit_procedure(text_to_log):
	os.system('touch ' + snakemake.output[0])
	outF = open(snakemake.output[0]+".log", "w")
	outF.write(text_to_log)
	outF.close()
	print(text_to_log)
	os._exit(0)

#results/{pdb_name}_combined_features.csv.gz
class testclass:
	def __init__(self):
		self.wildcards=["AF-A0A087WUL8-F11-model_v1"]
		self.output=["results/AF-A0A087WUL8-F11-model_v1_features.tsvneighbour_vals.csv"]
		self.input=["results/AF-A0A087WUL8-F11-model_v1__combined_features.csv.gz"]
		self.params=["/media/axel/Dateien/Arbeit_Gen/alphafold2/dbnsfp_results/"]

#snakemake=testclass()

offset=0
offset=( int(snakemake.wildcards[0].split("-")[2][1:] ) - 1) * 200

#print(snakemake.wildcards, snakemake.output, snakemake.params)


pdb_path=snakemake.params[1]
os.system('gunzip -c "' + pdb_path + '".gz > ' +pdb_path)
molecule = bg.Pmolecule(pdb_path)
os.system('rm -f "' + pdb_path+ '"')

# biopython molecule structural model
mol_model = molecule.model
# networkx graph, by default 5 angstrom
network = molecule.network()
network = molecule.network(cutoff=4, weight=True)

start_resi=int(list(network.nodes)[0][1:])
network_relabeled=nx.convert_node_labels_to_integers(network, start_resi + offset)



#### read combined features

features_joined3=pd.read_csv(snakemake.input[0] ,low_memory=False)

#### read dbNSFP

dbnsfp=pd.read_csv(snakemake.params[0] + snakemake.wildcards[0].split("-")[1] + ".csv.gz", na_values=".", low_memory=False)


#### aggregate features of nodes
cons_scores=['CADD_raw','CADD_phred','BayesDel_noAF_score','REVEL_score', 'GERP++_NR', 'GERP++_RS', 'GERP++_RS_rankscore', 'phyloP100way_vertebrate', 'phyloP100way_vertebrate_rankscore', 'phyloP30way_mammalian', 'phyloP30way_mammalian_rankscore', 'phyloP17way_primate', 'phyloP17way_primate_rankscore', 'phastCons100way_vertebrate', 'phastCons100way_vertebrate_rankscore', 'phastCons30way_mammalian', 'phastCons30way_mammalian_rankscore', 'phastCons17way_primate', 'phastCons17way_primate_rankscore'] # removed:  'Polyphen2_HDIV_score', 'Polyphen2_HDIV_rankscore', 'Polyphen2_HDIV_pred', 'Polyphen2_HVAR_score', 'SIFT_score', 'SIFT_converted_rankscore', 'SIFT_pred', 'SiPhy_29way_pi'

def convert_int(num_str):
	try:
		num_int=int(num_str)
		return(num_int)
	except:
		return(0)

dbnsfp.HGVSp_VEP_split=dbnsfp.HGVSp_VEP_split.replace("p.Met1?","p.Met1???")
num_string=dbnsfp.HGVSp_VEP_split.str[5:-3]
dbnsfp["RESI"]=[convert_int(item) for item in num_string.to_list()]
dbnsfp["RESN"]=dbnsfp.HGVSp_VEP_split.str[2:5]
dbnsfp.RESN=dbnsfp.RESN.str.upper()
dbnsfp["RESN_RESI"]=dbnsfp["RESN"]  + dbnsfp["RESI"].astype(str)
dbnsfp=dbnsfp[dbnsfp.RESI!=0]
dbnsfp["to_AS"]=dbnsfp.HGVSp_VEP_split.str[-3:]
dbnsfp=dbnsfp[dbnsfp.to_AS!="Ter"]
dbnsfp_inf=dbnsfp.groupby('RESI')[cons_scores].mean()


protein_mean_CADD=dbnsfp_inf.mean()["CADD_raw"]
protein_mean_b_factor=features_joined3["b_factor"].mean()

dbnsfp_resn_resi=set(dbnsfp.RESN_RESI)

mismatch_count=0
nodes=[]

resis=list(set(features_joined3["RESI"].tolist()))

for i in resis:
	try:
		node_dict=features_joined3[features_joined3['RESI']==i].to_dict('records')[0]
		#if i in dbnsfp_inf.index:
		if not (node_dict["RESN_RESI"] in dbnsfp_resn_resi):
			print(node_dict["RESN_RESI"], "not present in dbnsfp!")
			mismatch_count=mismatch_count+1
			#import pdb; pdb.set_trace()
		node_dict={**node_dict, **dbnsfp_inf[dbnsfp_inf.index==i].to_dict('records')[0]}
		nodes.append([i, node_dict])
		#import pdb; pdb.set_trace()
	except:
		exit_procedure("1 Node construction failed")


if mismatch_count>i/2:
	exit_procedure("2 mismatched amino acids: " + str(mismatch_count))


network_relabeled.add_nodes_from(nodes)
print("Network okay!")

def create_pd_dict(diction, name_):
	return(pd.DataFrame.from_dict(diction, orient="index", columns=[name_]))

#score="HYDROPHOBICITY_3A"
physicochemical_scores=["CHARGE_3A","HYDROPHOBICITY_3A","MOBILITY_3A","SOLVENT_ACCESSIBILITY_core",'HSE1','HSE2']
dict_vals={}
list_vals=pd.DataFrame()
for score in physicochemical_scores + cons_scores:
	print(score + " ok.")
	for x in range(int(list(network_relabeled.nodes)[0]),int(list(network_relabeled.nodes)[-1])+1):
		sum_weight=0
		sum_value=0
		for neighbor in network_relabeled.neighbors(x):
			values=network_relabeled.nodes[neighbor][score]
			weight=network_relabeled[x][neighbor]["weight"]
			sum_weight=sum_weight+weight
			sum_value=sum_value+weight*values
			#print(values, weight, neighbor)
		#print(dict_vals)
		dict_vals={**dict_vals, **{x :  sum_value/sum_weight}}
	if list_vals.size==0:
		df_join=create_pd_dict(dict_vals, score+"_sur")
		list_vals=df_join.fillna(df_join.median())
	else:
		df_join=create_pd_dict(dict_vals, score+"_sur")
		list_vals=list_vals.join(df_join.fillna(df_join.median()))



deg=create_pd_dict(dict(nx.degree(network_relabeled)), "degree")
betw_cent=create_pd_dict(nx.betweenness_centrality(network_relabeled), "betweenness_centrality")
av_neigh=create_pd_dict(nx.average_neighbor_degree(network_relabeled), "av_neighbor_degree")
ev_cent=create_pd_dict(nx.eigenvector_centrality(network_relabeled, max_iter=100000, tol=0.0001),"eigenvec_centrality")

list_vals=list_vals.join([av_neigh, ev_cent, betw_cent, deg])
features_joined3.index = features_joined3.RESI

dbnsfp=dbnsfp.join(on="RESI", other=features_joined3, rsuffix="_features")
dbnsfp=dbnsfp.join(on="RESI", other=list_vals)
dbnsfp=dbnsfp[dbnsfp.residue_number.notnull()]

dbnsfp["protein_mean_CADD"]=protein_mean_CADD
dbnsfp["protein_mean_b_factor"]=protein_mean_b_factor

dbnsfp.to_csv(snakemake.output[0])
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
library(tidyverse)
library(optparse)
library(pROC)

source("workflow/scripts/existing_scores_glm_functions.R")
source("workflow/scripts/precision_recall_resource.R")

option_list = list(
  make_option(c("-b", "--clinvar_benign"), type="character", default="results/clinvar2022/clinvar_2022_benign.vcf.gz", 
              help="location of ClinVar benign missense variants in vcf like format"),
  make_option(c("-p", "--clinvar_pathogenic"), type="character", default="results/clinvar2022/clinvar_2022_pathogenic.vcf.gz", 
              help="location of ClinVar benign missense variants in vcf like format"),
  make_option(c("-a", "--AlphaFold_scores"), type="character", default="results/clinvar2022/values_of_clinvar_variants_FINAL.tsv.gz", 
              help="pre calculated Alphafold scores"),
  make_option(c("-v", "--variants"), type="character", default="results/prediction_final/final_regular_variants.csv.gz", 
              help="variants with calculated AlphScore"),
  make_option(c("-o", "--out_folder"), type="character", default="results/final_model_curves/", 
              help="name of folder to store output")
              )
opt = parse_args(OptionParser(option_list=option_list))

# load data, add variant ID (chrom:pos:ref:alt)
clinvar_benign<-read_tsv(opt$clinvar_benign, col_names=TRUE) %>%
  mutate(ID=paste(CHROM,POS,REF,ALT, sep=":"))
clinvar_pathogenic<-read_tsv(opt$clinvar_pathogenic, col_names=TRUE) %>%
  mutate(ID=paste(CHROM,POS,REF,ALT, sep=":"))
alphafold_pre_calculated<-read_tsv(opt$AlphaFold_scores, col_names=TRUE, 
                                   col_types = cols(.default = "?", REVEL_score = "d"))%>%
  mutate(ID=paste(`#chr`, `pos(1-based)`, ref, alt, sep=":"))
variants<-read_csv(opt$variants, 
                   col_types = cols(.default = "?", REVEL_score = "d"))


dir.create(opt$out_folder, recursive=TRUE)
setwd(opt$out_folder)

head(alphafold_pre_calculated)


all_clinvar_ids<-c(clinvar_benign$ID, clinvar_pathogenic$ID)

alphafold_pre_calculated_w_CV2022<-alphafold_pre_calculated %>%
  filter(ID %in% all_clinvar_ids)%>%
  mutate(outcome=(ID %in% clinvar_pathogenic$ID))%>%
  filter(!is.na(AlphScore))

variants<-variants%>% 
  mutate(ID=paste(`#chr`, `pos(1-based)`,ref, sep=":"))

alphafold_pre_calculated_w_CV2022$pos_in_VEP_and_Uniprot<-get_index_col(alphafold_pre_calculated_w_CV2022)
alphafold_pre_calculated_w_CV2022$DEOGEN2_score_med<-unlist_score(alphafold_pre_calculated_w_CV2022$DEOGEN2_score,
                                                                  alphafold_pre_calculated_w_CV2022$pos_in_VEP_and_Uniprot)

alphafold_pre_calculated_w_CV2022<-alphafold_pre_calculated_w_CV2022 %>%
  mutate(ID=paste(`#chr`, `pos(1-based)`,ref,  sep=":"))%>%
  filter(!ID %in% variants$ID)

# ensure that no variant is multiple times in the data set; if so, take the mean of the predictors, ensure, that the variant is rated consistently benign / pathogenic
testSet<-alphafold_pre_calculated_w_CV2022 %>% 
  group_by(ID) %>%
  summarise(AlphScore=mean(AlphScore), 
            REVEL_score=mean(REVEL_score), 
            CADD_raw=mean(CADD_raw), 
            DEOGEN2_score_med=mean(DEOGEN2_score_med), 
            glm_AlphRevel=mean(glm_AlphRevel),
            glm_AlphCadd=mean(glm_AlphCadd), 
            glm_AlphDeogen=mean(glm_AlphDeogen), 
            glm_AlphDeogenRevel=mean(glm_AlphDeogenRevel),
            glm_AlphCaddDeogen=mean(glm_AlphCaddDeogen),
            glm_CaddDeogenRevel=mean(glm_CaddDeogenRevel),
            outcome=mean(outcome))%>%
  filter(outcome %in% c(0,1))%>%
  mutate(outcome=as.logical(outcome))

nrow(testSet)
sum(complete.cases(testSet))

sum(testSet$ID %in% variants$ID)

testSet<-testSet[complete.cases(testSet),]
table(testSet$outcome)

dens_plot<-function(score, testSet_f=testSet){
  testSet_f_mod<-testSet_f
  testSet_f_mod$cur_score<-unlist(testSet_f[,score], use.names=FALSE)

AlphScorePlot<-ggplot(testSet_f_mod)+
  geom_density(aes(x=cur_score, fill=outcome), alpha=0.5)+
  theme_bw()+
  labs(fill = "(likely)\npathogenic:")+
  scale_fill_manual(values=c("darkblue", "red"))
ggsave(filename=paste0(score, "_FINAL.pdf"), 
       plot=AlphScorePlot,
       width=5, height=3)
AlphScore_table<-generate_table(score)
write_tsv(x=AlphScore_table,
          file=paste0(score, "_table_FINAL.tsv"))
}

dens_plot("AlphScore")
dens_plot("CADD_raw")
dens_plot("REVEL_score")
dens_plot("DEOGEN2_score_med")

dens_plot("glm_AlphRevel")
dens_plot("glm_AlphCadd")
dens_plot("glm_AlphDeogen")

dens_plot("glm_AlphDeogenRevel")
dens_plot("glm_AlphCaddDeogen")
dens_plot("glm_CaddDeogenRevel")


pdf(file="ClinVar_val_REVEL.pdf")
plot(roc(as.integer(testSet$outcome), testSet$AlphScore), print.auc = TRUE, 
     col = "black", print.auc.y = .35)
plot(roc(testSet$outcome, testSet$REVEL_score), print.auc = TRUE, 
     col = "blue", print.auc.y = .25, add = TRUE)
plot(roc(testSet$outcome, testSet$glm_AlphRevel), print.auc = TRUE, 
     col = "red", print.auc.y = .15, add = TRUE)
legend(0.2, 0.3, legend=c("AlphScore", "REVEL", "AlphScore + REVEL"),
       col=c("black", "blue","red"), lty=1, cex=0.8)
dev.off()

pdf(file="ClinVar_val_CADD.pdf")
plot(roc(as.integer(testSet$outcome), testSet$AlphScore), print.auc = TRUE, 
     col = "black", print.auc.y = .35)
plot(roc(testSet$outcome, testSet$CADD_raw), print.auc = TRUE, 
     col = "blue", print.auc.y = .25, add = TRUE)
plot(roc(testSet$outcome, testSet$glm_AlphCadd), print.auc = TRUE, 
     col = "red", print.auc.y = .15, add = TRUE)
legend(0.2, 0.3, legend=c("AlphScore", "CADD", "AlphScore + CADD"),
       col=c("black", "blue","red"), lty=1, cex=0.8)
dev.off()

pdf(file="ClinVar_val_DEOGEN2.pdf")
plot(roc(as.integer(testSet$outcome), testSet$AlphScore), print.auc = TRUE, 
     col = "black", print.auc.y = .35)
plot(roc(testSet$outcome, testSet$DEOGEN2_score_med), print.auc = TRUE, 
     col = "blue", print.auc.y = .25, add = TRUE)
plot(roc(testSet$outcome, testSet$glm_AlphDeogen), print.auc = TRUE, 
     col = "red", print.auc.y = .15, add = TRUE)
legend(0.2, 0.3, legend=c("AlphScore", "DEOGEN2", "AlphScore + DEOGEN2"),
       col=c("black", "blue","red"), lty=1, cex=0.8)
dev.off()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
library(tidyverse)
library(gridExtra)
library(optparse)

option_list = list(
  make_option(c("-t", "--variants"), type="character", default="results/prediction_final/pre_final_model_regular_variants.csv.gz", 
              help="csv.gz file, test dataset"),
  make_option(c("-o", "--out_folder"), type="character", default="results/pLDDT", 
              help="Output folder"))

opt = parse_args(OptionParser(option_list=option_list))

variants<-read_csv(opt$variants)

dir.create(opt$out_folder, recursive=TRUE)
setwd(opt$out_folder)

gnomAD_vars<-variants %>% 
  filter(gnomadSet==1)%>%
  mutate(outcome_f=ifelse(outcome==1, "proxy pathogenic", "proxy benign"))%>%
  mutate(outcome_f=as.factor(outcome_f))%>%
  mutate(pLDDT_above_70=b_factor>70)

ClinVar_vars<-variants %>% 
  filter(gnomadSet==0)%>%
  mutate(outcome_f=ifelse(outcome==1, "(likely) pathogenic", "(likely) benign"))%>%
  mutate(outcome_f=as.factor(outcome_f))%>%
  mutate(pLDDT_above_70=b_factor>70)


hist_gnomad<-ggplot(gnomAD_vars)+
  geom_histogram(aes(x=b_factor, fill = outcome_f))+
  theme_bw()+
  geom_vline(xintercept=70, color="grey")+
  xlim(15,100)

hist_CV<-ggplot(ClinVar_vars)+
  geom_histogram(aes(x=b_factor, fill = outcome_f))+
  theme_bw()+
  geom_vline(xintercept=70, color="grey")+
  xlim(15,100)

bar_gnomad<-ggplot(gnomAD_vars)+
  geom_bar(aes(fill=outcome_f, x=pLDDT_above_70), position="fill",   width = 0.6)+
  theme_bw()

bar_CV<-ggplot(ClinVar_vars)+
  geom_bar(aes(fill=outcome_f, x=pLDDT_above_70), position="fill", width = 0.6)+
  theme_bw()

arranged_plddt_plots<-grid.arrange(hist_gnomad, hist_CV,bar_gnomad, bar_CV, nrow=2, ncol=2)

ggsave(filename = "arranged_plddt_plots.pdf",  plot=arranged_plddt_plots)
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
library(tidyverse)
library(pROC)
library(readxl)
library(caret)
library(optparse)
source("workflow/scripts/existing_scores_glm_functions.R")

set.seed(1)

option_list = list(
  make_option(c("-c", "--csv_location"), type="character", default="/media/axel/Dateien/Arbeit_Gen/alphafold2/git_version/AlphScore/results/train_testset1/subsampled_vars.csv", 
              help="csv.gz file"),
  make_option(c("-e", "--excel_location"), type="character", default="resources/available_colnames_regular.xlsx", 
              help="Excel file listing columns to use"),
  make_option(c("-p", "--prefix"), type="character", default="base_model", 
              help="Prefix for output"),
  make_option(c("-m", "--method_pred"), type="character", default="randomforest", 
              help="Prediction method [randomforest, xgboost, extratree]"),
  make_option(c("-n", "--num_trees_param"), type="integer", default=200, 
              help="Number of trees to use / or number of rounds"),
  make_option(c("-x", "--max_depth_param"), type="integer", default=6, 
              help="Maximal depth of trees"),
  make_option(c("-l", "--lambda_param"), type="double", default=0, 
              help="Lambda value for xgboost"),
  make_option(c("-t", "--eta_param"), type="double", default=0.05, 
              help="Eta - learning rate, xgboost"),
  make_option(c("-s", "--subsample_param"), type="double", default=0.8, 
              help="Subsample parameter, xgboost"),
  make_option(c("-i", "--min_node_param"), type="integer", default=50, 
              help="randomforest: min.node.size, xgboost: min node parameter, extratree: nodesize"),
  make_option(c("-r", "--cor_param"), type="double", default=0.999999, 
              help="Just keep columns that have a lower correlation than this value, general"),
  make_option(c("-b", "--b_factor_param"), type="integer", default=20, 
              help="pLDDT value to filter positions that will be used to generate the average values of amino acids, general"),
  make_option(c("-d", "--min_child_weight_param"), type="integer", default=0, 
              help="min child weight, xgboost"),
  make_option(c("-g", "--gamma_param"), type="integer", default=0, 
              help="gamma value, xgboost"),
  make_option(c("-o", "--out_folder"), type="character", default="results/prediction", 
              help="name of folder to store output"),
  make_option(c("-w", "--write_dataset"), type="logical", default=FALSE, 
              help="Write predictions of test-dataset to file"),
  make_option(c("-f", "--full_model"), type="logical", default=FALSE, #results/preprocess/validation_set.csv.gzpreprocessed.csv.gz
              help="generate a full model"),
  make_option(c("-a", "--importance"), type="character", default="impurity", 
              help="variable importance mode of the ranger model"),
  make_option(c("-y", "--write_model"), type="logical", default=FALSE, #results/preprocess/validation_set.csv.gzpreprocessed.csv.gz
              help="write the model parameters to a file"),
  make_option(c("-k", "--k_fold_cross_val"), type="logical", default=FALSE,
              help="activate k-fold cross validation of the training and test data set")

)

opt = parse_args(OptionParser(option_list=option_list))

save_tibble<-data.frame()

variants<-read_csv(opt$csv_location, na=c(".","NA", NA))
colnames_usage <- read_excel(opt$excel_location, col_types="text")

dir.create(opt$out_folder, recursive=TRUE)
setwd(opt$out_folder)

pdf(file=paste0(opt$prefix, "_RPlots.pdf"))

# calculate average values of add_to_AS column for amino acids of training set
sel_vars_to<-(colnames_usage %>% filter(!is.na(add_to_AS)))$value
toAS_properties<-variants  %>%
  filter(gnomadSet == 1, b_factor>opt$b_factor_param)%>%
  dplyr::select(all_of(sel_vars_to), from_AS)%>%
  filter(complete.cases(.))%>%
  group_by(from_AS) %>%
  summarize_all(mean)
colnames(toAS_properties)<-paste0(colnames(toAS_properties), "_toAS")

# add these toAS values to the data set
variants<-variants%>%
  left_join(toAS_properties, by=c("from_AS"="from_AS_toAS"))
colnames_toAS<-colnames(toAS_properties[,names(toAS_properties) != "from_AS_toAS"])
variants[, colnames_toAS]<-  variants[, sel_vars_to] - 
  variants[, colnames_toAS]

# remove correlated columns in the data set
colnames_prediction<-c(colnames_toAS, 
                       (colnames_usage %>% 
                          filter(!is.na(for_prediction)))$value)

variants_pred<-variants %>% 
  dplyr::select(c(colnames_prediction, "outcome"))

#https://machinelearningmastery.com/feature-selection-with-the-caret-r-package/
zv <- apply(variants_pred, 2, function(x) length(unique(x)) == 1)
dfr <- variants_pred[, !zv]
n=length(colnames(dfr))
correlationMatrix <- cor(dfr,use="complete.obs")

set.seed(1)
# prune correltated variables 
df_colsCorrRemoved<-dfr[, -(findCorrelation(correlationMatrix, cutoff=opt$cor_param))]
colsCorrRemoved<-colnames(df_colsCorrRemoved)
rm(df_colsCorrRemoved)
rm(dfr)

variants<-variants[complete.cases(variants[,c(colsCorrRemoved, "outcome", "CADD_raw")]), ]
gc()


ranger_fit<-function(xtrain_dataset){
  library(ranger)
  model1<-ranger(outcome ~ ., 
                 data=xtrain_dataset, 
                 importance=opt$importance, 
                 max.depth=opt$max_depth_param,
                 num.trees = opt$num_trees_param,
                 min.node.size = opt$min_node_param
  ) 
  return(model1)
}
ranger_predict<-function(dataset, modelx){
  return(predict(modelx, dataset)$predictions) }

extraT_fit<-function(xtrain_dataset){
  library(ranger)
  model1<-ranger(outcome ~ ., 
                 data=xtrain_dataset, 
                 importance=opt$importance, 
                 splitrule="extratrees",
                 max.depth=opt$max_depth_param,
                 num.trees = opt$num_trees_param,
                 min.node.size = opt$min_node_param
  ) 
  return(model1)
}
ranger_save<-function(modelx, filename){
  saveRDS(modelx, file=filename)
}

xgboost_fit<-function(xtrain_dataset){
  library(xgboost)
  set.seed(1)
  gnomad_train<-xgb.DMatrix(data=as.matrix(xtrain_dataset %>% dplyr::select(-outcome)), label=xtrain_dataset$outcome )
  test_ds_tibble<-variants %>% 
    filter(clinvar_holdout_test==TRUE) %>% 
    dplyr::select(any_of(colnames(xtrain_dataset)))
  test_ds<-xgb.DMatrix(data=as.matrix(test_ds_tibble %>% dplyr::select(-outcome)), label=test_ds_tibble$outcome)
  watchlist=list(train=gnomad_train, test=test_ds)

  params <- list(max_depth = opt$max_depth_param, 
                 subsample=opt$subsample_param, 
                 eta=opt$eta_param, 
                 lambda=opt$lambda_param, 
                 alpha=0, 
                 gamma=opt$gamma_param, 
                 min_child_weight=opt$min_child_weight_param)
  model1<-xgb.train(
    data=gnomad_train,
    watchlist = watchlist,
    params=params,
    nrounds = opt$num_trees_param,
    early_stopping_rounds = 10,
    #eval_metric="auc",
    objective = "binary:logistic",  # for regression models
    verbose = 1)
  return(model1)
}

xgboost_predict<-function(dataset, modelx){
  data_x<-xgb.DMatrix(data=as.matrix(dataset %>% dplyr::select(-outcome)))
  return(
    predict(modelx, data_x) 
  )
}
xgboost_save<-function(modelx, filename){
  xgb.save(model=modelx, fname=filename)
}

if (opt$method_pred=="xgboost"){
  fit_model<-xgboost_fit
  predict_model<-xgboost_predict
  save_model<-xgboost_save
} else if (opt$method_pred=="randomforest") {
  fit_model<-ranger_fit
  predict_model<-ranger_predict
  save_model<-ranger_save
}else {
  fit_model<-extraT_fit
  predict_model<-ranger_predict
  save_model<-ranger_save
}


if (opt$k_fold_cross_val == TRUE){
  set.seed(1)

  # splitKFold function from existing_scores_glm_functions.R
  variants<-splitKFold(variants, 5)

  glm_cv<-data.frame()
  for (h in unique(variants$kfold_index)) {
    variants<-setTrainTestSet(variants, h)

    gnomad_train<-variants %>% filter(gnomad_train)

    gnomad_model_Alph<-fit_model(gnomad_train %>% dplyr::select(all_of(colsCorrRemoved)))
    variants$predicted_Alph<-predict_model(variants, gnomad_model_Alph) 

    glm_AlphCadd<- glm(outcome ~ . , family=binomial(link='logit'),
                            data=variants %>% filter(clinvar_holdout_test) %>% 
                            dplyr::select(outcome, predicted_Alph, CADD_raw) %>%
                            filter(complete.cases(.)))

    variants$glm_AlphCadd <-predict(glm_AlphCadd, variants)#prediction with model = non_h_cv_model_glm, test = : predicted__non_h_cv_glm_h_cv_test

    clinvar_holdout_test<-variants %>% filter(clinvar_holdout_test)
    clinvar_interim_test<-variants %>% filter(clinvar_interim_test)
    gnomad_train<-variants %>% filter(gnomad_train)

    save_tibble <- rbind(save_tibble, 
                         tibble(auc_Alph_train_gnomAD=roc(gnomad_train$outcome, gnomad_train$predicted_Alph)$auc, 
                                Alph_OOB=ifelse((opt$method_pred=="randomforest"), gnomad_model_Alph$prediction.error, NA),
                                auc_CADD_interim_CV=roc(clinvar_holdout_test$outcome, clinvar_holdout_test$CADD_raw)$auc, 
                                auc_Alph_interim_CV=roc(clinvar_holdout_test$outcome, clinvar_holdout_test$predicted_Alph)$auc,
                                auc_CADD_test_CV=roc(clinvar_interim_test$outcome, clinvar_interim_test$CADD_raw)$auc, 
                                auc_Alph_test_CV=roc(clinvar_interim_test$outcome, clinvar_interim_test$predicted_Alph)$auc, 
                                auc_glm_test_CV=roc(clinvar_interim_test$outcome, clinvar_interim_test$glm_AlphCadd)$auc, 
                                sample_num=h,
                                gnomad_train_nrow=nrow(gnomad_train),
                                interim_cv_nrow=nrow(clinvar_holdout_test),
                                test_cv_nrow=nrow(clinvar_interim_test),
                                condition=opt$prefix ))


   plot(roc(clinvar_holdout_test$outcome, clinvar_holdout_test$CADD_raw), print.auc = TRUE, col = "red")
   plot(roc(clinvar_holdout_test$outcome, clinvar_holdout_test$predicted_Alph), print.auc = TRUE, 
                               col = "green", print.auc.y = .2, add = TRUE)

   plot(roc(clinvar_interim_test$outcome, clinvar_interim_test$CADD_raw), print.auc = TRUE, col = "red", add=FALSE)
   plot(roc(clinvar_interim_test$outcome, clinvar_interim_test$glm_AlphCadd), print.auc = TRUE, 
                           col = "blue", print.auc.y = .2, add = TRUE)
   plot(roc(clinvar_interim_test$outcome, clinvar_interim_test$predicted_Alph), print.auc = TRUE, 
                           col = "green", print.auc.y = .4, add = TRUE)
  }

}else if (opt$full_model==TRUE){
  var_full_model<-variants %>% 
    filter(gnomadSet==1)%>%
    dplyr::select(all_of(colsCorrRemoved), "outcome") 

  gnomad_model_Alph<-fit_model(var_full_model)
  var_full_model$predicted_Alph<-predict_model(var_full_model, gnomad_model_Alph)
  variants$predicted_Alph<-predict_model(variants, gnomad_model_Alph)

  interim_dataset<-variants %>% 
    filter(clinvar_interim_test)

  save_tibble <- tibble(auc_Alph_train_gnomAD=roc(var_full_model$outcome, var_full_model$predicted_Alph)$auc,
                        Alph_OOB=ifelse((opt$method_pred %in% c("randomforest","extratree")), gnomad_model_Alph$prediction.error, NA),
                        gnomad_train_nrow=nrow(var_full_model),
                        gnomad_train_nProteins=train_dataset<-length(unique((variants %>% filter(gnomadSet==1))$Uniprot_acc_split)),
                        auc_CADD_interim_CV=roc(interim_dataset$outcome, interim_dataset$CADD_raw)$auc, 
                        auc_Alph_interim_CV=roc(interim_dataset$outcome, interim_dataset$predicted_Alph)$auc,
                        interim_nrow=nrow(interim_dataset),
                        interim_nProteins=length(unique(interim_dataset$Uniprot_acc_split))
                        )

}else{
  train_dataset<-variants %>% 
    filter(gnomad_train)%>%
    dplyr::select(all_of(colsCorrRemoved), "outcome")

  gnomad_model_Alph<-fit_model(train_dataset)

  train_dataset$predicted_Alph<-predict_model(train_dataset, gnomad_model_Alph)
  variants$predicted_Alph<-predict_model(variants %>% dplyr::select(all_of(colsCorrRemoved), "outcome"), gnomad_model_Alph)

  interim_dataset<-variants %>% 
    filter(clinvar_interim_test)

  test_dataset<-variants %>% 
    filter(clinvar_holdout_test)

  test_dataset_gnomad<-variants %>% 
    filter(gnomad_holdout_test) 

  roc_rose <- plot(roc(train_dataset$outcome, train_dataset$predicted_Alph), print.auc = TRUE, col = "blue")
  legend(0.3, 0.3, legend=c("Alph"),
         col=c("blue"), lty=1, cex=0.8)
  title(main = "train set")

  roc_rose <- plot(roc(interim_dataset$outcome, interim_dataset$CADD_raw), print.auc = TRUE, col = "red")
  roc_rose <- plot(roc(interim_dataset$outcome, interim_dataset$predicted_Alph), print.auc = TRUE, 
                   col = "blue", print.auc.y = .4, add = TRUE)
  legend(0.3, 0.3, legend=c("CADD", "Alph"),
         col=c("red", "blue"), lty=1, cex=0.8)
  title(main = "interim set")

  roc_rose <- plot(roc(test_dataset$outcome, test_dataset$CADD_raw), print.auc = TRUE, col = "red")
  roc_rose <- plot(roc(test_dataset$outcome, test_dataset$predicted_Alph), print.auc = TRUE, 
                   col = "blue", print.auc.y = .4, add = TRUE)
  legend(0.3, 0.3, legend=c("CADD", "Alph"),
         col=c("red", "blue"), lty=1, cex=0.8)
  title(main = "test set")

  roc_rose <- plot(roc(test_dataset_gnomad$outcome, test_dataset_gnomad$CADD_raw), print.auc = TRUE, col = "red")
  roc_rose <- plot(roc(test_dataset_gnomad$outcome, test_dataset_gnomad$predicted_Alph), print.auc = TRUE, 
                   col = "blue", print.auc.y = .4, add = TRUE)
  legend(0.3, 0.3, legend=c("CADD", "Alph"),
         col=c("red", "blue"), lty=1, cex=0.8)
  title(main = "test set genes, gnomad variants")


  save_tibble <- tibble(auc_Alph_train_gnomAD=roc(train_dataset$outcome, train_dataset$predicted_Alph)$auc, 
                        Alph_OOB=ifelse((opt$method_pred!="xgboost"), gnomad_model_Alph$prediction.error, NA),
                        auc_CADD_interim_CV=roc(interim_dataset$outcome, interim_dataset$CADD_raw)$auc, 
                        auc_Alph_interim_CV=roc(interim_dataset$outcome, interim_dataset$predicted_Alph)$auc,
                        auc_CADD_test_CV=roc(test_dataset$outcome, test_dataset$CADD_raw)$auc,
                        auc_Alph_test_CV=roc(test_dataset$outcome, test_dataset$predicted_Alph)$auc, 
                        auc_CADD_test_gnomad=roc(test_dataset_gnomad$outcome, test_dataset_gnomad$CADD_raw)$auc, 
                        auc_Alph_test_gnomad=roc(test_dataset_gnomad$outcome, test_dataset_gnomad$predicted_Alph)$auc, 
                        gnomad_train_nrow=nrow(train_dataset),
                        interim_nrow=nrow(interim_dataset),
                        test_nrow=nrow(test_dataset),
                        test_gnomad_nrow=nrow(test_dataset_gnomad),
                        gnomad_train_nProteins=train_dataset<-length(unique((variants %>% 
                                                                               filter(gnomad_train))$Uniprot_acc_split)),
                        interim_nProteins=length(unique(interim_dataset$Uniprot_acc_split)),
                        test_nProteins=length(unique(test_dataset$Uniprot_acc_split)),
                        test_gnomad_nProteins=length(unique(test_dataset_gnomad$Uniprot_acc_split)),
                        condition=opt$prefix )
}

if (opt$write_dataset){
  write_csv(x=variants, file=paste0(opt$prefix,"_variants.csv.gz"))
}

if (opt$method_pred %in% c("randomforest", "extratree")){
  var_imp<-tibble(importance=as.vector(gnomad_model_Alph$variable.importance), 
                  variable=names(gnomad_model_Alph$variable.importance))
  write_tsv(x=var_imp,
            file = paste0(opt$prefix,"_", opt$importance, "_importance.tsv"))
}

if (opt$write_model){
  save_model(gnomad_model_Alph, file=paste0(opt$prefix,"_written_full_model.RData"))
  saveRDS(colsCorrRemoved, file=paste0(opt$prefix,"_colnames_to_use.RData"))
  saveRDS(toAS_properties,file=paste0(opt$prefix,"_toAS_properties.RData") )
}

write_tsv(x=save_tibble, file=paste0(opt$prefix, "_results.tsv"))
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
library(readr)
library(dplyr)
library(stringr)
library(optparse)
library(ranger)
source("workflow/scripts/existing_scores_glm_functions.R")
set.seed(1)

option_list = list(
  make_option(c("-c", "--csv_location"), type="character", default="results/combine2_protein/A0A1B0GUS4_w_AFfeatures.csv.gz", 
              help="csv.gz file"),
  make_option(c("-m", "--model_location"), type="character", default="results/prediction/final_written_full_model.RData", 
              help="File containing the full model for prediction"),
  make_option(c("-u", "--use_cols_file"), type="character", default="results/prediction/final_colnames_to_use.RData", 
              help="File containing the colnames to use in the final model"),
  make_option(c("-t", "--toAS_properties"), type="character", default="results/prediction/final_toAS_properties.RData", 
              help="File containing the properties of the alternative amino acids"),
  make_option(c("-b", "--combined_model"), type="character", default="results/prediction_final/combined_model.RData", 
              help="File where the final combined logistic regression model should be saved to"),
  make_option(c("-v", "--training_var_ids"), type="character", default="results/prediction_final/training_var_ids.tsv", 
              help="IDs of variants that were in the training / test set"),
  make_option(c("-o", "--output_file"), type="character", default="results/prediction/predicted.csv.gz", 
              help="Excel file listing columns to use"),
  make_option(c("-r", "--reduced"), type="logical", default="FALSE", 
              help="just save essential columns")
)

opt = parse_args(OptionParser(option_list=option_list))

to_AS_table<-read_tsv("resources/to_AS_table.txt")
variants<-read_csv(opt$csv_location, na=c(".","NA", NA), col_types = cols(.default = "?", REVEL_score = "d"))

# if variant file is empty, stop
if (nrow(variants)==0){
  system(paste("touch", opt$output_file))
}else{

model_to_use<-readRDS(opt$model_location)
colnames_to_use<-readRDS(opt$use_cols_file)
toAS_properties<-readRDS(opt$toAS_properties)

set_of_models<-readRDS(opt$combined_model)

training_var_ids<-read_tsv(opt$training_var_ids)
gnomad_var_ids<-(training_var_ids %>% filter(gnomadSet==1))$var_id_genomic
ClinVar_var_ids<-(training_var_ids %>% filter(gnomadSet==0))$var_id_genomic

# function to add to AS properties
addToAS<-function(variants_par, toAsProp, colToUse){
  variants_mod<-variants
  variants_mod<-variants_mod%>%
    left_join(toAsProp, by=c("from_AS"="from_AS_toAS"))

  colnames_toAS<-colnames(toAsProp)
  colnames_toAS<-colnames_toAS[colnames_toAS!="from_AS_toAS"]

  sel_vars_to<-str_replace(colnames_toAS, fixed("_toAS"),"")

  variants_mod[, colnames_toAS]<-variants_mod[, sel_vars_to] - variants_mod[, colnames_toAS]
  colToUse_mod<-colToUse[colToUse!="outcome"]

  return(variants_mod[, colToUse_mod])
}


# external function that is also used by preprocess
variants<-prepareVariantsForPrediction(variants, to_AS_table)

#add to AS properties
variants_alph<-addToAS(variants, toAS_properties, colnames_to_use)

# just keep complete cases
variants<-variants[complete.cases(variants_alph),]
variants_alph<-variants_alph[complete.cases(variants_alph),]

# predict AlphScore
variants$AlphScore<-predict(model_to_use, variants_alph)$predictions
rm(variants_alph)

index_col<-get_index_col(variants)

variants$DEOGEN2_score_med<- unlist_score(variants$DEOGEN2_score, index_col)

variants<-predict_set_of_models(set_of_models = set_of_models, variants_to_predict = variants)
colNamesCombinedModels<-unlist(set_of_models[2])

# flag variants that are in the training data set or that are in dbNSFP ClinVar
variants<-variants %>% 
  mutate(in_gnomad_train=var_id_genomic %in% gnomad_var_ids)%>%
  mutate(in_clinvar_ds=var_id_genomic %in% ClinVar_var_ids)


# if wanted, reduce the output to have smaller files
if (opt$reduced==TRUE){
 variants<-variants %>% 
  mutate(ID=paste(`#chr`, `pos(1-based)`, ref, alt, sep=":"))%>%
  select(any_of(colnames(variants)[2:10]), 
         ID, genename, Uniprot_acc_split,Uniprot_acc,HGVSp_VEP_split, HGVSp_VEP, CADD_raw, REVEL_score, DEOGEN2_score, 
         b_factor, SOLVENT_ACCESSIBILITY_core, in_gnomad_train, in_clinvar_ds, AlphScore, any_of(colNamesCombinedModels))
}

# write to disk
write_csv(x=variants,
          file=opt$output_file)
}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
library(readr)
library(dplyr)
library(stringr)
library(optparse)
library(ranger)
source("workflow/scripts/existing_scores_glm_functions.R")
set.seed(1)

option_list = list(
  make_option(c("-c", "--csv_location"), type="character", default="results/combine2_protein/P38398_w_AFfeatures.csv.gz", 
              help="csv.gz file"),
  make_option(c("-m", "--model_location"), type="character", default="results/prediction/final_written_full_model.RData", 
              help="File containing the full model for prediction"),
  make_option(c("-u", "--use_cols_file"), type="character", default="results/prediction/final_colnames_to_use.RData", 
              help="File containing the colnames to use in the final model"),
  make_option(c("-t", "--toAS_properties"), type="character", default="results/prediction/final_toAS_properties.RData", 
              help="File containing the properties of the alternative amino acids"),
  make_option(c("-n", "--model_location_null"), type="character", default="results/prediction/final_written_full_model.RData", 
              help="File containing the full model for prediction, null model"),
  make_option(c("-x", "--use_cols_file_null"), type="character", default="results/prediction/final_colnames_to_use.RData", 
              help="File containing the colnames to use in the final model, null model"),
  make_option(c("-v", "--toAS_properties_null"), type="character", default="results/prediction/final_toAS_properties.RData", 
              help="File containing the properties of the alternative amino acids, null model"),
  make_option(c("-r", "--reduced"), type="logical", default="FALSE", 
              help="just save essential columns"),
  make_option(c("-o", "--output_file"), type="character", default="results/prediction/predicted.csv.gz", 
              help="Excel file listing columns to use")
)

opt = parse_args(OptionParser(option_list=option_list))

to_AS_table<-read_tsv("resources/to_AS_table.txt")
variants<-read_csv(opt$csv_location, na=c(".","NA", NA))

# if variant file is empty, stop
if (nrow(variants)==0){
  system(paste("touch", opt$output_file))
}else{

model_to_use<-readRDS(opt$model_location)
colnames_to_use<-readRDS(opt$use_cols_file)
toAS_properties<-readRDS(opt$toAS_properties)

model_to_use_null<-readRDS(opt$model_location_null)
colnames_to_use_null<-readRDS(opt$use_cols_file_null)
toAS_properties_null<-readRDS(opt$toAS_properties_null)

# function to add to AS properties
addToAS<-function(variants_par, toAsProp, colToUse){
  variants_mod<-variants
  variants_mod<-variants_mod%>%
    left_join(toAsProp, by=c("from_AS"="from_AS_toAS"))

  colnames_toAS<-colnames(toAsProp)
  colnames_toAS<-colnames_toAS[colnames_toAS!="from_AS_toAS"]

  sel_vars_to<-str_replace(colnames_toAS, fixed("_toAS"),"")

  variants_mod[, colnames_toAS]<-variants_mod[, sel_vars_to] - variants_mod[, colnames_toAS]
  colToUse_mod<-colToUse[colToUse!="outcome"]

  return(variants_mod[, colToUse_mod])
}

# external function that is also used by preprocess
variants<-prepareVariantsForPrediction(variants, to_AS_table)

#add to AS properties
variants_alph<-addToAS(variants, toAS_properties, colnames_to_use)

# just keep complete cases
variants<-variants[complete.cases(variants_alph),]
variants_alph<-variants_alph[complete.cases(variants_alph),]

# predict AlphScore
variants$AlphScore<-predict(model_to_use, variants_alph)$predictions
rm(variants_alph)

# predict null model
variants_alph_null<-addToAS(variants, toAS_properties_null, colnames_to_use_null)
variants$Alph_null<-predict(model_to_use_null, variants_alph_null)$predictions

# if wanted, reduce the output to have smaller files
if (opt$reduced==TRUE){
 variants<-variants %>% 
  mutate(ID=paste(`#chr`, `pos(1-based)`, ref, alt, sep=":"))%>%
  select(any_of(colnames(variants)[2:10]), 
         ID, genename, Uniprot_acc_split,Uniprot_acc,HGVSp_VEP_split, HGVSp_VEP, CADD_raw, REVEL_score, DEOGEN2_score, 
         b_factor, SOLVENT_ACCESSIBILITY_core, Alph_null, AlphScore)
}

# write to disk
write_csv(x=variants,
          file=opt$output_file)
}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
library(tidyverse)
library(VennDiagram)
futile.logger::flog.threshold(futile.logger::ERROR, name = "VennDiagramLogger") # prevent VennDiagram to write lots of log messages
library(data.table)
library(optparse)
source("workflow/scripts/existing_scores_glm_functions.R")
set.seed(1)

option_list = list(
  make_option(c("-i", "--input"), type="character", default="/media/axel/Dateien/Arbeit_Gen/alphafold2/data_from_xcat_v2/gnomad_extracted_v2.csv.gz", 
              help="csv.gz file"),
  make_option(c("-o", "--output"), type="character", default="results/train_testset1/gnomad_extracted_prepro.csv.gz", 
              help="csv.gz file for output")
)
opt = parse_args(OptionParser(option_list=option_list))

variants<-read_csv(opt$input)
to_AS_table<-read_tsv("resources/to_AS_table.txt")

dir.create(dirname(opt$output))
setwd(dirname(opt$output))

# preprocess variants (see sourced file)
variants<-prepareVariantsForPrediction(variants, to_AS_table)
variants<-splitKFold(variants, 5)
variants<-setTrainTestSet(variants, 1)

# check for overlaps
gnomad_train<-variants %>% filter(gnomad_train)
clinvar_interim_test<-variants %>% filter(clinvar_interim_test)
clinvar_holdout_test<-variants %>% filter(clinvar_holdout_test)

venn.diagram(x=list(gnomad_train$Uniprot_acc_split, clinvar_interim_test$Uniprot_acc_split, clinvar_holdout_test$Uniprot_acc_split),category.names = c("gnomAD train" , "ClinVar interim", "ClinVar hold out"), filename = 'Protein_id_level.png')
venn.diagram(x=list(gnomad_train$var_id_prot, clinvar_interim_test$var_id_prot, clinvar_holdout_test$var_id_prot),category.names = c("gnomAD train" , "ClinVar interim", "ClinVar hold out"), filename = 'Protein_variant_id_level.png')
venn.diagram(x=list(gnomad_train$var_id_genomic, clinvar_interim_test$var_id_genomic, clinvar_holdout_test$var_id_genomic),category.names = c("gnomAD train" , "ClinVar interim", "ClinVar hold out"), filename = 'Variant_id_level.png')

# check for duplicates
n_occur <- data.frame(table(paste(variants$var_id_genomic, variants$gnomadSet)))
double_ids<-str_split(n_occur[n_occur$Freq > 1,]$Var1, " ", simplify=TRUE)[,1]
double_variants<-variants %>% filter(var_id_genomic %in% double_ids) %>% select(var_id_genomic, Uniprot_acc_split, RESN, RESN_RESI, outcome, gnomadSet, alt)
head(double_variants)
# duplicates are different alternative alleles at one position

write_csv(variants, file=basename(opt$output))
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
library(tidyverse)
library(optparse)
set.seed(1)

option_list = list(
  make_option(c("-i", "--input"), type="character", default="results/train_testset1/gnomad_extracted_prepro.csv.gz", 
              help="csv.gz file"),
  make_option(c("-o", "--output"), type="character", default="results/preprocess/gnomad_extracted_prepro_rec.csv.gz", 
              help="csv.gz file for output"),
  make_option(c("-u", "--undersample"), type="logical", default="TRUE", 
              help="Undersample to 50% - 50 % patho / benign")
)
opt = parse_args(OptionParser(option_list=option_list))

base_file_name=tools::file_path_sans_ext(basename(opt$input))
variants_org<-read_csv(opt$input)

dir.create(dirname(opt$output))
setwd(dirname(opt$output))


proportion_patho<-variants_org %>% 
  filter(gnomadSet==TRUE)%>%
  group_by(from_AS) %>%
  summarize(mean_outcome=mean(outcome), anz=sum(outcome))%>%
  arrange(mean_outcome)
print(proportion_patho)

# set the ratio patho / beningn
if (opt$undersample){
  PROP_PATHO_FACTOR=0.5
}else{
  PROP_PATHO_FACTOR=min(proportion_patho$mean_outcome)
}

amino_acids<-unique(variants_org$from_AS)

new_variants<-data.frame()
for (amino_acid in amino_acids){
  temp_patho<-variants_org %>% 
    filter(outcome==1, from_AS==amino_acid)%>%
    filter(gnomadSet==1)

  temp_benign<-variants_org %>% 
    filter(outcome==0, from_AS==amino_acid) %>%
    filter(gnomadSet==1)

  non_training_variants<-variants_org%>%
    filter(from_AS==amino_acid) %>%
    filter(gnomadSet==0)

  number_ben<-nrow(temp_benign)
  number_patho_to_sample<-as.integer((PROP_PATHO_FACTOR/(1-PROP_PATHO_FACTOR))*number_ben)

  new_temp_patho<-sample_n(temp_patho,number_patho_to_sample, replace = FALSE)

  new_variants<-rbind(new_variants, temp_benign, new_temp_patho, non_training_variants)

}

print(new_variants %>%
  filter(gnomadSet==1)%>%
  group_by(from_AS) %>%
  summarize(mean_outcome=mean(outcome))%>%
  arrange(mean_outcome))

print("total number of variants:")
print(nrow(new_variants))



print("Number of variants in gnomAD set after recalibration:")
print(nrow(new_variants %>%
             filter(gnomadSet==1)))

print("Number of variants in gnomAD set before recalibration:")
print(nrow(variants_org%>%
             filter(gnomadSet==1)))

print("Number of variants in ClinVar set:")
print(nrow(new_variants %>%
             filter(gnomadSet==0)))

write_csv(new_variants, file=basename(opt$output))
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
library(tidyverse)
library(optparse)

option_list = list(
  make_option(c("-v", "--dup_vars"), type="character", default="all_possible_values_dups.csv.gz", 
              help="list of duplicated variants, final set"),
  make_option(c("-o", "--out_file"), type="character", default="deduplicated_vars_bind.tsv", 
              help="Output file name"))

opt = parse_args(OptionParser(option_list=option_list))


duplicted_vars<-read_tsv(opt$dup_vars)

collapse_as_text<-c("Uniprot_acc_split", "HGVSp_VEP_split", "b_factor", "SOLVENT_ACCESSIBILITY_core","in_gnomad_train","in_clinvar_ds")
scoreVals_to_combine<-c("CADD_raw", "REVEL_score", "AlphScore", "glm_AlphCadd", "glm_AlphRevel", "glm_RevelCadd", "glm_AlphRevelCadd",
                        "glm_AlphDeogen", "glm_CaddDeogen", "glm_DeogenRevel", "glm_AlphDeogenRevel", "glm_AlphCaddDeogen", "glm_CaddDeogenRevel", "glm_AlphCaddRevel")
any_action_required<-c(collapse_as_text, scoreVals_to_combine)

cols_vars<-colnames(duplicted_vars)
constant_cols<-cols_vars[!cols_vars %in% any_action_required]

deduplicate_vars<-function(varID_of_duplicate){
  duplicated_rows<-duplicted_vars %>% 
    mutate(temp_var_ID=paste(ID, genename))%>%
    filter(temp_var_ID == varID_of_duplicate)%>%
    select(-temp_var_ID)

  merged_as_text<-duplicated_rows %>% 
    select(all_of(collapse_as_text)) %>%
    summarise_all(paste ,collapse=",")

  merged_as_numeric_val<-duplicated_rows %>% 
    select(all_of(scoreVals_to_combine)) %>%
    summarise_all(mean)

  constant_vals<-duplicated_rows %>% 
    select(all_of(constant_cols)) %>% distinct()

  deduplicated<-cbind(
        constant_vals,
        merged_as_text,
        merged_as_numeric_val
        )%>%
    relocate(any_of(cols_vars))
  if (nrow(deduplicated)>1){
    print("Deduplication not successful!")
    print(constant_vals)
  }
  return(deduplicated)
}

var_ids<-unique(paste(duplicted_vars$ID, duplicted_vars$genename))
#var_ids<-var_ids[1:50]

deduplicated_vars<-lapply(var_ids, deduplicate_vars)

deduplicated_vars_bind<-do.call("rbind", deduplicated_vars)

if (sum(!colnames(deduplicated_vars_bind)==cols_vars)>0) {
  print("Order of columns does not match the requirements!")
}

write_tsv(x=deduplicated_vars_bind,
          file=opt$out_file,
          col_names = FALSE)
58
59
60
61
62
63
64
65
66
67
68
69
shell:
	"""
	homedir=$(pwd)

	mkdir -p "{params.alphafold}"

	cd "{params.alphafold}"
	rm -f UP000005640_9606_HUMAN_v2.tar
	wget {config[alphafold_download_adress]}
	tar -xf UP000005640_9606_HUMAN_v2.tar
	rm UP000005640_9606_HUMAN_v2.tar
	"""
78
79
80
81
82
83
84
85
86
87
88
89
90
91
shell:
	"""
	homedir=$(pwd)
	mkdir -p "{params.dbNSFP}"
	cd {params.dbNSFP}

	rm -f dbNSFP4.2a.zip
	wget {config[dbNSFP_download_adress]}
	unzip dbNSFP4.2a.zip
	rm dbNSFP4.2a.zip

	cd $homedir
	touch {output}
	"""
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
shell:
	"""
	homedir=$(pwd)
	mkdir -p "{params.tmp}"
	mkdir -p "{params.outdir}"
	cd "{params.tmp}"
	mkdir -p header

	if (( $(zcat $homedir"/{params.db_file}" | head -n1 > header/header.txt) )) # pipe / head combination produces non-0-exit
	then
	echo "not ok"
	fi

	zcat $homedir"/{params.db_file}" | tail -n+2 | split -l 20000 - {wildcards.chr}

	for file in $(find * -maxdepth 0 -type f)
	do
	python $homedir"/workflow/scripts/split_dbNSFP_unnest.py" $homedir"/config/pdb_ids.txt" $(pwd)"/header/header.txt" $file $homedir"/"{params.outdir}
	done

	cd $homedir
	touch "{output}"
	"""
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
shell:
	"""
	feature="$(pwd)/"{config[feature_folder]}"bin/featurize"
	export FEATURE_DIR="$(pwd)/"{config[feature_folder]}"data/"
	work_dir="results/pdb_features/{wildcards.pdb_name}/"
	temp_pdb=$work_dir"1A1A.pdb"

	export DSSP_DIR=$work_dir
	export PDB_DIR=$work_dir

	mkdir -p $work_dir
	gunzip -c "{input}" > $temp_pdb

	#run dssp
	cat $temp_pdb | {config[dssp_bin]} -i /dev/stdin > $work_dir"1A1A.dssp"
	echo 1A1A > $work_dir"pdb_id"

	$feature $temp_pdb -n1 -w0.2 | grep ":A@CA$" > $work_dir"{wildcards.pdb_name}_core.txt"
	$feature $temp_pdb -n1 -w3 | grep ":A@CA$" > $work_dir"{wildcards.pdb_name}_3A.txt"
	$feature $temp_pdb -n1 -w6 | grep ":A@CA$" > $work_dir"{wildcards.pdb_name}_6A.txt"
	$feature $temp_pdb -n1 -w9 | grep ":A@CA$" > $work_dir"{wildcards.pdb_name}_9A.txt"
	$feature $temp_pdb -n1 -w12 | grep ":A@CA$" > $work_dir"{wildcards.pdb_name}_12A.txt"

	rm $temp_pdb*
	"""
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
shell:
	"""
	protinter="$(pwd)/"{config[protinter_bin]}
	work_dir="results/pdb_features/{wildcards.pdb_name}/"
	mkdir -p $work_dir

	gunzip -c "{input}" > $work_dir"/pdbfile.pdb"
	cd $work_dir

	python3 $protinter pdbfile.pdb -hydrophobic -csv
	python3 $protinter pdbfile.pdb -disulphide -csv
	python3 $protinter pdbfile.pdb -ionic -csv
	python3 $protinter pdbfile.pdb -aroaro -csv
	python3 $protinter pdbfile.pdb -arosul -csv
	python3 $protinter pdbfile.pdb -catpi -csv
	#python3 $protinter pdbfile.pdb -hb1 -csv
	python3 $protinter pdbfile.pdb -hb2 -csv
	python3 $protinter pdbfile.pdb -hb3 -csv
	python3 $protinter pdbfile.pdb -within_radius -csv

	sed -i "s/ //g" result_*
	rm pdbfile.pdb
	"""
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
run:
	import os
	from Bio.PDB.PDBParser import PDBParser
	import Bio.PDB as bpd
	import pandas as pd
	pdb_path=params[0]
	os.system('gunzip -c "' + pdb_path + '.gz" > "' + pdb_path + 'X"' )

	parser = PDBParser()
	structure = parser.get_structure("test", pdb_path + 'X')
	hse = bpd.HSExposure
	exp_ca = hse.HSExposureCB(structure)
	os.system('rm "' + pdb_path + 'X"' )
	resi_count=len(exp_ca.keys())
	list_RESI_HSE=[ [exp_ca.keys()[i][1][1], exp_ca.property_list[i][1][0], exp_ca.property_list[i][1][1] ] for i in range(0, resi_count)]
	df_HSE=pd.DataFrame(list_RESI_HSE)
	# manual check: HSE1=upper sphere, HSE2=lower sphere
	df_HSE.columns =['RESI','HSE1','HSE2']
	df_HSE.index=df_HSE.RESI
	df_HSE.to_csv(output[0], index=False)
235
236
237
238
239
240
241
242
run:
	import pandas as pd 
	from biopandas.pdb import PandasPdb
	import numpy as np
	ppdb_load = PandasPdb()
	ppdb_load.read_pdb(input[0])
	pd_df=ppdb_load.df["ATOM"][["residue_number","atom_name","b_factor"]].to_numpy()
	np.savetxt(output[0], pd_df[pd_df[:,1]=="CA"], fmt='%s')
258
259
script:
	"scripts/combine_features_pdb_level.py"
274
275
script:
	"scripts/graph_based_analysis_pdb_level.py"
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
run:
	import pandas as pd
	import os
	li =[]
	for filename in input:
		try:
			df = pd.read_csv(filename, low_memory=False)
			li.append(df)
		except:
			print(filename + " error")
	try:
		frame = pd.concat(li, axis=0, ignore_index=True)
	except:
		print("empty df")
		os.system('touch ' + output[0])
326
327
script:
	"scripts/extract_clinvar_gnomad_sets.R"
338
339
340
341
342
343
shell:
	"""
	Rscript workflow/scripts/preprocess.R -i {input} -o {output.preprocessed}
	Rscript workflow/scripts/recalibrate_variants.R -i {output.preprocessed} -o {output.recalibrated} \
	--undersample FALSE
	"""
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
shell:
	"""
	Rscript workflow/scripts/Fig_recal.R \
	--input_preprocessed {input.preprocessed} \
	--input_recalibrated {input.recalibrated} \
	--out_folder {params.out_folder} 

	Rscript workflow/scripts/AA_freq.R \
	--input_recalibrated {input.recalibrated} \
	--excel_location {input.excel} \
	--out_folder {params.out_folder} 

	Rscript workflow/scripts/char_variants.R \
	--csv_location {input.recalibrated} \
	--out_folder {params.out_folder} 
	"""
386
387
388
389
390
391
392
393
run:
	import os
	parameters=grid_search_table[grid_search_table["prefix"]==wildcards[0]]
	os.system('Rscript workflow/scripts/prediction.R --prefix ' + 
		wildcards[0] + " --excel_location " + input[0] + 
		" --csv_location " + input[1] + 
		" --out_folder " + params[1] +
		" " + " ".join(parameters.iloc[0,1:].to_list()))
403
404
405
406
407
408
shell:
	"""
	mkdir -p "results/joined_grid"
	cat {input} | tail -n3 | grep "auc_Alph_train_gnomAD" > {output}
	cat {input} | grep -v "auc_Alph_train_gnomAD" >> {output}
	"""
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
run:
	import pandas as pd
	import os

	# uncomment for automatic selection:
	grid_results=pd.read_csv("results/joined_grid/joined_grid.tsv", sep="\t")
	grid_results=grid_results.sort_values(["auc_Alph_interim_CV"])
	best_model=grid_results.iloc[-1]
	parameters=grid_search_table[grid_search_table["prefix"]==best_model.condition]

	#manual selection:
	#parameters=grid_search_table[grid_search_table["prefix"]=="randomforest_4000_12_500_0.999999_0_0_90_0_0_regCols"] 

	base_command=('Rscript workflow/scripts/prediction.R ' +
		' --csv_location ' + input[1] + 
		' --out_folder ' + params[1] +
		' ' + ' '.join(parameters.iloc[0,1:].to_list()) )

	# overwrite excel location if NullModel should be fit:
	if wildcards[0] == "_NullModel":
		base_command = base_command + ' --excel_location resources/available_colnames_NullModel.xlsx'
	if wildcards[0] == "_nopLDDT":
		base_command = base_command + ' --excel_location resources/available_colnames_nopLDDT.xlsx'

	print("base command:")
	print(base_command)

	pre_final_k_fold_com=base_command + ' --prefix pre_final_model_k_fold'  + wildcards[0] +' --k_fold_cross_val TRUE '

	pre_final_write_dataset_imp=base_command + ' --prefix pre_final_model' + wildcards[0] +' --write_model TRUE --write_dataset TRUE --importance impurity '

	pre_final_permutation=base_command + ' --prefix pre_final_model' + wildcards[0] +' --importance permutation '

	final_write_model=base_command + ' --prefix final' + wildcards[0] + ' --full_model TRUE --write_dataset TRUE --write_model TRUE '

	print("run k-fold cross validation, command:")
	os.system("echo " + pre_final_k_fold_com)
	os.system(pre_final_k_fold_com)

	print("run pre final model importance, command:")
	os.system("echo " + pre_final_write_dataset_imp)
	os.system(pre_final_write_dataset_imp)

	print("run pre final model importance, command:")
	os.system("echo " + pre_final_permutation)
	os.system(pre_final_permutation)

	print("run final model, command:")
	os.system("echo " + final_write_model)
	os.system(final_write_model)
499
500
501
502
503
504
505
506
507
508
509
510
511
shell:
	"""
	Rscript workflow/scripts/predict_w_ranger_model_for_Evaluation.R \
	--csv_location {input.csv} \
	--model_location {input.model} \
	--use_cols_file {input.colnames} \
	--toAS_properties {input.to_AS} \
	--model_location_null {input.model_null} \
	--use_cols_file_null {input.colnames_null} \
	--toAS_properties_null {input.to_AS_null} \
	--output_file {output} \
	--reduced {wildcards.reduced}
	"""
526
527
528
529
530
531
532
533
534
535
536
537
shell:
	"""
	Rscript workflow/scripts/predict_w_ranger_model_Final.R \
	--csv_location {input.csv} \
	--model_location {input.model_Full} \
	--use_cols_file {input.colnames_Full} \
	--toAS_properties {input.to_AS_Full} \
	--combined_model {input.combined_model} \
	--training_var_ids {input.training_var_ids} \
	--output_file {output} \
	--reduced {wildcards.reduced}
	"""
548
549
550
551
552
553
554
shell:
	"""
	Rscript workflow/scripts/create_final_combined_models.R \
	--training_dataset {input.training_dataset} \
	--combined_model {output.combined_model} \
	--training_var_ids {output.training_var_ids}
	"""
565
566
567
568
569
570
571
shell:
	"""
	Rscript workflow/scripts/compile_scores.R \
	--prot_folder="results/predicted_prots_eval/" \
	--suffix="_w_AlphScore_red_FALSE.csv.gz" \
	--mapping={input.map_file}
	"""
583
584
585
586
587
588
589
shell:
	"""
	Rscript workflow/scripts/analyse_DMS_score.R \
	--variants {input.test_dataset} \
	--validation_set {input.compiled} \
	--out_folder {params.out_folder}
	"""
606
607
608
609
610
611
612
613
614
615
616
617
shell:
	"""
	Rscript workflow/scripts/Fig_feat_imp.R \
	--input_impurity {input.impurity} \
	--input_permutation {input.permutation} \
	--prefix {params.prefix}  \
	--out_folder {params.out_folder} 

	Rscript workflow/scripts/Fig_auc_cv.R \
	--tsv_location {input.tsv_location} \
	--out_folder {params.out_folder} 
	"""
631
632
shell:
	"workflow/scripts/combine_all_proteins_to_one_file.sh {params.in_folder} {params.out_folder}"
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
shell:
	"""
	old_wd=$(pwd)
	mkdir -p {params.out_folder}
	cd {params.out_folder}

	#get clinvar2022
	wget -nc {params.clinvar_adress}

	# filter for benign / pathogenic missense variants
	zcat clinvar_20220109.vcf.gz | grep "^#CHR" | sed "s/#//g" > header_clinvar_vcf.txt
	zcat clinvar_20220109.vcf.gz | grep missense_variant | egrep "CLNSIG=Pathogenic|CLNSIG=Likely_pathogenic" | cat header_clinvar_vcf.txt - | gzip > clinvar_2022_pathogenic.vcf.gz
	zcat clinvar_20220109.vcf.gz | grep missense_variant | egrep "CLNSIG=Benign|CLNSIG=Likely_benign" | cat header_clinvar_vcf.txt - | gzip > clinvar_2022_benign.vcf.gz

	# create list of these clinvar variants, use to extract predicted scores
	zcat clinvar_2022_pathogenic.vcf.gz clinvar_2022_benign.vcf.gz | grep -v "^CHROM" | awk '{{print $1, $2 }}'  >  varlist_clinvar_2022.txt

	cat $old_wd"/"{input.header} | tr "," "\t" > headerPostTabix.txt

	tabix $old_wd"/"{input.values} -R varlist_clinvar_2022.txt | cat headerPostTabix.txt - | gzip > values_of_clinvar_variants.tsv.gz
	"""
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
	shell:
		"""
		mkdir -p {params.temp}

		zcat {input.values} | cut -f10 | \
		 uniq -d | uniq > {output.duplicate_list}

		  cat {output.duplicate_list}
		cat {input.header} | tr "," "\t" > {output.tab_header}

 		lines_dupl=$(wc -l {output.duplicate_list} | cut -f1 -d" " )

		if [ $lines_dupl -gt 0 ]
		then

		echo "duplicated variants present"

		zcat {input.values} | grep -f {output.duplicate_list} | \
		 cat {output.tab_header} - | bgzip > {output.duplicated_vars}

		# deduplicate variants
		Rscript workflow/scripts/remove_duplicates_from_final_score.R \
		 --dup_vars {output.duplicated_vars} \
		 --out_file {output.deduplicated_duplicates}

		# merge unique and deduplicated vars
		zcat {input.values} | tail -n +2 | \
		 grep -v -f {output.duplicate_list} | \
		 cat - {output.deduplicated_duplicates} | \
		 sort -V -t, -T {params.temp} -k1 -k2 - | \
		 cat  {output.tab_header} - | \
		 bgzip > {output.all_deduplicated}

		else

		echo "no duplicated variants present"

		touch {output.duplicated_vars}
		touch {output.deduplicated_duplicates}

		zcat {input.values} | tail -n +2 | \
		 cat  {output.tab_header} - | \
		 bgzip > {output.all_deduplicated}

		fi

		tabix -s 1 -b 2 -e 2 {output.all_deduplicated}
		rm -rf {params.temp}				

		"""
749
750
751
752
shell:
	"""
	tabix {input.values} -R {input.varlist} | cat {input.tab_header} - | gzip > {output.cv_vars}
	"""
769
770
771
772
773
774
775
776
777
778
shell:
	"""
	Rscript workflow/scripts/get_performance_clinvar_2022.R \
	--clinvar_benign {input.cv_ben} \
	--clinvar_pathogenic {input.cv_patho} \
	--AlphaFold_scores {input.cv_scores} \
	--variants {input.variants} \
	--protein_subset {input.membrane_proteins} \
	--out_folder {params.out_folder} 
	"""
793
794
795
796
797
798
799
800
801
shell:
	"""
	Rscript workflow/scripts/histogramsFinalModel.R \
	--clinvar_benign {input.cv_ben} \
	--clinvar_pathogenic {input.cv_patho} \
	--AlphaFold_scores {input.cv_scores} \
	--variants {input.variants} \
	--out_folder {params.out_folder} 
	"""
813
814
815
816
817
818
shell:
	"""
	Rscript workflow/scripts/plot_pLDDT_vs_CV_gnomAD.R \
	--variants {input.variants} \
	--out_folder {params.out_folder}
	"""
ShowHide 38 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/Ax-Sch/AlphScore
Name: alphscore
Version: 0.9
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: GNU General Public License v3.0
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...