This is a phylogeny workflow for batch submission to HPC cluster built with Slurm queue manager

public public 1yr ago Version: v2.6.1 0 bookmarks

PHACT

PHACT, PH ylogeny- A ware C omputation of T olerance (for amino acid substitutions), is a computational framework based on Snakemake workflow to predict the functional effects of missense mutations.

The proposed approach exploits the phylogenetic tree information to measure the deleteriousness of a given variant. Basically, the workflow is designed by considering not only amino acid frequency, also evolutionary relationship between species, physicochemical properties of amino acids and dependence and independence of substitutions.

rulegraph.svg The workflow of PHACT for given queries as an example to predict the effect of missense mutations.

First time setup

The following steps are required to run PHACT workflow

  1. Install conda & snakemake without super user right permission

    • Install miniconda , a package management system

    • Install snakemake , a workflow management system using mamba

    • Compile and install PAML into a specific PATH, for example, ~/resources/paml4.9j

  2. Configuration of workflow (see below)

  3. Set model parameters (see below)

  4. Check PHACT workflow (see below)

Genetic database

Configuration

The configuration file (config/config.yml) , a dictionary of configuration parameters and their values, is prepared to make the workflow more flexible. The following global parameters should be set as following.

  • workdir indicates the working directory. After cloning the repository, the full path, for example, /home/emrah/PHACT , should be set properly.

  • qeury_fasta is a list of query files in fasta format, for example, ["O95153", "O60312"] that workflow will analyze.

  • archived_path , an optional parameter, to store all output for completed analyses into specified folder if proper bash script available in the repository is used.

Meanwhile, which tools and versions are used during the performing task are defined a set of environment yaml files available under workflow/envs folder. Each file has prepared for conda package manager to install and setup environment based on definition.

Model parameters

PHACT framework is based on a snakemake workflow, which is defined by specifying rules in Snakefile ( workflow/Snakefile ). Rules decompose the workflow into small steps such as finding homologues of each query sequence (PSI-BLAST), performing multiple sequence alignment (MAFFT) or generating a maximum-likelihood phylogenetic tree (RAXML-NG, FASTTREE).

Each rule has its own model parameters which can be set via single configuration file ( config/config.yml ). For example, the selected method for multiple sequence alignment is set as following.

#alignment
mafft_method: "-fftns" #if left empty it will be FFT-NS-2

Test workflow

To check if the workflow is defined properly and to estimate the amount of calculation remaining, dry-run parameter is used as following.

$ cd workflow
$ snakemake --dry-run

It basically summarizes the number of total job (rule) performed, and sets of input and output files used and created respectively. For given 2 query files in this example, 29 jobs will be executed as they are listed below.

Job counts: count jobs 1 all 2 fast_tree 2 get_blasthits 2 iqtree_anc_score 2 iqtree_ancestral 2 ml_tree 2 msa 2 psiblastp 2 query_fasta 2 raxml_anc_score 2 raxmlng_ancestral 2 remove_gaps 2 trim_msa 2 unroot_fasttree 2 unroot_tree 29

Running PHACT workflow

The simplest way to run PHACT workflow in a local computer is as following.

  1. Clone the PHACT repository and cd into it. $ git clone https://github.com/CompGenomeLab/PHACT.git $ cd PHACT

  2. Modify workdir and query_files parameters in config/config.yml to be path the working directory and query id's $ pwd (gives the working directory)

  3. Prepare the environment for snakemake $ conda activate snakemake

  4. Run the workflow and follow the executed rules during analyses snakemake -j 1 --use-conda

PHACT output

The outputs will be in a subfolder of results with their query_id prefix. Each query id folder has its own blast, multiple sequence alignment, maximum likelihood phylogenetic tree, ancestral sequence reconstruction and their computing score. The result/query_id directory will have the following structure:

.
├── 1_psiblast
├── 2_msa
├── 3_mltree
├── 4_raxmlng_ancestral
├── 5_raxmlng_ancestral_scores
├── 6_fasttree
├── 7_iqtree_ancestral
└── 8_iqtree_ancestral_scores
8 directories

Scaling the pipeline across a HPC (farm)

In addition to allow workload running on a local computer with limited number of query id's, PHACT framework is designed to analyze a bulk of query id's in a parallel way using High Performance Computing center.

Most HPC clusters has a scheduler that handles which workload are run on which compute nodes. To interact with a scheduler, users typically need to prepare a bash script and then submit it to cluster. Snakemake has a functionality to perform all these efforts automatically.

PHACT framework, based on Snakemake, has a number of profile files prepared for SLURM scheduler which is mostly used nowadays. The slurm profile files ( config/slurm_truba or config/slurm_sabanci ) indicate the fundamental definitions for executing workflow on HPC with Slurm.

Running snakemake with snakemake --profile config/slurm_truba would be enough to search the directory and load the profile file that can be used at any time.

Tips and tricks for running workflow

  • Caching between workflow is very useful property that Snakemake provides. To avoid redundant computation between workflow, it is recommended to use cache as following. Although, the implementation should be considered experimental. $ export SNAKEMAKE_OUTPUT_CACHE=<THE_PATH_OUTPUT_WILL_BE_CACHED> $ snakemake --use-conda --cache --profile ../config/slurm_truba

  • Monitoring the execution via web url is possible if a panoptes server is installed and configured properly. Running snakemake with --wms-monitor http://ephesus.sabanciuniv.edu:5000 parameter is neccessary.

  • Use --keep-going parameter is recommended to proceed running independent jobs in case of any failure on a task.

  • The screen command or execute the snakemake in background are useful for long set of runs in workflow in order to avoid any troubles in case of connection lost to the user interface or terminal

Distribution and reproducibility

We provide a template for a reproducible and scalable research project using Snakemake and various programming language (python, R) and tools managed by conda package manager. Set of rules is constructed to implement the entire research pipeline starting from genetic databases with querying the homologues, aligning sequences, trimming MSA, generating the phylogenetic tree, ancestral reconstruction and then calculate the score.

To get the same results after running workflow regardless to the platform (local computer, HPC, cloud computing), the framework should have the same set of input files, the version of software/tools. Since all requirement is well defined explicitly, it's possible to get the same results.

Citing this work

Kuru, N., Dereli, O., Akkoyun, E., Bircan, A., Tastan, O., & Adebali, O. (2022). PHACT: Phylogeny-aware computing of tolerance for missense mutations. Molecular Biology and Evolution. https://doi.org/10.1093/molbev/msac114

Data availability

All data obtained during this study is shared on the following url. https://phact.sabanciuniv.edu/pubs/kuru_mbe_2022/

  • Please see Publications/Kuru_etal_2022 folder to reproduce the figures in PHACT manuscript.

Acknowledgement

This work was supported by the Health Institutes of Turkey (TUSEB) (Project no: 4587 to O.A.) and EMBO Installation Grant (Project no: 4163 to O.A.) funded by the Scientific and Technological Research Council of Turkey (TÜBİTAK). Most of the numerical calculations reported in this paper were performed at the High Performance and Grid Computing Center (TRUBA resources) of TÜBITAK. The TOSUN cluster at Sabanci University was also used for computational analyses. We also want to thank Nehircan Özdemir for his art illustration of the PHACT approach.

Code Snippets

15
16
shell:
    "FastTreeMP {config[fasttree_model]} {input.trimmed_msa}  >  {output.bestTree} 2>{log}"
18
19
shell:
    "python3 scripts/parse_blastp.py {input.blastp_out} {config[blast_hit_number]} {config[max_e_value]} {config[min_identity]} {params.blastdb} {input.query_fasta} 2> {log}"
18
19
shell:
    "iqtree2 -redo -s {input.no_gap_msa} -te {input.unrooted_tree} -m {config[iqtree_ancestral_model]} -asr -safe -nt {resources.cpus} --prefix {params.iqtree_ancestral_out_name}"
18
19
shell:
    "query=`python scripts/get_query.py {params.query_fasta}` && Rscript scripts/ComputeScore_Nonnormal_withDiversity_0411.R {input.tree_file} {input.probabilities} {params.fasta} {params.out} $query {config[weights]} 2>{log}"
15
16
shell:
    "mafft{config[mafft_method]} --anysymbol --thread {resources.cpus} {input.fasta} > {output.msa_file} 2> {log}"
SnakeMake From line 15 of rules/msa.smk
15
16
shell:
    "psiblast -query {input.query_fasta} -db {params.blastdb} -outfmt {config[outfmt]} -out {output.outfile} -max_target_seqs {config[max_target_seqs]} -num_iterations {config[num_iterations]}  2> {log}"
14
15
shell:
    "python scripts/make_query_fasta.py {params.query_id}  {params.blastdb} {output.fasta_file} 2> {log}"
20
21
shell:
    "raxml-ng --ancestral --msa {input.no_gap_msa} --tree {input.unrooted_tree} --model {config[raxmlng_ancestral_model]} --prefix {params.raxml_ancestral_out_name} --threads {resources.cpus}"
15
16
shell:
    "python3 scripts/remove_gaps.py {input.query_file} {input.msa_file} {input.tree_file}  {output.no_gap_file} 2> {log}"
15
16
shell:
    "python3 scripts/remove_gaps.py {input.query_file} {input.msa_file} {input.tree_file}  {output.no_gap_file} 2> {log}"
13
14
shell:
    "trimal -in {input.msa_file} -out {output.trimmed_msa} {config[trimal_method]} 2> {log}"
13
14
shell:
    "python  scripts/unroot_tree.py {input.input_tree}  {input.no_gap_msa_file} 2> {log}"
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
library(ape)
library(tidytree)
library(stringr)
library(dplyr)
library(bio3d)
#source("./position_score.R")
args = commandArgs(trailingOnly=TRUE)

aa_to_num <- function(aa) {
  amino_acids <- c("G", "A", "L", "M", "F", "W", "K", "Q", "E", "S", "P", "V", "I", "C", "Y", "H", "R", "N", "D", "T")
  num <- sapply(aa, function(a){ifelse(sum(amino_acids %in% a) == 1, as.numeric(which(amino_acids %in% a)), 21)})
  # num <- ifelse(sum(amino_acids %in% aa) == 1, as.numeric(which(amino_acids %in% aa)), 21)
  return(num)
}

num_to_aa <- function(num) {
  amino_acids <- c("G", "A", "L", "M", "F", "W", "K", "Q", "E", "S", "P", "V", "I", "C", "Y", "H", "R", "N", "D", "T")
  aa <- ifelse(num == 21, 21, amino_acids[num])
  return(aa)
}

compute_score <- function(file_nwk, file_rst, file_fasta, output_name, human_id, pos_chosen, parameters) {

  # Read tree file
  tr_org <- read.tree(file_nwk)
  x <- read.table(file = file_rst, sep = '\t', header = TRUE, fill = TRUE)
  colnames(x)[4:ncol(x)] <- gsub("p_", replacement = "", x = colnames(x)[4:ncol(x)], fixed = TRUE )
  # Tree_info: node-node, node-leaf connections
  tree_info <- as.data.frame(as_tibble(tr_org))

  # Read fasta file, MSA
  fasta <- read.fasta(file = file_fasta)
  msa <- fasta$ali

  # connections_1: Parent node, connections_2: connected node/leaf
  connections_1 <- tree_info$parent
  connections_2 <- tree_info$node

  # Names of leaves
  names_all <- tr_org[["tip.label"]]
  msa <- msa[names_all, ]
  # Number of total leaves&nodes
  num_leaves <- length(tr_org[["tip.label"]])
  num_nodes <- tr_org[["Nnode"]]

  # Unnecessary
  num_nodes_codeml <- max(connections_1,connections_2)
  if (num_nodes_codeml-num_leaves != num_nodes){
    print("Number of nodes is less than expected: CODEML")
  }

  # Distance between leaves & nodes
  dd_node <- dist.nodes(tr_org)
  dist_leaf <- dd_node[1:num_leaves, 1:num_leaves]
  dist_node <- dd_node[(num_leaves+1):(num_leaves + num_nodes), (num_leaves+1):(num_leaves + num_nodes)]

  # Human position (leaf & node)
  h_name <- human_id
  human_codeml <- names_all[grep(pattern = h_name, x = names_all, fixed = TRUE)]
  leaf_human <- tree_info[which(tree_info$label == human_codeml), "node"]
  human_plc <- leaf_human
  node_human <- tree_info[which(tree_info$label == human_codeml), "parent"]
  nodes_raxml <- as.numeric(gsub(pattern = "Node", replacement = "", x = tree_info[num_leaves+1:num_nodes, "label"])) #Node or Branch
  names(nodes_raxml) <- tree_info[num_leaves+1:num_nodes, "node"]

  # Total number of positions from ancestralProbs file
  total_pos <- max(x$Site)

  # Chosen positions (all or some)
  if (pos_chosen[1] == "all"){
    positions <- 1:total_pos
    score_all <- matrix(0, total_pos, 21)
  } else {
    positions <- pos_chosen
    score_all <- matrix(0, length(positions), 21)
  }

  ####################################################
  ####################################################

  # Connections between leaves & nodes
  chosen_leaves <- tree_info[1:num_leaves,c("parent", "node")]
  # Connections between nodes & nodes
  chosen_nodes <- tree_info[(num_leaves+2):(num_leaves +num_nodes),c("parent", "node")]
  leaf_names <- tree_info$label

  human_leaf_len <- as.double(tree_info[human_plc, "branch.length"])

  d_n <- dist_node[as.character(node_human),] + human_leaf_len
  # d_n <- d_n[nn2[,1]]
  d_l <- dist_leaf[leaf_human,]

  # chosen_nodes2: ordered connections (for probability differences)
  chosen_nodes2 <- matrix(0, num_nodes-1, 2)

  n1 <- as.numeric(chosen_nodes$parent)
  n2 <- as.numeric(chosen_nodes$node)
  dist_f <- d_n[as.character(n1)]
  dist_s <- d_n[as.character(n2)]

  # chosen_nodes2: ordered connections (for probability differences)
  chosen_nodes2[which(dist_f < dist_s), 1] <- n2[which(dist_f < dist_s)]
  chosen_nodes2[which(dist_f < dist_s), 2] <- n1[which(dist_f < dist_s)]

  chosen_nodes2[which(dist_f >= dist_s), 1] <- n1[which(dist_f >= dist_s)]
  chosen_nodes2[which(dist_f >= dist_s), 2] <- n2[which(dist_f >= dist_s)]

  ###########################################################################
  ###############      NEW PART - 15NOV                      ################
  ###########################################################################

  # Number of nodes between nodes & leaf of human
  nodes_conn <- numeric(num_nodes)
  nodes_conn[node_human-num_leaves] <- 1
  names(nodes_conn) <- names(d_n)
  chs <- c()
  chs2 <- c()
  ##########################
  inds <- chosen_nodes2[chosen_nodes2[,2]==(node_human),1]
  nodes_conn[as.character(inds)] <- 2
  chs <- inds

  s0 <- sapply(3:num_leaves, function(i){
    for (j in chs){
      inds <- chosen_nodes2[chosen_nodes2[,2]==j,1]
      if (length(inds)!=0){
        nodes_conn[as.character(inds)] <<- i
        chs2 <- c(chs2, inds)
      }
    }
    chs <<- chs2
    chs2 <- c()
  })

  # Number of nodes between leaves & leaf of human
  leaves_conn <- nodes_conn[as.character(chosen_leaves[,1])]

  #########################################3

  parameters <- unlist(str_split(parameters, pattern = ","))

  score_norm <- t(mapply(function(ps, parameter){position_score(ps, x, msa, num_nodes, num_leaves, total_pos, human_plc, node_human, nodes_raxml, human_leaf_len, dist_node, dist_leaf, parameter, leaves_conn, nodes_conn, chosen_leaves, chosen_nodes2, d_n, d_l)}, rep(positions, length(parameters)), rep(parameters, each = length(positions))))
  score_norm_with_leaf <- matrix(unlist(score_norm[ ,1]), nrow = length(positions) * length(parameters), ncol = 20, byrow = TRUE)
  score_norm_without_leaf <- matrix(unlist(score_norm[ ,2]), nrow = length(positions) * length(parameters), ncol = 20, byrow = TRUE)

  score_norm_with_leaf <- cbind(rep(positions, length(parameters)), score_norm_with_leaf)
  score_norm_without_leaf <- cbind(rep(positions, length(parameters)), score_norm_without_leaf)

  colnames(score_norm_with_leaf) <- c("Pos/AA", num_to_aa(1:20))
  colnames(score_norm_without_leaf) <- c("Pos/AA", num_to_aa(1:20))

  print_wl <- lapply(1:length(parameters), function(p){
    score_to_print_wl <- score_norm_with_leaf[(positions + length(positions)*(p - 1)), ]
    score_to_print_wol <- score_norm_without_leaf[(positions + length(positions)*(p - 1)), ]
    filename <- ifelse(parameters[p] == "0", "max05", parameters[p])
    filename <- ifelse(parameters[p] == "X", "max05_Gauss", parameters[p]) ### New Line (11.11)
    write.csv(score_to_print_wl, sprintf("%s.csv", paste(output_name, "_wl_param_", filename, sep = "")), row.names = FALSE, quote = FALSE)
    write.csv(score_to_print_wol, sprintf("%s.csv", paste(output_name, "_wol_param_", filename, sep = "")), row.names = FALSE, quote = FALSE)
  })

}

position_score <- function(ps, x, msa, num_nodes, num_leaves, total_pos, human_plc, node_human, nodes_raxml, human_leaf_len, dist_node, dist_leaf, parameter, leaves_conn, nodes_conn, chosen_leaves, chosen_nodes2, d_n, d_l) {
  position <- ps

  b1 <- position + total_pos*(0:(num_nodes-1))
  TT <- x[b1,]
  matrix_prob <- matrix(0, num_nodes, 20)

  probs <- data.matrix((TT[, (4:ncol(TT))]))
  rownames(probs) <- NULL
  rr <- aa_to_num(colnames(x)[4:ncol(TT)])
  matrix_prob[,rr] <- probs
  matrix_prob <- matrix_prob[nodes_raxml,]

  position_vec <- msa[, ps]

  position_num <- aa_to_num(position_vec)
  prob_leaves <- matrix(0, num_leaves, 20)
  prob_leaves[cbind(which(position_num <= 20), position_num[which(position_num <= 20)])] <- 1

  gaps <- which(position_num == 21)

  tot_sc <- 1
  mxx <- 1

  diff_leaves <- matrix(0, num_leaves, 20)
  diff_leaves <- prob_leaves - matrix_prob[(chosen_leaves$parent - num_leaves), ]
  diff_leaves[human_plc,]<- -diff_leaves[human_plc,]

  diff_nodes <- matrix(0, num_nodes-1, 20)
  diff_nodes <- matrix_prob[chosen_nodes2[,1] - num_leaves, ] - matrix_prob[chosen_nodes2[,2] - num_leaves, ]


  ################## weights
  weights <- weight_fnc(d_n, d_l, human_plc, parameter, leaves_conn, nodes_conn, mxx) # Updated with related parameters
  weight_leaf <- weights[1:num_leaves]
  weight_node <- tail(weights,num_nodes)
  ####################

  score <- matrix(0,1,20)

  s1 <- sapply(1:20, function(ii){
    dif_pr <- diff_nodes[1:(num_nodes-1),ii]
    dif_pr[dif_pr<0] <- 0
    # CHANGED 29.03
    sel_node <- chosen_nodes2[1:length(dif_pr), 1] - num_leaves
    score[ii] <<- score[ii] + sum(weight_node[sel_node] * dif_pr)
  })

  ### NOVEL 29.03
  aa_f <- position_num[human_plc]
  vect_human <- matrix_prob[node_human - num_leaves,]
  vect_human[aa_f]<-0

  score_without_leaf <- score
  score_without_leaf <- score_without_leaf + weight_node[(node_human-num_leaves)]*vect_human
  score <- score + weight_node[(node_human-num_leaves)]*vect_human

  s2 <- sapply(1:20, function(ii){
    diff_lf <- diff_leaves[1:num_leaves,ii]
    diff_lf[gaps] <-  0
    diff_lf[diff_lf<0] <- 0

    s1 <- sum(weight_leaf[((1:length(diff_lf)) != human_plc )] * diff_lf[((1:length(diff_lf)) != human_plc )])
    score[ii] <<- score[ii] + s1
  })

  aa_f <- position_num[human_plc]
  if (aa_f != 21){
    score[aa_f] <- score[aa_f] + weight_leaf[human_plc]*1
    score_without_leaf[aa_f] <- score_without_leaf[aa_f] + weight_leaf[human_plc]*1
  }

  sum_exc_max <- sum(score_without_leaf)-max(score_without_leaf)
  diversity <- (-(length(which(score_without_leaf<0.0001))*0.1)/20+0.1)*(sum_exc_max)

  scores <- list()
  scores$score_with_leaf <- 1- log((score*0.9 + diversity)/(num_nodes+num_leaves)+10^(-15))/log(10^(-15))
  scores$score_without_leaf <- 1- log((score_without_leaf*0.9 + diversity)/num_nodes + 10^(-15))/log(10^(-15))

  return(scores)
}

weight_fnc <- function(d_n, d_l, human_plc, parameter, leaves_conn, nodes_conn, mxx) {
  # print(parameter)
  if (parameter=="0"){
    d_l_n <- d_l[-human_plc]
    min_l <- min(d_l_n)
    d_l2 <- d_l/min_l
    d_n2 <- d_n/min_l

    d_l2 <- d_l2 +1
    d_n2 <- d_n2 +1
    weight_node <- 1/d_n2
    weight_leaf <- 1/d_l2

  } else if (parameter=="0_MinNode"){
    d_l_n <- d_l[-human_plc]
    min_l <- min(d_n)
    d_l2 <- d_l/min_l
    d_n2 <- d_n/min_l

    d_l2 <- d_l2 +1
    d_n2 <- d_n2 +1
    weight_node <- 1/d_n2
    weight_leaf <- 1/d_l2

  } else if (parameter=="0_MinNode_Mix"){
    d_l_n <- d_l[-human_plc]
    min_l <- min(d_n)
    d_l2 <- d_l/min_l
    d_n2 <- d_n/min_l

    d_l2 <- d_l2 +1
    d_n2 <- d_n2 +1
    weight_node1 <- 1/d_n2
    weight_leaf1 <- 1/d_l2

    param <- mean(c(d_n, d_l))
    weight_leaf2 <- exp(-d_l^2/param^2)
    weight_node2 <- exp(-d_n^2/param^2)

    weight_node<-sqrt(weight_node1*weight_node2)
    weight_leaf<-sqrt(weight_leaf1*weight_leaf2)

  } else if (parameter=="0_MinNode_Mix2"){
    d_l_n <- d_l[-human_plc]
    min_l <- min(d_n)
    d_l2 <- d_l/min_l
    d_n2 <- d_n/min_l

    d_l2 <- d_l2 +1
    d_n2 <- d_n2 +1
    weight_node1 <- 1/d_n2
    weight_leaf1 <- 1/d_l2

    param <- mean(c(d_n, d_l))
    weight_leaf2 <- exp(-d_l^2/param^2)/2
    weight_node2 <- exp(-d_n^2/param^2)/2

    weight_node<-sqrt(weight_node1*weight_node2)
    weight_leaf<-sqrt(weight_leaf1*weight_leaf2)

  } else if (parameter == "mean"){
    param <- mean(c(d_n, d_l))
    weight_leaf <- exp(-d_l^2/param^2)
    weight_node <- exp(-d_n^2/param^2)
  } else if (parameter == "median"){
    param <- median(c(d_n, d_l))
    weight_leaf <- exp(-d_l^2/param^2)
    weight_node <- exp(-d_n^2/param^2)
  } else if (parameter == "X"){
    d_l_n <- d_l[-human_plc]
    min_l <- min(d_l_n)
    min_n <- min(d_n)
    min_ch <- min(min_l, min_n)
    d_l2 <- d_l-min_ch
    d_n2 <- d_n-min_ch
    weight_leaf <- exp(-d_l^2)/2
    weight_node <- exp(-d_n^2)/2
    weight_leaf[human_plc] <- 1
  } else if (parameter == "CountNodes_1"){      ### NewFunction 1 (15Nov)
    weight_node = (exp(-d_n^2) + 1/nodes_conn)/2
    weight_leaf = (exp(-d_l^2) + 1/leaves_conn)/2
  } else if (parameter == "CountNodes_2"){      ### NewFunction 2 (15Nov)
    weight_node = (exp(-d_n^2) + exp(-nodes_conn^2))/2
    weight_leaf = (exp(-d_l^2) + exp(-leaves_conn^2))/2
  } else if (parameter == "CountNodes_3"){      ### NewFunction 3 (15Nov)
    weight_node = sqrt(exp(-d_n^2)*1/nodes_conn)
    weight_leaf = sqrt(exp(-d_l^2)*1/leaves_conn)
  } else if (parameter == "CountNodes_4"){      ### NewFunction 4 (15Nov)
    weight_node = exp(-(sqrt(d_n*nodes_conn))^2)
    weight_leaf = exp(-(sqrt(d_l*leaves_conn))^2)
  } else if (parameter == "Equal"){

    weight_leaf <- matrix(1,1,length(d_l))
    weight_node <- matrix(1,1,length(d_n))

  } else if (parameter == "MinThreshold"){
    max_dis <- max(c(d_l, d_n))
    param_min <- as.double(param_min)    
    weight_node <- (-1+param_min)/max_dis*d_n + 1
    weight_leaf <- (-1+param_min)/max_dis*d_l + 1
  } else if (parameter == "MinThreshold_Gauss"){
    max_dis <- max(c(d_l, d_n))
    param_min <- as.double(param_min)
    param <- sqrt(-max_dis^2/log(param_min))

    weight_leaf <- exp(-d_l^2/param^2)
    weight_node <- exp(-d_n^2/param^2)

  } else {
    param <- as.double(parameter)
    weight_leaf <- exp(-d_l^2/param^2)
    weight_node <- exp(-d_n^2/param^2)
  }
  weights = c(weight_leaf, weight_node)


  return(weights)
}

csv_file <- compute_score(file_nwk=args[1],file_rst=args[2],file_fasta=args[3], output_name=args[4],human_id=args[5],'all', parameters = args[6])
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import re
import sys

def get_human_id(fasta_file):
    with open(str(fasta_file),'r') as f:
        for line in f:
            if line.startswith('>'):
                line = line[0] + line[1:].replace('>', '-') 
                header = line.split(">")[1].strip()
                human_id = header.split(" ")[0]+"_"+header.split("OX=")[1].split(" ")[0]
                human_id = re.sub(r'[^\w>]', '_',human_id)
    print(human_id)
    return human_id

get_human_id(sys.argv[1])
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import os
import sys


def make_query_fasta(acc_no,all_eu_file,output_path):

    protein_dict = {}

    with open(all_eu_file,'r') as f:
        for line in f:
            if line.startswith('>'):
                line = line[0] + line[1:].replace('>', '-') 
                header = line.split(">")[1].strip()
                if header.split("|")[1].split("|")[0] == acc_no:
                    line = next(f).strip()
                    protein_dict[header] = ''
                    while not line.startswith('>'):
                        protein_dict[header] += line
                        line = next(f).strip()
    f.close()

    with open(output_path,'w') as f:
        for header,sequence in protein_dict.items():
            f.write(">"+header+"\n" + sequence + "\n")
    f.close()


if __name__ == "__main__":
    acc_no = sys.argv[1]
    all_eu_file = sys.argv[2]
    output_path = sys.argv[3]
    make_query_fasta(acc_no,all_eu_file,output_path)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import os
import sys
import re


def get_fasta_dict(fasta_file):
    fasta_dict = {}

    with open(fasta_file,'r') as f:
        for line in f:
            if line.startswith('>'):
                line = line[0] + line[1:].replace('>', '-') 
                header = line.split(">")[1].strip()
                fasta_dict[header] = ''
            else:
                fasta_dict[header] += line.strip()

    f.close()

    return fasta_dict


def parse_blastp_out(blastp_out_file,blast_hit_number,max_e_value,min_identity):
    blast_list = []
    with open(blastp_out_file,'r') as filein_:
        for line in filein_:
            if len(blast_list) <= int(blast_hit_number):
                if line.startswith(">"):
                    best_hit = line.split(">")[1].strip()
                    best_hit = best_hit.split(" ")[0]
                    while not line.startswith(" Score"):
                        line = next(filein_)
                        if line.startswith(" Score"):
                            score = float(line.split("Score = ")[1].split(" bits")[0])
                            e_value = float(line.split("Expect = ")[1].split(",")[0])
                            identity_line = next(filein_)
                            identity = float(identity_line.split("Identities = ")[1].split("(")[1].split("%")[0])
                            if e_value < float(max_e_value) and identity>float(min_identity):
                                blast_list.append(best_hit)

    return blast_list

def write_to_file(blast_list,all_eu_file,blastp_out_file,query_fasta):
    query_fasta_dict = get_fasta_dict(query_fasta)
    all_eu_dict = get_fasta_dict(all_eu_file)
    all_eu_keys_list = []
    for k,v in all_eu_dict.items():
        all_eu_keys_list.append(k.split(" ")[0].split("|")[1])
    if set(all_eu_keys_list).issuperset(set(blast_list)):
        with open(blastp_out_file.split(".")[0]+".fasta",'w') as f:
            for k,v in all_eu_dict.items():
                if k.split(" ")[0].split("|")[1] in blast_list:
                    new_k = k.split(" ")[0]+"_"+k.split("OX=")[1].split(" ")[0]
                    new_k = re.sub(r'[^\w>]', '_',new_k)
                    f.write(">"+new_k + "\n" + v +"\n")
            for h,s in query_fasta_dict.items():
                if h.split(" ")[0].split("|")[1] in blast_list:
                    pass
                else:
                    new_h = h.split(" ")[0]+"_"+h.split("OX=")[1].split(" ")[0]
                    new_h = re.sub(r'[^\w>]', '_',new_h)
                    f.write(">"+new_h + "\n" + s +"\n")
        f.close()


if __name__ == "__main__":
    blastp_out_file = sys.argv[1]
    blast_hit_number = sys.argv[2]
    max_e_value = sys.argv[3]
    min_identity = sys.argv[4]
    blastdb_file = sys.argv[5]
    query_fasta = sys.argv[6]
    blast_list =  parse_blastp_out(blastp_out_file,blast_hit_number,max_e_value,min_identity)
    write_to_file(blast_list,blastdb_file,blastp_out_file,query_fasta)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import os
import sys
from fasta_dict import *
import re
import ete3
from ete3 import Tree

def get_human_id(query_file):
    query_dict = get_fasta_dict(query_file)
    for k,v in query_dict.items():
        human_id = k.split(" ")[0]+"_"+k.split("OX=")[1].split(" ")[0]
        human_id = re.sub(r'[^\w>]', '_',human_id)
    #print(human_id)
    return human_id

def get_gap_positions(seqDict,human_id):

    for key, value in seqDict.items():
        if human_id  in key:
            #print(value)
            indices = [i for i, x in enumerate(value) if x == "-"]
            #print(indices)
    return indices

def get_sequences_without_gap(seqDict,indices):
    new_seqDict = {}
    for k,v in seqDict.items():
        new_seqDict[k] = ''.join([v[i] for i in range(len(v)) if i not in indices])
        #print(new_seqDict)  
    return new_seqDict


def write_new_fasta(new_seqDict,fasta_file,tree_file,output_file):
    t = Tree(tree_file,format=1)
    leaves = []
    for leaf in t:
        leaves.append(leaf.name)
    with open (output_file,'w') as new_file:
        #print(new_file)
        for newheader,value in  new_seqDict.items():
            if newheader in leaves:
                new_value = re.sub(r'[BXJZUO#$]', '-',value)
                if not all([c=="-" for c in new_value]):
                    new_file.write(">" +newheader+"\n" + new_value+ "\n")
    new_file.close()



if __name__ == "__main__":
    query_file = sys.argv[1]
    fasta_file= sys.argv[2]
    tree_file= sys.argv[3]
    output_file= sys.argv[4]
    human_id= get_human_id(query_file)
    seqDict = get_fasta_dict(fasta_file)
    gap_indices = get_gap_positions(seqDict,human_id)
    new_seqDict = get_sequences_without_gap(seqDict,gap_indices)
    write_new_fasta(new_seqDict,fasta_file,tree_file,output_file)
ShowHide 13 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/CompGenomeLab/PHACT
Name: phact
Version: v2.6.1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...