This is a phylogeny workflow for batch submission to HPC cluster built with Slurm queue manager

public 1yr ago Version: v2.6.1 0 bookmarks

View Workflow

PHACT

PHACT, PH ylogeny- A ware C omputation of T olerance (for amino acid substitutions), is a computational framework based on Snakemake workflow to predict the functional effects of missense mutations.

The proposed approach exploits the phylogenetic tree information to measure the deleteriousness of a given variant. Basically, the workflow is designed by considering not only amino acid frequency, also evolutionary relationship between species, physicochemical properties of amino acids and dependence and independence of substitutions.

The workflow of PHACT for given queries as an example to predict the effect of missense mutations.

First time setup

The following steps are required to run PHACT workflow

Install conda & snakemake without super user right permission
- Install miniconda , a package management system
- Install snakemake , a workflow management system using mamba
- Compile and install PAML into a specific PATH, for example, ~/resources/paml4.9j
Configuration of workflow (see below)
Set model parameters (see below)
Check PHACT workflow (see below)

Genetic database

Configuration

The configuration file (config/config.yml) , a dictionary of configuration parameters and their values, is prepared to make the workflow more flexible. The following global parameters should be set as following.

workdir indicates the working directory. After cloning the repository, the full path, for example, /home/emrah/PHACT , should be set properly.
qeury_fasta is a list of query files in fasta format, for example, ["O95153", "O60312"] that workflow will analyze.
archived_path , an optional parameter, to store all output for completed analyses into specified folder if proper bash script available in the repository is used.

Meanwhile, which tools and versions are used during the performing task are defined a set of environment yaml files available under workflow/envs folder. Each file has prepared for conda package manager to install and setup environment based on definition.

Model parameters

PHACT framework is based on a snakemake workflow, which is defined by specifying rules in Snakefile ( workflow/Snakefile ). Rules decompose the workflow into small steps such as finding homologues of each query sequence (PSI-BLAST), performing multiple sequence alignment (MAFFT) or generating a maximum-likelihood phylogenetic tree (RAXML-NG, FASTTREE).

Each rule has its own model parameters which can be set via single configuration file ( config/config.yml ). For example, the selected method for multiple sequence alignment is set as following.

#alignment
mafft_method: "-fftns" #if left empty it will be FFT-NS-2

Test workflow

To check if the workflow is defined properly and to estimate the amount of calculation remaining, dry-run parameter is used as following.

$ cd workflow
$ snakemake --dry-run

It basically summarizes the number of total job (rule) performed, and sets of input and output files used and created respectively. For given 2 query files in this example, 29 jobs will be executed as they are listed below.

Job counts: count jobs 1 all 2 fast_tree 2 get_blasthits 2 iqtree_anc_score 2 iqtree_ancestral 2 ml_tree 2 msa 2 psiblastp 2 query_fasta 2 raxml_anc_score 2 raxmlng_ancestral 2 remove_gaps 2 trim_msa 2 unroot_fasttree 2 unroot_tree 29

Running PHACT workflow

The simplest way to run PHACT workflow in a local computer is as following.

Clone the PHACT repository and cd into it. $ git clone https://github.com/CompGenomeLab/PHACT.git $ cd PHACT
Modify workdir and query_files parameters in config/config.yml to be path the working directory and query id's $ pwd (gives the working directory)
Prepare the environment for snakemake $ conda activate snakemake
Run the workflow and follow the executed rules during analyses snakemake -j 1 --use-conda

PHACT output

The outputs will be in a subfolder of results with their query_id prefix. Each query id folder has its own blast, multiple sequence alignment, maximum likelihood phylogenetic tree, ancestral sequence reconstruction and their computing score. The result/query_id directory will have the following structure:

.
├── 1_psiblast
├── 2_msa
├── 3_mltree
├── 4_raxmlng_ancestral
├── 5_raxmlng_ancestral_scores
├── 6_fasttree
├── 7_iqtree_ancestral
└── 8_iqtree_ancestral_scores
8 directories

Scaling the pipeline across a HPC (farm)

In addition to allow workload running on a local computer with limited number of query id's, PHACT framework is designed to analyze a bulk of query id's in a parallel way using High Performance Computing center.

Most HPC clusters has a scheduler that handles which workload are run on which compute nodes. To interact with a scheduler, users typically need to prepare a bash script and then submit it to cluster. Snakemake has a functionality to perform all these efforts automatically.

PHACT framework, based on Snakemake, has a number of profile files prepared for SLURM scheduler which is mostly used nowadays. The slurm profile files ( config/slurm_truba or config/slurm_sabanci ) indicate the fundamental definitions for executing workflow on HPC with Slurm.

Running snakemake with snakemake --profile config/slurm_truba would be enough to search the directory and load the profile file that can be used at any time.

Tips and tricks for running workflow

Caching between workflow is very useful property that Snakemake provides. To avoid redundant computation between workflow, it is recommended to use cache as following. Although, the implementation should be considered experimental. $ export SNAKEMAKE_OUTPUT_CACHE=<THE_PATH_OUTPUT_WILL_BE_CACHED> $ snakemake --use-conda --cache --profile ../config/slurm_truba
Monitoring the execution via web url is possible if a panoptes server is installed and configured properly. Running snakemake with --wms-monitor http://ephesus.sabanciuniv.edu:5000 parameter is neccessary.
Use --keep-going parameter is recommended to proceed running independent jobs in case of any failure on a task.
The screen command or execute the snakemake in background are useful for long set of runs in workflow in order to avoid any troubles in case of connection lost to the user interface or terminal

Distribution and reproducibility

We provide a template for a reproducible and scalable research project using Snakemake and various programming language (python, R) and tools managed by conda package manager. Set of rules is constructed to implement the entire research pipeline starting from genetic databases with querying the homologues, aligning sequences, trimming MSA, generating the phylogenetic tree, ancestral reconstruction and then calculate the score.

To get the same results after running workflow regardless to the platform (local computer, HPC, cloud computing), the framework should have the same set of input files, the version of software/tools. Since all requirement is well defined explicitly, it's possible to get the same results.

Citing this work

Kuru, N., Dereli, O., Akkoyun, E., Bircan, A., Tastan, O., & Adebali, O. (2022). PHACT: Phylogeny-aware computing of tolerance for missense mutations. Molecular Biology and Evolution. https://doi.org/10.1093/molbev/msac114

Data availability

All data obtained during this study is shared on the following url. https://phact.sabanciuniv.edu/pubs/kuru_mbe_2022/

Please see Publications/Kuru_etal_2022 folder to reproduce the figures in PHACT manuscript.

Acknowledgement

This work was supported by the Health Institutes of Turkey (TUSEB) (Project no: 4587 to O.A.) and EMBO Installation Grant (Project no: 4163 to O.A.) funded by the Scientific and Technological Research Council of Turkey (TÜBİTAK). Most of the numerical calculations reported in this paper were performed at the High Performance and Grid Computing Center (TRUBA resources) of TÜBITAK. The TOSUN cluster at Sabanci University was also used for computational analyses. We also want to thank Nehircan Özdemir for his art illustration of the PHACT approach.

Code Snippets

shell:
    "FastTreeMP {config[fasttree_model]} {input.trimmed_msa}  >  {output.bestTree} 2>{log}"

SnakeMake From line 15 of rules/fasttree.smk

shell:
    "python3 scripts/parse_blastp.py {input.blastp_out} {config[blast_hit_number]} {config[max_e_value]} {config[min_identity]} {params.blastdb} {input.query_fasta} 2> {log}"

SnakeMake From line 18 of rules/get_blasthits.smk

shell:
    "iqtree2 -redo -s {input.no_gap_msa} -te {input.unrooted_tree} -m {config[iqtree_ancestral_model]} -asr -safe -nt {resources.cpus} --prefix {params.iqtree_ancestral_out_name}"

SnakeMake From line 18 of rules/iq_anc.smk

shell:
    "query=`python scripts/get_query.py {params.query_fasta}` && Rscript scripts/ComputeScore_Nonnormal_withDiversity_0411.R {input.tree_file} {input.probabilities} {params.fasta} {params.out} $query {config[weights]} 2>{log}"

SnakeMake From line 18 of rules/iqtree_anc_score.smk

shell:
    "mafft{config[mafft_method]} --anysymbol --thread {resources.cpus} {input.fasta} > {output.msa_file} 2> {log}"

SnakeMake From line 15 of rules/msa.smk

shell:
    "psiblast -query {input.query_fasta} -db {params.blastdb} -outfmt {config[outfmt]} -out {output.outfile} -max_target_seqs {config[max_target_seqs]} -num_iterations {config[num_iterations]}  2> {log}"

SnakeMake From line 15 of rules/psiblast.smk

shell:
    "python scripts/make_query_fasta.py {params.query_id}  {params.blastdb} {output.fasta_file} 2> {log}"

SnakeMake From line 14 of rules/query_fasta.smk

shell:
    "raxml-ng --ancestral --msa {input.no_gap_msa} --tree {input.unrooted_tree} --model {config[raxmlng_ancestral_model]} --prefix {params.raxml_ancestral_out_name} --threads {resources.cpus}"

SnakeMake raxml-ng From line 20 of rules/raxmlng_ancestral.smk

shell:
    "python3 scripts/remove_gaps.py {input.query_file} {input.msa_file} {input.tree_file}  {output.no_gap_file} 2> {log}"

SnakeMake From line 15 of rules/remove_gaps_fasttree.smk

shell:
    "python3 scripts/remove_gaps.py {input.query_file} {input.msa_file} {input.tree_file}  {output.no_gap_file} 2> {log}"

SnakeMake From line 15 of rules/remove_gaps.smk

shell:
    "trimal -in {input.msa_file} -out {output.trimmed_msa} {config[trimal_method]} 2> {log}"

SnakeMake trimAl From line 13 of rules/trim_msa.smk

shell:
    "python  scripts/unroot_tree.py {input.input_tree}  {input.no_gap_msa_file} 2> {log}"

SnakeMake From line 13 of rules/unroot_fasttree.smk

library(ape)
library(tidytree)
library(stringr)
library(dplyr)
library(bio3d)
#source("./position_score.R")
args = commandArgs(trailingOnly=TRUE)

aa_to_num <- function(aa) {
  amino_acids <- c("G", "A", "L", "M", "F", "W", "K", "Q", "E", "S", "P", "V", "I", "C", "Y", "H", "R", "N", "D", "T")
  num <- sapply(aa, function(a){ifelse(sum(amino_acids %in% a) == 1, as.numeric(which(amino_acids %in% a)), 21)})
  # num <- ifelse(sum(amino_acids %in% aa) == 1, as.numeric(which(amino_acids %in% aa)), 21)
  return(num)
}

num_to_aa <- function(num) {
  amino_acids <- c("G", "A", "L", "M", "F", "W", "K", "Q", "E", "S", "P", "V", "I", "C", "Y", "H", "R", "N", "D", "T")
  aa <- ifelse(num == 21, 21, amino_acids[num])
  return(aa)
}

compute_score <- function(file_nwk, file_rst, file_fasta, output_name, human_id, pos_chosen, parameters) {

  # Read tree file
  tr_org <- read.tree(file_nwk)
  x <- read.table(file = file_rst, sep = '\t', header = TRUE, fill = TRUE)
  colnames(x)[4:ncol(x)] <- gsub("p_", replacement = "", x = colnames(x)[4:ncol(x)], fixed = TRUE )
  # Tree_info: node-node, node-leaf connections
  tree_info <- as.data.frame(as_tibble(tr_org))

  # Read fasta file, MSA
  fasta <- read.fasta(file = file_fasta)
  msa <- fasta$ali

  # connections_1: Parent node, connections_2: connected node/leaf
  connections_1 <- tree_info$parent
  connections_2 <- tree_info$node

  # Names of leaves
  names_all <- tr_org[["tip.label"]]
  msa <- msa[names_all, ]
  # Number of total leaves&nodes
  num_leaves <- length(tr_org[["tip.label"]])
  num_nodes <- tr_org[["Nnode"]]

  # Unnecessary
  num_nodes_codeml <- max(connections_1,connections_2)
  if (num_nodes_codeml-num_leaves != num_nodes){
    print("Number of nodes is less than expected: CODEML")
  }

  # Distance between leaves & nodes
  dd_node <- dist.nodes(tr_org)
  dist_leaf <- dd_node[1:num_leaves, 1:num_leaves]
  dist_node <- dd_node[(num_leaves+1):(num_leaves + num_nodes), (num_leaves+1):(num_leaves + num_nodes)]

  # Human position (leaf & node)
  h_name <- human_id
  human_codeml <- names_all[grep(pattern = h_name, x = names_all, fixed = TRUE)]
  leaf_human <- tree_info[which(tree_info$label == human_codeml), "node"]
  human_plc <- leaf_human
  node_human <- tree_info[which(tree_info$label == human_codeml), "parent"]
  nodes_raxml <- as.numeric(gsub(pattern = "Node", replacement = "", x = tree_info[num_leaves+1:num_nodes, "label"])) #Node or Branch
  names(nodes_raxml) <- tree_info[num_leaves+1:num_nodes, "node"]

  # Total number of positions from ancestralProbs file
  total_pos <- max(x$Site)

  # Chosen positions (all or some)
  if (pos_chosen[1] == "all"){
    positions <- 1:total_pos
    score_all <- matrix(0, total_pos, 21)
  } else {
    positions <- pos_chosen
    score_all <- matrix(0, length(positions), 21)
  }

  ####################################################
  ####################################################

  # Connections between leaves & nodes
  chosen_leaves <- tree_info[1:num_leaves,c("parent", "node")]
  # Connections between nodes & nodes
  chosen_nodes <- tree_info[(num_leaves+2):(num_leaves +num_nodes),c("parent", "node")]
  leaf_names <- tree_info$label

  human_leaf_len <- as.double(tree_info[human_plc, "branch.length"])

  d_n <- dist_node[as.character(node_human),] + human_leaf_len
  # d_n <- d_n[nn2[,1]]
  d_l <- dist_leaf[leaf_human,]

  # chosen_nodes2: ordered connections (for probability differences)
  chosen_nodes2 <- matrix(0, num_nodes-1, 2)

  n1 <- as.numeric(chosen_nodes$parent)
  n2 <- as.numeric(chosen_nodes$node)
  dist_f <- d_n[as.character(n1)]
  dist_s <- d_n[as.character(n2)]

  # chosen_nodes2: ordered connections (for probability differences)
  chosen_nodes2[which(dist_f < dist_s), 1] <- n2[which(dist_f < dist_s)]
  chosen_nodes2[which(dist_f < dist_s), 2] <- n1[which(dist_f < dist_s)]

  chosen_nodes2[which(dist_f >= dist_s), 1] <- n1[which(dist_f >= dist_s)]
  chosen_nodes2[which(dist_f >= dist_s), 2] <- n2[which(dist_f >= dist_s)]

  ###########################################################################
  ###############      NEW PART - 15NOV                      ################
  ###########################################################################

  # Number of nodes between nodes & leaf of human
  nodes_conn <- numeric(num_nodes)
  nodes_conn[node_human-num_leaves] <- 1
  names(nodes_conn) <- names(d_n)
  chs <- c()
  chs2 <- c()
  ##########################
  inds <- chosen_nodes2[chosen_nodes2[,2]==(node_human),1]
  nodes_conn[as.character(inds)] <- 2
  chs <- inds

  s0 <- sapply(3:num_leaves, function(i){
    for (j in chs){
      inds <- chosen_nodes2[chosen_nodes2[,2]==j,1]
      if (length(inds)!=0){
        nodes_conn[as.character(inds)] <<- i
        chs2 <- c(chs2, inds)
      }
    }
    chs <<- chs2
    chs2 <- c()
  })

  # Number of nodes between leaves & leaf of human
  leaves_conn <- nodes_conn[as.character(chosen_leaves[,1])]

  #########################################3

  parameters <- unlist(str_split(parameters, pattern = ","))

  score_norm <- t(mapply(function(ps, parameter){position_score(ps, x, msa, num_nodes, num_leaves, total_pos, human_plc, node_human, nodes_raxml, human_leaf_len, dist_node, dist_leaf, parameter, leaves_conn, nodes_conn, chosen_leaves, chosen_nodes2, d_n, d_l)}, rep(positions, length(parameters)), rep(parameters, each = length(positions))))
  score_norm_with_leaf <- matrix(unlist(score_norm[ ,1]), nrow = length(positions) * length(parameters), ncol = 20, byrow = TRUE)
  score_norm_without_leaf <- matrix(unlist(score_norm[ ,2]), nrow = length(positions) * length(parameters), ncol = 20, byrow = TRUE)

  score_norm_with_leaf <- cbind(rep(positions, length(parameters)), score_norm_with_leaf)
  score_norm_without_leaf <- cbind(rep(positions, length(parameters)), score_norm_without_leaf)

  colnames(score_norm_with_leaf) <- c("Pos/AA", num_to_aa(1:20))
  colnames(score_norm_without_leaf) <- c("Pos/AA", num_to_aa(1:20))

  print_wl <- lapply(1:length(parameters), function(p){
    score_to_print_wl <- score_norm_with_leaf[(positions + length(positions)*(p - 1)), ]
    score_to_print_wol <- score_norm_without_leaf[(positions + length(positions)*(p - 1)), ]
    filename <- ifelse(parameters[p] == "0", "max05", parameters[p])
    filename <- ifelse(parameters[p] == "X", "max05_Gauss", parameters[p]) ### New Line (11.11)
    write.csv(score_to_print_wl, sprintf("%s.csv", paste(output_name, "_wl_param_", filename, sep = "")), row.names = FALSE, quote = FALSE)
    write.csv(score_to_print_wol, sprintf("%s.csv", paste(output_name, "_wol_param_", filename, sep = "")), row.names = FALSE, quote = FALSE)
  })

}

position_score <- function(ps, x, msa, num_nodes, num_leaves, total_pos, human_plc, node_human, nodes_raxml, human_leaf_len, dist_node, dist_leaf, parameter, leaves_conn, nodes_conn, chosen_leaves, chosen_nodes2, d_n, d_l) {
  position <- ps

  b1 <- position + total_pos*(0:(num_nodes-1))
  TT <- x[b1,]
  matrix_prob <- matrix(0, num_nodes, 20)

  probs <- data.matrix((TT[, (4:ncol(TT))]))
  rownames(probs) <- NULL
  rr <- aa_to_num(colnames(x)[4:ncol(TT)])
  matrix_prob[,rr] <- probs
  matrix_prob <- matrix_prob[nodes_raxml,]

  position_vec <- msa[, ps]

  position_num <- aa_to_num(position_vec)
  prob_leaves <- matrix(0, num_leaves, 20)
  prob_leaves[cbind(which(position_num <= 20), position_num[which(position_num <= 20)])] <- 1

  gaps <- which(position_num == 21)

  tot_sc <- 1
  mxx <- 1

  diff_leaves <- matrix(0, num_leaves, 20)
  diff_leaves <- prob_leaves - matrix_prob[(chosen_leaves$parent - num_leaves), ]
  diff_leaves[human_plc,]<- -diff_leaves[human_plc,]

  diff_nodes <- matrix(0, num_nodes-1, 20)
  diff_nodes <- matrix_prob[chosen_nodes2[,1] - num_leaves, ] - matrix_prob[chosen_nodes2[,2] - num_leaves, ]


  ################## weights
  weights <- weight_fnc(d_n, d_l, human_plc, parameter, leaves_conn, nodes_conn, mxx) # Updated with related parameters
  weight_leaf <- weights[1:num_leaves]
  weight_node <- tail(weights,num_nodes)
  ####################

  score <- matrix(0,1,20)

  s1 <- sapply(1:20, function(ii){
    dif_pr <- diff_nodes[1:(num_nodes-1),ii]
    dif_pr[dif_pr<0] <- 0
    # CHANGED 29.03
    sel_node <- chosen_nodes2[1:length(dif_pr), 1] - num_leaves
    score[ii] <<- score[ii] + sum(weight_node[sel_node] * dif_pr)
  })

  ### NOVEL 29.03
  aa_f <- position_num[human_plc]
  vect_human <- matrix_prob[node_human - num_leaves,]
  vect_human[aa_f]<-0

  score_without_leaf <- score
  score_without_leaf <- score_without_leaf + weight_node[(node_human-num_leaves)]*vect_human
  score <- score + weight_node[(node_human-num_leaves)]*vect_human

  s2 <- sapply(1:20, function(ii){
    diff_lf <- diff_leaves[1:num_leaves,ii]
    diff_lf[gaps] <-  0
    diff_lf[diff_lf<0] <- 0

    s1 <- sum(weight_leaf[((1:length(diff_lf)) != human_plc )] * diff_lf[((1:length(diff_lf)) != human_plc )])
    score[ii] <<- score[ii] + s1
  })

  aa_f <- position_num[human_plc]
  if (aa_f != 21){
    score[aa_f] <- score[aa_f] + weight_leaf[human_plc]*1
    score_without_leaf[aa_f] <- score_without_leaf[aa_f] + weight_leaf[human_plc]*1
  }

  sum_exc_max <- sum(score_without_leaf)-max(score_without_leaf)
  diversity <- (-(length(which(score_without_leaf<0.0001))*0.1)/20+0.1)*(sum_exc_max)

  scores <- list()
  scores$score_with_leaf <- 1- log((score*0.9 + diversity)/(num_nodes+num_leaves)+10^(-15))/log(10^(-15))
  scores$score_without_leaf <- 1- log((score_without_leaf*0.9 + diversity)/num_nodes + 10^(-15))/log(10^(-15))

  return(scores)
}

weight_fnc <- function(d_n, d_l, human_plc, parameter, leaves_conn, nodes_conn, mxx) {
  # print(parameter)
  if (parameter=="0"){
    d_l_n <- d_l[-human_plc]
    min_l <- min(d_l_n)
    d_l2 <- d_l/min_l
    d_n2 <- d_n/min_l

    d_l2 <- d_l2 +1
    d_n2 <- d_n2 +1
    weight_node <- 1/d_n2
    weight_leaf <- 1/d_l2

  } else if (parameter=="0_MinNode"){
    d_l_n <- d_l[-human_plc]
    min_l <- min(d_n)
    d_l2 <- d_l/min_l
    d_n2 <- d_n/min_l

    d_l2 <- d_l2 +1
    d_n2 <- d_n2 +1
    weight_node <- 1/d_n2
    weight_leaf <- 1/d_l2

  } else if (parameter=="0_MinNode_Mix"){
    d_l_n <- d_l[-human_plc]
    min_l <- min(d_n)
    d_l2 <- d_l/min_l
    d_n2 <- d_n/min_l

    d_l2 <- d_l2 +1
    d_n2 <- d_n2 +1
    weight_node1 <- 1/d_n2
    weight_leaf1 <- 1/d_l2

    param <- mean(c(d_n, d_l))
    weight_leaf2 <- exp(-d_l^2/param^2)
    weight_node2 <- exp(-d_n^2/param^2)

    weight_node<-sqrt(weight_node1*weight_node2)
    weight_leaf<-sqrt(weight_leaf1*weight_leaf2)

  } else if (parameter=="0_MinNode_Mix2"){
    d_l_n <- d_l[-human_plc]
    min_l <- min(d_n)
    d_l2 <- d_l/min_l
    d_n2 <- d_n/min_l

    d_l2 <- d_l2 +1
    d_n2 <- d_n2 +1
    weight_node1 <- 1/d_n2
    weight_leaf1 <- 1/d_l2

    param <- mean(c(d_n, d_l))
    weight_leaf2 <- exp(-d_l^2/param^2)/2
    weight_node2 <- exp(-d_n^2/param^2)/2

    weight_node<-sqrt(weight_node1*weight_node2)
    weight_leaf<-sqrt(weight_leaf1*weight_leaf2)

  } else if (parameter == "mean"){
    param <- mean(c(d_n, d_l))
    weight_leaf <- exp(-d_l^2/param^2)
    weight_node <- exp(-d_n^2/param^2)
  } else if (parameter == "median"){
    param <- median(c(d_n, d_l))
    weight_leaf <- exp(-d_l^2/param^2)
    weight_node <- exp(-d_n^2/param^2)
  } else if (parameter == "X"){
    d_l_n <- d_l[-human_plc]
    min_l <- min(d_l_n)
    min_n <- min(d_n)
    min_ch <- min(min_l, min_n)
    d_l2 <- d_l-min_ch
    d_n2 <- d_n-min_ch
    weight_leaf <- exp(-d_l^2)/2
    weight_node <- exp(-d_n^2)/2
    weight_leaf[human_plc] <- 1
  } else if (parameter == "CountNodes_1"){      ### NewFunction 1 (15Nov)
    weight_node = (exp(-d_n^2) + 1/nodes_conn)/2
    weight_leaf = (exp(-d_l^2) + 1/leaves_conn)/2
  } else if (parameter == "CountNodes_2"){      ### NewFunction 2 (15Nov)
    weight_node = (exp(-d_n^2) + exp(-nodes_conn^2))/2
    weight_leaf = (exp(-d_l^2) + exp(-leaves_conn^2))/2
  } else if (parameter == "CountNodes_3"){      ### NewFunction 3 (15Nov)
    weight_node = sqrt(exp(-d_n^2)*1/nodes_conn)
    weight_leaf = sqrt(exp(-d_l^2)*1/leaves_conn)
  } else if (parameter == "CountNodes_4"){      ### NewFunction 4 (15Nov)
    weight_node = exp(-(sqrt(d_n*nodes_conn))^2)
    weight_leaf = exp(-(sqrt(d_l*leaves_conn))^2)
  } else if (parameter == "Equal"){

    weight_leaf <- matrix(1,1,length(d_l))
    weight_node <- matrix(1,1,length(d_n))

  } else if (parameter == "MinThreshold"){
    max_dis <- max(c(d_l, d_n))
    param_min <- as.double(param_min)    
    weight_node <- (-1+param_min)/max_dis*d_n + 1
    weight_leaf <- (-1+param_min)/max_dis*d_l + 1
  } else if (parameter == "MinThreshold_Gauss"){
    max_dis <- max(c(d_l, d_n))
    param_min <- as.double(param_min)
    param <- sqrt(-max_dis^2/log(param_min))

    weight_leaf <- exp(-d_l^2/param^2)
    weight_node <- exp(-d_n^2/param^2)

  } else {
    param <- as.double(parameter)
    weight_leaf <- exp(-d_l^2/param^2)
    weight_node <- exp(-d_n^2/param^2)
  }
  weights = c(weight_leaf, weight_node)


  return(weights)
}

csv_file <- compute_score(file_nwk=args[1],file_rst=args[2],file_fasta=args[3], output_name=args[4],human_id=args[5],'all', parameters = args[6])

R dplyr stringr APE Bio3D From line 2 of scripts/ComputeScore_Nonnormal_withDiversity_0411.R

import re
import sys

def get_human_id(fasta_file):
    with open(str(fasta_file),'r') as f:
        for line in f:
            if line.startswith('>'):
                line = line[0] + line[1:].replace('>', '-') 
                header = line.split(">")[1].strip()
                human_id = header.split(" ")[0]+"_"+header.split("OX=")[1].split(" ")[0]
                human_id = re.sub(r'[^\w>]', '_',human_id)
    print(human_id)
    return human_id

get_human_id(sys.argv[1])

Python From line 1 of scripts/get_query.py

import os
import sys


def make_query_fasta(acc_no,all_eu_file,output_path):

    protein_dict = {}

    with open(all_eu_file,'r') as f:
        for line in f:
            if line.startswith('>'):
                line = line[0] + line[1:].replace('>', '-') 
                header = line.split(">")[1].strip()
                if header.split("|")[1].split("|")[0] == acc_no:
                    line = next(f).strip()
                    protein_dict[header] = ''
                    while not line.startswith('>'):
                        protein_dict[header] += line
                        line = next(f).strip()
    f.close()

    with open(output_path,'w') as f:
        for header,sequence in protein_dict.items():
            f.write(">"+header+"\n" + sequence + "\n")
    f.close()


if __name__ == "__main__":
    acc_no = sys.argv[1]
    all_eu_file = sys.argv[2]
    output_path = sys.argv[3]
    make_query_fasta(acc_no,all_eu_file,output_path)

Python From line 1 of scripts/make_query_fasta.py

import os
import sys
import re


def get_fasta_dict(fasta_file):
    fasta_dict = {}

    with open(fasta_file,'r') as f:
        for line in f:
            if line.startswith('>'):
                line = line[0] + line[1:].replace('>', '-') 
                header = line.split(">")[1].strip()
                fasta_dict[header] = ''
            else:
                fasta_dict[header] += line.strip()

    f.close()

    return fasta_dict


def parse_blastp_out(blastp_out_file,blast_hit_number,max_e_value,min_identity):
    blast_list = []
    with open(blastp_out_file,'r') as filein_:
        for line in filein_:
            if len(blast_list) <= int(blast_hit_number):
                if line.startswith(">"):
                    best_hit = line.split(">")[1].strip()
                    best_hit = best_hit.split(" ")[0]
                    while not line.startswith(" Score"):
                        line = next(filein_)
                        if line.startswith(" Score"):
                            score = float(line.split("Score = ")[1].split(" bits")[0])
                            e_value = float(line.split("Expect = ")[1].split(",")[0])
                            identity_line = next(filein_)
                            identity = float(identity_line.split("Identities = ")[1].split("(")[1].split("%")[0])
                            if e_value < float(max_e_value) and identity>float(min_identity):
                                blast_list.append(best_hit)

    return blast_list

def write_to_file(blast_list,all_eu_file,blastp_out_file,query_fasta):
    query_fasta_dict = get_fasta_dict(query_fasta)
    all_eu_dict = get_fasta_dict(all_eu_file)
    all_eu_keys_list = []
    for k,v in all_eu_dict.items():
        all_eu_keys_list.append(k.split(" ")[0].split("|")[1])
    if set(all_eu_keys_list).issuperset(set(blast_list)):
        with open(blastp_out_file.split(".")[0]+".fasta",'w') as f:
            for k,v in all_eu_dict.items():
                if k.split(" ")[0].split("|")[1] in blast_list:
                    new_k = k.split(" ")[0]+"_"+k.split("OX=")[1].split(" ")[0]
                    new_k = re.sub(r'[^\w>]', '_',new_k)
                    f.write(">"+new_k + "\n" + v +"\n")
            for h,s in query_fasta_dict.items():
                if h.split(" ")[0].split("|")[1] in blast_list:
                    pass
                else:
                    new_h = h.split(" ")[0]+"_"+h.split("OX=")[1].split(" ")[0]
                    new_h = re.sub(r'[^\w>]', '_',new_h)
                    f.write(">"+new_h + "\n" + s +"\n")
        f.close()


if __name__ == "__main__":
    blastp_out_file = sys.argv[1]
    blast_hit_number = sys.argv[2]
    max_e_value = sys.argv[3]
    min_identity = sys.argv[4]
    blastdb_file = sys.argv[5]
    query_fasta = sys.argv[6]
    blast_list =  parse_blastp_out(blastp_out_file,blast_hit_number,max_e_value,min_identity)
    write_to_file(blast_list,blastdb_file,blastp_out_file,query_fasta)

Python From line 1 of scripts/parse_blastp.py

import os
import sys
from fasta_dict import *
import re
import ete3
from ete3 import Tree

def get_human_id(query_file):
    query_dict = get_fasta_dict(query_file)
    for k,v in query_dict.items():
        human_id = k.split(" ")[0]+"_"+k.split("OX=")[1].split(" ")[0]
        human_id = re.sub(r'[^\w>]', '_',human_id)
    #print(human_id)
    return human_id

def get_gap_positions(seqDict,human_id):

    for key, value in seqDict.items():
        if human_id  in key:
            #print(value)
            indices = [i for i, x in enumerate(value) if x == "-"]
            #print(indices)
    return indices

def get_sequences_without_gap(seqDict,indices):
    new_seqDict = {}
    for k,v in seqDict.items():
        new_seqDict[k] = ''.join([v[i] for i in range(len(v)) if i not in indices])
        #print(new_seqDict)  
    return new_seqDict


def write_new_fasta(new_seqDict,fasta_file,tree_file,output_file):
    t = Tree(tree_file,format=1)
    leaves = []
    for leaf in t:
        leaves.append(leaf.name)
    with open (output_file,'w') as new_file:
        #print(new_file)
        for newheader,value in  new_seqDict.items():
            if newheader in leaves:
                new_value = re.sub(r'[BXJZUO#$]', '-',value)
                if not all([c=="-" for c in new_value]):
                    new_file.write(">" +newheader+"\n" + new_value+ "\n")
    new_file.close()



if __name__ == "__main__":
    query_file = sys.argv[1]
    fasta_file= sys.argv[2]
    tree_file= sys.argv[3]
    output_file= sys.argv[4]
    human_id= get_human_id(query_file)
    seqDict = get_fasta_dict(fasta_file)
    gap_indices = get_gap_positions(seqDict,human_id)
    new_seqDict = get_sequences_without_gap(seqDict,gap_indices)
    write_new_fasta(new_seqDict,fasta_file,tree_file,output_file)