Spliced peptide identification from in vitro digestions of polypeptides with purified proteasomes

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Spliced peptide identification from in vitro digestions of polypeptides with purified proteasomes

Please cite the following publication if you are using invitroSPI for your research:

Roetschke, H.P., Rodriguez-Hernandez, G., Cormican, J.A. et al. InvitroSPI and a large database of proteasome-generated spliced and non-spliced peptides. Sci Data 10, 18 (2023). https://doi.org/10.1038/s41597-022-01890-6

overview

The invitroSPI pipeline consists of four main steps that are implemented in a Snakemake workflow:

parsing of search result files and creation of a preliminary MS database ( MSDB ) containing all peptide-spectrum matches (PSMs), mapping of peptides to their substrate origins and potential product type re-assignment
scoring and filtering of PSMs using Mascot's ion score and q-value as well as the delta score described in the manuscript
identification and, optionally, removal of synthesis errors using the control runs
mapping of peptides to the substrate sequence accounting for potential multi-mappers
database statistics and diagnostic information

Additionally, code for the computation of all possible spliced and non-spliced peptides which is used for the Mascot search is provided in SOURCE/_generateSearchDB_polypeptides.R . We also include a script containing useful functions for downstream analyses ( invitroSPI_utils.R ).

execution

invitroSPI relies on Conda and Snakemake. In order to install Conda, click on this link and follow the installation guidelines for your respective operating system.
After installing Conda, you need to install Snakemake. The Snakemake installation procedure is described here .

Briefly, open the terminal on your computer and paste the following lines sequentially:
conda install -n base -c conda-forge mamba
conda activate base
mamba create -c conda-forge -c bioconda -n snakemake snakemake
Additionally, you might need to run conda update conda .

Download this repository as a .zip file (click on Code at the upper right corner of this repository --> Download ZIP), move the .zip file in the desired directory on your computer and unpack it. Make sure to edit INPUT/sample_list.csv and INPUT/config.yaml before starting the data processing (see below for a more detailed description) Open the terminal in this directory and enter: conda activate snakemake .

The pipeline can be executed by pasting snakemake --use-conda --cores all -R createMSDB into the terminal. The progress of the pipeline execution should appear in your terminal window. In case you have installed an older version of Conda/Snakemake and encounter an error when executing the pipeline, try executing snakemake --use-conda --cores all -R createMSDB --conda-frontend conda .

After your jobs finished, enter conda deactivate in order to terminate your Conda environment.

input

invitroSPI identifies spliced and non-spliced peptides from Mascot search result files. Therefore, the user must provide a table sample_list.csv in the INPUT/ folder containing information about:

project name
substrate ID
substrate sequence
time point
search result file name
replicate
MSfile (optional)
metainformation (optional)

!!! In case you are creating/editing the sample_list.csv file in Microsoft Excel, make sure to save it as actual comma-separated file. I.e., Save as... --> Comma-separated Values (.csv) To check that the sample list is in the correct format, you can open it using a text editor and verify that the columns are separated by commas (and NOT semicolons). !!!

Additionally, the user must provide search result files deposited in the folder INPUT/search_results/ and list their names in the sample_list.csv table.

An example of the sample_list.csv table is given below and can be modified accordingly by the user:

project_name	substrateID	substrateSeq	digestTime	filename	replicate	metainformation
test_data	TSN2	TSN2.fasta	4	F029125.csv	1	INPUT/metainformation_testData.csv
test_data	TSN89	RTKAWNRQLYPEW	4	F029129.csv	1	INPUT/metainformation_testData.csv
test_data	TSN2	VSRQLRTKAWNRQLYPEWTEAQR	CTRL	F029123.csv	1	INPUT/metainformation_testData.csv
test_data	TSN89	RTKAWNRQLYPEW	CTRL	F029127.csv	1	INPUT/metainformation_testData.csv

You can either paste the substrate sequence in the sample_list directly or put the name of a single-entry .fasta file containing the substrate sequence. This file should be located in INPUT/sequences .

Note that also the filenames of the control files must be provided. For the control files, put CTRL in the digestTime column . Please provide all other time points in hours ( e.g. , 0.5 for 30 min).
MSfile is an optional column that might be helpful to keep track of the .raw/.mgf files that were used to generate the respective search result files. It will not be used to create the database.
In case you would like to include additional information in the final database (for instance species, location, instrument, substrate origin, proteasome isotype, ...), you could do so by creating a .csv table containing all this information. The .csv table must also contain a column filename corresponding to the filename in the sample list (see INPUT/metainformation_SciData2022 for an example). Provide the name of the metainformation table in the metainformation column of the sample list.

The invitroSPI workflow is constructed in such a way that the user can keep appending projects and search result files to the sample_list . However, only the samples of the current project (which is specified in INPUT/config.yaml ) are processed and stored separately in OUTPUT/project_name subdirectories.

output

The pipeline provides a final database of identified spliced and non-spliced peptides in .csv format ( OUTPUT/project_name/ProteasomeDB.csv ) as well as all intermediate files in binary format ( OUTPUT/project_name/tmp/ ).

Additionally, some diagnostic plots and database statistics are being produced which can also be found in the OUTPUT/ folder. They comprise:

number of unique peptides
number and frequencies of peptides and PSMs at each time point
length distribution of peptides, splice reactants and intervening sequences at each time point

parameters that can be modified by the user

We are using the following default parameters that are specified in the INPUT/config.yaml file and that can be changed by the user:

project_name : indicates which files listed in the sample_list.csv should be processed (enter valid directory names only! - no spaces or German umlauts)
delta_score : 0.3
ion_score : 20
q_value : 0.05
include_scanNum : "yes"
keep_synErrors : "no"

Processing of scan numbers is only possible if .mgf files were created with msconvert or Mascot Distiller . In case you provide search results in another format (not recommended), please set include_scanNum to "no". In case you would like to include synthesis errors (labelled as such) in the final ProteasomeDB , change the keep_synErrors flag accordingly.

Code Snippets

library(plyr)
library(dplyr)
library(seqinr)
library(stringr)

source("SOURCE/loadFunctions.r")

print("--------------------------------------------")
print("1) PARSE SEARCH RESULT FILES AND CREATE MSDB")
print("--------------------------------------------")

project_name = snakemake@params[["project_name"]]
include_scanNum = snakemake@params[["include_scanNum"]]

SEARCHRESULTS_PATH = "INPUT/search_results/"


### INPUT ###
sample_list = read.csv(file = snakemake@input[["sample_list"]],
                       sep = ",", header = T, stringsAsFactors = F)

# sample_list = read.table(file = "INPUT/sample_list.csv",
#                          sep = ",", header = T, stringsAsFactors = F)

# filter sample list
sample_list = sample_list[sample_list$project_name == project_name, ]

# add meta-information
if (!any(is.na(sample_list$metainformation)) & length(unique(sample_list$metainformation)) == 1) {
  metainfo = read.csv(sample_list$metainformation[1], stringsAsFactors = F)
  meta = T
} else {
  meta = F
}

### MAIN PART ###

if (!dir.exists(paste0("OUTPUT/", project_name))) {
  dir.create(paste0("OUTPUT/", project_name))
}

columns = c(            "runID" ,
                        "substrateID" ,
                        "substrateSeq" ,
                        "digestTime" ,
                        "pepSeq" ,
                        "productType",
                        "spliceType",
                        "positions",
                        "scanNum",
                        "rank",
                        "ionScore",
                        "qValue",
                        "charge",
                        "PTM")

MSDB = list()
pb = txtProgressBar(min = 0, max = nrow(sample_list), style = 3)

for (i in 1:nrow(sample_list)) {
  # setTxtProgressBar(pb, i)

  # load search result file
  con = file(paste0(SEARCHRESULTS_PATH, sample_list$filename[i]),"r")
  i1 = readLines(con)
  close(con)

  # remove header from search result file
  # and remove not assigned queries if there are any
  start = grep("prot_hit_num",i1)

  currentSearchFile = read.table(paste(SEARCHRESULTS_PATH, sample_list$filename[i],sep="/"),
                                 skip = start-1,
                                 sep = ",", header = TRUE, fill = TRUE)

  kk = which(currentSearchFile$prot_hit_num == "")
  if (length(kk) > 0) {
    kk = min(kk)
    currentSearchFile = currentSearchFile[c(1:(kk-2)),]
  }

  substrateID = sample_list$substrateID[i]
  digestTime = sample_list$digestTime[i]


  if(nrow(currentSearchFile) > 0){

    # empty DB for current search result file
    currentDB = data.frame(matrix(ncol = length(columns), nrow = nrow(currentSearchFile)))
    colnames(currentDB) = columns

    # fill data frame with info straight outta .csv
    currentDB$charge <- currentSearchFile$pep_exp_z
    currentDB$rank <- currentSearchFile$pep_rank
    currentDB$pepSeq <- currentSearchFile$pep_seq
    currentDB$ionScore <- currentSearchFile$pep_score
    currentDB$qValue <- currentSearchFile$pep_expect


    currentDB$PTM <- currentSearchFile$pep_var_mod
    currentDB$scanNum <- currentSearchFile$pep_query

    # currentDB[which(duplicated(currentDB) | duplicated(currentDB, fromLast = T)),] %>% nrow()
    # cntS = currentSearchFile[, c("pep_exp_z","pep_rank","pep_seq", "pep_score", "pep_expect", "pep_var_mod", "pep_query")]
    # cntS[which(duplicated(cntS) | duplicated(cntS, fromLast = T)),] %>% nrow()

    #get actual scan numbers, not query numbers
    # varies between Mascot and Mascot Distiller
    if(include_scanNum != "no") {

      warn = F

      scans = rep(NA,dim(currentSearchFile)[1])
      for(ii in 1:dim(currentSearchFile)[1]){
        tit = currentSearchFile$pep_scan_title[ii]

        if (str_detect(tit, pattern = "scan=")) {

          scans[ii] = as.numeric(strsplit(strsplit(tit,split="scan=")[[1]][2],split="~")[[1]][1])

        } else if (str_detect(tit, pattern = "Scan ")) {

          scans[ii] = as.numeric(unlist(strsplit(tit, " "))[3])

        } else if (str_detect(tit, pattern = coll(".")) | str_detect(tit, pattern = "TSN")) {
          scans[ii] = as.numeric(strsplit(tit,split="\\.")[[1]][2])

        } else {
          warn = T
        }

      }
      currentDB$scanNum <- scans

      if (warn) {
        print("could not extract scan numbers - check search result formatting!")
        currentDB$scanNum = currentSearchFile$pep_query
      }

    }


    #extract product type and position information
    pepName <- as.character(currentSearchFile$prot_acc)
    pepType <- c()
    pepPos <- c()
    for (z in 1:length(pepName)){
      splitPepName <- unlist(strsplit(pepName[z], "_"))
      pepType <- c(pepType, splitPepName[1])
      splitPepName <- splitPepName[-1]
      pepPos <- c(pepPos, paste(splitPepName, collapse = "_"))

    }

    currentDB$productType <- pepType
    currentDB$positions <- pepPos

    # substrate sequence
    if (grepl(pattern = ".fasta", sample_list$substrateSeq[i])) {

      cntSeq = read.fasta(file = paste0("INPUT/sequences/", sample_list$substrateSeq[i]),
                          seqtype = "AA", strip.desc = T)[[1]] %>%
        unlist() %>%
        paste(collapse = "")

    } else {
      cntSeq = sample_list$substrateSeq[i]
    }

    # add metainfo
    currentDB$substrateID = substrateID
    currentDB$substrateSeq <- cntSeq
    currentDB$digestTime <- digestTime
    currentDB$runID = paste(substrateID, digestTime,sample_list$filename[i], sample_list$replicate[i], sep = "-")

    # add further meta-information if provided
    if (meta) {

      cntMeta = metainfo[metainfo$filename == sample_list$filename[i], ]
      currentDB = cbind(currentDB, cntMeta[rep(1,nrow(currentDB)),]) %>%
        as.data.frame()

      dp = which(duplicated(names(currentDB)))
      if (length(dp) > 0) { currentDB = currentDB[,-dp]}

    }

    # assign substrate hits as PCP
    substratehits = which(currentDB$substrateSeq == currentDB$pepSeq)
    if (length(substratehits) > 0) {
      currentDB$productType[substratehits] = "PCP"
      currentDB$positions[substratehits] = paste(1,
                                                 nchar(currentDB$substrateSeq[substratehits]),
                                                 sep = "_")
    }

    print("")
    paste0("obtain correct annotations for positions and splice types for run: ",
           currentDB$runID[1]) %>%
      print()

    currentDB = currentDB %>% mapping()
    currentDB$spliceType[currentDB$productType == "PCP"] = NA

  } else {
    currentDB = data.frame(matrix(ncol = length(columns), nrow = 0))
    colnames(currentDB) = columns
    print("No peptides listed in this file! Skipping...")
  }

  # append data frame for current search file to master data frame
  MSDB[[i]] = currentDB

}

MSDB = plyr::ldply(MSDB)

### OUTPUT ###
save(MSDB, file = unlist(snakemake@output[["MSDB"]]))

R dplyr stringr plyr seqinr From line 7 of SOURCE/1_createMSDB.R

library(stringr)
library(plyr)
library(dplyr)

print("------------------------------------------------")
print("2) EXTRACT HIGH-QUALITY PSMs FROM SEARCH RESULTS")
print("------------------------------------------------")

project_name = snakemake@params[["project_name"]]
delta_score = snakemake@params[["delta_score"]]
ion_score = snakemake@params[["ion_score"]]
q_value = snakemake@params[["q_value"]]

print(paste0("Delta score: ", delta_score))
print(paste0("Mascot ion score: ", ion_score))
print(paste0("q-value: ", q_value))


### INPUT ###
sample_list = read.csv(file = snakemake@input[["sample_list"]],
                       sep = ",", header = T, stringsAsFactors = F)
# filter sample list
sample_list = sample_list[sample_list$project_name == project_name, ]

load(snakemake@input[["MSDB"]])


### MAIN PART ###
columns = names(MSDB)
runIDs = unique(MSDB$runID)

allPSMs = list()
allPSMs_Delta = list()

pb = txtProgressBar(min = 0, max = length(runIDs), style = 3)
for (i in 1:length(runIDs)) {

  setTxtProgressBar(pb, i)

  selected <- data.frame(matrix(ncol = length(columns), nrow = 0))
  colnames(selected) = columns

  selectedDeltaRecord = data.frame(matrix(ncol = length(columns), nrow = 0))
  colnames(selectedDeltaRecord) <- columns

  runIDTable = MSDB[MSDB$runID == runIDs[i],]
  scanNum = as.character(runIDTable$scanNum) %>% unique()

  # bah = sapply(scanNum, function(x){
  #   all(runIDTable$rank[runIDTable$scanNum == x] > 1)
  # }) %>% which()
  # if (length(bah) > 0) {
  #   print("THERE ARE PSMS WITHOUT A TOP-RANKED PEPTIDE!!")
  # }


  # Iterate through scanNum and sort by rank
  for (k in 1:length(scanNum)) {

    PSPcandidatesIndex = NULL
    PCPcandidatesIndex = NULL

    deltaRecord = data.frame(matrix(ncol = length(columns), nrow = 0))
    colnames(deltaRecord ) = columns

    scanNumTable = runIDTable[runIDTable$scanNum == scanNum[k],]
    scanNumTable = scanNumTable[order(scanNumTable$rank),]

    # # How many entries in filteredScans are rank 1?
    scanNumTable = unique(scanNumTable)
    filteredScans = scanNumTable
    filteredScans = filteredScans[order(filteredScans$rank),]

    # print(filteredScans)

    # extract the top score
    topScore <- filteredScans[1, "ionScore"]

    # extract scans with rank 1
    allRanks = table(filteredScans$rank)
    NoTopRankedScans <- allRanks[names(allRanks) == 1]

    # it can somehow happen that there is no top-ranked peptide...
    if (length(NoTopRankedScans) == 0) {
      filteredScans <- filteredScans[0,]
      paste0("!!! NO RANK 1 PEPTIDES FOUND IN SCAN ", scanNum[k], ", RUN-ID ", runIDs[i], " !!!") %>%
        print()
      print("likely, there is an issue with the Mascot search settings - check!")

    } else if (NoTopRankedScans > 1) {

      topIndices = which(filteredScans$rank == 1)
      NoTopRankedPeps = gsub("I","L",filteredScans$pepSeq[topIndices]) %>%
        unique() %>%
        length()


      PCPindex = which(filteredScans$productType[filteredScans$rank == 1]=="PCP")
      PSPindex = which(filteredScans$productType[filteredScans$rank == 1]=="PSP")

      # if more than one peptide is rank 1
      # keep scan only if there is just a single PCP, otherwise discard
      if (NoTopRankedPeps > 1) {

        if (length(PCPindex) >= 1) {

          if (length(unique(gsub("I","L",filteredScans$pepSeq[PCPindex]))) == 1) {
            filteredScans = filteredScans[PCPindex[1],]
          } else {
            filteredScans = filteredScans[0,]
          }

          } else {

            if (length(PSPindex) > 1) {
              index = which(filteredScans$productType[which(filteredScans$rank == 1)]=="PSP")
              PSPcandidatescores <- filteredScans[index, "ionScore"]
              ind = which(1-PSPcandidatescores/topScore <= delta_score)
              deltaRecord = rbind(deltaRecord,filteredScans[index[ind],])
            }

            filteredScans = filteredScans[0,]

        }

      } else {
        filteredScans = filteredScans[1,]
      } 

    }
    # if there is only one scan on rank 1, there is also only one peptide

    # if there is only a single scan on rank 1
    # (nrow(filteredScans) will be larger than 1 since the previous if condition did not apply)
    # if the 1st rank is a PCP, assign it as such
    else if (NoTopRankedScans == 1 & filteredScans[1, "productType"] == "PCP") {

      filteredScans <- filteredScans[1,]

      # If rank 1 entry is PSP check if a likely PCP is present in the lower ranks (PCPcandidates)
      # also check if there are lower-ranked PSPs
    } else if (NoTopRankedScans == 1 & filteredScans[1, "productType"] == "PSP") {

      # lower-ranked PCPs and PSPs
      PCPcandidatesIndex_temp <- which(as.numeric(filteredScans$rank) > 1 & as.character(filteredScans$productType) == "PCP")
      PSPcandidatesIndex_temp <- which(as.numeric(filteredScans$rank) > 1 & as.character(filteredScans$productType) == "PSP")

      # if there are no PSPs or PCPs with a lower rank, assign the PSP
      if(length(PCPcandidatesIndex_temp)==0 & length(PSPcandidatesIndex_temp)==0){
        filteredScans <- filteredScans[1,]
      }

      # if there are PSPs with a lower rank --> get highest-ranked PSM
      if (length(PSPcandidatesIndex_temp)>0) {
        #keeps only top PSP candidate
        PSPcandidatesIndex = which(as.numeric(filteredScans$rank)==min(as.numeric(filteredScans$rank)[PSPcandidatesIndex_temp]) & as.character(filteredScans$productType) == "PSP")[1]
        # keeps all PSP candidates
        PSPcandidatesIndexAll = PSPcandidatesIndex_temp
      } else {
        PSPcandidatesIndex = NULL
      }

      # if there are PCPs with a lower rank --> get highest-ranked PSM
      if (length(PCPcandidatesIndex_temp)>0) {
        PCPcandidatesIndex = which(as.numeric(filteredScans$rank)==min(as.numeric(filteredScans$rank)[PCPcandidatesIndex_temp]) & as.character(filteredScans$productType) == "PCP")
      } else {
        PCPcandidatesIndex = NULL
      }

      # for lower-ranked PCPs --> calculate Delta score
      # Keep top ranked PSP if deltascore > 0.3 and remove rest
      if (length(PCPcandidatesIndex) > 0) {


        PCPcandidatescore <- filteredScans[PCPcandidatesIndex[1], "ionScore"]
        # if the Delta score between the top-ranked PSP and a lower-ranked PCP
        # is larger than the threshold
        if (1 - PCPcandidatescore / topScore > delta_score) {

          # if there are any lower-ranked PSPs
          # Calculate delta_score scores also for lower-ranked PSPs
          if (length(PSPcandidatesIndex)>0) {
            PSPcandidatescore <- filteredScans[PSPcandidatesIndex, "ionScore"]

            # lower-ranked PSP, but with very low ion score --> assign top-ranked PSP
            if (1 - PSPcandidatescore / topScore > delta_score) {
              filteredScans <- filteredScans [1,]

              # discard spectrum bc there are several PSP candidates with similar
              # ion score
            } else if (1 - PSPcandidatescore / topScore <= delta_score) {

              PSPcandidatescores <- filteredScans[PSPcandidatesIndexAll, "ionScore"]
              ind = which(1-PSPcandidatescores/topScore <= delta_score)
              deltaRecord = rbind(deltaRecord,filteredScans[PSPcandidatesIndexAll[ind],])

              filteredScans <- filteredScans[0,]
            }

            # if there are no lower-ranked PSPs --> assign the top-ranked PSP
          } else if (length(PSPcandidatesIndex)==0){
            filteredScans <- filteredScans[1,]
          }

          # if the Delta score is too small:
          # Take candidate PCP as new top ranked and remove rest
        } else if (1 - PCPcandidatescore / topScore <= delta_score) {
          if (length(PCPcandidatesIndex) == 1) {
            filteredScans <- filteredScans[PCPcandidatesIndex,]
          } else {

            # more than one lower-ranked PCP that survived Delta filter
            if (length(unique(gsub("I","L",filteredScans$pepSeq[PCPcandidatesIndex]))) == 1) {
              filteredScans <- filteredScans[PCPcandidatesIndex[1],]
            } else {
              filteredScans = filteredScans[0,]
            }

          }

        }

        # end if there are both PCPs and PSPs in lower ranks
        # if there are lower ranked PSPs but no PCPs: continue

      } else if (length(PSPcandidatesIndex) > 0 & length(PCPcandidatesIndex)==0){
        PSPcandidatescore <- filteredScans[PSPcandidatesIndex, "ionScore"]

        # keep the candidate PSP if the Delta score is large enough
        if (1 - PSPcandidatescore / topScore > delta_score) {
          filteredScans <- filteredScans [1,]

          # if the Delta score is too small --> discard entire scan
        } else if (1 - PSPcandidatescore / topScore <= delta_score) {

          PSPcandidatescores <- filteredScans[PSPcandidatesIndexAll, "ionScore"]
          ind = which(1-PSPcandidatescores/topScore <= delta_score)
          deltaRecord = rbind(deltaRecord,filteredScans[PSPcandidatesIndexAll[ind],])

          filteredScans <- filteredScans[0,]
        }

      }
    }

    # apply filter for q-value and ion score
    if(nrow(filteredScans) > 0){
      if(as.double(filteredScans$qValue) >= q_value | as.double(filteredScans$ionScore <= ion_score)){
        filteredScans <- filteredScans[0,]
      }else{
        selected <- rbind(filteredScans, selected)
      }
    }

    if(nrow(deltaRecord)> 0){
      k = which(as.double(deltaRecord$qValue) < q_value & as.double(deltaRecord$ionScore > ion_score))
      deltaRecord <- deltaRecord[k,]

      if(nrow(deltaRecord)> 0){
        selectedDeltaRecord <- rbind(deltaRecord,selectedDeltaRecord)
      }
    }

  }

  allPSMs[[i]] = selected
  allPSMs_Delta[[i]] = selectedDeltaRecord

}

allPSMs = plyr::ldply(allPSMs)
allPSMs_Delta = plyr::ldply(allPSMs_Delta)

paste0("number of PSMs: ", nrow(allPSMs)) %>%
  print()


### OUTPUT ###

save(allPSMs, file = unlist(snakemake@output[["allPSMs"]]))
save(allPSMs_Delta, file = unlist(snakemake@output[["allPSMs_Delta"]]))

R dplyr stringr plyr From line 7 of SOURCE/2_scoringAndFiltering.R

library(stringr)
library(dplyr)

print("----------------------------")
print("3) IDENTIFY SYNTHESIS ERRORS")
print("----------------------------")


### INPUT ###
load(snakemake@input[["allPSMs"]])
keep_synErrors = snakemake@params[["keep_synErrors"]]

# gsub("I","L",subControls0$pepSeq) %>% unique() %>% length()
# gsub("I","L",ProteasomeDB$pepSeq[str_detect(ProteasomeDB$productType, "synError")]) %>% unique() %>% length()
# gsub("I","L",ProteasomeDB$pepSeq[str_detect(ProteasomeDB$productType, "_synErrorPrecursor")]) %>% unique() %>% length()
# gsub("I","L",ProteasomeDB$pepSeq[str_detect(ProteasomeDB$productType, "_synError$")]) %>% unique() %>% length()


### MAIN PART ###
subControls0 = allPSMs[allPSMs$digestTime == "CTRL", ]
extracted0 = allPSMs[allPSMs$digestTime != "CTRL", ]

unqTSNctrl<-unique(subControls0$substrateID)
unqTSNextr<-unique(extracted0$substrateID)
unqTSN = unqTSNextr

allExtracted = numeric()
for (i in 1:length(unqTSN))
{

  subIndex=which(subControls0$substrateID == unqTSN[i])
  extIndex=which(extracted0$substrateID == unqTSN[i])
  extracted<-extracted0[extIndex,]
  subControls<-subControls0[subIndex,]


  # Get all unique peptide seqs present in the substrate controls
  subControlSeq <- unique(gsub("I","L",subControls$pepSeq))
  subControlPCP = unique(gsub("I","L",subControls$pepSeq[subControls$productType == "PCP"]))

  # get SR2s of detected sequences
  extPos = str_split_fixed(extracted$positions,"[:punct:]",Inf)[,c(1:4)]
  extPos = apply(extPos,2,as.numeric)
  extracted = extracted %>%
    mutate(sr2s = gsub("I","L",str_sub(extracted$substrateSeq, extPos[,3], extPos[,4])),
           synErrSR2 = ifelse(sr2s %in% subControlPCP, "yes", "no"),
           synErrSR2 = ifelse(productType == "PCP", NA, synErrSR2)) %>%
    select(-sr2s)


  # Check which entries of extracted are also present in the substrate controls.
  errorIndex <- which(gsub("I","L",extracted$pepSeq) %in% subControlSeq)
  errorSeqs <- unique(gsub("I","L",extracted$pepSeq[errorIndex]))

  # We conservatively regard all matches as synthesis errors and delete them from extracted
  # extracted0 <- extracted0[-extIndex[errorIndex], ]

  # Match PSPs to errors
  PSPtoRemove <- c()
  for (b in 1:nrow(extracted)){
    candidateType <- extracted$productType[b]
    if (str_detect(candidateType,"PSP")){
      candidateSeq <- gsub("I","L",extracted$pepSeq[b])

      if (any(grepl(candidateSeq, subControlSeq))){
        PSPtoRemove <- c(PSPtoRemove, b)
      }
    }
  }

  mergeIndex=(unique(c(errorIndex, PSPtoRemove)))
  k = which(PSPtoRemove%in%errorIndex)
  if(length(k)>0){
    PSPtoRemove = PSPtoRemove[-k]
  }
  # Label PSPs that were found to match errors
  if (length(errorIndex)>0) {
    extracted$productType[errorIndex] = paste(extracted$productType[errorIndex],"_synError",sep="")
  }
  if (length(PSPtoRemove)>0) {
    extracted$productType[PSPtoRemove] = paste(extracted$productType[PSPtoRemove],"_synErrorPrecursor",sep="")
  }

  allExtracted = rbind(allExtracted,extracted)

} # the for loop

extracted = allExtracted
subControls = subControls0

# Also, check if timepoint 0 measurements or substrate controls are missing for any substrates
# controlSubstrates <- c(unique(subControls$substrateID), zeroSubstrates)
controlSubstrates <- unique(subControls$substrateID)
extractedSubstrates <- unique(extracted$substrateID)

missingSubIndex <- which(!(extractedSubstrates %in% controlSubstrates))

# remove synthesis errors
if (keep_synErrors == "no") {

    if(length(missingSubIndex) > 0){
      print("WARNING! The following substrates do not have any substrate control measurements:")

      print(extractedSubstrates[missingSubIndex])

      print("They are being removed from your table!")
      extracted <- extracted[-which(extracted$substrateID == extractedSubstrates[missingSubIndex[i]]),]

    }else{
      print("Success! All substrates in extracted are covered by substrate control measurements!")
    }


  print("Removing synthesis errors from the final dataset")

  kk = which(str_detect(extracted$productType, "_synError"))
  if (length(kk > 0)) {
    extracted = extracted[-kk, ]

    print("........................................................")
    paste0("number of synthesis errors that are being removed: ", length(kk)) %>%
      print()
    print("........................................................")

  }

}

### OUTPUT ###

save(extracted, file = unlist(snakemake@output[["PSMs"]]))

R dplyr stringr From line 7 of SOURCE/3_removeSynErrors.R

library(seqinr)
source("SOURCE/loadFunctions.r")

print("-----------------------------")
print("4) MAPPING AND POSTPROCESSING")
print("-----------------------------")


### INPUT ###
load(snakemake@input[["PSMs"]])
load(snakemake@input[["allPSMs_Delta"]])



### MAIN PART ##
d = mapping(extracted)
d_delta = mapping(allPSMs_Delta)


### OUTPUT ###
write.csv(d, file = unlist(snakemake@output[["ProteasomeDB"]]),
          row.names = F)
write.csv(d_delta, file = unlist(snakemake@output[["DeltaPeptides"]]),
          row.names = F)

R seqinr From line 7 of SOURCE/4_mapping.R

library(dplyr)
library(stringr)
library(grid)
library(ggplot2)
library(gridExtra)

theme_set(theme_bw())

source("SOURCE/invitroSPI_utils.R")
source("SOURCE/plottingFunctions.R")

print("-------------------------------")
print("5) GENERATE DATABASE STATISTICS")
print("-------------------------------")


### INPUT ###
ProteasomeDB = read.csv(snakemake@input[["ProteasomeDB"]],
                        stringsAsFactors = F)
# ProteasomeDB = read.csv("OUTPUT/test_data/ProteasomeDB.csv", stringsAsFactors = F)

S = ProteasomeDB$substrateID %>% unique()
tps = ProteasomeDB$digestTime %>% unique() %>% as.numeric() %>% sort()


### MAIN PART ###

# ----- 1) pre-processing -----

Peps = ProteasomeDB %>%
  remSynthErrors() %>%
  ILredundancy() %>%
  filterPepLength() %>%
  disentangleMultimappers.Type() %>%
  disentangleMultimappers.SRlen() %>%
  disentangleMultimappers.IVSeqLen() %>%
  disentangleMultimappers.AA() %>%
  DBcosmetics()

ProteasomeDB = ProteasomeDB %>%
  remSynthErrors()


# ----- 2) general stats + peptides over time -----

uniquePeps = Peps %>% uniquePeptides()

# plotNumberofPeptides(ProteasomeDB,
#                      outname = "OUTPUT/test_data/number_of_products.pdf")
plotNumberofPeptides(ProteasomeDB,
                     outname = unlist(snakemake@output[["number_of_products"]]))

# pdf("OUTPUT/test_data/DB_stats.pdf", height = 4, width = 6)
pdf(file = unlist(snakemake@output[["DB_stats"]]), height = 4, width = 6)
generalStats(uniquePeps, tp = "all") %>% grid.table()

for (t in 1:length(tps)) {
  cntDB = Peps[Peps$digestTime == tps[t], ] %>%
    uniquePeptides()
  grid.newpage()
  generalStats(DB = cntDB, tp = tps[t]) %>% grid.table()

}
dev.off()



# ----- 3) coverage maps -----

# pdf("OUTPUT/test_data/coverage_map.pdf", height = 12, width = 16)
# pdf(unlist(snakemake@output[["coverage_map"]]), height = 12, width = 16)
# for (s in 1:length(S)) {
#   
#   cntDB = uniquePeps[uniquePeps$substrateID == S[s], ]
#   out = plotCoverage(cntDB, name = S[s], tp = "all")
#   replayPlot(out)
#   
# }
# 
# dev.off()

# ----- 4) length distributions + product type frequencies -----

# pdf("OUTPUT/Delta03/length_distributions.pdf", height = 10, width = 14)
pdf(file = unlist(snakemake@output[["length_distributions"]]), height = 12, width = 16)

freq = TypeFrequency(uniquePeps)
pl = PepLength(uniquePeps, tp="all")
srlen = SRLength(uniquePeps, tp="all")
ivlen = IVSeqLength(uniquePeps, tp="all")

cntG = grid.arrange(freq,pl,srlen,ivlen,
                    nrow = 2, ncol = 2)

print(cntG)

for (t in 1:length(tps)) {

  cntDB = uniquePeps[uniquePeps$digestTime == tps[t], ]

  freq = TypeFrequency(cntDB)
  pl = PepLength(cntDB, tp=tps[t])
  srlen = SRLength(cntDB, tp=tps[t])
  ivlen = IVSeqLength(cntDB, tp=tps[t])

  cntG = grid.arrange(freq,pl,srlen,ivlen,
                      nrow = 2, ncol = 2)

  print(cntG)

}
dev.off()

R ggplot2 dplyr stringr gridExtra From line 7 of SOURCE/5_output_statistics.R

library(data.table)
library(parallel)
library(dplyr)
library(seqinr)
library(stringr)
library(beepr)


# numCPU = if (detectCores() > 8) detectCores()-1 else if (detectCores() >= 4) 4 else 1
# cl <- makeForkCluster(numCPU)


# ----- user parameters -----
nmers = c(5, 40)
generateTrans = T  # if false, generate only cis peptides
outdir = "../DATA/polypeptideDBs/"
suppressWarnings(dir.create(outdir, recursive = T))

# change here (you can read in a dataframe with the name and the sequence)
sample_list = read.csv("~/Documents/SCIENCE/_tools/aSPIre+invitroSPI/data/sample_list_aSPIre.csv", stringsAsFactors = F)
substrates = sample_list %>%
  select(substrateID, substrateSeq) %>%
  unique()


# ----- functions -----

getCis = function(subSeq, N, Lext,L) {

  if (L-1 >= N) {

    allCis = sapply(seq(1,L-N), function(i){

      sapply(seq(i+Lext-1,N-Lext+i-1), function(j){

        sr2 = N-j+i-1
        sapply(seq(j+2, L-sr2+1), function(k){

          n = k+sr2-1
          pepSeq = paste(str_sub(subSeq, start=i, end=j),
                         str_sub(subSeq, start=k, end=n), sep = "")
          positions = paste(i,j,k,n, sep = "_")

          return(c(pepSeq, positions))

        })
      })
    })
    CIS = unlist(allCis)
    cisPep = data.frame(pepSeq = CIS[seq(1,length(CIS),2)],
                        PositionSubstrate = CIS[seq(2,length(CIS),2)])
  } else {
    cisPep=NA
  }

  return(cisPep)
}



getRevCis = function(subSeq, L, N, Lext=1) {


  if (L >= N) {

    allRevCis = sapply(seq(1,L-N+1), function(k){

      sapply(seq(k+Lext-1,N-Lext+k-1), function(n){

        sr1 = N-n+k-1
        sapply(seq(n+1, L-sr1+1), function(i){

          j = i+sr1-1

          pepSeq = paste(str_sub(subSeq, start=i, end=j),
                         str_sub(subSeq, start=k, end=n), sep = "")
          positions = paste(i,j,k,n, sep = "_")
          return(c(pepSeq, positions))
        })
      })
    })

    REVCIS = unlist(allRevCis)
    revcisPep = data.frame(pepSeq = REVCIS[seq(1,length(REVCIS),2)],
                           PositionSubstrate = REVCIS[seq(2,length(REVCIS),2)])

  } else {
    revcisPep = NA
  }


  return(revcisPep)
}




getAllTrans = function(L, N, subSeq, Lext=1) {

  allSRs = lapply(seq(Lext,N-Lext), function(sr1){

    # SR1
    Is = seq(1, L-sr1+1)
    Js = Is+sr1-1

    # SR2
    Ks = seq(1,L-N+sr1+1)
    Ns = Ks+N-sr1-1

    SR1_pos = paste(Is,Js,sep = "_")
    SR2_pos = paste(Ks,Ns,sep = "_")

    SR1 = str_sub(subSeq,Is,Js)
    SR2 = str_sub(subSeq,Ks,Ns)

    PEP = outer(SR1, SR2, paste, sep="") %>% as.vector()
    positions = outer(SR1_pos,SR2_pos,paste, sep = "_") %>% as.vector()

    return(data.frame(pepSeq = PEP,
                      PositionSubstrate = positions
    ))

  })

  transPep = plyr::ldply(allSRs)


  return(transPep)
}


getPCP = function(L,subSeq,N) {

    if (L > N) {

    allPCP = sapply(seq(1,L-N+1), function(i){
      j = N+i-1
      positions = paste(i,j, sep = "_")
      pepSeq=str_sub(subSeq,i,j)
      return(c(pepSeq,positions))
    })

    PCP = unlist(allPCP)
    PCP_Pep = data.frame(pepSeq = PCP[1,],
                         PositionSubstrate = PCP[2,])
  } else {
    PCP = NA
  }

  return(PCP_Pep)
}


# ----- compute DB -----

for (i in 1:nrow(substrates)) {


  Name = substrates$substrateID[i]
  print("...............")
  print(Name)
  print("...............")

  # get substrate sequence and length
  subSeq = substrates$substrateSeq[i]
  if (grepl(".fasta", subSeq)) {
    subSeq = read.fasta(file = subSeq, seqtype = "AA", strip.desc = T, as.string = T) %>%
      unlist() %>%
      suppressWarnings()
  }

  L = nchar(subSeq)

  # initiate empty fasta
  write.fasta(sequences = subSeq, names = Name,
              file.out=paste0(outdir, Name, "_allPSP_0.fasta"),
              open="w")

  start_time = Sys.time()
  # generate peptides
  if (generateTrans) {
    print("generating cis and trans-spliced peptides.....")

    # all cis and trans
    allPSP <- lapply(seq(nmers[1], nmers[2]), function(N){
      paste0("generating ",N,"-mers") %>% print()

      DB = getAllTrans(subSeq = subSeq, L = L, N = N, Lext=1)
      write.fasta(as.list(DB$pepSeq), names = paste0("PSP_", DB$PositionSubstrate),
                  file.out = paste0(outdir, Name, "_allPSP_0.fasta"),
                  open="a")
    })


    print("generating non-spliced peptides.....")
    Nmax = if(nmers[2] >= L-1) L-1 else nmers[2]

    allPCP <- lapply(seq(nmers[1], Nmax), function(N){
      paste0("generating ",N,"-mers") %>% print()

      DB = getPCP(subSeq = subSeq, L = L, N = N)
      write.fasta(as.list(DB$pepSeq), names = paste0("PSP_", DB$PositionSubstrate),
                  file.out = paste0(outdir, Name, "_allPSP_0.fasta"),
                  open="a")
    })

  } else {
    print("generating cis-spliced peptides.....")

    # forward cis
    allCis <- lapply(seq(nmers[1], nmers[2]), function(N){
      paste0("generating fwd cis ",N,"-mers") %>% print()

      DB = getCis(subSeq = subSeq, L = L, N = N, Lext=1)
      write.fasta(as.list(DB$pepSeq), names = paste0("PSP_", DB$PositionSubstrate),
                  file.out = paste0(outdir, Name, "_allPSP_0.fasta"),
                  open="a")
    })

    # reverse cis
    allRevCis <- lapply(seq(nmers[1], nmers[2]), function(N){
      paste0("generating rev cis ",N,"-mers") %>% print()

      DB = getRevCis(subSeq = subSeq, L = L, N = N, Lext=1)
      write.fasta(as.list(DB$pepSeq), names = paste0("PSP_", DB$PositionSubstrate),
                  file.out = paste0(outdir, Name, "_allPSP_0.fasta"),
                  open="a")
    })


  }

  end_time = Sys.time()
  paste0("..... finished after: ", difftime(end_time, start_time, units = "s")) %>% print()

}


print("FINISHED!")
beep("mario")

R dplyr data.table stringr seqinr From line 7 of SOURCE/_generateSearchDB_polypeptides.R

library(dplyr)
library(stringr)


########## plotting colours ########## 
plottingCols = c(
  PCP = "darkorange",
  PSP = "dodgerblue",
  cis = "darkslateblue",
  revCis = "lightskyblue",
  trans = "darkorchid",
  allcis = "darkslateblue",

  ProteasomeDB = "olivedrab4",
  randomDB = "lightsteelblue"
)


########## disentange rows containing multimappers ########## 
disentangleMultimappers.Type = function(DB) {

  print("DISENTANGLE MULTI-MAPPERS FOR PRODUCT TYPE")

  k = which(str_detect(DB$positions, coll(";")))

  if (length(k) > 0) {

    DB_mm = DB[k, ]
    DB_nomm = DB[-k, ]

    mm = list()

    pb = txtProgressBar(min = 0, max = nrow(DB_mm), style = 3)
    for (r in 1:nrow(DB_mm)) {

      setTxtProgressBar(pb, r)

      cnt_types = str_split(DB_mm$spliceType[r], coll(";"), simplify = T) %>%
        paste()

      cnt_pos = str_split(DB_mm$positions[r], coll(";"), simplify = T) %>%
        paste()

      cntDB = DB_mm[r, ]

      # all multi-mappers correspond to the same product type
      if (cnt_types %>% unique() %>% length() == 1) {
        cntDB$spliceType = cnt_types %>% unique()

        # in case of cis and trans --> keep cis
      } else if ( any("trans" %in% cnt_types) ) {

        x = which(cnt_types == "trans")
        cnt_types = cnt_types[-x]
        cnt_pos = cnt_pos[-x]

        if (cnt_types %>% unique() %>% length() == 1) {
          cntDB$spliceType = cnt_types %>% unique()

        } else {
          cntDB$spliceType = "type_multi-mapper"
        }

        cntDB$positions = cnt_pos %>% paste(collapse = ";")

      } else {
        cntDB$spliceType = "type_multi-mapper"
        cntDB$positions = cnt_pos %>% paste(collapse = ";")
      }

      mm[[r]] = cntDB
    }

    mm = plyr::ldply(mm)
    DB = rbind(DB_nomm, mm) %>%
      as.data.frame()
  }

  # beep("mario")
  return(DB)
}


disentangleMultimappers.AA = function(DB, retSinglePos = T) {

  print("DISENTANGLE MULTI-MAPPERS FOR AA AT sP1 AND sP1'")

  DB$AAmultimapper = "no"

  # only considering PSP multi-mappers
  k = which(str_detect(DB$positions, coll(";")) & !str_detect(DB$productType, "PCP"))

  if (length(k) > 0) {
    DB_mm = DB[k, ]
    DB_nomm = DB[-k, ]

    pb = txtProgressBar(min = 0, max = nrow(DB_mm), style = 3)
    for (r in 1:nrow(DB_mm)) {

      setTxtProgressBar(pb, r)

      cnt_pos = strsplit(DB_mm$positions[r], ";") %>%
        unlist() %>%
        str_split_fixed(pattern = coll("_"), n = Inf)

      sP1 = str_sub(DB_mm$substrateSeq[r], start = cnt_pos[, 2], end = cnt_pos[, 2])
      sP1d = str_sub(DB_mm$substrateSeq[r], start = cnt_pos[, 3], end = cnt_pos[, 3])

      if ((length(unique(sP1)) > 1) | (length(unique(sP1d)) > 1)) {
        DB_mm$AAmultimapper[r] = "yes"

      } else if (retSinglePos) {
        DB_mm$positions[r] = paste(cnt_pos[1, c(1:4)], collapse = "_")
      }

    }

    DB = rbind(DB_nomm, DB_mm) %>%
      as.data.frame()
  }


  # beep("mario")
  return(DB)
}


disentangleMultimappers.SRlen = function(DB, retSinglePos = T) {

  print("DISENTANGLE MULTI-MAPPERS FOR SR length'")

  DB$SRmultimapper = "no"

  # only considering PSP multi-mappers
  k = which(str_detect(DB$positions, coll(";")) & !str_detect(DB$productType, "PCP"))

  if (length(k) > 0) {

    DB_mm = DB[k, ]
    DB_nomm = DB[-k, ]

    pb = txtProgressBar(min = 0, max = nrow(DB_mm), style = 3)
    for (r in 1:nrow(DB_mm)) {

      setTxtProgressBar(pb, r)

      cnt_pos = strsplit(DB_mm$positions[r], ";") %>%
        unlist() %>%
        str_split_fixed(pattern = coll("_"), n = Inf)

      SR1len = as.numeric(cnt_pos[, 2]) - as.numeric(cnt_pos[, 1]) + 1
      SR2len = as.numeric(cnt_pos[, 4]) - as.numeric(cnt_pos[, 3]) + 1

      if ((length(unique(SR1len)) > 1) | (length(unique(SR2len)) > 1)) {
        DB_mm$SRmultimapper[r] = "yes"

      }  else if (retSinglePos) {
        DB_mm$positions[r] = paste(cnt_pos[1, c(1:4)], collapse = "_")
      }

    }

    DB = rbind(DB_nomm, DB_mm) %>%
      as.data.frame()


  }


  # beep("mario")
  return(DB)
}

disentangleMultimappers.IVSeqLen = function(DB, retSinglePos = T) {

  print("DISENTANGLE MULTI-MAPPERS FOR INTERVENING SEQUENCE length'")

  DB$IVseqmultimapper = "no"

  # only considering PSP multi-mappers
  k = which(str_detect(DB$positions, coll(";")) & !str_detect(DB$productType, "PCP"))

  if (length(k) > 0) {

    DB_mm = DB[k, ]
    DB_nomm = DB[-k, ]

    pb = txtProgressBar(min = 0, max = nrow(DB_mm), style = 3)
    for (r in 1:nrow(DB_mm)) {

      setTxtProgressBar(pb, r)

      cnt_pos = strsplit(DB_mm$positions[r], ";") %>%
        unlist() %>%
        str_split_fixed(pattern = coll("_"), n = Inf)

      cnt_types = str_split(DB_mm$spliceType[r], coll(";"), simplify = T) %>%
        paste()

      ivLens = rep(NA, nrow(cnt_pos))
      for (p in 1:nrow(cnt_pos)) {

        # cis and trans
        if (cnt_pos[p,3] >= cnt_pos[p,1]) {
          ivLens[p] = (abs(as.numeric(cnt_pos[p, 3]) - as.numeric(cnt_pos[p, 2])) - 1) %>%
            as.numeric()

        # revCis
        } else {
          ivLens[p] = (abs(as.numeric(cnt_pos[p, 1]) - as.numeric(cnt_pos[p, 4])) - 1) %>%
            as.numeric()
        }

      }


      if (length(unique(ivLens)) > 1) {
        DB_mm$IVseqmultimapper[r] = "yes"

      }  else if (retSinglePos) {
        DB_mm$positions[r] = paste(cnt_pos[1, c(1:4)], collapse = "_")
      }

    }

    DB = rbind(DB_nomm, DB_mm) %>%
      as.data.frame()

  }

  # beep("mario")
  return(DB)
}


removeMultimappers.Type = function(DB) {
  print("REMOVE PEPTIDES THAT CAN SOLELY BE MULTI-MAPPERS FOR PRODUCT TYPE")

  k = which(DB$spliceType == "type_multi-mapper")

  if (length(k) > 0) {
    DB = DB[-k, ]
  }

  return(DB)
}


removeMultimappers.AA = function(DB) {
  print("REMOVE PEPTIDES THAT ARE MULTI-MAPPERS IN TERMS OF AA AT sP1 OR sP1'")

  k = which(DB$AAmultimapper == "yes")

  if (length(k) > 0) {
    DB = DB[-k, ]
  }

  return(DB)
}


removeMultimappers.SRlen = function(DB) {
  print("REMOVE PEPTIDES THAT ARE MULTI-MAPPERS IN TERMS OF SR LENGTH")

  k = which(DB$SRmultimapper == "yes")

  if (length(k) > 0) {
    DB = DB[-k, ]
  }

  return(DB)
}

removeMultimappers.IVSeqLen = function(DB) {
  print("REMOVE PEPTIDES THAT ARE MULTI-MAPPERS IN TERMS OF INTERVENING SEQUENCE LENGTH")

  k = which(DB$IVseqmultimapper == "yes")

  if (length(k) > 0) {
    DB = DB[-k, ]
  }

  return(DB)
}


########## filter for peptide length ########## 
filterPepLength = function(DB, cutoff=5) {
  print(paste0("REMOVE PEPTIDES THAT ARE SHORTER THAN ", cutoff, " AA"))

  k = which(nchar(DB$pepSeq) < cutoff)

  if (length(k) > 0) {
    DB = DB[-k, ]
  }

  return(DB)
}

########## synthesis errors ########## 

remSynthErrors = function(DB) {
  print("REMOVE SYNTHESIS ERRORS")

  k = which(str_detect(DB$productType, "synError"))
  if (length(k) > 0) {
    DB = DB[-k, ]
  }

  return(DB)
}


keepSynthErrors = function(DB) {
  print("FILTER FOR SYNTHESIS ERRORS")

  k = which(str_detect(DB$productType, "synError"))
  if (length(k) > 0) {
    DB = DB[k, ]
  }

  return(DB)
}

########## early time points only ########## 
filterEarlyTimepoints = function(DB) {
  print("FILTER FOR EARLY TIME POINTS")

  DB = DB[which(DB$digestTime %in% c(2, 4)), ]

  return(DB)
}


########## 20S standard proteasome only ########## 
filter20Sstandard = function(DB) {
  print("FILTER FOR 20S STANDARD PROTEASOME")

  if ("protIsotype" %in% names(DB)) {
    DB = DB[DB$protIsotype %in% c("20S standard", "20S K562"), ]
  }

  return(DB)
}


########## I/L redundancy in product sequences #########
ILredundancy = function(DB) {

  print("REPLACE ALL ISOLEUCINS BY LEUCINS")

  DB$pepSeq = str_replace_all(DB$pepSeq, "I", "L")
  DB$substrateSeq = str_replace_all(DB$substrateSeq, "I", "L")

  return(DB)
}

########## unique product sequences ########## 
uniquePeptides = function(DB) {
  print("UNIQUE PEPTIDES")

  DB = DB %>%
    distinct(substrateID, substrateSeq, pepSeq, productType,
             #spliceType,
             positions, .keep_all = T)

  # DB2 = DB %>%
  #   group_by(substrateID, substrateSeq, pepSeq, productType,
  #            spliceType, positions) %>% slice(1)

  # DB3 = DB[, c("substrateID", "substrateSeq", "pepSeq",
  #             "productType", "spliceType", "positions")] %>%
  #   unique()

  # DB4 = DB[!duplicated(DB[, c("substrateID", "substrateSeq", "pepSeq",
  #                             "productType", "spliceType", "positions")]),]

  return(DB)
}

uniquePeptides.perTP = function(DB) {
  print("UNIQUE PEPTIDES FOR EACH TIME POINT")

  DB = DB %>%
    distinct(substrateID, substrateSeq, digestTime, pepSeq, productType,
             spliceType, positions, .keep_all = T)


  return(DB)
}

########## cosmetics ########## 
DBcosmetics = function(DB) {

  print("COSMETICS AND STATS")

  if ("X" %in% names(DB)) {
    DB$X = NULL
  }

  # replace NA in splice type column of PCPs
  DB$spliceType[str_detect(DB$productType, "PCP")] = "PCP"

  # # statistics
  # print(paste0("number of identified products: ", nrow(DB)))
  # 
  # print(table(DB$spliceType))
  # print(table(DB$spliceType) / nrow(DB))
  # 
  # print("substrates")
  # print(paste0("number of substrates: ", DB$substrateSeq %>% unique() %>% length()))
  # DB$substrateID %>%
  #   unique() %>% 
  #   paste(collapse = ", ") %>%
  #   print()

  return(DB)
}


########## load random DBs #########
loadRandom = function(fpath) {

  print("LOAD RANDOM DATABASES")

  # read cis, revCis, trans and PCP
  # concatenate and return list of data frames

  cis = read.csv(fpath[str_detect(fpath, "_cis")], stringsAsFactors = F)
  revCis = read.csv(fpath[str_detect(fpath, "_revCis")], stringsAsFactors = F)
  trans = read.csv(fpath[str_detect(fpath, "_trans")], stringsAsFactors = F)

  if (any(str_detect(fpath, "PCP_"))) {
    PCP = read.csv(fpath[str_detect(fpath, "_PCP")], stringsAsFactors = F)
  } else {
    PCP = NA
  }

  random = list(cis = cis,
                revCis = revCis,
                trans = trans,
                PCP = PCP)

  return(random)
}

R dplyr stringr From line 7 of SOURCE/invitroSPI_utils.R

generalStats = function(DB, tp) {
  cntTBL = table(DB$spliceType) %>% as.data.frame()
  cntTBL = cbind(cntTBL, 
                 (table(DB$spliceType) / nrow(DB)) %>% as.data.frame())
  names(cntTBL) = c(paste0("time point: ", tp), "# peptides", "type", "frequency")
  cntTBL$frequency = round(cntTBL$frequency, 4)

  return(cntTBL)
}


# ----- number of peptides over time ----
getNumberofPeptides = function(ProteasomeDB) {

  DB_filtered = ProteasomeDB %>%
    filterPepLength() %>%
    disentangleMultimappers.Type() %>%
    DBcosmetics()

  # fraction of PSMs
  psms = DB_filtered %>%
    group_by(substrateID,spliceType, digestTime) %>%
    summarise(n = n()) %>%
    ungroup() %>% group_by(substrateID, digestTime) %>%
    mutate(freq = n / sum(n))

  psms_allTP = DB_filtered %>%
    group_by(substrateID,spliceType) %>%
    summarise(n = n()) %>%
    ungroup() %>% group_by(substrateID) %>%
    mutate(freq = n / sum(n)) %>%
    mutate(digestTime = 0)

  # number + frequency of unique peptides
  peps = DB_filtered %>%
    uniquePeptides() %>%
    group_by(substrateID,spliceType, digestTime) %>%
    summarise(n = n()) %>%
    ungroup() %>% group_by(substrateID, digestTime) %>%
    mutate(freq = n / sum(n))

  peps_allTP = DB_filtered %>%
    uniquePeptides() %>%
    group_by(substrateID,spliceType) %>%
    summarise(n = n()) %>%
    ungroup() %>% group_by(substrateID) %>%
    mutate(freq = n / sum(n)) %>%
    mutate(digestTime = 0)

  return(list(psms = rbind(psms, psms_allTP[, names(psms)]),
              peps = rbind(peps, peps_allTP[, names(peps)])))

}


plotNumberofPeptides = function(ProteasomeDB, outname) {

  res = getNumberofPeptides(ProteasomeDB)

  psms = res$psms
  peps = res$peps

  if (length(unique(ProteasomeDB$digestTime)) == 1) {
    psms = psms[psms$digestTime == 0, ]
    peps = peps[peps$digestTime == 0, ]
  }

  # number of PSMs over time
  psms$digestTime = factor(psms$digestTime, levels = sort(unique(as.numeric(psms$digestTime))))
  psms$spliceType = factor(psms$spliceType, levels = c("PCP", "cis", "revCis", "trans", "type_multi-mapper"))

  theme_set(theme_classic())

  a = ggplot(psms, aes(x=digestTime, y=n, col=spliceType)) +
    geom_boxplot(position=position_dodge(width=0.8)) +
    geom_point(alpha = .3, position=position_dodge(width=0.8)) +
    scale_y_continuous(trans='log10') +
    scale_x_discrete(labels = gsub("^0$","all",sort(unique(psms$digestTime)))) +
    scale_color_manual("product type",
                       values = c(plottingCols["PCP"], plottingCols["cis"], plottingCols["revCis"],
                                  plottingCols["trans"], "gray"),
                       labels = c("non-spliced", "forward cis", "reverse cis", "trans", "multi-mapper")) +
    ggtitle("number of PSMs") +
    ylab("# PSMs") +
    xlab("digestion time")

  a2 = ggplot(psms, aes(x=digestTime, y=freq, col=spliceType)) +
    geom_boxplot(position=position_dodge(width=0.8)) +
    scale_x_discrete(labels = gsub("^0$","all",sort(unique(psms$digestTime)))) +
    scale_color_manual("product type",
                       values = c(plottingCols["PCP"], plottingCols["cis"], plottingCols["revCis"],
                                  plottingCols["trans"], "gray"),
                       labels = c("non-spliced", "forward cis", "reverse cis", "trans", "multi-mapper")) +
    ylim(c(0,1)) +
    ggtitle("frequency of PSMs") +
    ylab("frequency") +
    xlab("digestion time")


  peps$digestTime = factor(peps$digestTime, levels = sort(unique(as.numeric(peps$digestTime))))
  peps$spliceType = factor(peps$spliceType, levels = c("PCP", "cis", "revCis", "trans", "type_multi-mapper"))

  b = ggplot(peps, aes(x=digestTime, y=n, col=spliceType)) +
    geom_boxplot(position=position_dodge(width=0.8)) +
    geom_point(alpha = .3, position=position_dodge(width=0.8)) +
    scale_y_continuous(trans='log10') +
    scale_x_discrete(labels = gsub("^0$","all",sort(unique(peps$digestTime)))) +
    scale_color_manual("product type",
                       values = c(plottingCols["PCP"], plottingCols["cis"], plottingCols["revCis"],
                                  plottingCols["trans"], "gray"),
                       labels = c("non-spliced", "forward cis", "reverse cis", "trans", "multi-mapper")) +
    ggtitle("number of unique peptides") +
    ylab("# unique peptides") +
    xlab("digestion time")

  b2 = ggplot(peps, aes(x=digestTime, y=freq, col=spliceType)) +
    geom_boxplot(position=position_dodge(width=0.8)) +
    scale_x_discrete(labels = gsub("^0$","all",sort(unique(peps$digestTime)))) +
    scale_color_manual("product type",
                       values = c(plottingCols["PCP"], plottingCols["cis"], plottingCols["revCis"],
                                  plottingCols["trans"], "gray"),
                       labels = c("non-spliced", "forward cis", "reverse cis", "trans", "multi-mapper")) +
    ylim(c(0,1)) +
    ggtitle("frequency of unique peptides") +
    ylab("frequency") +
    xlab("digestion time")

  p = grid.arrange(a,b, a2,b2, nrow=2, ncol=2)
  ggsave(filename = outname,
         plot = p, device = cairo_pdf(), height = 10, width = 14, dpi = "retina")
}


# ----- coverage maps ------

plotCoverage = function(ProtesomeDB, tp, name) {

  k = which(str_detect(ProteasomeDB$positions, ";"))
  if (length(k) > 0) {
    print("removing all multi-mappers")
    ProteasomeDB = ProteasomeDB[-k,]
  }

  pcp_pos = ProteasomeDB$positions[ProteasomeDB$productType == "PCP"]
  psp_pos = ProteasomeDB$positions[ProteasomeDB$productType == "PSP"]
  S = ProteasomeDB$substrateSeq[1]

  # --- PCPs ---
  pcp_pos = str_split(pcp_pos, "_", simplify = T)
  pcp_pos = apply(pcp_pos,2,as.numeric) %>%
    as.data.frame()
  pcp_pos = pcp_pos[order(pcp_pos$V1, decreasing = F), ]

  par(mfrow = c(2,1))
  y0 =0.1
  plot(NULL, ylim=c(0,nrow(pcp_pos)*0.05),
       ylab = "", xlab="substrate",
       xlim = c(0, nchar(S)),
       main = paste0(name, ": ", tp, " hrs"),
       axes=F)

  for (i in 1:nrow(pcp_pos)) {

    if (length(which(pcp_pos$V1[1:i] <= pcp_pos$V1[i] & 
                     pcp_pos$V2[1:i] >= pcp_pos$V1[i])) > 1) {

      cnt.intercept = length(which(pcp_pos$V1[1:i] <= pcp_pos$V1[i] & 
                                     pcp_pos$V2[1:i] >= pcp_pos$V1[i]))*0.1

    } else {cnt.intercept = y0}

    segments(x0 = pcp_pos$V1[i], y0 = cnt.intercept,
             x1 = pcp_pos$V2[i], y1 = cnt.intercept,
             col = plottingCols["PCP"], lwd = .5)
  }

  xlabel = c("0",
             paste(S %>% strsplit("") %>% unlist(), seq(1, nchar(S)), sep = ""))
  text(x = seq(0, nchar(S)), par("usr")[3]-.3, labels = xlabel,
       srt=90, xpd=T, cex=.5)
  axis(2, labels = NA)


  # --- PSPs ---
  psp_pos = str_split(psp_pos, "_", simplify = T)
  psp_pos = apply(psp_pos,2,as.numeric) %>%
    as.data.frame()
  psppos = psp_pos[order(psp_pos$V1, decreasing = F), ]

  psp_pos = rbind(as.matrix(psppos[,c(1:2)]), as.matrix(psppos[,c(3:4)])) %>%
    as.data.frame()

  y0 =0.1
  plot(NULL, ylim=c(0,nrow(psp_pos)*0.05),
       ylab = "", xlab="substrate",
       xlim = c(0, nchar(S)),
       main = paste0(name, ": ", tp, " hrs"),
       axes=F)

  for (i in 1:nrow(psp_pos)) {

    if (length(which(psp_pos$V1[1:i] <= psp_pos$V1[i] & 
                     psp_pos$V2[1:i] >= psp_pos$V1[i])) > 1) {

      cnt.intercept = length(which(psp_pos$V1[1:i] <= psp_pos$V1[i] & 
                                     psp_pos$V2[1:i] >= psp_pos$V1[i]))*0.1

    } else {cnt.intercept = y0}

    segments(x0 = psp_pos$V1[i], y0 = cnt.intercept,
             x1 = psp_pos$V2[i], y1 = cnt.intercept,
             col = plottingCols["PSP"], lwd = .5)
  }

  xlabel = c("0",
             paste(S %>% strsplit("") %>% unlist(), seq(1, nchar(S)), sep = ""))
  text(x = seq(0, nchar(S)), par("usr")[3]-.3, labels = xlabel,
       srt=90, xpd=T, cex=.5)
  axis(2, labels = NA)

  out = recordPlot()
  return(out)
}


# ----- length distributions -----
# function for violin plots
ViolinPlot = function(data) {

  # print stats
  s = data %>%
    dplyr::group_by(type) %>%
    dplyr::summarise(n = dplyr::n(),
                     mean = mean(value),
                     median = median(value),
                     std = sd(value))
  print.data.frame(s)

  theme_set(theme_classic())

  # plotting
  data$type = factor(data$type, levels = c("PCP", "cis", "revCis", "trans"))

  sc = ggplot(data = data, aes(x = type, y = value, fill = type)) +
    geom_violin(size = .1) +
    stat_summary(fun = median, fun.min = median, fun.max = median,
                 geom = "crossbar", 
                 width = .2,
                 position = position_dodge(width = 4)) +
    xlab("product type") +
    scale_fill_manual("product type",
                      values = c(plottingCols["PCP"], plottingCols["cis"], plottingCols["revCis"],
                                 plottingCols["trans"]),
                      labels = c("non-spliced", "forward cis", "reverse cis", "trans")) +
    theme(axis.text.x = element_text(angle = 90),
          legend.position = "none")

  # add labs and title downstream

  return(sc)
}


# peptide length
PepLength = function(DB, tp) {

  print("PEPTIDE LENGTH")
  DB = removeMultimappers.Type(DB)

  pl = data.frame(value = nchar(DB$pepSeq),
                  type = DB$spliceType)

  pepLen = ViolinPlot(data = pl) +
    ylab("peptide product length (aa residues)") +
    scale_x_discrete(labels = c("non-spliced", "forward cis", "reverse cis", "trans")) +
    # scale_y_continuous(limits = c(0, 40), breaks = seq(0, 40, by = 5)) +
    ggtitle("peptide length distribution",
            subtitle = paste0(tp, " hrs"))

  return(pepLen)
}


# SR length
SRLength = function(DB, tp) {

  print("SPLICE-REACTANT LENGTH")

  DB = DB[str_detect(DB$productType, "PSP"), ] %>%
    removeMultimappers.Type() %>%
    removeMultimappers.SRlen()

  srlen = function(db) {
    pos = str_split_fixed(db$positions, pattern = "_", n = Inf)

    sr = data.frame(value_sr1 = as.numeric(pos[, 2]) - as.numeric(pos[, 1]) + 1,
                    value_sr2 = as.numeric(pos[, 4]) - as.numeric(pos[, 3]) + 1,
                    type = db$spliceType)
    sr$type = factor(sr$type, levels = c("cis", "revCis", "trans"))
    return(sr)
  }

  sr = srlen(db = DB)

  sr1 = sr[, c("value_sr1", "type")]
  names(sr1)[1] = "value"
  sr2 = sr[, c("value_sr2", "type")]
  names(sr2)[1] = "value"

  sr1.plot = ViolinPlot(data = sr1) +
    ylab("splice reactant length (aa residues)") +
    scale_x_discrete(labels = c("forward cis", "reverse cis", "trans")) +
    # scale_y_continuous(limits = c(0, 40), breaks = seq(0, 40, by = 5)) +
    ggtitle("SR1 length distribution",
            subtitle = paste0(tp, " hrs"))

  sr2.plot = ViolinPlot(data = sr2) +
    ylab("splice reactant length (aa residues)") +
    scale_x_discrete(labels = c("forward cis", "reverse cis", "trans")) +
    # scale_y_continuous(limits = c(0, 40), breaks = seq(0, 40, by = 5)) +
    ggtitle("SR2 length distribution",
            subtitle = paste0(tp, " hrs"))

  sr.plot = arrangeGrob(grobs = list(sr1.plot, sr2.plot), ncol=2)

  return(sr.plot)
}


# intervening sequence length
IVSeqLength = function(DB, tp) {

  print("INTERVENING SEQUENCE LENGTH")

  DB = DB[str_detect(DB$productType, "PSP"), ] %>%
    removeMultimappers.Type() %>%
    removeMultimappers.IVSeqLen()

  ivlen = function(db) {
    # intervening sequence length
    # calculated as in Specht et al., Paes et al. and SPI-ART
    pos = str_split_fixed(db$positions, pattern = "_", n = Inf)
    iv = rep(NA, nrow(db))

    cistrans.idx = which(db$spliceType %in% c("cis", "trans"))
    iv[cistrans.idx] = (abs(as.numeric(pos[cistrans.idx, 3]) - as.numeric(pos[cistrans.idx, 2])) - 1) %>%
      as.numeric()

    revcis.idx = which(db$spliceType == "revCis")
    iv[revcis.idx] = (abs(as.numeric(pos[revcis.idx, 1]) - as.numeric(pos[revcis.idx, 4])) - 1) %>%
      as.numeric()

    ivl = data.frame(value = iv,
                     type = db$spliceType)
    ivl$type = factor(ivl$type, levels = c("cis", "revCis", "trans"))
    return(ivl)
  }

  iv = ivlen(db = DB)
  iv = iv[iv$type != "trans", ]

  iv.plot = ViolinPlot(data = iv) +
    ylab("intervening sequence length (aa residues)") +
    scale_x_discrete(labels = c("forward cis", "reverse cis")) +
    ggtitle("intervening sequence length distribution",
            subtitle = paste0(tp, " hrs"))

  return(iv.plot)
}


# ------ type frequencies -----

TypeFrequency = function(DB) {

  print("FREQUENCY OF PRODUCT TYPES")

  Freq = DB %>% 
    removeMultimappers.Type() %>%
    group_by(spliceType) %>% 
    summarise(n = n()) %>%
    mutate(freq = n / sum(n)) %>%
    arrange(desc(spliceType)) %>%
    mutate(ypos = cumsum(freq)- 0.5*freq )

  print(Freq)

  Freq$spliceType = factor(Freq$spliceType, levels = c("PCP", "cis", "revCis", "trans"))

  freqs = ggplot(Freq, aes(x="", y=freq, fill=spliceType)) +
    geom_bar(stat="identity", width=1) +
    coord_polar("y", start=0) +
    scale_fill_manual("product type",
                      values = c(plottingCols["PCP"], plottingCols["cis"], plottingCols["revCis"],
                                 plottingCols["trans"]),
                      labels = c("non-spliced", "forward cis", "reverse cis", "trans")) +
    theme_void() +
    geom_text(aes(y = ypos, label=paste0("n = ", n)), size = 4)

  return(freqs)
}