Spliced peptide identification from in vitro digestions of polypeptides with purified proteasomes

public public 1yr ago 0 bookmarks

Spliced peptide identification from in vitro digestions of polypeptides with purified proteasomes

Please cite the following publication if you are using invitroSPI for your research:

Roetschke, H.P., Rodriguez-Hernandez, G., Cormican, J.A. et al. InvitroSPI and a large database of proteasome-generated spliced and non-spliced peptides. Sci Data 10, 18 (2023). https://doi.org/10.1038/s41597-022-01890-6

overview

The invitroSPI pipeline consists of four main steps that are implemented in a Snakemake workflow:

  1. parsing of search result files and creation of a preliminary MS database ( MSDB ) containing all peptide-spectrum matches (PSMs), mapping of peptides to their substrate origins and potential product type re-assignment

  2. scoring and filtering of PSMs using Mascot's ion score and q-value as well as the delta score described in the manuscript

  3. identification and, optionally, removal of synthesis errors using the control runs

  4. mapping of peptides to the substrate sequence accounting for potential multi-mappers

  5. database statistics and diagnostic information

Additionally, code for the computation of all possible spliced and non-spliced peptides which is used for the Mascot search is provided in SOURCE/_generateSearchDB_polypeptides.R . We also include a script containing useful functions for downstream analyses ( invitroSPI_utils.R ).

execution

invitroSPI relies on Conda and Snakemake. In order to install Conda, click on this link and follow the installation guidelines for your respective operating system.
After installing Conda, you need to install Snakemake. The Snakemake installation procedure is described here .

Briefly, open the terminal on your computer and paste the following lines sequentially:
conda install -n base -c conda-forge mamba
conda activate base
mamba create -c conda-forge -c bioconda -n snakemake snakemake
Additionally, you might need to run conda update conda .

Download this repository as a .zip file (click on Code at the upper right corner of this repository --> Download ZIP), move the .zip file in the desired directory on your computer and unpack it. Make sure to edit INPUT/sample_list.csv and INPUT/config.yaml before starting the data processing (see below for a more detailed description) Open the terminal in this directory and enter: conda activate snakemake .

The pipeline can be executed by pasting snakemake --use-conda --cores all -R createMSDB into the terminal. The progress of the pipeline execution should appear in your terminal window. In case you have installed an older version of Conda/Snakemake and encounter an error when executing the pipeline, try executing snakemake --use-conda --cores all -R createMSDB --conda-frontend conda .

After your jobs finished, enter conda deactivate in order to terminate your Conda environment.

input

invitroSPI identifies spliced and non-spliced peptides from Mascot search result files. Therefore, the user must provide a table sample_list.csv in the INPUT/ folder containing information about:

  • project name

  • substrate ID

  • substrate sequence

  • time point

  • search result file name

  • replicate

  • MSfile (optional)

  • metainformation (optional)

!!! In case you are creating/editing the sample_list.csv file in Microsoft Excel, make sure to save it as actual comma-separated file. I.e., Save as... --> Comma-separated Values (.csv) To check that the sample list is in the correct format, you can open it using a text editor and verify that the columns are separated by commas (and NOT semicolons). !!!

Additionally, the user must provide search result files deposited in the folder INPUT/search_results/ and list their names in the sample_list.csv table.

An example of the sample_list.csv table is given below and can be modified accordingly by the user:

project_name substrateID substrateSeq digestTime filename replicate MSfile metainformation
test_data TSN2 TSN2.fasta 4 F029125.csv 1 INPUT/metainformation_testData.csv
test_data TSN89 RTKAWNRQLYPEW 4 F029129.csv 1 INPUT/metainformation_testData.csv
test_data TSN2 VSRQLRTKAWNRQLYPEWTEAQR CTRL F029123.csv 1 INPUT/metainformation_testData.csv
test_data TSN89 RTKAWNRQLYPEW CTRL F029127.csv 1 INPUT/metainformation_testData.csv

You can either paste the substrate sequence in the sample_list directly or put the name of a single-entry .fasta file containing the substrate sequence. This file should be located in INPUT/sequences .

Note that also the filenames of the control files must be provided. For the control files, put CTRL in the digestTime column . Please provide all other time points in hours ( e.g. , 0.5 for 30 min).
MSfile is an optional column that might be helpful to keep track of the .raw/.mgf files that were used to generate the respective search result files. It will not be used to create the database.
In case you would like to include additional information in the final database (for instance species, location, instrument, substrate origin, proteasome isotype, ...), you could do so by creating a .csv table containing all this information. The .csv table must also contain a column filename corresponding to the filename in the sample list (see INPUT/metainformation_SciData2022 for an example). Provide the name of the metainformation table in the metainformation column of the sample list.

The invitroSPI workflow is constructed in such a way that the user can keep appending projects and search result files to the sample_list . However, only the samples of the current project (which is specified in INPUT/config.yaml ) are processed and stored separately in OUTPUT/project_name subdirectories.

output

The pipeline provides a final database of identified spliced and non-spliced peptides in .csv format ( OUTPUT/project_name/ProteasomeDB.csv ) as well as all intermediate files in binary format ( OUTPUT/project_name/tmp/ ).

Additionally, some diagnostic plots and database statistics are being produced which can also be found in the OUTPUT/ folder. They comprise:

  • number of unique peptides

  • number and frequencies of peptides and PSMs at each time point

  • length distribution of peptides, splice reactants and intervening sequences at each time point

parameters that can be modified by the user

We are using the following default parameters that are specified in the INPUT/config.yaml file and that can be changed by the user:

  • project_name : indicates which files listed in the sample_list.csv should be processed (enter valid directory names only! - no spaces or German umlauts)

  • delta_score : 0.3

  • ion_score : 20

  • q_value : 0.05

  • include_scanNum : "yes"

  • keep_synErrors : "no"

Processing of scan numbers is only possible if .mgf files were created with msconvert or Mascot Distiller . In case you provide search results in another format (not recommended), please set include_scanNum to "no". In case you would like to include synthesis errors (labelled as such) in the final ProteasomeDB , change the keep_synErrors flag accordingly.

Code Snippets

  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
library(plyr)
library(dplyr)
library(seqinr)
library(stringr)

source("SOURCE/loadFunctions.r")

print("--------------------------------------------")
print("1) PARSE SEARCH RESULT FILES AND CREATE MSDB")
print("--------------------------------------------")

project_name = snakemake@params[["project_name"]]
include_scanNum = snakemake@params[["include_scanNum"]]

SEARCHRESULTS_PATH = "INPUT/search_results/"


### INPUT ###
sample_list = read.csv(file = snakemake@input[["sample_list"]],
                       sep = ",", header = T, stringsAsFactors = F)

# sample_list = read.table(file = "INPUT/sample_list.csv",
#                          sep = ",", header = T, stringsAsFactors = F)

# filter sample list
sample_list = sample_list[sample_list$project_name == project_name, ]

# add meta-information
if (!any(is.na(sample_list$metainformation)) & length(unique(sample_list$metainformation)) == 1) {
  metainfo = read.csv(sample_list$metainformation[1], stringsAsFactors = F)
  meta = T
} else {
  meta = F
}

### MAIN PART ###

if (!dir.exists(paste0("OUTPUT/", project_name))) {
  dir.create(paste0("OUTPUT/", project_name))
}

columns = c(            "runID" ,
                        "substrateID" ,
                        "substrateSeq" ,
                        "digestTime" ,
                        "pepSeq" ,
                        "productType",
                        "spliceType",
                        "positions",
                        "scanNum",
                        "rank",
                        "ionScore",
                        "qValue",
                        "charge",
                        "PTM")

MSDB = list()
pb = txtProgressBar(min = 0, max = nrow(sample_list), style = 3)

for (i in 1:nrow(sample_list)) {
  # setTxtProgressBar(pb, i)

  # load search result file
  con = file(paste0(SEARCHRESULTS_PATH, sample_list$filename[i]),"r")
  i1 = readLines(con)
  close(con)

  # remove header from search result file
  # and remove not assigned queries if there are any
  start = grep("prot_hit_num",i1)

  currentSearchFile = read.table(paste(SEARCHRESULTS_PATH, sample_list$filename[i],sep="/"),
                                 skip = start-1,
                                 sep = ",", header = TRUE, fill = TRUE)

  kk = which(currentSearchFile$prot_hit_num == "")
  if (length(kk) > 0) {
    kk = min(kk)
    currentSearchFile = currentSearchFile[c(1:(kk-2)),]
  }

  substrateID = sample_list$substrateID[i]
  digestTime = sample_list$digestTime[i]


  if(nrow(currentSearchFile) > 0){

    # empty DB for current search result file
    currentDB = data.frame(matrix(ncol = length(columns), nrow = nrow(currentSearchFile)))
    colnames(currentDB) = columns

    # fill data frame with info straight outta .csv
    currentDB$charge <- currentSearchFile$pep_exp_z
    currentDB$rank <- currentSearchFile$pep_rank
    currentDB$pepSeq <- currentSearchFile$pep_seq
    currentDB$ionScore <- currentSearchFile$pep_score
    currentDB$qValue <- currentSearchFile$pep_expect


    currentDB$PTM <- currentSearchFile$pep_var_mod
    currentDB$scanNum <- currentSearchFile$pep_query

    # currentDB[which(duplicated(currentDB) | duplicated(currentDB, fromLast = T)),] %>% nrow()
    # cntS = currentSearchFile[, c("pep_exp_z","pep_rank","pep_seq", "pep_score", "pep_expect", "pep_var_mod", "pep_query")]
    # cntS[which(duplicated(cntS) | duplicated(cntS, fromLast = T)),] %>% nrow()

    #get actual scan numbers, not query numbers
    # varies between Mascot and Mascot Distiller
    if(include_scanNum != "no") {

      warn = F

      scans = rep(NA,dim(currentSearchFile)[1])
      for(ii in 1:dim(currentSearchFile)[1]){
        tit = currentSearchFile$pep_scan_title[ii]

        if (str_detect(tit, pattern = "scan=")) {

          scans[ii] = as.numeric(strsplit(strsplit(tit,split="scan=")[[1]][2],split="~")[[1]][1])

        } else if (str_detect(tit, pattern = "Scan ")) {

          scans[ii] = as.numeric(unlist(strsplit(tit, " "))[3])

        } else if (str_detect(tit, pattern = coll(".")) | str_detect(tit, pattern = "TSN")) {
          scans[ii] = as.numeric(strsplit(tit,split="\\.")[[1]][2])

        } else {
          warn = T
        }

      }
      currentDB$scanNum <- scans

      if (warn) {
        print("could not extract scan numbers - check search result formatting!")
        currentDB$scanNum = currentSearchFile$pep_query
      }

    }


    #extract product type and position information
    pepName <- as.character(currentSearchFile$prot_acc)
    pepType <- c()
    pepPos <- c()
    for (z in 1:length(pepName)){
      splitPepName <- unlist(strsplit(pepName[z], "_"))
      pepType <- c(pepType, splitPepName[1])
      splitPepName <- splitPepName[-1]
      pepPos <- c(pepPos, paste(splitPepName, collapse = "_"))

    }

    currentDB$productType <- pepType
    currentDB$positions <- pepPos

    # substrate sequence
    if (grepl(pattern = ".fasta", sample_list$substrateSeq[i])) {

      cntSeq = read.fasta(file = paste0("INPUT/sequences/", sample_list$substrateSeq[i]),
                          seqtype = "AA", strip.desc = T)[[1]] %>%
        unlist() %>%
        paste(collapse = "")

    } else {
      cntSeq = sample_list$substrateSeq[i]
    }

    # add metainfo
    currentDB$substrateID = substrateID
    currentDB$substrateSeq <- cntSeq
    currentDB$digestTime <- digestTime
    currentDB$runID = paste(substrateID, digestTime,sample_list$filename[i], sample_list$replicate[i], sep = "-")

    # add further meta-information if provided
    if (meta) {

      cntMeta = metainfo[metainfo$filename == sample_list$filename[i], ]
      currentDB = cbind(currentDB, cntMeta[rep(1,nrow(currentDB)),]) %>%
        as.data.frame()

      dp = which(duplicated(names(currentDB)))
      if (length(dp) > 0) { currentDB = currentDB[,-dp]}

    }

    # assign substrate hits as PCP
    substratehits = which(currentDB$substrateSeq == currentDB$pepSeq)
    if (length(substratehits) > 0) {
      currentDB$productType[substratehits] = "PCP"
      currentDB$positions[substratehits] = paste(1,
                                                 nchar(currentDB$substrateSeq[substratehits]),
                                                 sep = "_")
    }

    print("")
    paste0("obtain correct annotations for positions and splice types for run: ",
           currentDB$runID[1]) %>%
      print()

    currentDB = currentDB %>% mapping()
    currentDB$spliceType[currentDB$productType == "PCP"] = NA

  } else {
    currentDB = data.frame(matrix(ncol = length(columns), nrow = 0))
    colnames(currentDB) = columns
    print("No peptides listed in this file! Skipping...")
  }

  # append data frame for current search file to master data frame
  MSDB[[i]] = currentDB

}

MSDB = plyr::ldply(MSDB)

### OUTPUT ###
save(MSDB, file = unlist(snakemake@output[["MSDB"]]))
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
library(stringr)
library(plyr)
library(dplyr)

print("------------------------------------------------")
print("2) EXTRACT HIGH-QUALITY PSMs FROM SEARCH RESULTS")
print("------------------------------------------------")

project_name = snakemake@params[["project_name"]]
delta_score = snakemake@params[["delta_score"]]
ion_score = snakemake@params[["ion_score"]]
q_value = snakemake@params[["q_value"]]

print(paste0("Delta score: ", delta_score))
print(paste0("Mascot ion score: ", ion_score))
print(paste0("q-value: ", q_value))


### INPUT ###
sample_list = read.csv(file = snakemake@input[["sample_list"]],
                       sep = ",", header = T, stringsAsFactors = F)
# filter sample list
sample_list = sample_list[sample_list$project_name == project_name, ]

load(snakemake@input[["MSDB"]])


### MAIN PART ###
columns = names(MSDB)
runIDs = unique(MSDB$runID)

allPSMs = list()
allPSMs_Delta = list()

pb = txtProgressBar(min = 0, max = length(runIDs), style = 3)
for (i in 1:length(runIDs)) {

  setTxtProgressBar(pb, i)

  selected <- data.frame(matrix(ncol = length(columns), nrow = 0))
  colnames(selected) = columns

  selectedDeltaRecord = data.frame(matrix(ncol = length(columns), nrow = 0))
  colnames(selectedDeltaRecord) <- columns

  runIDTable = MSDB[MSDB$runID == runIDs[i],]
  scanNum = as.character(runIDTable$scanNum) %>% unique()

  # bah = sapply(scanNum, function(x){
  #   all(runIDTable$rank[runIDTable$scanNum == x] > 1)
  # }) %>% which()
  # if (length(bah) > 0) {
  #   print("THERE ARE PSMS WITHOUT A TOP-RANKED PEPTIDE!!")
  # }


  # Iterate through scanNum and sort by rank
  for (k in 1:length(scanNum)) {

    PSPcandidatesIndex = NULL
    PCPcandidatesIndex = NULL

    deltaRecord = data.frame(matrix(ncol = length(columns), nrow = 0))
    colnames(deltaRecord ) = columns

    scanNumTable = runIDTable[runIDTable$scanNum == scanNum[k],]
    scanNumTable = scanNumTable[order(scanNumTable$rank),]

    # # How many entries in filteredScans are rank 1?
    scanNumTable = unique(scanNumTable)
    filteredScans = scanNumTable
    filteredScans = filteredScans[order(filteredScans$rank),]

    # print(filteredScans)

    # extract the top score
    topScore <- filteredScans[1, "ionScore"]

    # extract scans with rank 1
    allRanks = table(filteredScans$rank)
    NoTopRankedScans <- allRanks[names(allRanks) == 1]

    # it can somehow happen that there is no top-ranked peptide...
    if (length(NoTopRankedScans) == 0) {
      filteredScans <- filteredScans[0,]
      paste0("!!! NO RANK 1 PEPTIDES FOUND IN SCAN ", scanNum[k], ", RUN-ID ", runIDs[i], " !!!") %>%
        print()
      print("likely, there is an issue with the Mascot search settings - check!")

    } else if (NoTopRankedScans > 1) {

      topIndices = which(filteredScans$rank == 1)
      NoTopRankedPeps = gsub("I","L",filteredScans$pepSeq[topIndices]) %>%
        unique() %>%
        length()


      PCPindex = which(filteredScans$productType[filteredScans$rank == 1]=="PCP")
      PSPindex = which(filteredScans$productType[filteredScans$rank == 1]=="PSP")

      # if more than one peptide is rank 1
      # keep scan only if there is just a single PCP, otherwise discard
      if (NoTopRankedPeps > 1) {

        if (length(PCPindex) >= 1) {

          if (length(unique(gsub("I","L",filteredScans$pepSeq[PCPindex]))) == 1) {
            filteredScans = filteredScans[PCPindex[1],]
          } else {
            filteredScans = filteredScans[0,]
          }

          } else {

            if (length(PSPindex) > 1) {
              index = which(filteredScans$productType[which(filteredScans$rank == 1)]=="PSP")
              PSPcandidatescores <- filteredScans[index, "ionScore"]
              ind = which(1-PSPcandidatescores/topScore <= delta_score)
              deltaRecord = rbind(deltaRecord,filteredScans[index[ind],])
            }

            filteredScans = filteredScans[0,]

        }

      } else {
        filteredScans = filteredScans[1,]
      } 

    }
    # if there is only one scan on rank 1, there is also only one peptide

    # if there is only a single scan on rank 1
    # (nrow(filteredScans) will be larger than 1 since the previous if condition did not apply)
    # if the 1st rank is a PCP, assign it as such
    else if (NoTopRankedScans == 1 & filteredScans[1, "productType"] == "PCP") {

      filteredScans <- filteredScans[1,]

      # If rank 1 entry is PSP check if a likely PCP is present in the lower ranks (PCPcandidates)
      # also check if there are lower-ranked PSPs
    } else if (NoTopRankedScans == 1 & filteredScans[1, "productType"] == "PSP") {

      # lower-ranked PCPs and PSPs
      PCPcandidatesIndex_temp <- which(as.numeric(filteredScans$rank) > 1 & as.character(filteredScans$productType) == "PCP")
      PSPcandidatesIndex_temp <- which(as.numeric(filteredScans$rank) > 1 & as.character(filteredScans$productType) == "PSP")

      # if there are no PSPs or PCPs with a lower rank, assign the PSP
      if(length(PCPcandidatesIndex_temp)==0 & length(PSPcandidatesIndex_temp)==0){
        filteredScans <- filteredScans[1,]
      }

      # if there are PSPs with a lower rank --> get highest-ranked PSM
      if (length(PSPcandidatesIndex_temp)>0) {
        #keeps only top PSP candidate
        PSPcandidatesIndex = which(as.numeric(filteredScans$rank)==min(as.numeric(filteredScans$rank)[PSPcandidatesIndex_temp]) & as.character(filteredScans$productType) == "PSP")[1]
        # keeps all PSP candidates
        PSPcandidatesIndexAll = PSPcandidatesIndex_temp
      } else {
        PSPcandidatesIndex = NULL
      }

      # if there are PCPs with a lower rank --> get highest-ranked PSM
      if (length(PCPcandidatesIndex_temp)>0) {
        PCPcandidatesIndex = which(as.numeric(filteredScans$rank)==min(as.numeric(filteredScans$rank)[PCPcandidatesIndex_temp]) & as.character(filteredScans$productType) == "PCP")
      } else {
        PCPcandidatesIndex = NULL
      }

      # for lower-ranked PCPs --> calculate Delta score
      # Keep top ranked PSP if deltascore > 0.3 and remove rest
      if (length(PCPcandidatesIndex) > 0) {


        PCPcandidatescore <- filteredScans[PCPcandidatesIndex[1], "ionScore"]
        # if the Delta score between the top-ranked PSP and a lower-ranked PCP
        # is larger than the threshold
        if (1 - PCPcandidatescore / topScore > delta_score) {

          # if there are any lower-ranked PSPs
          # Calculate delta_score scores also for lower-ranked PSPs
          if (length(PSPcandidatesIndex)>0) {
            PSPcandidatescore <- filteredScans[PSPcandidatesIndex, "ionScore"]

            # lower-ranked PSP, but with very low ion score --> assign top-ranked PSP
            if (1 - PSPcandidatescore / topScore > delta_score) {
              filteredScans <- filteredScans [1,]

              # discard spectrum bc there are several PSP candidates with similar
              # ion score
            } else if (1 - PSPcandidatescore / topScore <= delta_score) {

              PSPcandidatescores <- filteredScans[PSPcandidatesIndexAll, "ionScore"]
              ind = which(1-PSPcandidatescores/topScore <= delta_score)
              deltaRecord = rbind(deltaRecord,filteredScans[PSPcandidatesIndexAll[ind],])

              filteredScans <- filteredScans[0,]
            }

            # if there are no lower-ranked PSPs --> assign the top-ranked PSP
          } else if (length(PSPcandidatesIndex)==0){
            filteredScans <- filteredScans[1,]
          }

          # if the Delta score is too small:
          # Take candidate PCP as new top ranked and remove rest
        } else if (1 - PCPcandidatescore / topScore <= delta_score) {
          if (length(PCPcandidatesIndex) == 1) {
            filteredScans <- filteredScans[PCPcandidatesIndex,]
          } else {

            # more than one lower-ranked PCP that survived Delta filter
            if (length(unique(gsub("I","L",filteredScans$pepSeq[PCPcandidatesIndex]))) == 1) {
              filteredScans <- filteredScans[PCPcandidatesIndex[1],]
            } else {
              filteredScans = filteredScans[0,]
            }

          }

        }

        # end if there are both PCPs and PSPs in lower ranks
        # if there are lower ranked PSPs but no PCPs: continue

      } else if (length(PSPcandidatesIndex) > 0 & length(PCPcandidatesIndex)==0){
        PSPcandidatescore <- filteredScans[PSPcandidatesIndex, "ionScore"]

        # keep the candidate PSP if the Delta score is large enough
        if (1 - PSPcandidatescore / topScore > delta_score) {
          filteredScans <- filteredScans [1,]

          # if the Delta score is too small --> discard entire scan
        } else if (1 - PSPcandidatescore / topScore <= delta_score) {

          PSPcandidatescores <- filteredScans[PSPcandidatesIndexAll, "ionScore"]
          ind = which(1-PSPcandidatescores/topScore <= delta_score)
          deltaRecord = rbind(deltaRecord,filteredScans[PSPcandidatesIndexAll[ind],])

          filteredScans <- filteredScans[0,]
        }

      }
    }

    # apply filter for q-value and ion score
    if(nrow(filteredScans) > 0){
      if(as.double(filteredScans$qValue) >= q_value | as.double(filteredScans$ionScore <= ion_score)){
        filteredScans <- filteredScans[0,]
      }else{
        selected <- rbind(filteredScans, selected)
      }
    }

    if(nrow(deltaRecord)> 0){
      k = which(as.double(deltaRecord$qValue) < q_value & as.double(deltaRecord$ionScore > ion_score))
      deltaRecord <- deltaRecord[k,]

      if(nrow(deltaRecord)> 0){
        selectedDeltaRecord <- rbind(deltaRecord,selectedDeltaRecord)
      }
    }

  }

  allPSMs[[i]] = selected
  allPSMs_Delta[[i]] = selectedDeltaRecord

}

allPSMs = plyr::ldply(allPSMs)
allPSMs_Delta = plyr::ldply(allPSMs_Delta)

paste0("number of PSMs: ", nrow(allPSMs)) %>%
  print()


### OUTPUT ###

save(allPSMs, file = unlist(snakemake@output[["allPSMs"]]))
save(allPSMs_Delta, file = unlist(snakemake@output[["allPSMs_Delta"]]))
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
library(stringr)
library(dplyr)

print("----------------------------")
print("3) IDENTIFY SYNTHESIS ERRORS")
print("----------------------------")


### INPUT ###
load(snakemake@input[["allPSMs"]])
keep_synErrors = snakemake@params[["keep_synErrors"]]

# gsub("I","L",subControls0$pepSeq) %>% unique() %>% length()
# gsub("I","L",ProteasomeDB$pepSeq[str_detect(ProteasomeDB$productType, "synError")]) %>% unique() %>% length()
# gsub("I","L",ProteasomeDB$pepSeq[str_detect(ProteasomeDB$productType, "_synErrorPrecursor")]) %>% unique() %>% length()
# gsub("I","L",ProteasomeDB$pepSeq[str_detect(ProteasomeDB$productType, "_synError$")]) %>% unique() %>% length()


### MAIN PART ###
subControls0 = allPSMs[allPSMs$digestTime == "CTRL", ]
extracted0 = allPSMs[allPSMs$digestTime != "CTRL", ]

unqTSNctrl<-unique(subControls0$substrateID)
unqTSNextr<-unique(extracted0$substrateID)
unqTSN = unqTSNextr

allExtracted = numeric()
for (i in 1:length(unqTSN))
{

  subIndex=which(subControls0$substrateID == unqTSN[i])
  extIndex=which(extracted0$substrateID == unqTSN[i])
  extracted<-extracted0[extIndex,]
  subControls<-subControls0[subIndex,]


  # Get all unique peptide seqs present in the substrate controls
  subControlSeq <- unique(gsub("I","L",subControls$pepSeq))
  subControlPCP = unique(gsub("I","L",subControls$pepSeq[subControls$productType == "PCP"]))

  # get SR2s of detected sequences
  extPos = str_split_fixed(extracted$positions,"[:punct:]",Inf)[,c(1:4)]
  extPos = apply(extPos,2,as.numeric)
  extracted = extracted %>%
    mutate(sr2s = gsub("I","L",str_sub(extracted$substrateSeq, extPos[,3], extPos[,4])),
           synErrSR2 = ifelse(sr2s %in% subControlPCP, "yes", "no"),
           synErrSR2 = ifelse(productType == "PCP", NA, synErrSR2)) %>%
    select(-sr2s)


  # Check which entries of extracted are also present in the substrate controls.
  errorIndex <- which(gsub("I","L",extracted$pepSeq) %in% subControlSeq)
  errorSeqs <- unique(gsub("I","L",extracted$pepSeq[errorIndex]))

  # We conservatively regard all matches as synthesis errors and delete them from extracted
  # extracted0 <- extracted0[-extIndex[errorIndex], ]

  # Match PSPs to errors
  PSPtoRemove <- c()
  for (b in 1:nrow(extracted)){
    candidateType <- extracted$productType[b]
    if (str_detect(candidateType,"PSP")){
      candidateSeq <- gsub("I","L",extracted$pepSeq[b])

      if (any(grepl(candidateSeq, subControlSeq))){
        PSPtoRemove <- c(PSPtoRemove, b)
      }
    }
  }

  mergeIndex=(unique(c(errorIndex, PSPtoRemove)))
  k = which(PSPtoRemove%in%errorIndex)
  if(length(k)>0){
    PSPtoRemove = PSPtoRemove[-k]
  }
  # Label PSPs that were found to match errors
  if (length(errorIndex)>0) {
    extracted$productType[errorIndex] = paste(extracted$productType[errorIndex],"_synError",sep="")
  }
  if (length(PSPtoRemove)>0) {
    extracted$productType[PSPtoRemove] = paste(extracted$productType[PSPtoRemove],"_synErrorPrecursor",sep="")
  }

  allExtracted = rbind(allExtracted,extracted)

} # the for loop

extracted = allExtracted
subControls = subControls0

# Also, check if timepoint 0 measurements or substrate controls are missing for any substrates
# controlSubstrates <- c(unique(subControls$substrateID), zeroSubstrates)
controlSubstrates <- unique(subControls$substrateID)
extractedSubstrates <- unique(extracted$substrateID)

missingSubIndex <- which(!(extractedSubstrates %in% controlSubstrates))

# remove synthesis errors
if (keep_synErrors == "no") {

    if(length(missingSubIndex) > 0){
      print("WARNING! The following substrates do not have any substrate control measurements:")

      print(extractedSubstrates[missingSubIndex])

      print("They are being removed from your table!")
      extracted <- extracted[-which(extracted$substrateID == extractedSubstrates[missingSubIndex[i]]),]

    }else{
      print("Success! All substrates in extracted are covered by substrate control measurements!")
    }


  print("Removing synthesis errors from the final dataset")

  kk = which(str_detect(extracted$productType, "_synError"))
  if (length(kk > 0)) {
    extracted = extracted[-kk, ]

    print("........................................................")
    paste0("number of synthesis errors that are being removed: ", length(kk)) %>%
      print()
    print("........................................................")

  }

}

### OUTPUT ###

save(extracted, file = unlist(snakemake@output[["PSMs"]]))
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
library(seqinr)
source("SOURCE/loadFunctions.r")

print("-----------------------------")
print("4) MAPPING AND POSTPROCESSING")
print("-----------------------------")


### INPUT ###
load(snakemake@input[["PSMs"]])
load(snakemake@input[["allPSMs_Delta"]])



### MAIN PART ##
d = mapping(extracted)
d_delta = mapping(allPSMs_Delta)


### OUTPUT ###
write.csv(d, file = unlist(snakemake@output[["ProteasomeDB"]]),
          row.names = F)
write.csv(d_delta, file = unlist(snakemake@output[["DeltaPeptides"]]),
          row.names = F)
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
library(dplyr)
library(stringr)
library(grid)
library(ggplot2)
library(gridExtra)

theme_set(theme_bw())

source("SOURCE/invitroSPI_utils.R")
source("SOURCE/plottingFunctions.R")

print("-------------------------------")
print("5) GENERATE DATABASE STATISTICS")
print("-------------------------------")


### INPUT ###
ProteasomeDB = read.csv(snakemake@input[["ProteasomeDB"]],
                        stringsAsFactors = F)
# ProteasomeDB = read.csv("OUTPUT/test_data/ProteasomeDB.csv", stringsAsFactors = F)

S = ProteasomeDB$substrateID %>% unique()
tps = ProteasomeDB$digestTime %>% unique() %>% as.numeric() %>% sort()


### MAIN PART ###

# ----- 1) pre-processing -----

Peps = ProteasomeDB %>%
  remSynthErrors() %>%
  ILredundancy() %>%
  filterPepLength() %>%
  disentangleMultimappers.Type() %>%
  disentangleMultimappers.SRlen() %>%
  disentangleMultimappers.IVSeqLen() %>%
  disentangleMultimappers.AA() %>%
  DBcosmetics()

ProteasomeDB = ProteasomeDB %>%
  remSynthErrors()


# ----- 2) general stats + peptides over time -----

uniquePeps = Peps %>% uniquePeptides()

# plotNumberofPeptides(ProteasomeDB,
#                      outname = "OUTPUT/test_data/number_of_products.pdf")
plotNumberofPeptides(ProteasomeDB,
                     outname = unlist(snakemake@output[["number_of_products"]]))

# pdf("OUTPUT/test_data/DB_stats.pdf", height = 4, width = 6)
pdf(file = unlist(snakemake@output[["DB_stats"]]), height = 4, width = 6)
generalStats(uniquePeps, tp = "all") %>% grid.table()

for (t in 1:length(tps)) {
  cntDB = Peps[Peps$digestTime == tps[t], ] %>%
    uniquePeptides()
  grid.newpage()
  generalStats(DB = cntDB, tp = tps[t]) %>% grid.table()

}
dev.off()



# ----- 3) coverage maps -----

# pdf("OUTPUT/test_data/coverage_map.pdf", height = 12, width = 16)
# pdf(unlist(snakemake@output[["coverage_map"]]), height = 12, width = 16)
# for (s in 1:length(S)) {
#   
#   cntDB = uniquePeps[uniquePeps$substrateID == S[s], ]
#   out = plotCoverage(cntDB, name = S[s], tp = "all")
#   replayPlot(out)
#   
# }
# 
# dev.off()

# ----- 4) length distributions + product type frequencies -----

# pdf("OUTPUT/Delta03/length_distributions.pdf", height = 10, width = 14)
pdf(file = unlist(snakemake@output[["length_distributions"]]), height = 12, width = 16)

freq = TypeFrequency(uniquePeps)
pl = PepLength(uniquePeps, tp="all")
srlen = SRLength(uniquePeps, tp="all")
ivlen = IVSeqLength(uniquePeps, tp="all")

cntG = grid.arrange(freq,pl,srlen,ivlen,
                    nrow = 2, ncol = 2)

print(cntG)

for (t in 1:length(tps)) {

  cntDB = uniquePeps[uniquePeps$digestTime == tps[t], ]

  freq = TypeFrequency(cntDB)
  pl = PepLength(cntDB, tp=tps[t])
  srlen = SRLength(cntDB, tp=tps[t])
  ivlen = IVSeqLength(cntDB, tp=tps[t])

  cntG = grid.arrange(freq,pl,srlen,ivlen,
                      nrow = 2, ncol = 2)

  print(cntG)

}
dev.off()
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
library(data.table)
library(parallel)
library(dplyr)
library(seqinr)
library(stringr)
library(beepr)


# numCPU = if (detectCores() > 8) detectCores()-1 else if (detectCores() >= 4) 4 else 1
# cl <- makeForkCluster(numCPU)


# ----- user parameters -----
nmers = c(5, 40)
generateTrans = T  # if false, generate only cis peptides
outdir = "../DATA/polypeptideDBs/"
suppressWarnings(dir.create(outdir, recursive = T))

# change here (you can read in a dataframe with the name and the sequence)
sample_list = read.csv("~/Documents/SCIENCE/_tools/aSPIre+invitroSPI/data/sample_list_aSPIre.csv", stringsAsFactors = F)
substrates = sample_list %>%
  select(substrateID, substrateSeq) %>%
  unique()


# ----- functions -----

getCis = function(subSeq, N, Lext,L) {

  if (L-1 >= N) {

    allCis = sapply(seq(1,L-N), function(i){

      sapply(seq(i+Lext-1,N-Lext+i-1), function(j){

        sr2 = N-j+i-1
        sapply(seq(j+2, L-sr2+1), function(k){

          n = k+sr2-1
          pepSeq = paste(str_sub(subSeq, start=i, end=j),
                         str_sub(subSeq, start=k, end=n), sep = "")
          positions = paste(i,j,k,n, sep = "_")

          return(c(pepSeq, positions))

        })
      })
    })
    CIS = unlist(allCis)
    cisPep = data.frame(pepSeq = CIS[seq(1,length(CIS),2)],
                        PositionSubstrate = CIS[seq(2,length(CIS),2)])
  } else {
    cisPep=NA
  }

  return(cisPep)
}



getRevCis = function(subSeq, L, N, Lext=1) {


  if (L >= N) {

    allRevCis = sapply(seq(1,L-N+1), function(k){

      sapply(seq(k+Lext-1,N-Lext+k-1), function(n){

        sr1 = N-n+k-1
        sapply(seq(n+1, L-sr1+1), function(i){

          j = i+sr1-1

          pepSeq = paste(str_sub(subSeq, start=i, end=j),
                         str_sub(subSeq, start=k, end=n), sep = "")
          positions = paste(i,j,k,n, sep = "_")
          return(c(pepSeq, positions))
        })
      })
    })

    REVCIS = unlist(allRevCis)
    revcisPep = data.frame(pepSeq = REVCIS[seq(1,length(REVCIS),2)],
                           PositionSubstrate = REVCIS[seq(2,length(REVCIS),2)])

  } else {
    revcisPep = NA
  }


  return(revcisPep)
}




getAllTrans = function(L, N, subSeq, Lext=1) {

  allSRs = lapply(seq(Lext,N-Lext), function(sr1){

    # SR1
    Is = seq(1, L-sr1+1)
    Js = Is+sr1-1

    # SR2
    Ks = seq(1,L-N+sr1+1)
    Ns = Ks+N-sr1-1

    SR1_pos = paste(Is,Js,sep = "_")
    SR2_pos = paste(Ks,Ns,sep = "_")

    SR1 = str_sub(subSeq,Is,Js)
    SR2 = str_sub(subSeq,Ks,Ns)

    PEP = outer(SR1, SR2, paste, sep="") %>% as.vector()
    positions = outer(SR1_pos,SR2_pos,paste, sep = "_") %>% as.vector()

    return(data.frame(pepSeq = PEP,
                      PositionSubstrate = positions
    ))

  })

  transPep = plyr::ldply(allSRs)


  return(transPep)
}


getPCP = function(L,subSeq,N) {

    if (L > N) {

    allPCP = sapply(seq(1,L-N+1), function(i){
      j = N+i-1
      positions = paste(i,j, sep = "_")
      pepSeq=str_sub(subSeq,i,j)
      return(c(pepSeq,positions))
    })

    PCP = unlist(allPCP)
    PCP_Pep = data.frame(pepSeq = PCP[1,],
                         PositionSubstrate = PCP[2,])
  } else {
    PCP = NA
  }

  return(PCP_Pep)
}


# ----- compute DB -----

for (i in 1:nrow(substrates)) {


  Name = substrates$substrateID[i]
  print("...............")
  print(Name)
  print("...............")

  # get substrate sequence and length
  subSeq = substrates$substrateSeq[i]
  if (grepl(".fasta", subSeq)) {
    subSeq = read.fasta(file = subSeq, seqtype = "AA", strip.desc = T, as.string = T) %>%
      unlist() %>%
      suppressWarnings()
  }

  L = nchar(subSeq)

  # initiate empty fasta
  write.fasta(sequences = subSeq, names = Name,
              file.out=paste0(outdir, Name, "_allPSP_0.fasta"),
              open="w")

  start_time = Sys.time()
  # generate peptides
  if (generateTrans) {
    print("generating cis and trans-spliced peptides.....")

    # all cis and trans
    allPSP <- lapply(seq(nmers[1], nmers[2]), function(N){
      paste0("generating ",N,"-mers") %>% print()

      DB = getAllTrans(subSeq = subSeq, L = L, N = N, Lext=1)
      write.fasta(as.list(DB$pepSeq), names = paste0("PSP_", DB$PositionSubstrate),
                  file.out = paste0(outdir, Name, "_allPSP_0.fasta"),
                  open="a")
    })


    print("generating non-spliced peptides.....")
    Nmax = if(nmers[2] >= L-1) L-1 else nmers[2]

    allPCP <- lapply(seq(nmers[1], Nmax), function(N){
      paste0("generating ",N,"-mers") %>% print()

      DB = getPCP(subSeq = subSeq, L = L, N = N)
      write.fasta(as.list(DB$pepSeq), names = paste0("PSP_", DB$PositionSubstrate),
                  file.out = paste0(outdir, Name, "_allPSP_0.fasta"),
                  open="a")
    })

  } else {
    print("generating cis-spliced peptides.....")

    # forward cis
    allCis <- lapply(seq(nmers[1], nmers[2]), function(N){
      paste0("generating fwd cis ",N,"-mers") %>% print()

      DB = getCis(subSeq = subSeq, L = L, N = N, Lext=1)
      write.fasta(as.list(DB$pepSeq), names = paste0("PSP_", DB$PositionSubstrate),
                  file.out = paste0(outdir, Name, "_allPSP_0.fasta"),
                  open="a")
    })

    # reverse cis
    allRevCis <- lapply(seq(nmers[1], nmers[2]), function(N){
      paste0("generating rev cis ",N,"-mers") %>% print()

      DB = getRevCis(subSeq = subSeq, L = L, N = N, Lext=1)
      write.fasta(as.list(DB$pepSeq), names = paste0("PSP_", DB$PositionSubstrate),
                  file.out = paste0(outdir, Name, "_allPSP_0.fasta"),
                  open="a")
    })


  }

  end_time = Sys.time()
  paste0("..... finished after: ", difftime(end_time, start_time, units = "s")) %>% print()

}


print("FINISHED!")
beep("mario")
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
library(dplyr)
library(stringr)


########## plotting colours ########## 
plottingCols = c(
  PCP = "darkorange",
  PSP = "dodgerblue",
  cis = "darkslateblue",
  revCis = "lightskyblue",
  trans = "darkorchid",
  allcis = "darkslateblue",

  ProteasomeDB = "olivedrab4",
  randomDB = "lightsteelblue"
)


########## disentange rows containing multimappers ########## 
disentangleMultimappers.Type = function(DB) {

  print("DISENTANGLE MULTI-MAPPERS FOR PRODUCT TYPE")

  k = which(str_detect(DB$positions, coll(";")))

  if (length(k) > 0) {

    DB_mm = DB[k, ]
    DB_nomm = DB[-k, ]

    mm = list()

    pb = txtProgressBar(min = 0, max = nrow(DB_mm), style = 3)
    for (r in 1:nrow(DB_mm)) {

      setTxtProgressBar(pb, r)

      cnt_types = str_split(DB_mm$spliceType[r], coll(";"), simplify = T) %>%
        paste()

      cnt_pos = str_split(DB_mm$positions[r], coll(";"), simplify = T) %>%
        paste()

      cntDB = DB_mm[r, ]

      # all multi-mappers correspond to the same product type
      if (cnt_types %>% unique() %>% length() == 1) {
        cntDB$spliceType = cnt_types %>% unique()

        # in case of cis and trans --> keep cis
      } else if ( any("trans" %in% cnt_types) ) {

        x = which(cnt_types == "trans")
        cnt_types = cnt_types[-x]
        cnt_pos = cnt_pos[-x]

        if (cnt_types %>% unique() %>% length() == 1) {
          cntDB$spliceType = cnt_types %>% unique()

        } else {
          cntDB$spliceType = "type_multi-mapper"
        }

        cntDB$positions = cnt_pos %>% paste(collapse = ";")

      } else {
        cntDB$spliceType = "type_multi-mapper"
        cntDB$positions = cnt_pos %>% paste(collapse = ";")
      }

      mm[[r]] = cntDB
    }

    mm = plyr::ldply(mm)
    DB = rbind(DB_nomm, mm) %>%
      as.data.frame()
  }

  # beep("mario")
  return(DB)
}


disentangleMultimappers.AA = function(DB, retSinglePos = T) {

  print("DISENTANGLE MULTI-MAPPERS FOR AA AT sP1 AND sP1'")

  DB$AAmultimapper = "no"

  # only considering PSP multi-mappers
  k = which(str_detect(DB$positions, coll(";")) & !str_detect(DB$productType, "PCP"))

  if (length(k) > 0) {
    DB_mm = DB[k, ]
    DB_nomm = DB[-k, ]

    pb = txtProgressBar(min = 0, max = nrow(DB_mm), style = 3)
    for (r in 1:nrow(DB_mm)) {

      setTxtProgressBar(pb, r)

      cnt_pos = strsplit(DB_mm$positions[r], ";") %>%
        unlist() %>%
        str_split_fixed(pattern = coll("_"), n = Inf)

      sP1 = str_sub(DB_mm$substrateSeq[r], start = cnt_pos[, 2], end = cnt_pos[, 2])
      sP1d = str_sub(DB_mm$substrateSeq[r], start = cnt_pos[, 3], end = cnt_pos[, 3])

      if ((length(unique(sP1)) > 1) | (length(unique(sP1d)) > 1)) {
        DB_mm$AAmultimapper[r] = "yes"

      } else if (retSinglePos) {
        DB_mm$positions[r] = paste(cnt_pos[1, c(1:4)], collapse = "_")
      }

    }

    DB = rbind(DB_nomm, DB_mm) %>%
      as.data.frame()
  }


  # beep("mario")
  return(DB)
}


disentangleMultimappers.SRlen = function(DB, retSinglePos = T) {

  print("DISENTANGLE MULTI-MAPPERS FOR SR length'")

  DB$SRmultimapper = "no"

  # only considering PSP multi-mappers
  k = which(str_detect(DB$positions, coll(";")) & !str_detect(DB$productType, "PCP"))

  if (length(k) > 0) {

    DB_mm = DB[k, ]
    DB_nomm = DB[-k, ]

    pb = txtProgressBar(min = 0, max = nrow(DB_mm), style = 3)
    for (r in 1:nrow(DB_mm)) {

      setTxtProgressBar(pb, r)

      cnt_pos = strsplit(DB_mm$positions[r], ";") %>%
        unlist() %>%
        str_split_fixed(pattern = coll("_"), n = Inf)

      SR1len = as.numeric(cnt_pos[, 2]) - as.numeric(cnt_pos[, 1]) + 1
      SR2len = as.numeric(cnt_pos[, 4]) - as.numeric(cnt_pos[, 3]) + 1

      if ((length(unique(SR1len)) > 1) | (length(unique(SR2len)) > 1)) {
        DB_mm$SRmultimapper[r] = "yes"

      }  else if (retSinglePos) {
        DB_mm$positions[r] = paste(cnt_pos[1, c(1:4)], collapse = "_")
      }

    }

    DB = rbind(DB_nomm, DB_mm) %>%
      as.data.frame()


  }


  # beep("mario")
  return(DB)
}

disentangleMultimappers.IVSeqLen = function(DB, retSinglePos = T) {

  print("DISENTANGLE MULTI-MAPPERS FOR INTERVENING SEQUENCE length'")

  DB$IVseqmultimapper = "no"

  # only considering PSP multi-mappers
  k = which(str_detect(DB$positions, coll(";")) & !str_detect(DB$productType, "PCP"))

  if (length(k) > 0) {

    DB_mm = DB[k, ]
    DB_nomm = DB[-k, ]

    pb = txtProgressBar(min = 0, max = nrow(DB_mm), style = 3)
    for (r in 1:nrow(DB_mm)) {

      setTxtProgressBar(pb, r)

      cnt_pos = strsplit(DB_mm$positions[r], ";") %>%
        unlist() %>%
        str_split_fixed(pattern = coll("_"), n = Inf)

      cnt_types = str_split(DB_mm$spliceType[r], coll(";"), simplify = T) %>%
        paste()

      ivLens = rep(NA, nrow(cnt_pos))
      for (p in 1:nrow(cnt_pos)) {

        # cis and trans
        if (cnt_pos[p,3] >= cnt_pos[p,1]) {
          ivLens[p] = (abs(as.numeric(cnt_pos[p, 3]) - as.numeric(cnt_pos[p, 2])) - 1) %>%
            as.numeric()

        # revCis
        } else {
          ivLens[p] = (abs(as.numeric(cnt_pos[p, 1]) - as.numeric(cnt_pos[p, 4])) - 1) %>%
            as.numeric()
        }

      }


      if (length(unique(ivLens)) > 1) {
        DB_mm$IVseqmultimapper[r] = "yes"

      }  else if (retSinglePos) {
        DB_mm$positions[r] = paste(cnt_pos[1, c(1:4)], collapse = "_")
      }

    }

    DB = rbind(DB_nomm, DB_mm) %>%
      as.data.frame()

  }

  # beep("mario")
  return(DB)
}


removeMultimappers.Type = function(DB) {
  print("REMOVE PEPTIDES THAT CAN SOLELY BE MULTI-MAPPERS FOR PRODUCT TYPE")

  k = which(DB$spliceType == "type_multi-mapper")

  if (length(k) > 0) {
    DB = DB[-k, ]
  }

  return(DB)
}


removeMultimappers.AA = function(DB) {
  print("REMOVE PEPTIDES THAT ARE MULTI-MAPPERS IN TERMS OF AA AT sP1 OR sP1'")

  k = which(DB$AAmultimapper == "yes")

  if (length(k) > 0) {
    DB = DB[-k, ]
  }

  return(DB)
}


removeMultimappers.SRlen = function(DB) {
  print("REMOVE PEPTIDES THAT ARE MULTI-MAPPERS IN TERMS OF SR LENGTH")

  k = which(DB$SRmultimapper == "yes")

  if (length(k) > 0) {
    DB = DB[-k, ]
  }

  return(DB)
}

removeMultimappers.IVSeqLen = function(DB) {
  print("REMOVE PEPTIDES THAT ARE MULTI-MAPPERS IN TERMS OF INTERVENING SEQUENCE LENGTH")

  k = which(DB$IVseqmultimapper == "yes")

  if (length(k) > 0) {
    DB = DB[-k, ]
  }

  return(DB)
}


########## filter for peptide length ########## 
filterPepLength = function(DB, cutoff=5) {
  print(paste0("REMOVE PEPTIDES THAT ARE SHORTER THAN ", cutoff, " AA"))

  k = which(nchar(DB$pepSeq) < cutoff)

  if (length(k) > 0) {
    DB = DB[-k, ]
  }

  return(DB)
}

########## synthesis errors ########## 

remSynthErrors = function(DB) {
  print("REMOVE SYNTHESIS ERRORS")

  k = which(str_detect(DB$productType, "synError"))
  if (length(k) > 0) {
    DB = DB[-k, ]
  }

  return(DB)
}


keepSynthErrors = function(DB) {
  print("FILTER FOR SYNTHESIS ERRORS")

  k = which(str_detect(DB$productType, "synError"))
  if (length(k) > 0) {
    DB = DB[k, ]
  }

  return(DB)
}

########## early time points only ########## 
filterEarlyTimepoints = function(DB) {
  print("FILTER FOR EARLY TIME POINTS")

  DB = DB[which(DB$digestTime %in% c(2, 4)), ]

  return(DB)
}


########## 20S standard proteasome only ########## 
filter20Sstandard = function(DB) {
  print("FILTER FOR 20S STANDARD PROTEASOME")

  if ("protIsotype" %in% names(DB)) {
    DB = DB[DB$protIsotype %in% c("20S standard", "20S K562"), ]
  }

  return(DB)
}


########## I/L redundancy in product sequences #########
ILredundancy = function(DB) {

  print("REPLACE ALL ISOLEUCINS BY LEUCINS")

  DB$pepSeq = str_replace_all(DB$pepSeq, "I", "L")
  DB$substrateSeq = str_replace_all(DB$substrateSeq, "I", "L")

  return(DB)
}

########## unique product sequences ########## 
uniquePeptides = function(DB) {
  print("UNIQUE PEPTIDES")

  DB = DB %>%
    distinct(substrateID, substrateSeq, pepSeq, productType,
             #spliceType,
             positions, .keep_all = T)

  # DB2 = DB %>%
  #   group_by(substrateID, substrateSeq, pepSeq, productType,
  #            spliceType, positions) %>% slice(1)

  # DB3 = DB[, c("substrateID", "substrateSeq", "pepSeq",
  #             "productType", "spliceType", "positions")] %>%
  #   unique()

  # DB4 = DB[!duplicated(DB[, c("substrateID", "substrateSeq", "pepSeq",
  #                             "productType", "spliceType", "positions")]),]

  return(DB)
}

uniquePeptides.perTP = function(DB) {
  print("UNIQUE PEPTIDES FOR EACH TIME POINT")

  DB = DB %>%
    distinct(substrateID, substrateSeq, digestTime, pepSeq, productType,
             spliceType, positions, .keep_all = T)


  return(DB)
}

########## cosmetics ########## 
DBcosmetics = function(DB) {

  print("COSMETICS AND STATS")

  if ("X" %in% names(DB)) {
    DB$X = NULL
  }

  # replace NA in splice type column of PCPs
  DB$spliceType[str_detect(DB$productType, "PCP")] = "PCP"

  # # statistics
  # print(paste0("number of identified products: ", nrow(DB)))
  # 
  # print(table(DB$spliceType))
  # print(table(DB$spliceType) / nrow(DB))
  # 
  # print("substrates")
  # print(paste0("number of substrates: ", DB$substrateSeq %>% unique() %>% length()))
  # DB$substrateID %>%
  #   unique() %>% 
  #   paste(collapse = ", ") %>%
  #   print()

  return(DB)
}


########## load random DBs #########
loadRandom = function(fpath) {

  print("LOAD RANDOM DATABASES")

  # read cis, revCis, trans and PCP
  # concatenate and return list of data frames

  cis = read.csv(fpath[str_detect(fpath, "_cis")], stringsAsFactors = F)
  revCis = read.csv(fpath[str_detect(fpath, "_revCis")], stringsAsFactors = F)
  trans = read.csv(fpath[str_detect(fpath, "_trans")], stringsAsFactors = F)

  if (any(str_detect(fpath, "PCP_"))) {
    PCP = read.csv(fpath[str_detect(fpath, "_PCP")], stringsAsFactors = F)
  } else {
    PCP = NA
  }

  random = list(cis = cis,
                revCis = revCis,
                trans = trans,
                PCP = PCP)

  return(random)
}
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
generalStats = function(DB, tp) {
  cntTBL = table(DB$spliceType) %>% as.data.frame()
  cntTBL = cbind(cntTBL, 
                 (table(DB$spliceType) / nrow(DB)) %>% as.data.frame())
  names(cntTBL) = c(paste0("time point: ", tp), "# peptides", "type", "frequency")
  cntTBL$frequency = round(cntTBL$frequency, 4)

  return(cntTBL)
}


# ----- number of peptides over time ----
getNumberofPeptides = function(ProteasomeDB) {

  DB_filtered = ProteasomeDB %>%
    filterPepLength() %>%
    disentangleMultimappers.Type() %>%
    DBcosmetics()

  # fraction of PSMs
  psms = DB_filtered %>%
    group_by(substrateID,spliceType, digestTime) %>%
    summarise(n = n()) %>%
    ungroup() %>% group_by(substrateID, digestTime) %>%
    mutate(freq = n / sum(n))

  psms_allTP = DB_filtered %>%
    group_by(substrateID,spliceType) %>%
    summarise(n = n()) %>%
    ungroup() %>% group_by(substrateID) %>%
    mutate(freq = n / sum(n)) %>%
    mutate(digestTime = 0)

  # number + frequency of unique peptides
  peps = DB_filtered %>%
    uniquePeptides() %>%
    group_by(substrateID,spliceType, digestTime) %>%
    summarise(n = n()) %>%
    ungroup() %>% group_by(substrateID, digestTime) %>%
    mutate(freq = n / sum(n))

  peps_allTP = DB_filtered %>%
    uniquePeptides() %>%
    group_by(substrateID,spliceType) %>%
    summarise(n = n()) %>%
    ungroup() %>% group_by(substrateID) %>%
    mutate(freq = n / sum(n)) %>%
    mutate(digestTime = 0)

  return(list(psms = rbind(psms, psms_allTP[, names(psms)]),
              peps = rbind(peps, peps_allTP[, names(peps)])))

}


plotNumberofPeptides = function(ProteasomeDB, outname) {

  res = getNumberofPeptides(ProteasomeDB)

  psms = res$psms
  peps = res$peps

  if (length(unique(ProteasomeDB$digestTime)) == 1) {
    psms = psms[psms$digestTime == 0, ]
    peps = peps[peps$digestTime == 0, ]
  }

  # number of PSMs over time
  psms$digestTime = factor(psms$digestTime, levels = sort(unique(as.numeric(psms$digestTime))))
  psms$spliceType = factor(psms$spliceType, levels = c("PCP", "cis", "revCis", "trans", "type_multi-mapper"))

  theme_set(theme_classic())

  a = ggplot(psms, aes(x=digestTime, y=n, col=spliceType)) +
    geom_boxplot(position=position_dodge(width=0.8)) +
    geom_point(alpha = .3, position=position_dodge(width=0.8)) +
    scale_y_continuous(trans='log10') +
    scale_x_discrete(labels = gsub("^0$","all",sort(unique(psms$digestTime)))) +
    scale_color_manual("product type",
                       values = c(plottingCols["PCP"], plottingCols["cis"], plottingCols["revCis"],
                                  plottingCols["trans"], "gray"),
                       labels = c("non-spliced", "forward cis", "reverse cis", "trans", "multi-mapper")) +
    ggtitle("number of PSMs") +
    ylab("# PSMs") +
    xlab("digestion time")

  a2 = ggplot(psms, aes(x=digestTime, y=freq, col=spliceType)) +
    geom_boxplot(position=position_dodge(width=0.8)) +
    scale_x_discrete(labels = gsub("^0$","all",sort(unique(psms$digestTime)))) +
    scale_color_manual("product type",
                       values = c(plottingCols["PCP"], plottingCols["cis"], plottingCols["revCis"],
                                  plottingCols["trans"], "gray"),
                       labels = c("non-spliced", "forward cis", "reverse cis", "trans", "multi-mapper")) +
    ylim(c(0,1)) +
    ggtitle("frequency of PSMs") +
    ylab("frequency") +
    xlab("digestion time")


  peps$digestTime = factor(peps$digestTime, levels = sort(unique(as.numeric(peps$digestTime))))
  peps$spliceType = factor(peps$spliceType, levels = c("PCP", "cis", "revCis", "trans", "type_multi-mapper"))

  b = ggplot(peps, aes(x=digestTime, y=n, col=spliceType)) +
    geom_boxplot(position=position_dodge(width=0.8)) +
    geom_point(alpha = .3, position=position_dodge(width=0.8)) +
    scale_y_continuous(trans='log10') +
    scale_x_discrete(labels = gsub("^0$","all",sort(unique(peps$digestTime)))) +
    scale_color_manual("product type",
                       values = c(plottingCols["PCP"], plottingCols["cis"], plottingCols["revCis"],
                                  plottingCols["trans"], "gray"),
                       labels = c("non-spliced", "forward cis", "reverse cis", "trans", "multi-mapper")) +
    ggtitle("number of unique peptides") +
    ylab("# unique peptides") +
    xlab("digestion time")

  b2 = ggplot(peps, aes(x=digestTime, y=freq, col=spliceType)) +
    geom_boxplot(position=position_dodge(width=0.8)) +
    scale_x_discrete(labels = gsub("^0$","all",sort(unique(peps$digestTime)))) +
    scale_color_manual("product type",
                       values = c(plottingCols["PCP"], plottingCols["cis"], plottingCols["revCis"],
                                  plottingCols["trans"], "gray"),
                       labels = c("non-spliced", "forward cis", "reverse cis", "trans", "multi-mapper")) +
    ylim(c(0,1)) +
    ggtitle("frequency of unique peptides") +
    ylab("frequency") +
    xlab("digestion time")

  p = grid.arrange(a,b, a2,b2, nrow=2, ncol=2)
  ggsave(filename = outname,
         plot = p, device = cairo_pdf(), height = 10, width = 14, dpi = "retina")
}


# ----- coverage maps ------

plotCoverage = function(ProtesomeDB, tp, name) {

  k = which(str_detect(ProteasomeDB$positions, ";"))
  if (length(k) > 0) {
    print("removing all multi-mappers")
    ProteasomeDB = ProteasomeDB[-k,]
  }

  pcp_pos = ProteasomeDB$positions[ProteasomeDB$productType == "PCP"]
  psp_pos = ProteasomeDB$positions[ProteasomeDB$productType == "PSP"]
  S = ProteasomeDB$substrateSeq[1]

  # --- PCPs ---
  pcp_pos = str_split(pcp_pos, "_", simplify = T)
  pcp_pos = apply(pcp_pos,2,as.numeric) %>%
    as.data.frame()
  pcp_pos = pcp_pos[order(pcp_pos$V1, decreasing = F), ]

  par(mfrow = c(2,1))
  y0 =0.1
  plot(NULL, ylim=c(0,nrow(pcp_pos)*0.05),
       ylab = "", xlab="substrate",
       xlim = c(0, nchar(S)),
       main = paste0(name, ": ", tp, " hrs"),
       axes=F)

  for (i in 1:nrow(pcp_pos)) {

    if (length(which(pcp_pos$V1[1:i] <= pcp_pos$V1[i] & 
                     pcp_pos$V2[1:i] >= pcp_pos$V1[i])) > 1) {

      cnt.intercept = length(which(pcp_pos$V1[1:i] <= pcp_pos$V1[i] & 
                                     pcp_pos$V2[1:i] >= pcp_pos$V1[i]))*0.1

    } else {cnt.intercept = y0}

    segments(x0 = pcp_pos$V1[i], y0 = cnt.intercept,
             x1 = pcp_pos$V2[i], y1 = cnt.intercept,
             col = plottingCols["PCP"], lwd = .5)
  }

  xlabel = c("0",
             paste(S %>% strsplit("") %>% unlist(), seq(1, nchar(S)), sep = ""))
  text(x = seq(0, nchar(S)), par("usr")[3]-.3, labels = xlabel,
       srt=90, xpd=T, cex=.5)
  axis(2, labels = NA)


  # --- PSPs ---
  psp_pos = str_split(psp_pos, "_", simplify = T)
  psp_pos = apply(psp_pos,2,as.numeric) %>%
    as.data.frame()
  psppos = psp_pos[order(psp_pos$V1, decreasing = F), ]

  psp_pos = rbind(as.matrix(psppos[,c(1:2)]), as.matrix(psppos[,c(3:4)])) %>%
    as.data.frame()

  y0 =0.1
  plot(NULL, ylim=c(0,nrow(psp_pos)*0.05),
       ylab = "", xlab="substrate",
       xlim = c(0, nchar(S)),
       main = paste0(name, ": ", tp, " hrs"),
       axes=F)

  for (i in 1:nrow(psp_pos)) {

    if (length(which(psp_pos$V1[1:i] <= psp_pos$V1[i] & 
                     psp_pos$V2[1:i] >= psp_pos$V1[i])) > 1) {

      cnt.intercept = length(which(psp_pos$V1[1:i] <= psp_pos$V1[i] & 
                                     psp_pos$V2[1:i] >= psp_pos$V1[i]))*0.1

    } else {cnt.intercept = y0}

    segments(x0 = psp_pos$V1[i], y0 = cnt.intercept,
             x1 = psp_pos$V2[i], y1 = cnt.intercept,
             col = plottingCols["PSP"], lwd = .5)
  }

  xlabel = c("0",
             paste(S %>% strsplit("") %>% unlist(), seq(1, nchar(S)), sep = ""))
  text(x = seq(0, nchar(S)), par("usr")[3]-.3, labels = xlabel,
       srt=90, xpd=T, cex=.5)
  axis(2, labels = NA)

  out = recordPlot()
  return(out)
}


# ----- length distributions -----
# function for violin plots
ViolinPlot = function(data) {

  # print stats
  s = data %>%
    dplyr::group_by(type) %>%
    dplyr::summarise(n = dplyr::n(),
                     mean = mean(value),
                     median = median(value),
                     std = sd(value))
  print.data.frame(s)

  theme_set(theme_classic())

  # plotting
  data$type = factor(data$type, levels = c("PCP", "cis", "revCis", "trans"))

  sc = ggplot(data = data, aes(x = type, y = value, fill = type)) +
    geom_violin(size = .1) +
    stat_summary(fun = median, fun.min = median, fun.max = median,
                 geom = "crossbar", 
                 width = .2,
                 position = position_dodge(width = 4)) +
    xlab("product type") +
    scale_fill_manual("product type",
                      values = c(plottingCols["PCP"], plottingCols["cis"], plottingCols["revCis"],
                                 plottingCols["trans"]),
                      labels = c("non-spliced", "forward cis", "reverse cis", "trans")) +
    theme(axis.text.x = element_text(angle = 90),
          legend.position = "none")

  # add labs and title downstream

  return(sc)
}


# peptide length
PepLength = function(DB, tp) {

  print("PEPTIDE LENGTH")
  DB = removeMultimappers.Type(DB)

  pl = data.frame(value = nchar(DB$pepSeq),
                  type = DB$spliceType)

  pepLen = ViolinPlot(data = pl) +
    ylab("peptide product length (aa residues)") +
    scale_x_discrete(labels = c("non-spliced", "forward cis", "reverse cis", "trans")) +
    # scale_y_continuous(limits = c(0, 40), breaks = seq(0, 40, by = 5)) +
    ggtitle("peptide length distribution",
            subtitle = paste0(tp, " hrs"))

  return(pepLen)
}


# SR length
SRLength = function(DB, tp) {

  print("SPLICE-REACTANT LENGTH")

  DB = DB[str_detect(DB$productType, "PSP"), ] %>%
    removeMultimappers.Type() %>%
    removeMultimappers.SRlen()

  srlen = function(db) {
    pos = str_split_fixed(db$positions, pattern = "_", n = Inf)

    sr = data.frame(value_sr1 = as.numeric(pos[, 2]) - as.numeric(pos[, 1]) + 1,
                    value_sr2 = as.numeric(pos[, 4]) - as.numeric(pos[, 3]) + 1,
                    type = db$spliceType)
    sr$type = factor(sr$type, levels = c("cis", "revCis", "trans"))
    return(sr)
  }

  sr = srlen(db = DB)

  sr1 = sr[, c("value_sr1", "type")]
  names(sr1)[1] = "value"
  sr2 = sr[, c("value_sr2", "type")]
  names(sr2)[1] = "value"

  sr1.plot = ViolinPlot(data = sr1) +
    ylab("splice reactant length (aa residues)") +
    scale_x_discrete(labels = c("forward cis", "reverse cis", "trans")) +
    # scale_y_continuous(limits = c(0, 40), breaks = seq(0, 40, by = 5)) +
    ggtitle("SR1 length distribution",
            subtitle = paste0(tp, " hrs"))

  sr2.plot = ViolinPlot(data = sr2) +
    ylab("splice reactant length (aa residues)") +
    scale_x_discrete(labels = c("forward cis", "reverse cis", "trans")) +
    # scale_y_continuous(limits = c(0, 40), breaks = seq(0, 40, by = 5)) +
    ggtitle("SR2 length distribution",
            subtitle = paste0(tp, " hrs"))

  sr.plot = arrangeGrob(grobs = list(sr1.plot, sr2.plot), ncol=2)

  return(sr.plot)
}


# intervening sequence length
IVSeqLength = function(DB, tp) {

  print("INTERVENING SEQUENCE LENGTH")

  DB = DB[str_detect(DB$productType, "PSP"), ] %>%
    removeMultimappers.Type() %>%
    removeMultimappers.IVSeqLen()

  ivlen = function(db) {
    # intervening sequence length
    # calculated as in Specht et al., Paes et al. and SPI-ART
    pos = str_split_fixed(db$positions, pattern = "_", n = Inf)
    iv = rep(NA, nrow(db))

    cistrans.idx = which(db$spliceType %in% c("cis", "trans"))
    iv[cistrans.idx] = (abs(as.numeric(pos[cistrans.idx, 3]) - as.numeric(pos[cistrans.idx, 2])) - 1) %>%
      as.numeric()

    revcis.idx = which(db$spliceType == "revCis")
    iv[revcis.idx] = (abs(as.numeric(pos[revcis.idx, 1]) - as.numeric(pos[revcis.idx, 4])) - 1) %>%
      as.numeric()

    ivl = data.frame(value = iv,
                     type = db$spliceType)
    ivl$type = factor(ivl$type, levels = c("cis", "revCis", "trans"))
    return(ivl)
  }

  iv = ivlen(db = DB)
  iv = iv[iv$type != "trans", ]

  iv.plot = ViolinPlot(data = iv) +
    ylab("intervening sequence length (aa residues)") +
    scale_x_discrete(labels = c("forward cis", "reverse cis")) +
    ggtitle("intervening sequence length distribution",
            subtitle = paste0(tp, " hrs"))

  return(iv.plot)
}


# ------ type frequencies -----

TypeFrequency = function(DB) {

  print("FREQUENCY OF PRODUCT TYPES")

  Freq = DB %>% 
    removeMultimappers.Type() %>%
    group_by(spliceType) %>% 
    summarise(n = n()) %>%
    mutate(freq = n / sum(n)) %>%
    arrange(desc(spliceType)) %>%
    mutate(ypos = cumsum(freq)- 0.5*freq )

  print(Freq)

  Freq$spliceType = factor(Freq$spliceType, levels = c("PCP", "cis", "revCis", "trans"))

  freqs = ggplot(Freq, aes(x="", y=freq, fill=spliceType)) +
    geom_bar(stat="identity", width=1) +
    coord_polar("y", start=0) +
    scale_fill_manual("product type",
                      values = c(plottingCols["PCP"], plottingCols["cis"], plottingCols["revCis"],
                                 plottingCols["trans"]),
                      labels = c("non-spliced", "forward cis", "reverse cis", "trans")) +
    theme_void() +
    geom_text(aes(y = ypos, label=paste0("n = ", n)), size = 4)

  return(freqs)
}
ShowHide 5 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/QuantSysBio/invitroSPI
Name: invitrospi
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: GNU General Public License v3.0
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...