Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
This repository was moved on 02/12/2021! Find the latest version here . The old code has been left here, but no new updates will be made.
Code Snippets
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 | from sys import argv from os import path, SEEK_CUR from scipy.stats import norm from collections import defaultdict import argparse import csv import pysam import pdb genome_length = int(3e9) def main(argv): #get arguments parser = argparse.ArgumentParser(description='simulate viral integrations') parser.add_argument('--sim-info', help='information from simulation', required = True, type=str) parser.add_argument('--sim-bam', help='bam file from read simulation (must be sorted, indexed)', required = True, type=str) parser.add_argument('--soft-threshold', help='threshold for number of bases that must be mapped when finding integrations in soft-clipped reads', type=int, default=20) parser.add_argument('--discordant-threshold', help='threshold for number of bases that may be unmapped when finding integrations in discordant read pairs', type=int, default=20) parser.add_argument('--mean-frag-len', help='mean framgement length used when simulating reads', type=int, default=500) parser.add_argument('--sd-frag-len', help='standard devation of framgement length used when simulating reads', type=int, default=30) parser.add_argument('--window-frac', help='fraction of the distribution of fragment sizes to use when searching for discordant read pairs', type=float, default=0.99) parser.add_argument('--output', help='output file', required=False, default='annotated.tsv') args = parser.parse_args() # read in bam file samfile = pysam.AlignmentFile(args.sim_bam) # iterate over integrations in info file and pull out reads crossing each one # the sam/bam file has the same coordinates as the info file # so just use the coordinates of the left and right junctions from this files with open(args.sim_info, newline='') as info_file, open(args.output, 'w', newline='') as output: # use csv to read and write files reader = csv.DictReader(info_file, delimiter = '\t') writer_fieldnames = list(reader.fieldnames) + ['left_chimeric', 'right_chimeric', 'left_discord', 'right_discord', 'multiple_discord', 'fake_discord', 'discordant_and_chimeric'] writer = csv.DictWriter(output, delimiter = '\t', fieldnames = writer_fieldnames) writer.writeheader() # set window size for looking for discordant pairs and # looking for multiple ints with the same discordant pair window_width = window_size(args.mean_frag_len, args.sd_frag_len, args.window_frac) # create a buffer of integrations that all fall within this window width buffer = [next(reader)] while True: buffer, next_buffer = get_ints_in_window(buffer, window_width, reader, info_file) # find reads crossing each integration in the buffer buffer = find_reads_crossing_ints(buffer, samfile, args, window_width) # check for read pairs that cross multiple junctions buffer = find_multiple_discordant(buffer) # check for read pairs that are discordant at one integration and chimeric at another integration # assume that integrations are independent (even though they aren't in simulation), so these are not detectable buffer = find_multiple_chimeric_discordant(buffer) # write rows in buffer for row in buffer: writer.writerow(row) buffer = next_buffer # check if we've reached the end of the file if len(buffer) == 0: break samfile.close() def find_multiple_chimeric_discordant(buffer): """ Assume that in real data, integrations are independent, so two integrations do not occur close enough to each other in the same cell that we can detect them. This may or may not be the case, but the number of these kinds of pairs of integrations is assumed to be small enough that they're not worth worrying about. If a read pair has one read that is chimeric, but it's also discordant about the same or a different integration, only the chimeric read will be detected (and it won't be flagged as a discordant read pair). Assuming that these events are a tiny minority in real data, here don't flag the discordant pairs as something that should be detected (but instead add them to the 'discordant_and_chimeric' category) """ to_delete = {} for i, row in enumerate(buffer): # for each row, find reads that are both chimeric # and discordant for any other integration left_discord = row['left_discord'].split(';') right_discord = row['right_discord'].split(';') to_delete[i] = {'left' : [], 'right' : []} # check for these reads in other rows for row_2 in buffer: # skip if this is the same as the one we're looking at if row_2 == row: continue # get read names for chimeric reads for this row chimeric = row_2['left_chimeric'].split(";") chimeric += row_2['right_chimeric'].split(";") chimeric = [read[:-2] for read in chimeric] # check if chimeric reads in row_2 also occur in the list # of discordant reads in row for read in chimeric: if read in left_discord: to_delete[i]['left'].append(read) if read in right_discord: to_delete[i]['right'].append(read) # remove reads in to_delete for row_num in to_delete.keys(): # remove reads from left_discord left_discord = buffer[row_num]['left_discord'].split(';') left_discord = [read for read in left_discord if read not in to_delete[row_num]['left']] buffer[row_num]['left_discord'] = ";".join(left_discord) # remove reads from right_discord right_discord = buffer[row_num]['right_discord'].split(';') right_discord = [read for read in right_discord if read not in to_delete[row_num]['right']] buffer[row_num]['right_discord'] = ";".join(right_discord) # removed reads go to 'discordant_and_chimeric' buffer[row_num]['discordant_and_chimeric'] = [f"{read}_left" for read in to_delete[row_num]['left']] buffer[row_num]['discordant_and_chimeric'] += [f"{read}_right" for read in to_delete[row_num]['right']] buffer[row_num]['discordant_and_chimeric'] = ";".join(buffer[row_num]['discordant_and_chimeric']) return buffer def find_multiple_discordant(buffer): """ a read might be discordant about two or more integration sites for example, if a read pair crosses the right border of one integration and the left border of a close by integration, both read pairs will be mapped to the virus and therefore this pair won't be detected as an integration deal with discordant pairs that cross multiple integrations in one of two ways: 1. if a pair is discordant about the right side of one integration and the left side of a close by integration, then both reads are mapped to the virus and this read should be removed from both. Keep track of these in a different category ('fake_discord') 2. if a pair crosses right side of one integration, and both sides of a nearby integration, then one read maps to virus (in the first integration), and the other maps to host (after the second) and therefore this pair should be indicated to be an integration. However, if we want to score based on integration position, we need to decide which integration this read 'belongs' to. Keep these reads in a seperate category ('multiple_discord') for all the integrations that they cross, to reflect the fact that the 'belong' to multiple integrations """ # keep track of which reads we've already seen seen = dict() # for reads that we find at multiple junctions, # keep track of which junctions so we can work out what to do with that read multiples = dict() for i, row in enumerate(buffer): # check reads found around left side for read in row['left_discord'].split(';'): if read == '': continue if read in seen.keys(): # if this is the first double, add the entry from seen if read not in multiples: multiples[read] = [] multiples[read].append(seen[read]) # add entry to multiples for this time we found the read multiples[read].append({'int_id' : row['id'], 'side' : 'left' , 'buffer_row' : i}) else: seen[read] = {'int_id' : row['id'], 'side' : 'left', 'buffer_row' : i} # check right side for read in row['right_discord'].split(';'): if read == '': continue if read in seen.keys(): # if this is the first double, add the entry from seen if read not in multiples: multiples[read] = [] multiples[read].append(seen[read]) # add entry to multiples for this time we found the read multiples[read].append({'int_id' : row['id'], 'side' : 'right', 'buffer_row' : i}) else: seen[read] = {'int_id' : row['id'], 'side' : 'right', 'buffer_row' : i} # add extra keys to dicts row['multiple_discord'] = [] row['fake_discord'] = [] # deal with reads crossing multiple integrations for read in multiples: # if the side of the first junction matches the side of the last junction # this is a 'multiple_discord' sides = [find['side'] for find in multiples[read]] if sides[0] == sides[-1]: new_type = 'multiple_discord' else: new_type = 'fake_discord' for find in multiples[read]: # find row and side buffer_row = find['buffer_row'] old_type = f"{find['side']}_discord" # remove from list of reads of old type reads = buffer[buffer_row][old_type].split(";") reads.remove(read) buffer[buffer_row][old_type] = ";".join(reads) # add to list of new type buffer[buffer_row][new_type].append(f"{read}_{find['side']}") # for each row, join lists of 'multiple_disord' and 'fake_discord' for row in buffer: row['multiple_discord'] = ";".join(row['multiple_discord']) row['fake_discord'] = ";".join(row['fake_discord']) return buffer def times_already_found(read, multiples): """ check how many times a read has already been found when checking for discordant about multiple integration sides """ if read in multiples: return len(multiples[read]) else: return 0 def find_reads_crossing_ints(buffer, samfile, args, window_width): """ find reads crossing the integration site, and add them to the row reads crossing the integration site can be chimeric about the left or right junction or discordant about the left or right junction if they're discordant about both junctions, they actually go from host sequence to host sequence and therefore aren't actually discordant """ for row in buffer: # get information about integration site location chr = row['chr'] left_start = int(row['leftStart']) left_stop = int(row['leftStop']) right_start = int(row['rightStart']) right_stop = int(row['rightStop']) assert left_start >= 0 and right_start >= 0 # find chimeric reads left_chimeric = get_chimeric(chr, left_start, left_stop, samfile, args.soft_threshold, buffer) right_chimeric = get_chimeric(chr, right_start, right_stop, samfile, args.soft_threshold, buffer) row['left_chimeric'] = ";".join(left_chimeric) row['right_chimeric'] = ";".join(right_chimeric) # find discordant read pairs left_discord = get_discordant(chr, left_start, left_stop, samfile, args.discordant_threshold, window_width, buffer) right_discord = get_discordant(chr, right_start, right_stop, samfile, args.discordant_threshold, window_width, buffer) # if a read is both chimeric and discordant, chimeric takes priority # (but it shouldn't be both if the clipping threshold is the same for both read types) left_chimeric, left_discord = remove_chimeric_from_discord(left_chimeric, left_discord) right_chimeric, right_discord = remove_chimeric_from_discord(right_chimeric, right_discord) left_chimeric, right_discord = remove_chimeric_from_discord(left_chimeric, right_discord) right_chimeric, left_discord = remove_chimeric_from_discord(right_chimeric, left_discord) row['left_discord'] = ";".join(left_discord) row['right_discord'] = ";".join(right_discord) return buffer def get_ints_in_window(buffer, window_width, reader, reader_handle): """ get a list of integrations (dicts corresponding to one row from the int-info file) that are within window_width of each other """ assert len(buffer) == 1 # get position and chromosome from buffer start = int(buffer[0]['rightStop']) chr = buffer[0]['chr'] # get more integraions until we're not in the window anymore or we're at the end of the file # to avoid having to go back a line, save the first line not added to 'next_buffer' while True: # get next row try: row = next(reader) except StopIteration: next_buffer = [] break # compare previous integration with this integration prev_start = start prev_chr = chr start = int(row['leftStart']) chr = row['chr'] # check if next row is a window width away if (start - prev_start < window_width) and prev_chr == chr: # add to buffer buffer.append(row) start = int(row['rightStop']) else: # don't add the row but this will be the start of the next buffer next_buffer = [row] break return buffer, next_buffer def get_discordant(chr, start, stop, samfile, threshold, window_width, buffer): """ Get any discordant read pairs which cross an integration site In other words, get pairs where one mate is mapped on the host side, and the other on the virus side A read is considered mapped if it has at most threshold (default 20) unmapped bases This includes any pairs where the integration site falls within threshold (default 20) bases of the end of the read (for a read mapped on the left of the integration), or within threshold bases of the start of the read for a read on the mapped on the right of the integration Avoid an exhaustive search by extracting only the reads in a window around the integration site Set this window based on the mean length and standard deviation of fragment size used in simulation and the fraction of the fragment length distribution we want to cover. Find discordant pairs by finding pairs for which an integration (start, stop) falls between the 20th base from the end of the left read and the 20th base of the right read Current criteria in discordant.pl is that a pair must have one read mapped and the other unmapped to be considered a discordant read pair. To be considered 'mapped', a read must have 20 or fewer unmapped bases. To be considered 'unmapped', a read must have less than 20 unmapped bases Note that a read pair might have one integration fall between read1 and read2, but read1 or read2 might also cross a second integration. This pair is therefore both discordant about one integration, and also one member of the pair is chimeric A consequence of this is that one read maps only to vector or host, but the other maps to both. This discordant pair cannot currently be detected, since the pipeline currently detects discordant read-pairs only if one read is mapped to vector and not host, and vice versa for the other read. However, this pair really is evidence for integration at the first site (as a discordant read-pair), so include it in the output as such. """ reads = [] # extract read pairs in the desired window # pysam numbering is 0-based, with the only exception being the region string in the fetch() and pileup() methods. window_start = start - window_width if window_start < 1: window_start = 1 window_stop = stop + window_width if window_stop > samfile.get_reference_length(chr): window_stop = samfile.get_reference_length(chr) if window_stop == window_start: window_stop += 1 for read1, read2 in read_pair_generator(samfile, f"{chr}:{window_start}-{window_stop}"): # check mate is mapped if read1.is_unmapped or read2.is_unmapped: continue # check reference for this read is the same as mate if read1.reference_name != read2.reference_name: continue # if this read is forward, mate must be reverse and vice versa if (read1.is_reverse == read2.is_reverse): continue # if the integration site falls between left_boundary and right_boundary # (which are coordinates within the reference) # this pair crosses the integration site if read1.is_reverse is False: left_boundary = get_boundary(read1, threshold, side = "right") right_boundary = get_boundary(read2, threshold, side = "left") else: left_boundary = get_boundary(read2, threshold, side = "right") right_boundary = get_boundary(read1, threshold, side = "left") # if left_boundary is greater than right_boundary, reads overlap if left_boundary >= right_boundary: continue assert left_boundary is not None assert right_boundary is not None assert left_boundary < right_boundary if within(start, stop, left_boundary, right_boundary): reads.append(read1.qname) # # # TODO - need to decide if integrations should be excluded on the basis # # below (if it's not the case that one read is mapped to host and the other to virus) # # these events can't be currently detected in the pipeline, but are real evidence # # of integration, so for now include them in the output # # r1_mapped = get_mapped_ref(read1, buffer, threshold) # r2_mapped = get_mapped_ref(read2, buffer, threshold) # # assert r1_mapped['host'] or r1_mapped['virus'] # assert r2_mapped['host'] or r2_mapped['virus'] # # if r1_mapped['host'] != r2_mapped['host'] and r1_mapped['virus'] != r2_mapped['virus']: # reads.append(read1.qname) return reads def get_mapped_ref(read, buffer, threshold): """ figure out if each read in this read pair will be mapped to host or vector/virus returns a dict with the keys 'host' and 'virus', and values True or False depending on if the read is mapped to either or not """ assert read.is_unmapped is False read_mapped = {'host':False, 'virus':False, 'int' : []} # get first and last position to which read is mapped in reference first = read.get_reference_positions()[0] last = read.get_reference_positions()[-1] prev_start = 0 for row in buffer: # figure out if we need to include the ambiguous bases or not if row['juncTypes'].split(',')[0] == 'gap': left_host_junc = int(row['leftStart']) left_virus_junc = int(row['leftStop']) else: left_host_junc = int(row['leftStop']) left_virus_junc = int(row['leftStart']) if row['juncTypes'].split(',')[1] == 'gap': right_virus_junc = int(row['rightStart']) right_host_junc = int(row['rightStop']) else: right_virus_junc = int(row['rightStop']) right_host_junc = int(row['rightStart']) # if read is between the start of the chromosome and the start of the integration if intersects(first, last, prev_start, left_host_junc): # check we have at least threshold bases mapped if check_threshold(read, prev_start, left_host_junc, threshold): read_mapped['host'] = True # if read is between the left and right junctions if intersects(first, last, left_virus_junc, right_virus_junc): if check_threshold(read, left_virus_junc, right_virus_junc, threshold): read_mapped['virus'] = True read_mapped['int'].append(row['id']) prev_start = right_host_junc # if the prev_start is to the right of the position of the last mapped base in the read # then we don't need to continue if prev_start > last: break # check if read is between the end of the last integration and the end of the chromosome # don't know what the end of the chromosome is, so just use a large number (genome_length) if intersects(first, last, prev_start, genome_length): if check_threshold(read, prev_start, genome_length, threshold): read_mapped['host'] = True return read_mapped def get_chimeric(chr, start, stop, samfile, threshold, buffer): """ find reads that cross an interval defined as chr:start-stop in samfile the interval must be at least threshold bases from the start and end of the read """ reads = [] # get reads that cross interval # pysam numbering is 0-based, with the only exception being the region string in the fetch() and pileup() methods. # The same is true for any coordinates passed to the samtools command utilities directly, such as pysam.fetch(). for read in samfile.fetch(chr, start, stop + 1): # check that interval is at least threshold bases from either end of the read mapped = get_mapped_ref(read, buffer, threshold) assert (mapped['host'] or mapped['virus']) if mapped['host'] is True and mapped['virus'] is True: reads.append(read.query_name + read_num(read)) return reads def read_num(read): """ return '/1' if read is R1, '/2' if read is R2, or empty string otherwise """ if read.is_read1 is True: return "/1" elif read.is_read2 is True: return "/2" else: return "" def check_threshold(read, start, stop, threshold): """" check that there are least threshold bases that map to an interval (defined by start and stop) """ rstart = read.get_reference_positions()[0] # need to account for 0-based numbering (stop already has this accounted for) rstop = read.get_reference_positions()[-1] + 1 assert intersects(rstart, rstop, start, stop) if (rstop - start) < threshold: return False if (stop - rstart) < threshold: return False return True def window_size(mean_frag_len, sd_frag_len, window_frac): """ to avoid exhaustive search for discordant reads, work out window size for extracting reads when finding discordant read pairs based on mean and standard deviation of fragment length distribution, and fraction of this distribution we want to cover For example, if the mean fragment length is 500 bp, the standard deviation is 30 bp, and we want to cover 0.99 of this distribution, the window size would be 570 bp, which accounts for 99% of the fragment sizes in this distribution (one-tailed) """ # get one-tailed value which contains window_frac of fragments upper = norm.ppf(window_frac, loc = mean_frag_len, scale = sd_frag_len) return int(round(upper)) def get_boundary(read, threshold, side = "left"): """ get first position in reference that is at least threshold bases from the start or end of the read and isn't None if side is 'left', return the 0-based position after threshold mapped bases from the start of the read if side is 'right', return the 0-based position before threshold mapped bases from the end of the read """ assert isinstance(threshold, int) assert threshold >= 0 assert threshold <= read.qlen assert side in ['left', 'right'] aligned_pairs = read.get_aligned_pairs() if side == "left": # if we want to look on the right hand side, we need to look backwards # through the aligned pairs aligned_pairs = list(reversed(aligned_pairs)) # python numbering is zero-based and half-open threshold -= 1 # iterate over bases in aligned_pairs, starting from the threshold value for pair in aligned_pairs[threshold:]: # check the base is aligned if (pair[1] is not None): return pair[1] def within(start1, stop1, start2, stop2): """ compare two intervals, each with a start and stop value return true if the first interval is completely within in the second use half-open intervals, so [8, 8) is not within [5, 8) within(8, 8, 5, 8) => False within(6, 6, 5, 8) => True within(5, 8, 5, 8) => True within(4, 6, 5, 8) => False within(5, 9, 5, 8) => False """ assert start1 <= stop1 assert start2 <= stop2 assert isinstance(start1, int) assert isinstance(start2, int) assert isinstance(stop1, int) assert isinstance(stop2, int) if start1 >= start2 and start1 < stop2: if stop1 - 1 >= start2 and stop1 - 1 < stop2: return True return False def intersects(start1, stop1, start2, stop2): """ compare two intervals, each with a start and stop value return true if the first interval is intersects the second use half-open intervals, so [2, 3) and [3, 4) don't intersect """ assert start1 <= stop1 assert start2 <= stop2 assert isinstance(start1, int) assert isinstance(start2, int) assert isinstance(stop1, int) assert isinstance(stop2, int) # if first interval is completely to the left of the second interval if stop1 <= start2: return False # if first interval is completely to the right of the second interval if start1 >= stop2: return False # otherwise, they intersect return True def read_pair_generator(bam, region_string=None): """ https://www.biostars.org/p/306041/ Generate read pairs in a BAM file or within a region string. Reads are added to read_dict until a pair is found. """ read_dict = defaultdict(lambda: [None, None]) for read in bam.fetch(region=region_string): if read.is_secondary or read.is_supplementary: continue qname = read.query_name if qname not in read_dict: if read.is_read1: read_dict[qname][0] = read else: read_dict[qname][1] = read else: if read.is_read1: yield read, read_dict[qname][1] else: yield read_dict[qname][0], read del read_dict[qname] def remove_chimeric_from_discord(chimeric, discord): """ check for read ids that are in both chimeric and discord - remove them from discord if they're in both """ # chimeric reads have /1 or /2 added to_delete = [] chimeric_tmp = [read[:-2] for read in chimeric] for read in discord: if read in chimeric_tmp: print(f" removed a read that was in both chimeric and discord: {read}") to_delete.append(read) # remove reads flagged for deletion discord = [read for read in discord if read not in to_delete] return chimeric, discord if __name__ == "__main__": main(argv[1:]) |
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 | #location in host genome (chr, start, stop) #part of viral genome inserted (virus, start, stop) #integration type (whole, portion, rearrangement, etc) #overlaps/gaps at each junction # reports all integration sites relative to the host genome, independent of other integrations # intergrations are stored internally though the Integration class ###import libraries from Bio import SeqIO from Bio.Alphabet.IUPAC import unambiguous_dna from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord from scipy.stats import poisson import pandas as pd import argparse from sys import argv from os import path import numpy as np import scipy import pdb import csv import re import pprint import time ### max_attempts = 200000 #maximum number of times to try to place an integration site default_ints = 5 default_epi = 0 default_p_whole = 0.3 default_p_rearrange = 0.05 default_p_delete = 0.05 default_lambda_split = 1.5 default_p_overlap = 0.3 default_p_gap = 0.3 default_lambda_junction = 3 default_p_host_del = 0.0 default_lambda_host_del = 1000 default_min_sep = 500 search_length_overlap = 10000 # number of bases to search either side of randomly generated position for making overlaps lambda_junc_percentile = 0.99 fasta_extensions = [".fa", ".fna", ".fasta"] pp = pprint.PrettyPrinter(indent=4) def main(argv): #get arguments parser = argparse.ArgumentParser(description='simulate viral integrations') parser.add_argument('--host', help='host fasta file', required = True, type=str) # parser.add_argument('--simug_snp_indel', help='output file from simuG to map location of snps and indels') parser.add_argument('--virus', help = 'virus fasta file', required = True, type=str) parser.add_argument('--ints', help = 'output fasta file', type=str, default="integrations.fa") parser.add_argument('--int_info', help = 'output tsv with info about viral integrations', type=str, default="integrations.tsv") parser.add_argument('--int_num', help = f"number of integrations to be carried out [{default_ints}]", type=int, default=default_ints) parser.add_argument('--p_whole', help = f"probability of a virus being integrated completely [{default_p_whole}]", type = float, default=default_p_whole) parser.add_argument('--p_rearrange', help=f"probability of an integrated piece of virus being rearranged [{default_p_rearrange}]", type=float, default=default_p_rearrange) parser.add_argument('--p_delete', help=f"probability of an integrated piece of virus containing a deletion [{default_p_delete}]", type=float, default=default_p_delete) parser.add_argument('--lambda_split', help = f"mean of poisson distriubtion for number of pieces to split an integrated viral chunk into when rearranging or deleting [{default_lambda_split}]", type=float, default=default_lambda_split) parser.add_argument('--p_overlap', help=f"probability of a junction to be overlapped (common bases between virus and host) [{default_p_overlap}]", type=float, default=default_p_overlap) parser.add_argument('--p_gap', help=f"probability of a junction to have a gap [{default_p_gap}]", type=float, default=default_p_gap) parser.add_argument('--lambda_junction', help = f"mean of poisson distribution of number of bases in a gap/overlap [{default_lambda_junction}]", type=float, default=default_lambda_junction) parser.add_argument('--p_host_deletion', help=f"probability of a host deletion at the integation site [{default_p_host_del}]", type=float, default=default_p_host_del) parser.add_argument('--lambda_host_deletion', help = f"mean of poisson distribution of number of bases deleted from the host [{default_lambda_host_del}]", type=float, default=default_lambda_host_del) parser.add_argument('--epi_num', help = f"number of episomes to added to output fasta [{default_ints}]", type=int, default=default_ints) parser.add_argument('--epi_info', help = 'output tsv with info about episomes', type=str, default="episomes.tsv") parser.add_argument('--seed', help = 'seed for random number generator', default=12345, type=int) parser.add_argument('--verbose', help = 'display extra output for debugging', action="store_true") parser.add_argument('--model-check', help = 'check integration model every new integration', action="store_true") parser.add_argument('--min-sep', help='minimum separation for integrations', type=int, default=default_min_sep) parser.add_argument('--min-len', help='minimum length of integrated viral fragments', type=int, default=None) parser.add_argument('--max-len', help='maximum length of integrated viral fragments', type=int, default=None) args = parser.parse_args() args.simug_snp_indel = None #### generate integrations #### # generate dictionary of probabilities and lambda/means as input for Integrations class probs = {'p_whole' : args.p_whole, 'p_rearrange' : args.p_rearrange, 'p_delete' : args.p_delete, 'lambda_split' : args.lambda_split, 'p_overlap' : args.p_overlap, 'p_gap' : args.p_gap, 'lambda_junction' : args.lambda_junction, 'p_host_del' : args.p_host_deletion, 'lambda_host_del' : args.lambda_host_deletion} # initialise Events if args.verbose is True: print("initialising a new Events object") seqs = Events(args.host, args.virus, seed=args.seed, verbose = True, min_len = args.min_len, max_len=args.max_len, simug_snp_indel = args.simug_snp_indel) # add integrations seqs.add_integrations(probs, args.int_num, max_attempts, model_check = args.model_check, min_sep = args.min_sep) seqs.add_episomes(probs, args.epi_num, max_attempts) # save outputs seqs.save_fasta(args.ints) seqs.save_integrations_info(args.int_info) seqs.save_episomes_info(args.epi_info) class Events(dict): """ base class for this script - stores two kinds of events: Integrations and Episomes Integrations is a list-like class of Integration objects, which contain information about pieces of virus that have been integrated and their junctions with the surrounding chromomsomal sequence Episomes is a list-like class of ViralChunk objects, which contain information about pieces of virus that are episomal (present in sequence data but not integrated into the host chromosomes) """ def __init__(self, host_fasta_path, virus_fasta_path, fasta_extensions = ['.fa', '.fna', '.fasta'] , seed = 12345, verbose = False, min_len = None, max_len=None, simug_snp_indel = None, simug_cnv = None, simug_inversion = None, simug_tranlocation = None): """ initialise events class by importing a host and viral fasta, and setting up the random number generator """ # expectations for inputs assert isinstance(fasta_extensions, list) assert isinstance(seed, int) assert isinstance(verbose, bool) assert isinstance(min_len, int) or min_len is None if min_len is not None: assert min_len > 0 if max_len is not None: assert max_len > 1 self.verbose = verbose self.min_len = min_len self.max_len = max_len # check and import fasta files if self.verbose is True: print("importing host fasta", flush=True) # read host fasta - use index which doesn't load sequences into memory because host is large genome if self.checkFastaExists(host_fasta_path, fasta_extensions): self.host = SeqIO.to_dict(SeqIO.parse(host_fasta_path, 'fasta', alphabet=unambiguous_dna)) else: raise OSError("Could not open host fasta") if self.verbose is True: print(f"imported host fasta with {len(self.host)} chromosomes:", flush=True) host_chrs = {key:len(self.host[key]) for key in self.host.keys()} for key, length in host_chrs.items(): print(f"\thost chromosome '{key}' with length {length}", flush=True) # read virus fasta - make dictionary in memory of sequences if self.checkFastaExists(virus_fasta_path, fasta_extensions): self.virus = SeqIO.to_dict(SeqIO.parse(virus_fasta_path, 'fasta', alphabet=unambiguous_dna)) # convert all viral sequences to upper case to make it easier to check output for this_virus in self.virus.values(): this_virus.seq = this_virus.seq.upper() else: raise OSError("Could not open virus fasta") # check for SNP and indel map file from simuG - need to account for indels in output if simug_snp_indel is not None: self.indel = [row for row in self.read_simug_file(simug_snp_indel) if row['variant_type'] == 'INDEL'] else: self.indel = None # other types of structural variants self.cnv = self.read_simug_file(simug_cnv) self.inversion = self.read_simug_file(simug_inversion) self.tranlocation = self.read_simug_file(simug_tranlocation) # check that minimum length is greater than the length of all the viral references if self.min_len is not None: if not all([len(virus) >= self.min_len for virus in self.virus.values()]): raise ValueError(f"specified minimum length is more than the length of one of the input viruses") if any([len(virus) == self.min_len for virus in self.virus.values()]): print(f"warning: minimum length is equal to the length of one or more references - integrations involving these references will all be whole, regardless of p_whole") # check maximum length if self.max_len is not None and self.min_len is not None: if self.min_len > self.max_len: raise ValueError("Minimum length cannot be greater than maximum") if self.verbose is True: print(f"imported virus fasta with {len(self.virus)} sequences:", flush=True) virus_seqs = {key:len(self.virus[key]) for key in self.virus.keys()} for key, length in virus_seqs.items(): print(f"\tviral sequence '{key}' with length {length}", flush=True) # instantiate random number generator self.rng = np.random.default_rng(seed) def read_simug_file(self, filename): """ open simuG output file and read contents into memory """ if filename is None: return None assert path.isfile(filename) lines = [] with open(filename, 'r') as infile: reader = csv.DictReader(infile, delimiter = '\t') for row in reader: lines.append(row) return lines def add_integrations(self, probs, int_num, max_attempts = 50, model_check = False, min_sep = 1): """ Add an Integrations object with int_num integrations, with types specified by probs, to self """ assert isinstance(max_attempts, int) assert isinstance(int_num, int) assert isinstance(min_sep, int) assert max_attempts > 0 assert int_num >= 0 assert min_sep > 0 self.min_sep = min_sep self.max_attempts = max_attempts self.model_check = model_check # can only add integrations once if 'ints' in self: raise ValueError("integrations have already been added to this Events object") # is there a better error type for this? # check that the number of requested integrations will fit in the host, given the requested minimum separation # rule of thumb is that we allow 4*min_sep for each integration total_host_length = sum([len(seq) for seq in self.host.values()]) if self.min_sep * 4 * int_num > total_host_length: raise ValueError("The requested number of integrations, with the specified minimum separation, are not likely to fit into the specified host. Either decrease the number of integrations or the minimum separation.") # we require that the minimum length of integrations is longer than the integrated virus # so check the value of lambda_junction relative to min_len and the length of the shortest viral reference # require that both are greater than the 99th percentile of the poisson distribution defined by lambda_junction self.check_junction_length(probs) # instantiate Integrations object self.ints = Integrations(self.host, self.virus, probs, self.rng, self.max_attempts, self.model_check, self.min_sep, self.min_len, self.max_len, self.indel) # add int_num integrations if self.verbose is True: print(f"performing {int_num} integrations", flush=True) counter = 0 while len(self.ints) < int_num: t0 = time.time() if self.ints.add_integration() is False: counter += 1 # check for too many attempts if counter > max_attempts: raise ValueError('too many failed attempts to add integrations') t1 = time.time() print(f"added integration {len(self.ints)} in {(t1-t0):.2f}s", flush=True) print() # if we had fewer than 50% of our attempts left if (counter / max_attempts) > 0.5: print(f"warning: there were {counter} failed integrations", flush=True) def add_episomes(self, probs, epi_num, max_attepmts = 50): """ Add an Integrations object with int_num integrations, with types specified by probs, to self """ assert isinstance(max_attempts, int) assert isinstance(epi_num, int) assert max_attempts > 0 assert epi_num >= 0 # can only add episomes once if 'epis' in self: raise ValueError("episomes have already been added to this Events object") # we require that the minimum length of integrations is longer than the integrated virus # so check the value of lambda_junction relative to min_len and the length of the shortest viral reference # require that both are greater than the 99th percentile of the poisson distribution defined by lambda_junction self.check_junction_length(probs) # instantiate Episomes object self.epis = Episomes(self.virus, self.rng, probs, max_attempts, self.min_len) # add epi_num episomes if self.verbose is True: print(f"adding {epi_num} episomes", flush=True) counter = 0 while len(self.epis) < epi_num: if self.epis.add_episome() is False: counter += 1 if counter > max_attempts: raise ValueError('too many failed attempts to add episomes') def checkFastaExists(self, file, fasta_extensions): #check file exists exists = path.isfile(file) if not(exists): return False #check extension prefix = path.splitext(file)[-1] if prefix: if not prefix in fasta_extensions: return False return True def check_junction_length(self, probs): """ we require that the minimum length of integrations is longer than the integrated virus so check the value of lambda_junction relative to min_len and the length of the shortest viral reference require that both are greater than the 99th percentile of the poisson distribution defined by lambda_junction """ if probs['p_gap'] + probs['p_overlap'] == 0: return thresh = poisson.ppf(lambda_junc_percentile, probs['lambda_junction']) if self.min_len is not None: if thresh * 2 > self.min_len: raise ValueError( "There is likely to be a lot of clashes between the length of the left and right junctions, and the \ length of the integrations. Set a shorter lambda_jucntion or a longer minimum length" ) if thresh * 2 > min([len(virus) for virus in self.virus.values()]): raise ValueError( "There is likely to be a lot of clashes in which the length of the left and right junctions, and the \ length of the integrations. Set a shorter lambda_jucntion or use longer viral references." ) def save_fasta(self, file): """ save output sequences to file """ if 'ints' in vars(self): assert len(self.ints) >= 0 self.ints._Integrations__save_fasta(file, append = False) if 'epis' in vars(self): assert len(self.epis) >= 0 self.epis._Episomes__save_fasta(file, append = True) if ('ints' not in vars(self)) and ('epis' not in vars(self)): print("warning: no integrations or episomes have been added") if self.verbose is True: print(f"saved fasta with integrations and episomes to {file}") def save_integrations_info(self, file): """ save info about integrations to file """ assert len(self.ints) >= 0 self.ints._Integrations__save_info(file) if self.verbose is True: print(f"saved information about integrations to {file}") def save_episomes_info(self, file): """ save info about episomes to file """ assert 'epis' in vars(self) assert len(self.epis) >= 0 self.epis._Episomes__save_info(file) if self.verbose is True: print(f"saved information about episomes to {file}", flush=True) class Integrations(list): """ Class to store all integrations for a given host and virus This class stores the sequences for the host and virus (dictionaries of biopython seqRecord objects) And probabilities of each type of viral integration - whole/portion, rearrangement/deletion, gap/overlap, means of poisson distributions, host deletions and their mean lengths, etc These probabilities and means are stored in a dictionary - probs, which must contain the following values: p_whole - probability of an integration being of the whole virus [0.3]. probability of a portion is 1-p_whole p_rearrange - probability of a chunk being rearranged p_delete - probability of a chunk containing a deletion note that rearrangements and deletions are not mutually exclusive - an integration can have both a rearrangement and a deletion. if this is the case, deletions are always performed first. lambda_split - when splitting a chunk for a rearrangement or deletion, mean of the poission distribution for number of pieces in which to split chunk p_overlap - probability of a junction to contain an overlap (common bases between virus and host) p_gap - probability of a junction to contain a gap (random bases inserted between host and virus) (probability of neither a rearrangement or a deletion is 1 - (p_overlap + p_gap) ) lambda_junction - mean of poisson distribution of number of bases involved in overlap or gap p_host_del - probability that there will be a deletion from the host at integration site lambda_host_del - mean of poisson distribution of number of bases deleted from host genome at integration site """ def __init__(self, host, virus, probs, rng, max_attempts=50, model_check=False, min_sep=1, min_len=None, max_len=None, indel=None): """ Function to initialise Integrations object """ # checks for inputs assert isinstance(virus, dict) assert isinstance(probs, dict) assert isinstance(max_attempts, int) assert max_attempts > 0 assert isinstance(model_check, bool) assert isinstance(min_sep, int) assert min_sep > 0 assert isinstance(min_len, int) or min_len is None if min_len is not None: assert min_len > 0 assert isinstance(max_len, int) or max_len is None if max_len is not None: assert max_len > 1 if min_len is not None: assert max_len >= min_len assert isinstance(indel, list) or indel is None # assign properties common to all integrations self.host = host self.virus = virus self.probs = probs self.rng = rng self.max_attempts = max_attempts self.model_check = model_check self.min_sep = min_sep self.min_len = min_len self.max_len = max_len self.indel = indel # default values for probs self.set_probs_default('p_whole', default_p_whole) self.set_probs_default('p_rearrange', default_p_rearrange) self.set_probs_default('p_delete', default_p_delete) self.set_probs_default('lambda_split', default_lambda_split) self.set_probs_default('p_overlap', default_p_overlap) self.set_probs_default('p_gap', default_p_gap) self.set_probs_default('lambda_junction', default_lambda_junction) self.set_probs_default('p_host_del', default_p_host_del) self.set_probs_default('lambda_host_del', default_lambda_host_del) # checks on assigned variables assert 'p_whole' in probs assert 'p_rearrange' in probs assert 'p_delete' in probs assert 'lambda_split' in probs assert 'p_overlap' in probs assert 'p_gap' in probs assert 'lambda_junction' in probs assert 'p_host_del' in probs assert 'lambda_host_del' in probs # check that probabilities are between 0 and 1 self.check_prob(self.probs['p_whole'], 'p_whole') #self.check_prob(self.probs['p_rearrange'] + self.probs['p_delete'], 'the sum of p_rearrange and p_delete') self.check_prob(self.probs['p_rearrange'], 'p_rearrange') self.check_prob(self.probs['p_delete'], 'p_delete') self.check_prob(self.probs['p_overlap'] + self.probs['p_gap'], 'the sum of p_overlap and p_gap') self.check_prob(self.probs['p_host_del'], 'p_host_deletion') # check that lambda values are positive floats self.check_float_or_int_positive(self.probs['lambda_split'], 'lambda_split') self.check_float_or_int_positive(self.probs['lambda_junction'], 'lambda_junction') self.check_float_or_int_positive(self.probs['lambda_host_del'], 'lambda_host_deletion') # initialize model of integrations # each chromosome is an entry in the model dict # each chromosome is composed of a list of sequences, each composed of a dictionary # each sequence is a dictionary, specifying the 'origin' of the bases - 'host', 'virus', 'ambig' # as well as the start and stop and orientation of that sequence (in a list) # this is to be updated every time we do an integration self.model = {chr : [{'origin': 'host', 'coords':(0, len(seq)), "ori" : "+", 'seq_name': chr}] for chr, seq in host.items()} def set_probs_default(self, key, default): """ check if a key is in the probs dictionary, and if not set it to a default """ if key not in self.probs: self.probs[key] = default def check_float_or_int_positive(self, value, name): """ Check that a value, with a name, is a positive float or int """ if not isinstance(value, float) and not isinstance(value, int): raise ValueError(f"{name} must be a float (it's currently {type(value)})") if value <= 0: raise ValueError(f"{name} must be greater than zero (it's currently {value})") def check_prob(self, value, name): """ Check that a value, with a name, is a valid probability (float between 0 and 1): """ if not isinstance(value, float): raise ValueError(f"{name} must be a float (it's currently {type(value)})") if value < 0 or value > 1: raise ValueError(f"{name} must be between 0 and 1") def poisson_with_minimum_and_maximum(self, lambda_poisson, min=-np.inf, max=np.inf): """ get an integer from the poisson distribution with the specified lambda with a minimum value """ assert max >= min if min == max: return min counter = 0 while True: n = int(self.rng.poisson(lam = lambda_poisson)) if n >= min and n <= max: return n counter += 1 if counter > self.max_attempts: print(f"warning: too many attempts to get value with minimum {min} and maximum {max} from poisson distribution with mean {lambda_poisson}", flush=True) return None def add_integration(self): """ Add an integration by appending an Integration object to self. """ # call functions that randomly set properties of this integrations counter = 0 while True: chunk_props = self.set_chunk_properties() if len(chunk_props) != 0: break counter += 1 if counter > self.max_attempts: raise ValueError("too many attempts to set chunk properties") assert len(chunk_props) == 6 counter = 0 while True: junc_props = self.set_junc_properties() if len(junc_props) != 0: break counter += 1 if counter > self.max_attempts: raise ValueError("too many attempts to set junction properties") assert len(junc_props) == 4 # make an integration integration = Integration(self.host, self.virus, model = self.model, rng = self.rng, int_id = len(self), chunk_props = chunk_props, junc_props = junc_props, min_sep = self.min_sep ) # append to self if nothing went wrong with this integration if integration.chunk.pieces is not None: if integration.pos is not None: assert integration.id not in [item.id for item in self] self.append(integration) self.__update_model(integration) return True return False def set_junc_properties(self): """ randomly set the properties of the junctions (self.junc_props) for this Integration dict with the following properties: - junc_types = iterable of length 2 specifying type of left and right junctions (one of gap, overlap, clean) - n_junc = iterable of length 2 specifying length of left and right junctions - host_del = boolean specifying if there should be a deletion from the host at the integration site - host_del_len = integer specifying the number of bases to be deleted from the host at the integration site """ junc_props = {} # get type of left junction p_clean = 1 - self.probs['p_overlap'] - self.probs['p_gap'] prob_juncs = [self.probs['p_overlap'], self.probs['p_gap'], p_clean] junc_props['junc_types'] = self.rng.choice(a = ['overlap', 'gap', 'clean'], size = 2, p = prob_juncs) # get number of bases in left junction if junc_props['junc_types'][0] == 'clean': n_left_junc = 0 elif junc_props['junc_types'][0] in ['gap', 'overlap']: n_left_junc = self.poisson_with_minimum_and_maximum(self.probs['lambda_junction'], min = 1) if n_left_junc is None: return {} else: return {} # get number of bases in right junction if junc_props['junc_types'][1] == 'clean': n_right_junc = 0 elif junc_props['junc_types'][1] in ['gap', 'overlap']: n_right_junc = self.poisson_with_minimum_and_maximum(self.probs['lambda_junction'], min = 1) if n_right_junc is None: return {} else: return {} junc_props['n_junc'] = (n_left_junc, n_right_junc) # check that if we have a clean junction, it's length is 0 assert not(junc_props['junc_types'][0] == 'clean') or junc_props['n_junc'][0] == 0 assert not(junc_props['junc_types'][1] == 'clean') or junc_props['n_junc'][1] == 0 # check that if we don't have a clean junction, it's length is greater than zero assert not(junc_props['junc_types'][0] != 'clean') or junc_props['n_junc'][0] > 0 assert not(junc_props['junc_types'][1] != 'clean') or junc_props['n_junc'][1] > 0 # decide if this integration should have a deletion from the host host_deletion = self.rng.choice(a = [True, False], p = (self.probs['p_host_del'], 1 - self.probs['p_host_del'])) junc_props['host_del'] = bool(host_deletion) # if we're doing a host deletion, get number of bases to be deleted if junc_props['host_del'] is True: junc_props['host_del_len'] = self.poisson_with_minimum_and_maximum(self.probs['lambda_host_del'], min = 1) if junc_props['host_del_len'] is None: return {} elif junc_props['host_del'] is False: junc_props['host_del_len'] = 0 else: return {} # check that return junc_props def set_chunk_properties(self): """ randomly set the properties of the viral chunk for this Integration returns dict with the following properties: - is_whole: boolean specifying if the ViralChunk is whole (if false, chunk is just a portion) - n_fragments: number of fragments into which ViralChunk should be split - n_delete: number of fragments to delete from ViralChunk (should always leave at least two fragments after deletion) - n_swaps: number of swaps to make when rearranging ViralChunk - min_len: the minimum length of the viral chunk - optional (if not present, for integration will be set to the number of overlap bases + 1, for episome will be set to 1 - max_len: the maxumum length of the viral chunk """ chunk_props = {} # get if integration should be whole or portion p_portion = 1 - self.probs['p_whole'] chunk_props['is_whole'] = bool(self.rng.choice(a = [True, False], p = [self.probs['p_whole'], p_portion])) # get if integration should be rearranged p_not_rearrange = 1 - self.probs['p_rearrange'] is_rearrange = bool(self.rng.choice(a = [True, False], p = [self.probs['p_rearrange'], p_not_rearrange])) # get if integration should contain deletion p_not_delete = 1 - self.probs['p_delete'] is_delete = bool(self.rng.choice(a = [True, False], p = [self.probs['p_delete'], p_not_delete])) # get number of fragments - ignored if both isDelete and isRearrange are both False # must have at least two pieces for a rearrangment, or three for a deletion min_split = 1 if is_rearrange is True: min_split = 2 if is_delete is True: min_split = 3 # set the number of fragments for the chunk if is_delete is False and is_rearrange is False: chunk_props['n_fragments'] = 1 else: chunk_props['n_fragments'] = self.poisson_with_minimum_and_maximum(self.probs['lambda_split'], min = min_split) if chunk_props['n_fragments'] is None: return {} assert chunk_props['n_fragments'] > 0 # if we're doing a deletion, get the number of fragments to delete if is_delete is True: chunk_props['n_delete'] = int(self.rng.choice(range(0, chunk_props['n_fragments'] - 1))) else: chunk_props['n_delete'] = 0 # if we're doing a rearrangement, get the number of swaps to make if is_rearrange is True: chunk_props['n_swaps'] = self.poisson_with_minimum_and_maximum(self.probs['lambda_split'], min = 1) if chunk_props['n_swaps'] is None: return {} else: chunk_props['n_swaps'] = 0 # set minimum length of chunk chunk_props['min_len'] = self.min_len chunk_props['max_len'] = self.max_len return chunk_props def __update_model(self, integration): """ update self.model for a new integration """ # find segment in which integration should occur for i, seg in enumerate(self.model[integration.chr]): if seg['origin'] != 'host': continue if integration.pos >= seg['coords'][0] and integration.pos <= seg['coords'][1]: break t0 = time.time() # remove this element from the list seg = self.model[integration.chr].pop(i) host_start = seg['coords'][0] host_stop = seg['coords'][1] left_overlap_bases = integration.junc_props['n_junc'][0] if integration.junc_props['junc_types'][0] == 'overlap' else 0 right_overlap_bases = integration.junc_props['n_junc'][1] if integration.junc_props['junc_types'][1] == 'overlap' else 0 overlap_bases = left_overlap_bases + right_overlap_bases # create host segment before this integration and add to list host = {'origin' : 'host', 'seq_name' : integration.chr, 'ori' : '+'} # if the integration had a left overlap, we need to trim these bases from the host # note that int.pos is always to the left of any overlaps # so we don't need to consider them here host['coords'] = (host_start, integration.pos) assert host['coords'][1] > host['coords'][0] self.model[integration.chr].insert(i, host) i += 1 # if we have ambiguous bases at the left junction, add these to the list too assert len(integration.junc_props['junc_bases'][0]) == integration.junc_props['n_junc'][0] if integration.junc_props['junc_types'][0] in ['gap', 'overlap']: # features common to both ambiguous types ambig = {'origin': 'ambig', 'ori' : "+"} ambig['bases'] = integration.junc_props['junc_bases'][0] # gap features if integration.junc_props['junc_types'][0] == 'gap': ambig['seq_name'] = 'gap' ambig['coords'] = (0, integration.junc_props['n_junc'][0]) # overlap features elif integration.junc_props['junc_types'][0] == 'overlap': ambig['seq_name'] = integration.chr ambig['coords'] = (integration.pos, integration.pos + left_overlap_bases) assert str(self.host[integration.chr][ambig['coords'][0]:ambig['coords'][1]].seq).lower() == integration.junc_props['junc_bases'][0].lower() else: raise ValueError(f"unrecgonised integration type: {integration.junc_props[0]}") assert ambig['coords'][1] > ambig['coords'][0] self.model[integration.chr].insert(i, ambig) i += 1 # add each piece of the viral chunk too for j in range(len(integration.chunk.pieces)): virus = {'origin': 'virus', 'coords': (integration.chunk.pieces[j][0], integration.chunk.pieces[j][1]), 'ori' : integration.chunk.oris[j], 'seq_name' : integration.chunk.virus} assert virus['coords'][1] > virus['coords'][0] self.model[integration.chr].insert(i, virus) i += 1 t0 = time.time() # if we have ambiguous bases at the right junction, add these assert len(integration.junc_props['junc_bases'][1]) == integration.junc_props['n_junc'][1] if integration.junc_props['junc_types'][1] in ['gap', 'overlap']: ambig = {'origin': 'ambig', 'bases' : integration.junc_props['junc_bases'][1], 'ori' : "+"} if integration.junc_props['junc_types'][1] == 'gap': ambig['seq_name'] = 'gap' ambig['coords'] = (0, integration.junc_props['n_junc'][1]) # overlap features elif integration.junc_props['junc_types'][1] == 'overlap': ambig['seq_name'] = integration.chr # if the left junction was also an overlap, we need to account for this in the coordinates if integration.junc_props['junc_types'][0] == 'overlap': start = integration.pos + left_overlap_bases stop = start + right_overlap_bases else: start = integration.pos stop = start + right_overlap_bases ambig['coords'] = (start, stop) assert str(self.host[integration.chr][ambig['coords'][0]:ambig['coords'][1]].seq).lower() == integration.junc_props['junc_bases'][1].lower() else: raise ValueError(f"unrecgonised integration type: {integration.junc_props[0]}") assert ambig['coords'][1] > ambig['coords'][0] self.model[integration.chr].insert(i, ambig) i += 1 # finally, add second portion of host host = {'origin': 'host', 'seq_name': integration.chr, 'ori': '+'} # accounting for bases deleted from the host and overlaps # note that integration.pos is always to the left of any overlapped bases, so here is where we need to account for them host['coords'] = (integration.pos + integration.junc_props['host_del_len'] + overlap_bases, host_stop) assert host['coords'][1] > host['coords'][0] self.model[integration.chr].insert(i, host) i += 1 if self.model_check is True: self.__check_model() def __check_model(self): """ check model is valid by checking various properties """ n_ints = 0 next_int = True t0 = time.time() for chr in self.model.keys(): host_pos = 0 for seg in self.model[chr]: assert seg['coords'][1] > seg['coords'][0] assert seg['origin'] in ('host', 'virus', 'ambig') assert 'seq_name' in seg if seg['origin'] == 'host': next_int = True assert seg['seq_name'] == chr assert seg['ori'] == '+' # check that host position is only increasing assert seg['coords'][0] >= host_pos host_pos = seg['coords'][1] elif seg['origin'] == 'virus': if next_int is True: n_ints += 1 next_int = False assert seg['seq_name'] in list(self.virus.keys()) elif seg['origin'] == 'ambig': assert 'bases' in seg if seg['seq_name'] != 'gap': assert seg['seq_name'] in list(self.host.keys()) host_bases = str(self.host[chr][seg['coords'][0]:seg['coords'][1]].seq).lower() seg_bases = seg['bases'].lower() assert host_bases == seg_bases assert n_ints == len(self) t1 = time.time() print(f"checked model validity in {t1-t0}s") def __save_fasta(self, filename, append = False): """ Save host fasta with integrated viral bases to a fasta file """ assert isinstance(append, bool) if append is True: write_type = "a" if append is False: write_type = "w+" with open(filename, write_type) as handle: # loop over chromosomes for chr in self.host.keys(): # print chromosome name handle.write(f">{chr}\n") # loop over entries in the model for this chromosome for entry in self.model[chr]: start = entry['coords'][0] stop = entry['coords'][1] # if host if entry['origin'] == 'host': handle.write(str(self.host[chr].seq[start:stop])) # if ambiguous write bases - note that overlapped bases have been trimmed from host and virus # so we're ok to write them here elif entry['origin'] == 'ambig': handle.write(entry['bases']) # if viral elif entry['origin'] == 'virus': virus = entry['seq_name'] if entry['ori'] == '+': handle.write(str(self.virus[virus].seq[start:stop])) elif entry['ori'] == '-': handle.write(str(self.virus[virus].seq[start:stop].reverse_complement())) else: raise ValueError(f"unregconised orientation {entry['ori']} in {entry}") else: raise ValueError(f"unrecgonised model feature on chr {chr}: {entry}") handle.write("\n") def __handle_indels(self, integration): """ this function gets the offset due to indels that occurs before an input integration it also remove from the list of indels any indels that occur in the region of the input host chomosome that is deleted after this integration it also calculates the length of the host deletion in the original fasta, which differs from the length in the input fasta (integration.host_deleted) if the deleted region contains indels (which are not present in the original fasta it uses a list of indels: self.indel_list_for_consuming each time indels are accounted for, they are removed from the list (consumed), so they can only be counted once for increasing positions 03/03/2021 Abandoned the indel feature for now, but left this function in case I decide to come back to it. The main problem I have is the way I'm defining the integration position - this is to the left of any ambiguous bases However, it should probably be after the left ambiguous bases and before the right ambiguous bases. Deletions should also occur before the right ambiguous bases, so that the homology occurs after the deletion With the current definition of the integration position, it's difficult to handle indels in a consistent way, but changing to the more sensible definition is a big job """ host_position_offset = 0 host_deletion_offset = 0 if self.indel is None: integration.host_deleted_original_fasta = integration.host_deleted return host_position_offset, host_deletion_offset if not hasattr(self, "indel_list_for_consuming"): self.indel_list_for_consuming = list(self.indel) # find indels before the position of this integration prior_indels = [] i = 0 while i < len(self.indel_list_for_consuming): indel = self.indel_list_for_consuming[i] if indel['ref_chr'] != integration.chr: continue # if this indel is before the integration position if integration.pos >= int(indel['sim_end']): host_position_offset += len(indel['sim_allele']) - len(indel['ref_allele']) self.indel_list_for_consuming.pop(i) # if this integration is after the integration position (assumes indels are sorted) elif integration.pos < int(indel['sim_start']): break else: i += 1 # find indels that are in the deleted region del_region_indels = [] i = 0 while i < len(self.indel_list_for_consuming): indel = self.indel_list_for_consuming[i] if indel['ref_chr'] != integration.chr: continue # if integration is in this insertion if int(indel['sim_start']) <= integration.pos and int(indel['sim_end']) > integration.pos: # split insertion at integration # add portion of insertion that's before integration.pos to host position offset # add portion of insertion that's after integration.pos to deletion offset pass # if indel is in deleted region # add indel length to deletion offset # if insertion straddles end of deleted region pass # split insertion at end of deleted region # add portion of insetion that's before end of deleted region to deletion offset # retain portion of insertion that's after end of deleted region for next integration # if indel is after deleted region, we're done elif integration.pos < int(indel['sim_stop']): break else: i += 1 integration.host_deleted_original_fasta = integration.host_deleted + host_deletion_offset return host_position_offset, host_deletion_offset def __save_info(self, filename): """ Output the following info for each integration (one integration per line) into a tab-separated file: Note that all co-orindates are 0-based - chr: host chromosome on which integration occurs - hPos: 0-based position in the ORIGINAL host fasta (before adding variation with simuG) at which integration occurs - hPos_input_fasta: 0-based position in the input host fasta at which integration occurs (relative to input fasta) - leftStart, leftStop: position of the ambiguous bases (gap, overlap or clean junction) on left side, in final output fasta (accounting for any previous integrations) - rightStart, rightStop: position of the ambiguous bases (gap, overlap or clean junction) on right side in final output fasta (accounting for any previous integrations) - hDeleted: number of bases deleted from host chromosome in original fasta (before adding variation) - hDeleted_input_fasta: number of bases deleted from host chromosome in input fasta (after adding variation) - virus: name of integrated virus - viral breakpoints: a comma separated list of viral breakpoints which together indicate the integrated bases adjacent pairs of breakpoints indicate portions of the virus that have been integrated - vOris: orientation (+ or -) of each portion of the virus that was integrated - juncTypes: types (gap, overlap, clean) of left and right junctions, respectively - juncBases: bases at left and right junctions, respectively """ #pp.pprint(self.model) # dictionary will keep track of the number of bases previously integrated on each chromosome previous_ints = {key:0 for key in self.host.keys()} # length of previous integrations deleted_bases_input_fasta = {key:0 for key in self.host.keys()} # deletions because of integrations deleted_bases_original_fasta = {key:0 for key in self.host.keys()} sim_to_original_offset = {key:0 for key in self.host.keys()} # add this to sim position to get original position (due to indels) self.__check_model() self.sort() with open(filename, "w", newline='') as csvfile: intwriter = csv.writer(csvfile, delimiter = '\t') intwriter.writerow(['id', 'chr', 'hPos', 'hPos_input_fasta', 'leftStart', 'leftStop', 'rightStart', 'rightStop', 'hDeleted', 'hDelted_input_fasta', 'virus', 'vBreakpoints', 'vOris', 'juncTypes', 'juncBases', 'juncLengths', 'whole', 'rearrangement', 'deletion', 'n_swaps', 'n_delete']) for i, integration in enumerate(self): assert integration.pos is not None # to calculate hPos, we need to account for variants introduced by simuG host_position_offset, host_deletion_offset = self.__handle_indels(integration) sim_to_original_offset[integration.chr] += host_position_offset h_pos = integration.pos + sim_to_original_offset[integration.chr] # calculate start and stop position for this integration left_start = integration.pos + previous_ints[integration.chr] - deleted_bases_input_fasta[integration.chr] left_stop = left_start + integration.junc_props['n_junc'][0] right_start = left_stop + len(integration.chunk.bases) right_stop = right_start + integration.junc_props['n_junc'][1] # update previous_ints - total integrated bases # are the integrated viral bases, and the bases in the gaps previous_ints[integration.chr] += len(integration.chunk.bases) if integration.junc_props['junc_types'][0] == 'gap': previous_ints[integration.chr] += integration.junc_props['n_junc'][0] if integration.junc_props['junc_types'][1] == 'gap': previous_ints[integration.chr] += integration.junc_props['n_junc'][1] # update deleted_bases deleted_bases_input_fasta[integration.chr] += integration.junc_props['host_del_len'] deleted_bases_original_fasta[integration.chr] += integration.host_deleted_original_fasta # indels in deleted region didn't happen, so we need to account for this in the position offset sim_to_original_offset[integration.chr] -= host_deletion_offset # format lists into comma-separated strings breakpoints = ";".join([str(i) for i in integration.chunk.pieces]) oris = ','.join(integration.chunk.oris) junc_types = ",".join(integration.junc_props['junc_types']) junc_bases = ",".join(integration.junc_props['junc_bases']) junc_lengths = ",".join([str(i) for i in integration.junc_props['n_junc']]) # write a row for this integration intwriter.writerow([integration.id, integration.chr, h_pos, integration.pos, left_start, left_stop, right_start, right_stop, integration.host_deleted, integration.host_deleted_original_fasta, integration.chunk.virus, breakpoints, oris, junc_types, junc_bases, junc_lengths, integration.chunk.chunk_props['is_whole'], integration.chunk.chunk_props['n_swaps'] > 0, integration.chunk.chunk_props['n_delete'] > 0, integration.chunk.chunk_props['n_swaps'], integration.chunk.chunk_props['n_delete'] ]) def __str__(self): return f"Viral integrations object with {len(self)} integrations of viral sequences {list(self.virus.keys())} into host chromosomes {list(self.host.keys())}" def __repr__(self): return f"Object of type Integrations with properties {self}" class Episomes(Integrations): """ Episomes may be added to the output fasta to mimic contamination of sample with purely viral sequences this class stores a list of ViralChunk objects which make up the contaminating viral sequences This class is intended to be used by the Integrations class Since Integrations and Episomes use some similar methods, this class inherits from Integrations in order to avoid duplication """ def __init__(self, virus, rng, probs, max_attempts = 50, min_len = None, max_len=None): """ initialises an empty Episomes list, storing the properties common to all episomes """ # checks for inputs assert isinstance(virus, dict) assert isinstance(probs, dict) assert isinstance(max_attempts, int) assert isinstance(min_len, int) or min_len is None if min_len is None: min_len = 1 else: assert min_len > 0 assert isinstance(max_len, int) or max_len is None if max_len is not None: assert max_len > 1 if min_len is not None: assert max_len >= min_len # assign properties common to all episomes self.virus = virus self.probs = probs self.rng = rng self.max_attempts = max_attempts self.min_len = min_len self.max_len = max_len # default values for probs self.set_probs_default('p_whole', default_p_whole) self.set_probs_default('p_rearrange', default_p_rearrange) self.set_probs_default('p_delete', default_p_delete) self.set_probs_default('lambda_split', default_lambda_split) # checks on assigned variables assert 'p_whole' in probs assert 'p_rearrange' in probs assert 'p_delete' in probs assert 'lambda_split' in probs # check that probabilities are between 0 and 1 self.check_prob(self.probs['p_whole'], 'p_whole') #self.check_prob(self.probs['p_rearrange'] + self.probs['p_delete'], 'the sum of p_rearrange and p_delete') self.check_prob(self.probs['p_rearrange'], 'p_rearrange') self.check_prob(self.probs['p_delete'], 'p_delete') self.check_prob(self.probs['p_overlap'] + self.probs['p_gap'], 'the sum of p_overlap and p_gap') # when adding episomes, we will want to keep track of the number of episomes for each virus self.virus_counts = {virus: 0 for virus in self.virus.keys()} def add_episome(self): """ get a viral chunk and add to self """ # call functions that randomly set properties of this integrations chunk_props = self.set_chunk_properties() assert len(chunk_props) != 2 # get a viral chunk chunk = ViralChunk(self.virus, self.rng, chunk_props) # check for valid chunk if chunk.pieces is not None: # add id to chunk chunk.id = f"{chunk.virus}__{self.virus_counts[chunk.virus]}" self.virus_counts[chunk.virus] += 1 # append to self self.append(chunk) return True return False def __save_fasta(self, filename, append = True): """ save each ViralChunk as a separate sequence in an output fasta """ assert isinstance(append, bool) if append is True: write_type = "a" if append is False: write_type = "w+" # virus counts will keep track of the number of episomes # for each virus virus_counts = {virus: 0 for virus in self.virus.keys()} # open file for writing with open(filename, write_type) as handle: for chunk in self: # write name of virus and current count as an id handle.write(f">{chunk.id}\n") virus_counts[chunk.virus] += 1 # write viral bases handle.write(f"{chunk.bases}\n") def __save_info(self, filename): """ save chunk.chunk_properties into tab-separated format fields to write: - id: a unique identifier for each episome - virus: virus from which this episome comes from - start: coordinate (in virus) of left-most base in chunk - stop: coordinate (in virus) of right-most base in chunk - pieces: breakpoints of each piece of this chunk - oris: orientation of each piece of this chunk - is_whole: if this episome was originally the whole virus - is_rearrange: if this episome was rearranged - is_deletion: if this episome contains a deletion - n_swaps: number of swaps made when rearranging - n_delete: number of pieces deleted from chunk """ assert len(self) >= 0 # define header header = ["id", "virus", "start", "stop", "pieces", "oris", "is_whole", "is_rearrange", "is_deletion", "n_swaps", "n_delete"] with open(filename, "w", newline='') as csvfile: epiwriter = csv.writer(csvfile, delimiter = '\t') epiwriter.writerow(header) for chunk in self: epiwriter.writerow([chunk.id, chunk.virus, chunk.start, chunk.stop, chunk.pieces, chunk.oris, chunk.chunk_props['is_whole'], chunk.chunk_props['n_swaps'] > 0, chunk.chunk_props['n_swaps'], chunk.chunk_props['n_delete'] > 0, chunk.chunk_props['n_delete']]) def __str__(self): return f"Episomes object with {len(self)} episomes drawn from viral sequences {list(self.virus.keys())}" def __repr__(self): return f"Object of type Episomes with properties {self}" class Integration(dict): """ Class to store the properties of an individual integration. If properly instantiated, it stores the properties of the junctions either side of the integrated bases, and a ViralChunk object (self.chunk) which stores the properties of the integrated bases themselves. This class is intended to be used by the Integrations class, which is essentially a list of Integration objects """ def __init__(self, host, virus, model, rng, int_id, chunk_props, junc_props, min_sep): """ Function to initialise Integration object portionType is 'whole' or 'portion' - the part of the virus that has been inserted overlapType is two-member tuple of 'clean', 'gap' or 'overlap' - defines the junction at each end of the integration when initialising an Integration, need to provide: - host (as a dict of SeqRecord objects or similar) - virus (as a dict of SeqRecord ojbects or similar) - model - self.model of Integrations object - specifies where existing integrations have occured - rng (a numpy.random.Generator object for setting random properties) - chunk_props (a dict of properties for initialising ViralChunk object) - junc_props (a dict of properties defining the junctions of this integration) - junc_types = iterable of length 2 specifying type of left and right junctions (one of gap, overlap, clean) - n_junc = iterable of length 2 specifying length of left and right junction objects of this class have the following properties: self.id - an integer unique to this integrations self.chr - string: host chromosome on which integration should be located self.chunk - ViralChunk object which is integrated self.chunk_props - properties of the integration chunk that were specified as input. dict with fields: - is_whole: boolean specifying if the ViralChunk is whole (if false, chunk is just a portion) - n_fragments: number of fragments into which ViralChunk should be split - n_delete: number of fragments to delete from ViralChunk (should always leave at least two fragments after deletion) - n_swaps: number of swaps to make when rearranging ViralChunk self.junc_props - properties of the integration chunk that were specified as input - junc_types = iterable of length 2 specifying type of left and right junctions (one of gap, overlap, clean) - n_junc = iterable of length 2 specifying length of left and right junction - junc_bases = iterable of length 2 specifying bases involved in each junction - host_del = boolean specifying if there should be a deletion from the host - host_del_len = number of bases to be deleted from the host at this integration site if anything went wrong with initialisation, self.pos is set to None - check this to make sure integration is valid after assigning a """ # assign chunk_props and junc_props to self self.chunk_props = chunk_props self.junc_props = junc_props # check inputs assert isinstance(virus, dict) assert isinstance(model, dict) for key in host.keys(): assert key in model assert isinstance(chunk_props, dict) # leave most of the checking to ViralChunk assert isinstance(junc_props, dict) assert len(junc_props) == 4 assert 'junc_types' in junc_props assert len(self.junc_props['junc_types']) == 2 assert all([i in ['clean', 'gap', 'overlap'] for i in self.junc_props['junc_types']]) assert 'n_junc' in junc_props assert len(self.junc_props['n_junc']) == 2 assert all([isinstance(i, int) for i in self.junc_props['n_junc']]) assert all([i >=0 for i in self.junc_props['n_junc']]) assert 'host_del' in junc_props assert isinstance(junc_props['host_del'], bool) assert 'host_del_len' in junc_props assert isinstance(junc_props['host_del_len'], int) assert junc_props['host_del_len'] >= 0 if junc_props['host_del'] is True: assert junc_props['host_del_len'] > 0 if junc_props['host_del'] is False: assert junc_props['host_del_len'] == 0 assert search_length_overlap > 0 and isinstance(search_length_overlap, int) assert isinstance(min_sep, int) assert min_sep > 1 # set parameters that won't be changed self.search_length_overlap = search_length_overlap self.id = int_id # get random chromosome on which to do integration (weighted by chromosome length) lens = [len(i) for i in host.values()] lens = [i/sum(lens) for i in lens] self.chr = str(rng.choice(list(host.keys()), p=lens)) # set minimum length for viral chunk - longer than the number of bases involved in the junction if self.chunk_props['min_len'] is None: self.chunk_props['min_len'] = self.n_overlap_bases() + 1 if self.chunk_props['min_len'] < self.n_overlap_bases() + 1: self.chunk_props['min_len'] = self.n_overlap_bases() + 1 # need to check that minimum length is still greater than maximum length if self.chunk_props['max_len'] is not None: if self.chunk_props['max_len'] < self.chunk_props['min_len']: raise ValueError('The maximum length ({self.chunk_props["max_len"]}) is less than the length of the overlapped bases ({self.n_overlap_bases()}). Please specify a longer maximum length or a smaller number for the mean of the number of bases involved in the junction.') # get viral chunk self.chunk = ViralChunk(virus, rng, self.chunk_props) # if specified properties (chunk_props) are incompatible with the initialised chunk # self.chunk.pieces will be None if self.chunk.pieces is None: self.pos = None return # set the number of bases in overlaps self.junc_props['n_junc'] = [junc_props['n_junc'][0], junc_props['n_junc'][1]] # but overwrite in the case of a clean junction if self.junc_props['junc_types'][0] == 'clean': self.junc_props['n_junc'][0] = 0 if self.junc_props['junc_types'][1] == 'clean': self.junc_props['n_junc'][1] = 0 # number of bases in overlaps must be less than the length of the integrated chunk if self.n_overlap_bases() >= len(self.chunk.bases): self.pos = None return # set bases belonging to junction self.junc_props['junc_bases'] = (self.get_junc_bases(rng, 'left'), self.get_junc_bases(rng, 'right')) # get a position at which to integrate pos_success = self.get_int_position(host[self.chr].seq, rng, model, min_sep) if pos_success is False: self.pos = None return # number of bases deleted from host chromosome if junc_props['host_del'] is False: self.host_deleted = 0 else: self.host_deleted = junc_props['host_del_len'] # but only delete up to the length of the segment in which this integration occurs, # so that we don't delete any integrations as well for seg in model[self.chr]: # skip viral and ambiguous segments if seg['seq_name'] != self.chr: continue # find if this is the segment in which the integration occurs if not self.has_overlap(self.pos, self.pos, seg['coords'][0], seg['coords'][1]): continue # are we trying to delete past the end of the segment? if self.pos + self.host_deleted + self.n_overlap_bases() >= (seg['coords'][1] - min_sep): self.host_deleted = seg['coords'][1] - self.pos - min_sep - self.n_overlap_bases() self.junc_props['host_del_len'] = self.host_deleted if self.host_deleted < 0: self.pos = None return break # double check for valid chunk assert self.chunk.bases == self.chunk.get_bases(virus) assert 'junc_bases' in self.junc_props assert len(self.junc_props['junc_bases']) == 2 assert len(self.junc_props['junc_bases'][0]) == self.junc_props['n_junc'][0] assert len(self.junc_props['junc_bases'][1]) == self.junc_props['n_junc'][1] assert all([len(i) == 2 for i in self.chunk.pieces]) assert len(self.chunk.pieces) == len(self.chunk.oris) assert all([piece[1] > piece[0] for piece in self.chunk.pieces]) def n_overlap_bases(self): """ Get the total number of bases in overlaps """ assert len(self.junc_props['n_junc']) == 2 assert len(self.junc_props['junc_types']) == 2 n = 0 if self.junc_props['junc_types'][0] == 'overlap': n += self.junc_props['n_junc'][0] if self.junc_props['junc_types'][1] == 'overlap': n += self.junc_props['n_junc'][1] return n def get_random_position(self, model, rng, min_sep): """ based on current model, get a random position that is available for integration (that is, doesn't already have an integration) do this by getting all the host parts of the model, and choosing a random one (weighted by their length) and then choosing a random position from within this part """ # get a list of host coordinates host_coords = [segment['coords'] for segment in model[self.chr] if segment['origin'] == 'host'] # enforce minimum separation host_coords = [(coords[0] + min_sep, coords[1] - min_sep) for coords in host_coords] # ensure lengths are positive host_coords = [coord for coord in host_coords if (coord[1] - coord[0]) > 0] # get lengths of each range to weight choosing a part lengths = [coord[1] - coord[0] for coord in host_coords] lengths = [length/sum(lengths) for length in lengths] # get a random part, weighted by lengths part = rng.choice(host_coords, p = lengths) # get a random postion from this part return rng.integers(low = part[0], high = part[1], endpoint=True, dtype=int) def get_int_position(self, chr, rng, model, min_sep): """ get a position at which to perform the integration the position depends on the overlap type at each side of the overlap if both sides are 'clean' or 'gap', position can be random 'overlap' junctions place constraints on the integration location because the need to be placed where there are overlaps """ # if at both ends the junction is either 'clean' or 'gap', just get a random positon if all([True if (i in ['clean', 'gap']) else False for i in self.junc_props['junc_types']]): self.pos = self.get_random_position(model, rng, min_sep) return True # if overlap at both ends, look for both overlaps in host chromosome next to each other elif self.junc_props['junc_types'][0] == 'overlap' and self.junc_props['junc_types'][1] == 'overlap': # make string with both overlap bases self.pos = self.find_overlap_bases(self.junc_props['junc_bases'][0] + self.junc_props['junc_bases'][1], chr, rng, model, min_sep) # check for unsuccessful find if self.pos == -1: self.pos = None return False ## need to remove overlapped bases from viral chunk, and adjust chunk start, breakpoints and oris self.delete_left_bases(self.junc_props['n_junc'][0]) self.delete_right_bases(self.junc_props['n_junc'][1]) return True # if one end is an overlap, find those bases in the host chromosome # left overlap elif self.junc_props['junc_types'][0] == 'overlap': # find position with overlap at which to do overlap self.pos = self.find_overlap_bases(self.junc_props['junc_bases'][0], chr, rng, model, min_sep) # check for unsuccessful find if self.pos == -1: self.pos = None return False ## need to remove overlapped bases from viral chunk, and adjust chunk start, breakpoints and oris self.delete_left_bases(self.junc_props['n_junc'][0]) return True # right overlap elif self.junc_props['junc_types'][1] == 'overlap': # find position with overlap at which to do overlap self.pos = self.find_overlap_bases(self.junc_props['junc_bases'][1], chr, rng, model, min_sep) # check for unsuccessful find if self.pos == -1: self.pos = None return False ## need to remove overlapped bases from viral chunk, and adjust chunk start, breakpoints and oris self.delete_right_bases(self.junc_props['n_junc'][1]) return True else: raise ValueError(f"junction types {self.junc_props['junc_types']} are not implemented yet") return False def find_overlap_bases(self, overlap_bases, chr, rng, model, min_sep): """ find bases from an overlap in the host chromosome """ # get position around which to search start = self.get_random_position(model, rng, min_sep) # get start and stop positions for bases in host chromosome to search for overlaps stop = start + self.search_length_overlap # make sure that we aren't searching after the end of the chromosome if stop > len(chr): stop = len(chr) # find overlapping bases in host segments in model for seg in model[self.chr]: # check that this segment comes from the host if seg['origin'] != 'host': continue # check that this segment overlaps with desired start and stop from above if not self.has_overlap(start, stop, seg['coords'][0], seg['coords'][1]): continue # look for desired overlap bases in this segment search_start = seg['coords'][0] + min_sep search_stop = seg['coords'][1] - min_sep if search_stop <= search_start: continue if seg['ori'] == "+": found = re.search(overlap_bases, str(chr[search_start:search_stop]), re.IGNORECASE) if found: return found.span()[0] + search_start else: # not implemented - we assume no flipping of host chromosome segments raise ValueError("host chromosome segment {seg} has a negative orientation!") # check for unsuccessful find return -1 def has_overlap(self, start_1, stop_1, start_2, stop_2): """ check to see if two intervals, specified by start_1 and stop_1; and start_2 and stop_2, overlap """ # check inputs assert isinstance(start_1, int) assert isinstance(start_2, int) assert isinstance(stop_1, int) assert isinstance(stop_2, int) assert start_1 >= 0 assert start_2 >= 0 assert stop_1 >= 0 assert stop_2 >= 0 assert start_1 <= stop_1 assert start_2 <= stop_2 # interval 1 start and stop to the left of interval 2 if (start_1 < start_2) and (stop_1 < start_2): return False # interval 2 start and stop are to the right of interval 2 if (start_1 > stop_2) and (stop_1 > stop_2): return False # otherwise they must overlap return True def delete_left_bases(self, n): """ delete bases on the left after adding an overlap - need to adjust chunk bases, oris and breakpoints and start """ assert self.junc_props['junc_types'][0] == 'overlap' # check we're not trying to delete more bases than there are in the chunk assert n < len(self.chunk.bases) # adjust bases self.chunk.bases = self.chunk.bases[n:] # adjust breakpoints - delete bases one by one from breakpoints # until we've deleted enough baes deleted_bases = 0 to_delete = [] i = 0 while deleted_bases < n: # if this piece is a forward piece if self.chunk.oris[i] == '+': # detele one base self.chunk.pieces[i][0] += 1 deleted_bases += 1 # if this piece is a reverse piece we need to remove from the end # because self.chunk.bases has already taken orientations into account elif self.chunk.oris[i] == "-": self.chunk.pieces[i][1] -= 1 deleted_bases += 1 else: print(f"unrecgonised orientation {self.chunk.oris[i]} in chunk {vars(self.chunk)}") self.pos = None return # if we're left with a piece of length 0, flag this piece for deletion if self.chunk.pieces[i][0] == self.chunk.pieces[i][1]: to_delete.append(i) i += 1 # remove chunks that we want to delete self.chunk.pieces = [self.chunk.pieces[i] for i in range(len(self.chunk.pieces)) if (i not in to_delete)] self.chunk.oris = [self.chunk.oris[i] for i in range(len(self.chunk.oris)) if (i not in to_delete)] #adjust start and stop breakpoints = [piece[0] for piece in self.chunk.pieces] breakpoints += [piece[1] for piece in self.chunk.pieces] breakpoints.sort() self.chunk.start = breakpoints[0] self.chunk.stop = breakpoints[-1] def delete_right_bases(self, n): """ delete bases on the left after adding an overlap - need to adjust chunk bases, oris and breakpoints and stop """ assert self.junc_props['junc_types'][1] == 'overlap' # check we're not trying to delete more bases than there are in the chunk assert n < len(self.chunk.bases) # adjust stop self.chunk.stop -= n # adjust bases self.chunk.bases = self.chunk.bases[:-n] # adjust breakpoints deleted_bases = 0 to_delete = [] i = len(self.chunk.pieces) - 1 # start at the last piece in the chunk while deleted_bases < n: assert i >= 0 # if this piece is a forward piece if self.chunk.oris[i] == "+": # delete one base self.chunk.pieces[i][1] -= 1 deleted_bases += 1 elif self.chunk.oris[i] == "-": # delete one base self.chunk.pieces[i][0] += 1 deleted_bases += 1 else: print(f"unrecgonised orientation {self.chunk.oris[i]} in chunk {vars(self.chunk)} ") self.pos = None return # if we're left with a piece of length 0, flag this piece for deletion if self.chunk.pieces[i][0] == self.chunk.pieces[i][1]: to_delete.append(i) i -= 1 # remove chunks that we want to delete self.chunk.pieces = [self.chunk.pieces[i] for i in range(len(self.chunk.pieces)) if (i not in to_delete)] self.chunk.oris = [self.chunk.oris[i] for i in range(len(self.chunk.oris)) if (i not in to_delete)] #adjust start and stop breakpoints = [piece[0] for piece in self.chunk.pieces] breakpoints += [piece[1] for piece in self.chunk.pieces] breakpoints.sort() self.chunk.start = breakpoints[0] self.chunk.stop = breakpoints[-1] def get_junc_bases(self, rng, side): """ Get the bases at the left or right junction, depending on whether the junction is a gap, overlap, or clean junction """ assert side in ['left', 'right'] assert len(self.junc_props['junc_types']) == 2 assert len(self.junc_props['n_junc']) == 2 assert self.junc_props['junc_types'][0] in ['gap', 'overlap', 'clean'] assert self.junc_props['junc_types'][1] in ['gap', 'overlap', 'clean'] if side == 'left': n_bases = self.junc_props['n_junc'][0] # no bases in a clean junction if self.junc_props['junc_types'][0] == 'clean': return "" # random bases in a gap elif self.junc_props['junc_types'][0] == 'gap': return self.get_n_random_bases(rng, n_bases) # first n bases of viral chunk in an overlap elif self.junc_props['junc_types'][0] == 'overlap': return self.chunk.bases[:n_bases] else: raise ValueError(f"unrecgonised type: {self.junc_props['junc_types'][0]}") elif side == 'right': n_bases = self.junc_props['n_junc'][1] if self.junc_props['junc_types'][1] == "clean": return "" elif self.junc_props['junc_types'][1] == 'gap': return self.get_n_random_bases(rng, n_bases) elif self.junc_props['junc_types'][1] == 'overlap': return self.chunk.bases[-n_bases:] else: raise ValueError(f"unrecgonised type: {self.junc_props['junc_types'][1]}") else: raise ValueError(f"unrecgonised side: {side}") def get_n_random_bases(self, rng, n_bases): """ get a string composed of n random nucleotides """ return "".join(rng.choice(['A', 'T', 'G', 'C'], n_bases)) def __str__(self): return f"Viral integration into host chromosome {self.chr}'" def __repr__(self): return f"Object of type Integration with properties {self}" def __lt__(self, other): """ Integrations can be ranked by both chromosome name and position on that chromosome """ # make sure that we're comparing with another integration assert isinstance(other, Integration) # first check chromosome name assert isinstance(self.chr, str) and isinstance(other.chr, str) assert isinstance(self.pos, int) and isinstance(other.pos, int) return (self.chr.lower(), self.pos) < (other.chr.lower(), other.pos) def __eq__(self, other): """ Integrations can be ranked by both chromosome name and position on that chromosome """ # make sure that we're comparing with another integration assert isinstance(other, Integration) # first check chromosome name assert isinstance(self.chr, str) and isinstance(other.chr, str) assert isinstance(self.pos, int) and isinstance(other.pos, int) return (self.chr.lower(), self.pos) == (other.chr.lower(), other.pos) class ViralChunk(dict): """ Class to store properties of an integrated chunk of virus. Intended to be used by the Integrations and Episomes classes """ def __init__(self, virus, rng, chunk_props): """ function to get a chunk of a virus virus is the dictionary of seqrecords imported using biopython desired properties of chunk are input as a dict with the following attributes: - is_whole: boolean specifying if the ViralChunk is whole (if false, chunk is just a portion) - n_fragments: number of fragments into which ViralChunk should be split - n_delete: number of fragments to delete from ViralChunk (should always leave at least two fragments after deletion) - n_swaps: number of swaps to make when rearranging ViralChunk - min_len: the minimum length of this chunk (integer greater than 1) - max_len: the maxumum length of this chunk (None, or integer greater than min_len) the bases attribute of a ViralChunk consist of only the bases that are unique to the virus. So in the case of an Integration of a ViralChunk with a 'overlap' type junction, the bases, breakpoints and oris attributes are re-assigned to remove the overlapped bases """ # check inputs assert isinstance(virus, dict) assert len(virus) > 0 assert isinstance(chunk_props, dict) assert len(chunk_props) == 6 assert 'is_whole' in chunk_props assert isinstance(chunk_props['is_whole'], bool) assert 'n_fragments' in chunk_props assert isinstance(chunk_props['n_fragments'], int) assert chunk_props['n_fragments'] > 0 assert 'n_delete' in chunk_props assert isinstance(chunk_props['n_delete'], int) assert chunk_props['n_delete'] >= 0 assert 'n_swaps' in chunk_props assert isinstance(chunk_props['n_delete'], int) assert chunk_props['n_delete'] >= 0 assert 'min_len' in chunk_props assert isinstance(chunk_props['min_len'], int) or chunk_props['min_len'] is None if chunk_props['min_len'] is None: chunk_props['min_len'] = 1 assert chunk_props['min_len'] > 0 assert 'max_len' in chunk_props assert isinstance(chunk_props['max_len'], int) or chunk_props['max_len'] is None if chunk_props['max_len'] is not None and chunk_props['min_len'] is not None: assert chunk_props['max_len'] >= chunk_props['min_len'] # check that the minimum length specified is longer than all the viruses # otherwise we might fail if not all([chunk_props['min_len'] <= len(vir.seq) for vir in virus.values()]): raise ValueError("minimum length must be longer than all the Viruses") # get a random virus to integrate self.virus = str(rng.choice(list(virus.keys()))) self.chunk_props = chunk_props # if the length of the virus is equal to min_len, is_whole must be true if len(virus[self.virus]) == chunk_props['min_len']: chunk_props['is_whole'] = True if self.chunk_props['is_whole'] is True: self.start = 0 self.stop = len(virus[self.virus]) elif self.chunk_props['is_whole'] is False: self.start = int(rng.integers(low = 0, high = len(virus[self.virus].seq) - chunk_props['min_len'])) if chunk_props['max_len'] is None: self.stop = int(rng.integers(low = self.start + chunk_props['min_len'], high = len(virus[self.virus].seq) ) ) elif chunk_props['max_len'] == chunk_props['min_len']: self.stop = self.start + chunk_props['max_len'] else: self.stop = int(rng.integers(low = self.start + chunk_props['min_len'], high = min(len(virus[self.virus].seq), self.start+chunk_props['max_len']) ) ) else: raise ValueError("self.chunk_props['is_whole'] must be either True or False") # breakpoints are the start and stop coordinates of pieces of the virus that have been # integrated # set breakpoints self.pieces = [[self.start, self.stop]] self.oris = str(rng.choice(('+', '-'))) # do a deletion if applicable if self.chunk_props['n_delete'] > 0: self.__delete(rng) # if something went wrong, breakpoints will be None if self.pieces is None: return if self.chunk_props['n_swaps'] > 0: self.__rearrange(rng) # if something went wrong, breakpoints will be None if self.pieces is None: return # get bases self.bases = self.get_bases(virus) def get_bases(self, virus): """ return bases of viral chunk as a string note that objects of class ViralChunk do not store the whole viral sequence, so need to provide this in order to get the bases """ bases = [] # get integrated bases for i, points in enumerate(self.pieces): start = points[0] stop = points[1] if self.oris[i] == '+': bases += [str(virus[self.virus].seq[start:stop])] else: bases += [str(virus[self.virus].seq[start:stop].reverse_complement())] return "".join(bases) def __split_into_pieces(self, rng): """ get random, unique breakpoints to divide a viral chunk into pieces there must be at least min_breakpoints, which results in min_breakpoints + 1 pieces this is a list of coordinates, not tuples (unlike self.pieces) """ # shouldn't do this if this chunk has already been split assert len(self.pieces) == 1 assert len(self.oris) == 1 # check that we're not trying to divide a chunk into more pieces than there are bases if self.chunk_props['n_fragments'] >= self.stop - self.start: self.pieces = None return # get the number of pieces to divide into num_breakpoints = self.chunk_props['n_fragments'] - 1 # get random breakpoints from within this chunk breakpoints = rng.choice(range(self.start + 1, self.stop - 1), size = num_breakpoints, replace = False) # set self.pieces breakpoints = [self.start] + sorted(breakpoints) + [self.stop] self.pieces = [[int(breakpoints[i]), int(breakpoints[i+1])] for i in range(len(breakpoints) - 1)] # set self.oris self.oris = [self.oris[0]] * len(self.pieces) return def __swap_orientations(self, breakpoint, side): """ Given a breakpoint, swap all of the orientations (+ to - or vice versa) for all of the pieces on the left or right of this breakpoint """ if side == 'left': for i, ori in enumerate(self.oris[:breakpoint]): if ori == "+": self.oris[i] = "-" else: self.oris[i] = "+" else: for i, ori in enumerate(self.oris[breakpoint:]): if ori == "+": self.oris[i] = "-" else: self.oris[i] = "+" def __delete(self, rng): """ Divide a viral chunk up into multiple pieces and remove one of those pieces """ # deletions are always performed first, so the chunk should not have been split yet assert len(self.pieces) == 1 # want to have at least two pieces left assert self.chunk_props['n_fragments'] - self.chunk_props['n_delete'] >= 2 # split chunk into at n_fragments pieces self.__split_into_pieces(rng) if self.pieces is None: return assert len(self.pieces) == self.chunk_props['n_fragments'] # decide which portions to delete i_delete = rng.choice(range(1, len(self.pieces) - 1), self.chunk_props['n_delete'], replace=False) # do deletion self.pieces = [piece for i, piece in enumerate(self.pieces) if i not in i_delete] self.oris = [ori for i, ori in enumerate(self.oris) if i not in i_delete] assert len(self.pieces) == self.chunk_props['n_fragments'] - self.chunk_props['n_delete'] def __rearrange(self, rng): """ Divide a viral chunk up into multiple pieces and randomise their order and orientiations """ # split the chunk if it hasn't already been split if len(self.pieces) == 1: # split chunk into at least three pieces self.__split_into_pieces(rng) if self.pieces is None: return assert len(self.pieces) == self.chunk_props['n_fragments'] else: assert len(self.pieces) > 1 # if we only have two pieces, we should only do one swap # so that we don't end up back with the same fragment # there are other ways to end up with the same fragment after swaps # but don't worry about them for now - TODO if len(self.pieces) == 2: self.chunk_props['n_swaps'] = 1 for i in range(self.chunk_props['n_swaps']): # pick a point about which to swap if 1 == len(self.pieces) - 1: i_swap = 1 else: i_swap = rng.choice(range(1, len(self.pieces) - 1)) # swap everything to the left of this position with everything on the right self.pieces = self.pieces[i_swap:] + self.pieces[:i_swap] # 50 % chance of swapping the orientations of all the pieces for each side if bool(rng.choice((True, False))) is True: self.__swap_orientations(i_swap, 'left') if bool(rng.choice((True, False))) is True: self.__swap_orientations(i_swap, 'right') def __str__(self): return f"Viral chunk of virus {self.virus} ({self.start}, {self.stop}) and orientations {self.oris}" def __repr__(self): return f"Object of type ViralChunk with properties {self}" if __name__ == "__main__": main(argv[1:]) |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 | import sys import argparse import csv def main(args): parser = argparse.ArgumentParser(description='Convert tab-separated file with integration locations into bed file with junction locations') parser.add_argument('--sim_info', '-i', help='Tab-separated output file with location of integrations', required=True) parser.add_argument('--output', '-o', help='Output bed file', default='ints.bed') args = parser.parse_args(args[1:]) with open(args.sim_info, 'r', newline='') as infile, open(args.output, 'w', newline='') as outfile: inreader = csv.DictReader(infile, delimiter='\t') outwriter = csv.writer(outfile, delimiter='\t') for line in inreader: chrom = line['chr'] pos = int(line['hPos']) # left ambiguous bases left_type = line['juncTypes'].split(',')[0] left_len = int(line['juncLengths'].split(',')[0]) # check if left junction has supporting reads if not (line['left_chimeric'] == '' and line['left_discord'] == ''): # write left junction #outwriter.writerow((chrom, pos, pos+left_len, '+')) outwriter.writerow((chrom, pos, pos+left_len)) # right ambiguous bases right_type = line['juncTypes'].split(',')[1] right_len = int(line['juncLengths'].split(',')[1]) # deleted bases deleted = int(line['hDeleted']) # check if right junction has supporting reads if not (line['right_chimeric'] == '' and line['right_discord'] == ''): # write right junction #outwriter.writerow((chrom, pos+left_len+deleted, pos+left_len+deleted+right_len, '-')) outwriter.writerow((chrom, pos+left_len+deleted, pos+left_len+deleted+right_len)) if __name__ == "__main__": main(sys.argv) |
19 20 21 22 23 24 25 26 | shell: """ python3 scripts/annotate_reads.py \ --sim-info {input.info} \ --sim-bam {input.bam} \ --output {output.annotated_info} \ {params} """ |
42 43 44 45 46 | shell: """ python3 scripts/sim_bed.py -i {input.info} -o {output.tmp} sort -k1,1 -k2,2n {output.tmp} > {output.bed} """ |
31 32 33 34 | shell: """ art_illumina --noALN {params} """ |
50 51 52 53 54 55 | shell: """ rm -f {output.bam}*tmp*bam samtools sort -o {output.bam} {input.sam} samtools index {output.bam} """ |
21 22 | run: sim_df[sim_df['experiment'] == wildcards.exp].to_csv(output.tsv, sep='\t', index=False) |
98 99 100 101 102 103 104 105 106 107 108 | shell: """ python3 scripts/insert_virus.py \ --host {input.host} \ --virus {input.virus} \ --ints {output.sim_fasta} \ --int_info {output.sim_info} \ --epi_info {output.epi_info} \ {params} \ --verbose """ |
Support
Do you know this workflow well? If so, you can
request seller status , and start supporting this workflow.
Created: 1yr ago
Updated: 1yr ago
Maitainers:
public
URL:
https://github.com/szsctt/intvi_simulation
Name:
intvi_simulation
Version:
1
Downloaded:
0
Copyright:
Public Domain
License:
None
- Future updates
Related Workflows
ENCODE pipeline for histone marks developed for the psychENCODE project
psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project.
The o...
Near-real time tracking of SARS-CoV-2 in Connecticut
Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...
ATLAS - Three commands to start analyzing your metagenome data
Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...
raw sequence reads
Genome assembly
Annotation track
checkm2
gunc
prodigal
snakemake-wrapper-utils
MEGAHIT
Atlas
BBMap
Biopython
BioRuby
Bwa-mem2
cd-hit
CheckM
DAS
Diamond
eggNOG-mapper v2
MetaBAT 2
Minimap2
MMseqs
MultiQC
Pandas
Picard
pyfastx
SAMtools
SemiBin
Snakemake
SPAdes
SqueezeMeta
TADpole
VAMB
CONCOCT
ete3
gtdbtk
h5py
networkx
numpy
plotly
psutil
utils
metagenomics
RNA-seq workflow using STAR and DESeq2
This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...
This Snakemake pipeline implements the GATK best-practices workflow
This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...