Genome Assembly Validation via Inter-SUNK distances in ONT reads

public public 1yr ago Version: v1.0.0 0 bookmarks

Genome Assembly Validation via Inter-SUNK distances in nanopore reads

Setup source files in config.yaml and ont.tsv

config.yaml requires:

  • a manifest of ONT data for validation "ont.tsv",

  • and the kmer-length to use for validation "SUNK_len" (default: 20)

optionally, for more detailed output plots (uses more RAM, experimental), set plot_detailed: True in config.yaml

The ONT manifest is a .tsv file "ont.tsv" with the following columns:

  • sample: must be unique for each file

  • hap[1,2]_ONT : location of ONT reads (gzipped or raw .fa or .fq)

  • hap[1,2]_asm : location of haplotype assembly (FASTA, must also have .fai index in same directory)

  • hap[1,2]_bed : BED format regions of assembly to visualize (optional)

  • hap[1,2]_colotrack : BED format color track to include in visualizations (optional, no headers, columns Chrom Start End Color)

I recommend a single input file for each haplotype, and to haplotype phase with canu and parental illumina (not currently incorporated into this pipeline). Hi-C phasing of ONT is also possible: see the "hic" branch.

Prerequisites: Snakemake (tested with versions 6.12.1, 7.8.2, 7.14.0)

To run snakefile locally on the provided test cases (AMY locus of CHM13/1 pseudodiploid and HG02723), generating optional image outputs, execute:

git clone https://github.com/pdishuck/GAVISUNK
cd GAVISUNK
snakemake -R --use-conda --cores 8 --configfile .test/config.yaml --resources load=1000

.BED results are found in the results/[sample]/final_outs directory, along with .tsv files summarizing the probability of each validation gap having been spanned by ONT reads, given the input library

Automated visualizations of validation gaps are found in the results/pngs/gaps/[sample] directory

For the included test cases, the following output should be generated: results/gaps/AMY_HG02723/AMY_HG02723_hap1_AMY_h1_84861_524275.png , corresponding to the region displayed in Figure 1 of the manuscript: plot

Example SGE execution (your cluster's parameters may vary):

alias snakesub='mkdir -p log; snakemake --ri --jobname "{rulename}.{jobid}" --drmaa " -V -cwd -j y -o ./log -e ./log -l h_rt=48:00:00 -l mfree={resources.mem}G -pe serial {threads} -w n -S /bin/bash" -w 60'
snakesub --use-conda -j150 > snakelog 2>&1 &

Troubleshooting tips:

To get snakemake conda envs to work correctly, you may need to deactivate your local conda env ($PATH issues as in https://github.com/snakemake/snakemake/issues/883) GAVISUNK is written to execute from its top-level directory. Old versions of dependencies that are in your $PATH by default may interfere with GAVISUNK operation. See dependencies in workflow/envs/viz.yaml

How to cite:

Philip C Dishuck, Allison N Rozanski, Glennis A Logsdon, David Porubsky, Evan E Eichler, GAVISUNK: genome assembly validation via inter-SUNK distances in Oxford Nanopore reads, Bioinformatics, Volume 39, Issue 1, January 2023, btac714, https://doi.org/10.1093/bioinformatics/btac714

Code Snippets

15
16
17
18
19
shell:
  """
  zcat -f {input.hap1_asm} {input.hap2_asm} > {output.combined}
  samtools faidx {output.combined}
  """
37
38
39
40
shell:
  """
  jellyfish count -m {params.sunk_len} -s 10000000 -t {threads} -C -c 1 -U 1 {input.asm} -o {output.counts}
  """
57
58
59
60
61
shell:
  """
  jellyfish dump -c -t {input.counts}  | awk '{{print $1}}' > {output.db}
  awk '{{ print ">"$0"\\n"$0 }}' {output.db} > {output.fa}
  """
77
78
79
80
shell:
  """
  mrsfast --ws 14 --index {input.asm}
  """
 99
100
101
102
103
shell:
  """
  mrsfast --search {input.ref} --threads {threads} --mem 16 --seq {input.db} -o {output.sam} --disable-nohits -e 0 
  samtools sort -@ {threads} {output.sam} -o {output.bam} 
  """
122
123
124
125
126
127
shell:
  """
  bedtools bamtobed -i {input.bam} > {output.bed} && \
  bedtools merge -i {output.bed} > {output.bedmerge} && \
  bedtools intersect -a {output.bed} -b {output.bedmerge} -wo | cut -f 1,2,4,8 > {output.locs}
  """
22
23
script:
  "../scripts/viz_detailed.py"
38
39
script:
  "../scripts/viz.py"
59
60
script:
  "../scripts/covprob.py"
15
16
17
18
shell:
  """
  cat {input.reads} | seqtk seq -F '#' | rustybam fastq-split {output.reads}
  """
34
35
36
37
shell:
  """
  workflow/scripts/kmerpos_annot3 {input.ONT} {input.db} {input.locs} {output}
  """
52
53
54
55
shell:
  """
  cat {input.gather_ONT_pos} > {output.ONT_pos}
  """
71
72
73
74
shell:
  """
  workflow/scripts/rlen {input.ONT} {output}
  """
90
91
92
93
shell:
  """
  workflow/scripts/diag_filter_v3 {input.ONT_pos} {input.fai} > {output.ONT_pos_diag}
  """
108
109
110
111
shell:
  """
  workflow/scripts/diag_filter_step2 {input.ONT_pos} {input.ONT_pos_diag} > {output.ONT_pos_diag_final}
  """
SnakeMake From line 108 of rules/tagONT.smk
127
128
129
130
131
shell:
  """
  cat {input.gather_ONT_pos} > {output.ONT_pos}
  cat {input.gather_ONT_len} > {output.ONT_len}
  """
SnakeMake From line 127 of rules/tagONT.smk
148
149
150
151
shell:
  """
  python workflow/scripts/badsunks_AR.py {input.hap1_fai} {input.hap2_fai} {input.hap1_sunkpos} {input.hap2_sunkpos} {output.badsunks}
  """
SnakeMake From line 148 of rules/tagONT.smk
167
168
script:
  "../scripts/split_locs.py"
SnakeMake From line 167 of rules/tagONT.smk
187
188
189
190
191
shell:
  """
  python workflow/scripts/process-by-contig_lowmem_AR.py {input.locs_contig} {input.sunk_pos_contig} {input.rlen} {input.bad_sunks} {output.outputdf} {output.bed}
  touch {output.bed}
  """
SnakeMake From line 187 of rules/tagONT.smk
206
207
208
209
shell:
  '''
  cat {input.bed} > {output.allbeds}
  '''
SnakeMake From line 206 of rules/tagONT.smk
228
229
230
231
shell:
  """
  python workflow/scripts/get_gaps.py {input.hap1_fai} {input.hap2_fai}  {wildcards.sample} results/{wildcards.sample}/bed_files/  results/{wildcards.sample}/final_out/
  """
SnakeMake From line 228 of rules/tagONT.smk
247
248
249
250
shell:
  """
  bedtools slop -i {input.gaps} -g {input.fai}  -b 200000 > {output.gaps_slop}
  """
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
import argparse
import pandas as pd
from matplotlib import (pyplot as plt,
                        lines)
import seaborn as sns
import os

def main():
    parser = argparse.ArgumentParser()

    parser.add_argument("fai1", help="list of contigs, hap1")
    parser.add_argument("fai2", help="list of contigs, hap2")
    parser.add_argument("sunkpos1", help=".sunkpos file of SUNK locations on hap1 ONT reads")
    parser.add_argument("sunkpos2", help=".sunkpos file of SUNK locations on hap2 ONT reads")
    parser.add_argument("outpath", help="output file")
    args = parser.parse_args()

    sunkposcat = pd.read_csv(args.sunkpos1,sep="\t",header=None,names=['rname','pos','chrom','start','ID'],dtype={'rname':'string','pos':'uint32','chrom':'category','start':'uint32','ID':'uint32'})# 


    sunkposcat['ID2'] = sunkposcat['chrom'].astype(str) +":"+ sunkposcat['ID'].astype(str)
    kmer_counts = sunkposcat.ID2.value_counts()
    kmer_counts2 = pd.DataFrame(kmer_counts,index=None)
    kmer_counts2.reset_index(inplace=True)
    kmer_counts2.rename(columns={'ID2':'count','index':'ID2'},inplace=True)

    kmer_counts2[['contig','ID']] =  kmer_counts2.ID2.str.split(":",1,expand=True)

    fai = pd.read_csv(args.fai1, sep='\t',header=None,names=['contig','length','cumlen','3','4'])
    contiglist = set(fai.contig.tolist())

    kmer_counts2['correct_hap'] = kmer_counts2['contig'].isin(contiglist)

    kmer_counts2.correct_hap.value_counts()

    fig,ax = plt.subplots(figsize=(22,22))
    g = sns.histplot(data = kmer_counts2.query('count < 150 & count > 1'), x='count',binwidth=1,hue='correct_hap')
#    plt.savefig("badsunks_hap1.png")

    # meancov = kmer_counts2[kmer_counts2['count'] > kmer_counts2['count'].median()/2]['count'].mode()
    meancov = kmer_counts2.query("correct_hap")['count'].mode().values[0]
    print("mean coverage", meancov)

    limit = meancov + (meancov)**0.5*4 #4 SDs above mean (though apparent mean is depressed by seq accuracy)

    badsunks1a = set(kmer_counts2.query("correct_hap").query("(count > @limit) or (count < 2)")['ID2'])
    print(len(badsunks1a))

    badsunks1b = set(kmer_counts2.query("not correct_hap").query("(count > @limit)")['ID2'])
    print(len(badsunks1b))


    # sunkposfile = args.sunkpos
    sunkposfile = args.sunkpos2

    sunkposcat = pd.read_csv(sunkposfile,sep="\t",header=None,names=['rname','pos','chrom','start','ID'],dtype={'rname':'string','pos':'uint32','chrom':'category','start':'uint32','ID':'uint32'})# 

    sunkposcat['ID2'] = sunkposcat['chrom'].astype(str) +":"+ sunkposcat['ID'].astype(str)
    kmer_counts = sunkposcat.ID2.value_counts()
    kmer_counts2 = pd.DataFrame(kmer_counts,index=None)
    kmer_counts2.reset_index(inplace=True)
    kmer_counts2.rename(columns={'ID2':'count','index':'ID2'},inplace=True)

    kmer_counts2[['contig','ID']] =  kmer_counts2.ID2.str.split(":",1,expand=True)



    fai = pd.read_csv(args.fai2 ,sep='\t',header=None,names=['contig','length','cumlen','3','4'])
    contiglist = set(fai.contig.tolist())

    kmer_counts2['correct_hap'] = kmer_counts2['contig'].isin(contiglist)

    kmer_counts2.correct_hap.value_counts()

    fig,ax = plt.subplots(figsize=(22,22))
    g = sns.histplot(data = kmer_counts2.query('count < 150 & count > 1'), x='count',binwidth=1,hue='correct_hap')
 #   plt.savefig("badsunks_hap2.png")

    # meancov = kmer_counts2[kmer_counts2['count'] > kmer_counts2['count'].median()/2]['count'].mode()
    meancov = kmer_counts2.query("correct_hap")['count'].mode().values[0]
    print("mean coverage", meancov)


    limit = meancov + (meancov)**0.5*4 #4 SDs above mean (though apparent mean is depressed by seq accuracy)


    badsunks2a = set(kmer_counts2.query("correct_hap").query("(count > @limit) or (count < 2)")['ID2'])
    print(len(badsunks2a))


    badsunks2b = set(kmer_counts2.query("not correct_hap").query("(count > @limit)")['ID2'])
    print(len(badsunks2b))


    badsunks_all = badsunks1a |badsunks1b | badsunks2a | badsunks2b
    print(len(badsunks_all))

    outfile = open(args.outpath, "w")
    for element in badsunks_all:
        outfile.write(element + "\n")
    outfile.close()

main()
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
import pandas as pd, numpy as np
import os
from sympy.solvers import solve, solveset
from sympy import Symbol, Reals
import pyranges as pr

import argparse

print(snakemake.input)



# genome_size=3.055e6
fai = pd.read_csv(snakemake.input.fai,sep='\t',header=None,names=['contig','length','cumlen','3','4'])
genome_size = fai['length'].sum()/1000
# print("assembly size: ", genome_size)
df = pd.read_csv(snakemake.input.bed, header=None, names="Chromosome Start End".split(), sep="\t")
df['type'] = "gap"
# df2 = pd.read_csv(breakdir + "hap" + hap + ".nodata.bed", header=None, names="Chromosome Start End".split(), sep="\t")
# df2['type'] = 'nodata'
# df3 = pd.concat([df,df2])
df3 = df
df3.reset_index(inplace=True)
breaks = pr.PyRanges(df3)




# breaks

# df3.eval("End - Start").sort_values()

# kmermergefile = os.path.dirname(snakemake.input.pos_locs) +"/"+ cont2 + "_" + snakemake.wildcards.hap + ".loc"

kmermerge = pd.read_csv(snakemake.input.locs,sep="\t",header=None,names=['chrom','loc','kmer','ID'])# this could be split too
kmermerge = kmermerge.drop_duplicates(subset='kmer')
kmermerge['ID2'] = kmermerge['chrom'].astype(str) +":"+ kmermerge['ID'].astype(str)
kmer_groups = kmermerge.drop_duplicates(subset='ID2') #  
kmer_groups['dist'] = kmer_groups.ID.diff().fillna(kmer_groups.ID).astype(int) #rows)
kmer_groups['dist'] = kmer_groups['dist'].map( lambda x: max(x,0) ) 


lendf = pd.read_csv(snakemake.input.rlen,sep="\t",header=None,names=['rname','len'],dtype={'rname':'string','len':'uint32'})
lendf.drop_duplicates(inplace=True)
lendf.set_index("rname",drop=True,inplace=True)
lendf.sort_values(by="len",ascending=False,inplace=True)
lendf['bp']=lendf['len'].cumsum()
# lendf['cov'] = lendf['bp']/3.055e9

lendf['kbp'] = lendf['len']/1000
lendf['kbp'] = lendf['kbp'].astype(int)

kbp_cnt = lendf['kbp'].value_counts(sort=False)


def cov_prob(intsize,readlen,readcnt,basequal=1): # everything in kbp ?
    if intsize > readlen:
#         print("no help here")
        return 1
    return  ( (1-   (  ((readlen-intsize) /genome_size) * (basequal**2) )     )**(readcnt) )


x = Symbol('x')
p=0.94 # read quality
q=1-p
r=int(snakemake.config['SUNK_len']) # kmer size

roots2 = solveset( 1 - x + q*((p)**(r)) * ((x)**(r+1)), x, domain=Reals)
pinv = 1/p
roots2 = [x for x in roots2  if x > 0]
roots2.sort(key=lambda e: abs(pinv - e))
assert(len(roots2) == 2)
roots3  = roots2[1]

print("Should be unequal: ",roots3, pinv)



n=30 # group size
qn = ( ( 1- p*roots3)/(q*(r + 1 - r*roots3))) * (1 / (roots3**(n+1)))
pn = float(1-qn)
# print(qn, pn) # odds perfect 20mer not seen in kmergroup length 22



covprobs = []
for intsize in range(1,3500):
    cum_prob = 1
#     counter = 0
    for i,x in kbp_cnt.iteritems():
        prob = cov_prob(intsize, i, x,pn) 
        cum_prob = cum_prob * prob
    covprobs.append((intsize,1-cum_prob))
#         counter += 1    
covs,probs = zip(*covprobs)



covprobsdict = {k:v for k,v in covprobs}
covprobsdict[0] = 1.0

for cov, prob in covprobs:
    if prob <=0.99: 
        print(cov,prob)
        break



max_gaps = []
max_gaps2 = []
for x in breaks.df.itertuples():
    ix,ix2,contig,regstart,regend,types = x
    xmin = regstart
    xmax = regend    
    kmers = kmer_groups.query("(chrom == @contig) & (ID > (@xmin-2)) & (ID < (@xmax+2))")
#     print(len(kmers))
    kmerlist = list(pd.unique(kmers.ID))       
    max_gaps.append(kmers['dist'].max())    
    max_gaps2.append((contig,regstart,regend,kmers['dist'].max(),kmers.iloc[kmers['dist'].argmax()].kmer)) 


# breaks

# kmer_groups.dist.max()



df1 = breaks.df
df1['max_gap'] = max_gaps
df1['covprob'] = df1['max_gap'].apply(lambda x: covprobsdict[int(x/1000)] )
# df1.sort_values(by='covprob').head(25)

# df1.drop(columns='index').to_csv(snakemake.output.tsv,index=False,sep="\t")
df1.to_csv(snakemake.output.tsv,index=False,sep="\t")
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
import pandas as pd, numpy as np
import os
import argparse
import pyranges as pr

parser = argparse.ArgumentParser()
parser.add_argument("fai1")
parser.add_argument("fai2")
parser.add_argument("sample")
parser.add_argument('indir')
parser.add_argument('outdir')
args = parser.parse_args()
outdir = args.outdir
indir = args.indir
samp = args.sample

def main():
    print("hap1")
    fai = pd.read_csv(args.fai1,sep='\t',header=None,names=['contig','length','cumlen','3','4'])
    contigs = fai.contig.tolist()
    allbreaks={}
    outbreaks = []
    bp1 = 0
    bp2 = 0
    nodata = []
    for x in contigs:
        # print(x)
        cont2 = x.replace("#","_")        
        contiglen = fai.query("contig == @x").length.values[0]
        try: bedin = pd.read_csv(indir + cont2 + "_hap1.bed",sep="\t",header=None,names=['contig','start','end'])
    #     try: bedin = pd.read_csv(x + ".bed",sep="\t",header=None,names=['contig','start','end'])        

        except:
            print("No bed for", x)
            nodata.append((x, 0, contiglen ))
            bp2 += contiglen
            continue
        if len(bedin) == 0:
            nodata.append((x, 0, contiglen ))
            bp2 += contiglen
            continue 

        bed = pr.PyRanges(bedin.rename(columns={'contig':'Chromosome','start':'Start','end':'End'}))
        bedin2 = bed.merge().df
        bedin2.columns = ['contig', 'start', 'end']
        bedin2['end_prev'] = bedin2['end'].shift(1,fill_value=np.nan)


        breaks = []
        first=True
        for row in bedin2.itertuples():
            if row.end < row.end_prev: 
                print("weird", x)
                break
            if first:
                first=False
                # print("confirmed region begins: ",row.start)
            else:
                # print(int(row.end_prev), row.start,)
                breaks.append((int(row.end_prev), row.start,))
                outbreaks.append((x, int(row.end_prev), row.start-1,))
                bp1 += (row.start - int(row.end_prev))

    #     print(row.end)
    #     print("unconfirmed bp at contig end: ", contiglen - row.end)
    # #     print(breaks)
        allbreaks[x] = breaks
    print("breaks: ", len(outbreaks), bp1)
    print("no data: ", len(nodata), bp2)
    pd.DataFrame(nodata).to_csv(outdir + "hap1.nodata.bed",header=False,sep="\t",index=False)
    pd.DataFrame(outbreaks).to_csv(outdir + "hap1.gaps.bed",header=False,sep="\t",index=False)

    fai = pd.read_csv(args.fai2,sep='\t',header=None,names=['contig','length','cumlen','3','4'])
    contigs = fai.contig.tolist()

    print("hap2")
    allbreaks={}
    outbreaks = []
    bp1 = 0
    bp2 = 0
    nodata = []
    for x in contigs:
        cont2 = x.replace("#","_")        # print(x)
        contiglen = fai.query("contig == @x").length.values[0]
    #     try: bedin = pd.read_csv(x + "_hap2.bed",sep="\t",header=None,names=['contig','start','end'])
        try: bedin = pd.read_csv(indir + cont2 + "_hap2.bed",sep="\t",header=None,names=['contig','start','end'])
        except:
            print("No bed for", x)
            nodata.append((x, 0, contiglen ))
            bp2 += contiglen
            continue
        if len(bedin) == 0:
            nodata.append((x, 0, contiglen ))
            bp2 += contiglen
            continue 

        bed = pr.PyRanges(bedin.rename(columns={'contig':'Chromosome','start':'Start','end':'End'}))
        bedin2 = bed.merge().df
        bedin2.columns = ['contig', 'start', 'end']
        bedin2['end_prev'] = bedin2['end'].shift(1,fill_value=np.nan)


        breaks = []
        first=True
        for row in bedin2.itertuples():
            if first:
                first=False
                # print("confirmed region begins: ",row.start)
            else:
                # print(int(row.end_prev), row.start,)
                breaks.append((int(row.end_prev), row.start,))
                outbreaks.append((x, int(row.end_prev), row.start-1,))
                bp1 += (row.start - int(row.end_prev))

    #     print(row.end)
    #     print("unconfirmed bp at contig end: ", contiglen - row.end)
    # #     print(breaks)
        allbreaks[x] = breaks
    print("breaks: ", len(outbreaks), bp1)
    print("no data: ", len(nodata), bp2)
    pd.DataFrame(nodata).to_csv(outdir + "hap2.nodata.bed",header=False,sep="\t",index=False)
    pd.DataFrame(outbreaks).to_csv(outdir + "hap2.gaps.bed",header=False,sep="\t",index=False)
main()
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
import networkx as nx
import graph_tool as gt
from graph_tool.topology import extract_largest_component

import timeit

from scipy.spatial.distance import pdist
import itertools as it

import numpy as npt
import pandas as pd, numpy as np
from matplotlib import (pyplot as plt,
                        lines)
import seaborn as sns
import timeit

from matplotlib.patches import Circle, Wedge, Polygon
from matplotlib.collections import PatchCollection
import matplotlib.pyplot as plt

import seaborn as sns
import numpy as np

from matplotlib import (pyplot as plt,
                        lines)
import matplotlib.ticker as mtick
import argparse


def main():
    parser = argparse.ArgumentParser()

    parser.add_argument("SUNKs", help=".loc file of SUNKs aligned to assembly")
    parser.add_argument("sunkpos", help=".sunkpos file of SUNK locations on ONT reads")
    parser.add_argument("rlen", help=".rlen file of ONT read lengths")
    parser.add_argument("badsunks", help="list of SUNKs to exclude")
    parser.add_argument("outputfile", help="output file for intermediate")
    parser.add_argument("outputbed", help="output file for bed")

    parser.add_argument("--minlen", help="minimum length filter for reads to visualize",default=10000,type=int)
    parser.add_argument("--opt_filt", help="apply optional filter",action='store_true')

    args = parser.parse_args()
    #contig = args.contig
    ofile = args.outputfile



    lendf = pd.read_csv(args.rlen,sep="\t",header=None,names=['rname','len'],dtype={'rname':'string','len':'uint32'})
    lendf.set_index("rname",drop=True,inplace=True)
    minlen = args.minlen


    kmermergefile = args.SUNKs
    kmermerge = pd.read_csv(kmermergefile,sep="\t",header=None,names=['chrom','loc','kmer','ID'])# this could be split too
    kmermerge = kmermerge.drop_duplicates(subset='kmer')
    kmermerge['ID2'] = kmermerge['chrom'].astype(str) +":"+ kmermerge['ID'].astype(str)
    sunkposfile = args.sunkpos
    sunkposcat = pd.read_csv(sunkposfile,sep="\t",header=None,names=['rname','pos','chrom','start','ID'],dtype={'rname':'string','pos':'uint32','chrom':'category','start':'uint32','ID':'uint32'})# 

    contig = sunkposcat['chrom'].unique().tolist()[0]




    ## get list of bad sunks
    with open(args.badsunks, "r") as f:
        badsunkin = set(f.read().splitlines())
    sunkposcat['ID2'] = sunkposcat['chrom'].astype(str) +":"+ sunkposcat['ID'].astype(str)

    sunkposcat = sunkposcat.query("ID2 not in @badsunkin")


    # contig = row.chrom
    # regstart = row.start
    # regend = row.end
    # xmin = regstart
    # xmax = regend

    # contig = 'chr20'
    kmer_groups = kmermerge.drop_duplicates(subset='ID') #  
    pd.options.mode.chained_assignment = None  # default='warn
    kmer_groups['dist'] = kmer_groups.ID.diff().fillna(kmer_groups.ID).astype(int) #rows)

    # contig = bedreg1['chrom'].values[0]
    # xmin = bedreg1['start'].values[0]
    # xmax = bedreg1['end'].values[0]

    #Filter 1: only keep reads with at least two SUNKs within target region
    multisunk = sunkposcat.groupby('rname',as_index=False).agg({'ID':'nunique'}).sort_values(by='ID').query('ID > 1')
    if len(multisunk) == 0:
      pd.DataFrame(list([contig])).to_csv(ofile,header=False,sep="\t",index=False)
      exit()
    #   continue

    multisunkset = set(multisunk.rname.tolist())

    ### RESTRICT TO REGION
    sunkpos3 = sunkposcat.query('rname in @multisunkset ').sort_values(by=['rname','start']) # reads with multiple SUNK group matches # & start >= @xmin & start <= @xmax
    #         sunkpos3 = sunkpos2.drop_duplicates(subset=['rname','ID'],keep='first') # one representation per read * SUNK group # done in kmerpos_annot script


    sunkpos3b = sunkpos3.drop_duplicates()

    minlen = 10000
    minlenset = set(lendf.query("len >= @minlen").index.tolist())
    sub = sunkpos3b.query("rname in @minlenset") #sunkpos3





    # part2 bookmark

    # minlenset = set(lendf.query("len >= @minlen").index.tolist())


    ### keeep pos
    start = timeit.default_timer()

    sub['start'] = sub['start'].astype('int64')
    sub['pos'] = sub['pos'].astype('int64')

    # rnamelist = list(pd.unique(distdf2.rname))
    # rnamelist = ['00060b17-6288-4b65-a67e-93e7498e0633']
    # rnamelist = ['fffdf958-07a2-4bc1-95aa-7c6b05cc70c6']
    outputs= []
    counter=0
    subgraphs=[]
    start2=timeit.default_timer()
    times=[]
    sub['start'] = sub['start'].astype('int64')
    sub['pos'] = sub['pos'].astype('int64')
    grouped = sub.groupby("rname") # .drop(columns='rname')
    for rname, g in grouped:
    #     if counter > 10:break
        counter +=1 
        start=timeit.default_timer()
        startarray = pdist(g[['start']])
        posarray = pdist(g[['pos']])
        signarray = pdist(g[['pos']], lambda u,v: u[0]>v[0] )
        idarray = list(it.combinations(g['ID'],2))
        locarray = list(it.combinations(g['pos'],2)) # keep track of ID-Pos corresponence
        with np.errstate(divide='ignore'): # kmers occasionally seen in multiple locations on same read 
            diffarray = posarray/startarray
        mask = np.logical_not((diffarray >= 1.1) + (diffarray <= 0.9))
        if sum(mask) < 1: continue
        # only calculate on masked version (distdf2 equiv)
        signarray = signarray.astype(int)
        vals,counts = np.unique(signarray[mask], return_counts = True)
        trueOrient = vals[np.argmax(counts)]
        mask2 = np.logical_and(mask,(signarray == trueOrient))

        stack = np.column_stack([idarray,locarray])[mask2,:]
        sub2 = pd.DataFrame(np.column_stack([idarray,locarray])[mask2,:])
        sub2.columns = ['ID','ID2','pos1','pos2']
        mergedf = pd.DataFrame(np.row_stack([stack[:,[0,2]],stack[:,[1,3]]]))
        mergedf.columns = ['ID','pos']

        mergedf = pd.concat( [
            sub2[['ID','pos1']].rename(columns={'pos1':'pos'}),
            sub2[['ID2','pos2']].rename(columns={'ID2':'ID','pos2':'pos'})
            ]) # .drop_duplicates().query("ID > @xmin & ID < @xmax")


        multipos = mergedf.groupby("ID")['pos'].agg(nunique='nunique').query("nunique>1")

        if len(multipos)>=1:
            badrowlist = []
    #     multipos

            for i in multipos.index:
        #         i=38570942
                goodpos = mergedf.query("ID == @i")['pos'].value_counts().index[0] # any checks that this is good?
                badrowlist += list(sub2.query("(ID == @i) & not ((pos1 == @goodpos)|(pos2 == @goodpos))").index)



            sub2b = sub2.drop(badrowlist)
        else: sub2b = sub2


        mergedf2 = pd.concat( [
            sub2b[['ID','pos1']].rename(columns={'pos1':'pos'}),
            sub2b[['ID2','pos2']].rename(columns={'ID2':'ID','pos2':'pos'})
            ]) # .drop_duplicates().query("ID > @xmin & ID < @xmax")

        g = gt.Graph(directed=False)
        # g.vp.name = g.new_vertex_property("int64_t")
        name = g.add_edge_list(sub2b[['ID','ID2']].values,hashed=True, hash_type='int64_t') # eprops = [a_pos1,a_pos2]
        g.vp.name = name
        glarge = gt.topology.extract_largest_component(g)

        glarge.vp.name.get_array()
        df=pd.DataFrame(g.vp.name.get_array()[glarge.get_vertices()],columns=['ID'])
        df['rname'] = rname
        outputs.append(df)
        end=timeit.default_timer()
        times.append(end-start)

    if len(outputs) == 0:
      pd.DataFrame(list([contig])).to_csv(ofile,header=False,sep="\t",index=False)
      exit()
    else: 
      outputsdf = pd.concat(outputs)
      outputsdf.to_csv(ofile,header=False,sep="\t",index=False)
    end2=timeit.default_timer()




    # made outputdf

    ### make contig-wide graph
    graphtimes=[]
    start = timeit.default_timer()
    graphtimes.append(start)
    idarray = pd.DataFrame(outputsdf.groupby("rname").apply(lambda g: list(it.combinations(g['ID'],2)))).explode([0])
    end = timeit.default_timer()
    graphtimes.append(end)
    idarray.rename(columns={0:'IDs',},inplace=True)
    idarray[['ID', 'ID2']] = pd.DataFrame(idarray['IDs'].tolist(), index=idarray.index)
    graphtimes=[]
    end = timeit.default_timer()
    graphtimes.append(end)

    g = gt.Graph(directed=False)
    # g.vp.name = g.new_vertex_property("int64_t")
    name = g.add_edge_list(idarray[['ID','ID2']].values,hashed=True, hash_type='int64_t') # eprops = [a_pos1,a_pos2]
    g.vp.name = name

    end = timeit.default_timer()
    graphtimes.append(end)

    # glarge = gt.topology.extract_largest_component(g)
    c = gt.topology.label_components(g)[0]

    end = timeit.default_timer()
    graphtimes.append(end)


    regions = []
    for s in list(pd.unique(c.a)):
        color='lightgray'
        offset=counter
        sunks = g.vp.name.get_array()[gt.GraphView(g, vfilt=c.a == s).get_vertices()]
        if len(sunks) <= 2: continue
        start = int(sunks.min())
        end = int(sunks.max())
        span = end-start
        regions.append((start,end,span,len(sunks)))
    #     ax.plot([start,end],[rownum+offset,rownum+offset],color=color,linewidth=9,solid_capstyle='butt',zorder=1)    
        end = timeit.default_timer()
        counter +=1 

    regdf = pd.DataFrame(regions,columns=['start','end','span','sunks']).sort_values(by='start')    

    regdf['contig'] = contig
    regdf[['contig','start','end']].to_csv(args.outputbed,header=False,sep="\t",index=False)

main()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import pandas as pd
import os


def splitlocs(hap_type,group_df, dirname, exten):
  for name, contig_group in group_df:
    contig_name = name.replace("#", "_")
    output_path = f'{dirname}/{contig_name}_{hap_type}.{exten}'
    print(output_path)
    contig_group.to_csv(output_path, header=None, index=None, sep="\t")
  return(True)


ONT_pos_df = pd.read_csv(snakemake.input.ONT_pos, header=None, sep="\t", names=['read','read_pos','contig','contig_start','contig_stop'])
kmer_loc = pd.read_csv(snakemake.input.kmer_loc, header=None, sep="\t", names=['contig','pos1','SUNK','pos2'])
contig_list = ONT_pos_df['contig'].unique().tolist()
kmer_loc_subset = kmer_loc[kmer_loc['contig'].isin(contig_list)]
dir_name = os.path.dirname(snakemake.output.flag)  
group_ONT = ONT_pos_df.groupby(['contig'])
group_loc = kmer_loc_subset.groupby(['contig'])
output_complete = splitlocs(snakemake.wildcards.hap,group_ONT, dir_name,"sunkpos")
output_complete_2 = splitlocs(snakemake.wildcards.hap,group_loc, dir_name,"loc")
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
import pandas as pd
from matplotlib import (pyplot as plt,
                        lines)
from matplotlib.collections import PatchCollection
from matplotlib.patches import Circle, Wedge, Polygon, Rectangle
import seaborn as sns
import numpy as nptwi
import matplotlib.ticker as mtick
import argparse
#print(pd.__version__)
import os
from ncls import NCLS

print(snakemake.input)

runmode=snakemake.wildcards.runmode
opt_keys = list(snakemake.input.keys())

regbedfile = snakemake.input.bed
if os.stat(regbedfile).st_size == 0:
    print("empty gap file")
    exit()

bedreg1 = pd.read_csv(regbedfile, delimiter="\t",encoding='utf-8',header=None)
header = ['chrom','start','end']
bedreg1.columns = header + [''] * (len(bedreg1.columns) - len(header))  

lendf = pd.read_csv(snakemake.input.rlen,sep="\t",header=None,names=['rname','len'],dtype={'rname':'string','len':'uint32'})
lendf.drop_duplicates(inplace=True)
lendf.set_index("rname",drop=True,inplace=True)


bedreg1['chrom'].value_counts()
contigs = list(bedreg1['chrom'].value_counts().index)
# print(len(contigs))

if 'colorbed' in opt_keys:
    dm = pd.read_csv(snakemake.input.colorbed, delim_whitespace=True, names=['chr','chrStart','chrEnd','color'], header=None,dtype = {'chr':'string','chrStart':'int','chrEnd':'int','color':'string'})
    dm = dm[dm.chrStart >= 0]
    dm["y"] = 1
    dm["func"] = ""





# sunkposfile = os.path.dirname(snakemake.input.pos_locs)+"/"+ cont2 + "_" + snakemake.wildcards.hap + ".sunkpos"
sunkposfile = snakemake.input.pos_locs
sunkposcat = pd.read_csv(sunkposfile,sep="\t",header=None,names=['rname','pos','chrom','start','ID'],dtype={'rname':'string','pos':'int64','chrom':'category','start':'int64','ID':'int64'})

sunkposcat.set_index("rname",drop=True,inplace=True)    
sunkposcat['ID2'] = sunkposcat['chrom'].astype(str) +":"+ sunkposcat['ID'].astype(str)

for contig in contigs:

    cont2 = contig.replace("#","_")
    bedreg= bedreg1.query('chrom == @contig')

    #print(type(snakemake.input.interout))
    #my_str=os.path.dirname(snakemake.input.interout)
    #print(my_str)

    try: outputsdf= pd.read_csv(os.path.dirname(snakemake.input.interout) + "/" + cont2 + "_" + snakemake.wildcards.hap + ".tsv" ,sep="\t",names=['ID','rname'],dtype = {'ID':'int','rname':'string'})
    except:
        print("outputsdf not found for", contig)

    outputsdf.set_index("rname",inplace=True,drop=True)

    if len(outputsdf) <2:
        print("NO READS FOR", contig)


    kmermergefile = os.path.dirname(snakemake.input.splits) +"/"+ cont2 + "_" + snakemake.wildcards.hap + ".loc"
    kmermerge = pd.read_csv(kmermergefile,sep="\t",header=None,names=['chrom','loc','kmer','ID'])# this could be split too
    kmermerge = kmermerge.drop_duplicates(subset='kmer')
    plotdir = os.path.dirname(snakemake.output.flag) + "/"
    #print(plotdir)

    kmermerge['ID2'] = kmermerge['chrom'].astype(str) +":"+ kmermerge['ID'].astype(str)


    kmer_groups = kmermerge.drop_duplicates(subset='ID') #  
    kmer_groups['dist'] = kmer_groups.ID.diff().fillna(kmer_groups.ID).astype(int) #rows)

    sunkpos_sub = sunkposcat.query("chrom == @contig")
    multisunk = sunkpos_sub.groupby(sunkpos_sub.index,as_index=True).agg({'ID':'nunique'}).sort_values(by='ID').query('ID > 1')

    if len(multisunk) == 0:
      print("No viable reads for this region: "+str(contig)+"_"+str(regstart)+"_"+str(regend))

    multisunklist = list(multisunk.index)
    multisunkset = set(multisunklist)

    ### RESTRICT TO REGION
    # sunkpos3 = sunkposcat.loc[multisunklist] # reads with multiple SUNK group matches # & start >= @xmin & start <= @xmax
# orig sunkposfile stuff

    sunkposfile_split = os.path.dirname(snakemake.input.splits)+"/"+ cont2 + "_" + snakemake.wildcards.hap + ".sunkpos"
    sunkposcat_split = pd.read_csv(sunkposfile_split,sep="\t",header=None,names=['rname','pos','chrom','start','ID'],dtype={'rname':'string','pos':'int64','chrom':'category','start':'int64','ID':'int64'})
    sunkposcat_split['ID2'] = sunkposcat_split['chrom'].astype(str) +":"+ sunkposcat_split['ID'].astype(str)
    sunkpos3 = sunkposcat_split.query('rname in @multisunkset ').sort_values(by=['rname','start']) # reads with multiple SUNK group matches # & start >= @xmin & start <= @xmax
    sunkpos3.set_index("rname",drop=True,inplace=True)



    # sunkpos3.set_index("rname",drop=True,inplace=True)

    for row in bedreg.itertuples():
        contig = row.chrom
        regstart = row.start
        regend = row.end
        xmin = regstart
        xmax = regend
        breakloc = int((xmin + xmax)/2)
        rightset = set(outputsdf.query("ID > @breakloc").index)
        # print("skipped ", skip)
        #         print(len(pd.unique(outputsdf['rname'])))
        #     end = timeit.default_timer()
        #     print(end-start) # 


        sunklines=True

        # PLOT

        df = outputsdf.query("ID > @xmin & ID < @xmax")


        if len(df) == 0:
            print("NO HITS FOR", row)
        #     continue
        #         df.set_index("rname",inplace=True,drop=True)


        kmerlist = list(pd.unique(kmer_groups.query("ID > @xmin & ID < @xmax").ID))
        kmerset = set(kmerlist)
        never_seen2 = kmerset - set(df.ID)


        constart = df.ID.min()
        conend = df.ID.max()
        # plt.figure(figsize=(25,15))
        # plt.grid(False)
        maxes={}
        first=True

        fig,ax = plt.subplots(figsize=(20,9))
        fig.subplots_adjust(right=0.75)
        ax.set_xlim(regstart,regend)

        intervals = []
        valid_intervals = []
        poss = []
        poss_bad = []

        for rname in list(pd.unique(df.index)):
            try: rlen = lendf.loc[rname].len
            except:
                print("no rlen for: ", rname)
                if runmode == 'user_bed':
                    continue
            sub = df.loc[rname]

            try: idlist = set(sub.ID)
            except: idlist = set([sub.ID])

            possub = sunkpos3.loc[rname].query("ID in @idlist")
            posfirst = possub.iloc[0]
            poslast = possub.iloc[-1]
            forward = True
            badlocs_trans = []

            possuball = sunkposcat.loc[rname].drop_duplicates(subset=['chrom','ID'])
            badlocs_same = possuball.query(" chrom==@contig & (not ID in @idlist) ")    
        #     badlocs_diff = possuball.query(" chrom != @contig ")    
            badlocs_diff = possuball.query(" chrom != @contig  ")  
            badlocs_diff_most = badlocs_diff.chrom.value_counts().index[0]
            badlocs_diff = badlocs_diff.query("chrom == @badlocs_diff_most") 
        #     if len(badlocs_diff) > 0 : break


            if posfirst.pos > poslast.pos: forward=False


            if forward:
                posstart = posfirst.start - posfirst.pos
                posend =  poslast.start + (rlen - poslast.pos)
                badlocs_same_trans = [ (posfirst.start + (x - posfirst.pos)) for x in badlocs_same.pos.tolist() ]
                badlocs_diff_trans = [ (posfirst.start + (x - posfirst.pos)) for x in badlocs_diff.pos.tolist() ]
            else:
                posend = poslast.pos + poslast.start
                posstart =  posfirst.start - (rlen - posfirst.pos)
                badlocs_same_trans = [ (posfirst.start - (x - posfirst.pos)) for x in badlocs_same.pos.tolist() ]
                badlocs_diff_trans = [ (posfirst.start - (x - posfirst.pos)) for x in badlocs_diff.pos.tolist() ]
        #     print(posstart,posend)

            start = posstart
            end = posend

            if runmode == 'gaps':
                plotgroup=1
                columns = ['Start','End','Plotgroup']
                if rname in rightset: plotgroup=0
                intervals.append((start,end,min(idlist),max(idlist),plotgroup))
            elif runmode == 'user_bed':
                intervals.append((start,end,min(idlist),max(idlist),plotgroup))
                columns = ['Start','End']
            poss_bad.append(badlocs_diff_trans)
        #     valid_intervals.append((min(idlist),max(idlist),plotgroup))

            poss.append(list(idlist))

        interval_df = pd.DataFrame(intervals,columns=['Start','End','valid_Start','valid_End','Plotgroup'],)
        # valid_interval_df = pd.DataFrame(valid_intervals,columns=['Start','End','Plotgroup'],)


        ncls = NCLS(interval_df.Start,interval_df.End,interval_df.index)

        row_asn = dict()


        # interval_df.sort_values(by=['Plotgroup','Start'],inplace=True)
        interval_df.sort_values(by=['Start'],inplace=True)

        rows_used = set([0])
        if runmode == 'user_bed':
            for i,s,e in interval_df.itertuples():
                overlap = ncls.find_overlap(s,e)
                rows_overlap = set()
                for i2,s2,e2 in overlap:
                    rows_overlap.add( row_asn.get(e2,0))
                rows_avail = rows_used - rows_overlap
                if len(rows_avail) == 0:
                    row_pick = sorted(rows_used)[-1]+1
                    rows_used.add(row_pick)
                    row_asn[i] = row_pick
                else:
                    row_asn[i] = sorted(rows_avail)[0]
            interval_df['row_asn'] = interval_df.index.map(lambda x: row_asn[x])
            patches = []
            ticks = []
            for i,s,e,r in interval_df.itertuples():
                patches.append(Rectangle((s, r), e-s, 0.7))
                tick = [Rectangle((x, r), 1, 0.7)  for x in poss[i]]
                ticks = ticks + tick
        elif runmode == 'gaps':
            for pgi in [0,1]:
                intv_sub = interval_df.query("Plotgroup == @pgi")
                intv_sub['End'] = intv_sub['End'] + 500 # Minimum separation between reads
                if pgi==1:
                    intv_sub = intv_sub.sort_values(by=['End'],ascending=False)
                for i,s,e,sv,ev,pg in intv_sub.itertuples():

                    overlap = ncls.find_overlap(s,e)
                    rows_overlap = set()
                    for i2,s2,e2 in overlap:
                        rows_overlap.add( row_asn.get(e2,0))

                    rows_avail = rows_used - rows_overlap
                    if len(rows_avail) == 0:
                        row_pick = sorted(rows_used)[-1]+1
                        rows_used.add(row_pick)
                        row_asn[i] = row_pick
                    else:
                        row_asn[i] = sorted(rows_avail)[0]
                    maxrow_prev = max(row_asn.values())

        interval_df['row_asn'] = interval_df.index.map(lambda x: row_asn[x])

        patches = []
        valid_patches = []                     
        ticks = []
        bad_ticks = []
        for i,s,e,sv,ev,pg,r in interval_df.itertuples():
            patches.append(Rectangle((s, r+0.2), e-s, 0.4))
            valid_patches.append(Rectangle((sv, r), ev-sv, 0.7))                         
            tick = [Rectangle((x, r), 1, 0.7)  for x in poss[i]]
            bad_tick = [Rectangle((x, r+0.2), 1, 0.4)  for x in poss_bad[i]]
            bad_ticks = bad_ticks + bad_tick
            ticks = ticks + tick

        never_seen3 = [x for x in never_seen2 if not ((x<constart) | (x>conend))]

        if 'colorbed' in opt_keys:
            # print("COLORBED!!!")
            patches2=[]
            for r in dm[['chrStart','chrEnd','color','chr']].query('chr==@contig & chrEnd > @xmin & chrStart < @xmax ').itertuples():
                polygon = Polygon([[r[1],-4],[r[1],-5],[r[2],-5],[r[2],-4]], True, color=r[3],alpha=1)
                patches2.append(polygon)
            p = PatchCollection(patches2, alpha=1, match_original=True)
            # print("patchlen: ",len(patches2))
            ax.add_collection(p)

        ax.add_collection(PatchCollection(valid_patches,color='lightgray',zorder=2,aa=True,edgecolor='w',linewidth=0.01))
        ax.add_collection(PatchCollection(patches,color='paleturquoise',zorder=1,aa=True,edgecolor='w',linewidth=0.01))                     
        ax.add_collection(PatchCollection(ticks,color='k',zorder=7,linewidth=0.5,edgecolor='k',aa=True)) # aa=True
        ax.add_collection(PatchCollection(bad_ticks,color='r',zorder=8,linewidth=0.5,edgecolor='r',aa=True)) # aa=True

        ax.scatter(x=never_seen3,y=[-2.5]*len(never_seen3),color='k',s=55,marker="|",facecolors=None,linewidths=0.4,zorder=2)


        if sunklines:
            sl = [Rectangle((x, -1),0.5, (max(rows_used) + 1.5) )  for x in kmerlist]
            ax.add_collection(PatchCollection(sl,color='darkgray',zorder=0.9,linewidth=0.5,edgecolor='lightgray',aa=True,linestyle="--")) # aa=True

        # ax.scatter(x=kmerlist,y=[-1.5]*len(kmerlist),color='k',s=55,marker="|",facecolors=None,linewidths=0.4,zorder=2)
        # # plt.scatter(data=graph3,x='ID',y='counter',color='red',s=2)
        fmt = '{x:,.0f}'
        tick = mtick.StrMethodFormatter(fmt)
        ax.xaxis.set_major_formatter(tick)


        ax.set_ylabel("ONT Read Depth")
        ax.set_xlabel("Contig coordinate")

        plt.rcParams['svg.fonttype'] = 'none'
        plt.savefig(plotdir + "/" + snakemake.wildcards.sample +"_" + snakemake.wildcards.hap + "_" +cont2+"_"+str(regstart)+"_"+str(regend)+"_detailed.svg",format="svg",pad_inches=0,bbox_inches='tight')
        plt.savefig(plotdir + "/" + snakemake.wildcards.sample +"_" + snakemake.wildcards.hap + "_" +cont2+"_"+str(regstart)+"_"+str(regend)+"_detailed.png",pad_inches=0,bbox_inches='tight',dpi=300)
        plt.savefig(plotdir + "/" + snakemake.wildcards.sample +"_" + snakemake.wildcards.hap + "_" +cont2+"_"+str(regstart)+"_"+str(regend)+"_detailed.pdf",pad_inches=0,bbox_inches='tight')


        #plt.savefig(args.plotdir +row.name +"__" +  args.sampname +"_" + args.hap + "_" +contig+"_"+str(regstart)+"_"+str(regend)+".svg",format="svg",pad_inches=0,bbox_inches='tight')

        #         plt.savefig(plotdir+contig+"_"+str(regstart)+"_"+str(regend)+"_v4.png",dpi=300,pad_inches=0,bbox_inches='tight')
        # plt.savefig(plotdir+contig+"_"+str(regstart)+"_"+str(regend)+"_v4.pdf",dpi=300,pad_inches=0,bbox_inches='tight')
        # plt.savefig(plotdir+contig+"_"+str(regstart)+"_"+str(regend)+"_v4.svg",dpi=300,pad_inches=0,bbox_inches='tight')  
        # plt.savefig(args.plotdir +row.name +"__" +  args.sampname +"_" + args.hap + "_" +contig+"_"+str(regstart)+"_"+str(regend)+".png",dpi=300,pad_inches=0,bbox_inches='tight')

        #         plt.savefig(args.plotdir + args.sampname +"_" + args.hap + "_" + row.name +"_" +contig+"_"+str(regstart)+"_"+str(regend)+".png",dpi=300,pad_inches=0,bbox_inches='tight')
        # plt.savefig("breakplots/"+contig+"_"+str(regstart)+"_"+str(regend)+"_.pdf",dpi=300,pad_inches=0,bbox_inches='tight')
        # plt.savefig("breakplots/"+contig+"_"+str(regstart)+"_"+str(regend)+"_.svg",dpi=300,pad_inches=0,bbox_inches='tight')  
        plt.close()

        print("Plotting complete for " +contig+"_"+str(regstart)+"_"+str(regend)) 
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
import pandas as pd
from matplotlib import (pyplot as plt,
                        lines)
from matplotlib.collections import PatchCollection
from matplotlib.patches import Circle, Wedge, Polygon, Rectangle
import seaborn as sns
import numpy as nptwi
import matplotlib.ticker as mtick
import argparse
#print(pd.__version__)
import os
from ncls import NCLS

print(snakemake.input)

runmode=snakemake.wildcards.runmode
opt_keys = list(snakemake.input.keys())

regbedfile = snakemake.input.bed
if os.stat(regbedfile).st_size == 0:
    print("empty gap file")
    exit()

bedreg1 = pd.read_csv(regbedfile, delimiter="\t",encoding='utf-8',header=None)
header = ['chrom','start','end']
bedreg1.columns = header + [''] * (len(bedreg1.columns) - len(header))  

lendf = pd.read_csv(snakemake.input.rlen,sep="\t",header=None,names=['rname','len'],dtype={'rname':'string','len':'uint32'})
lendf.drop_duplicates(inplace=True)
lendf.set_index("rname",drop=True,inplace=True)


bedreg1['chrom'].value_counts()
contigs = list(bedreg1['chrom'].value_counts().index)
print(len(contigs))



if 'colorbed' in opt_keys:
    dm = pd.read_csv(snakemake.input.colorbed, delim_whitespace=True, names=['chr','chrStart','chrEnd','color'], header=None,dtype = {'chr':'string','chrStart':'int','chrEnd':'int','color':'string'})
    dm = dm[dm.chrStart >= 0]
    dm["y"] = 1
    dm["func"] = ""

for contig in contigs:

    cont2 = contig.replace("#","_")
    print(contig)
    bedreg= bedreg1.query('chrom == @contig')

    #print(type(snakemake.input.interout))
    #my_str=os.path.dirname(snakemake.input.interout)
    #print(my_str)

    try: outputsdf= pd.read_csv(os.path.dirname(snakemake.input.interout) + "/" + cont2 + "_" + snakemake.wildcards.hap + ".tsv" ,sep="\t",names=['ID','rname'],dtype = {'ID':'int','rname':'string'})
    except:
        print("outputsdf not found for", contig)

    outputsdf.set_index("rname",inplace=True,drop=True)

    if len(outputsdf) <2:
        print("NO READS FOR", contig)


    kmermergefile = os.path.dirname(snakemake.input.pos_locs) +"/"+ cont2 + "_" + snakemake.wildcards.hap + ".loc"
    kmermerge = pd.read_csv(kmermergefile,sep="\t",header=None,names=['chrom','loc','kmer','ID'])# this could be split too
    kmermerge = kmermerge.drop_duplicates(subset='kmer')
    plotdir = os.path.dirname(snakemake.output.flag) + "/"
    #print(plotdir)
    sunkposfile = os.path.dirname(snakemake.input.pos_locs)+"/"+ cont2 + "_" + snakemake.wildcards.hap + ".sunkpos"
    sunkposcat = pd.read_csv(sunkposfile,sep="\t",header=None,names=['rname','pos','chrom','start','ID'],dtype={'rname':'string','pos':'int64','chrom':'category','start':'int64','ID':'int64'})

    kmermerge['ID2'] = kmermerge['chrom'].astype(str) +":"+ kmermerge['ID'].astype(str)
    sunkposcat['ID2'] = sunkposcat['chrom'].astype(str) +":"+ sunkposcat['ID'].astype(str)
    print("sunkposcat len: ", len(sunkposcat))

    kmer_groups = kmermerge.drop_duplicates(subset='ID') #  
    kmer_groups['dist'] = kmer_groups.ID.diff().fillna(kmer_groups.ID).astype(int) #rows)

    multisunk = sunkposcat.groupby('rname',as_index=False).agg({'ID':'nunique'}).sort_values(by='ID').query('ID > 1')
    print("Filter 1 multisunk len: ", len(multisunk))

    if len(multisunk) == 0:
      print("No viable reads for this region: "+str(contig)+"_"+str(regstart)+"_"+str(regend))


    multisunkset = set(multisunk.rname.tolist())

    ### RESTRICT TO REGION
    sunkpos3 = sunkposcat.query('rname in @multisunkset ').sort_values(by=['rname','start']) # reads with multiple SUNK group matches # & start >= @xmin & start <= @xmax

    print(len(sunkpos3))
    print(len(pd.unique(sunkpos3.rname)))
    print(len(pd.unique(sunkposcat.rname)))

    sunkpos3.set_index("rname",drop=True,inplace=True)

    for row in bedreg.itertuples():
        contig = row.chrom
        regstart = row.start
        regend = row.end
        xmin = regstart
        xmax = regend
        breakloc = int((xmin + xmax)/2)
        rightset = set(outputsdf.query("ID > @breakloc").index)
        df = outputsdf.query("ID > @xmin & ID < @xmax")

        if len(df) == 0:
            print("NO HITS FOR", row)
            continue

        kmerlist = list(pd.unique(kmer_groups.query("ID > @xmin & ID < @xmax").ID))
        kmerset = set(kmerlist)
        never_seen2 = kmerset - set(df.ID)
        print(len(never_seen2))
        constart = df.ID.min()
        conend = df.ID.max()
        maxes={}
        first=True
        fig,ax = plt.subplots(figsize=(20,9))
        fig.subplots_adjust(right=0.75)
        ax.set_xlim(regstart,regend)
        intervals = []
        poss = []
        print("NUMBER OF RNAMES: ",len(list(pd.unique(df.index))))
        for rname in list(pd.unique(df.index)):
            try: rlen = lendf.loc[rname].len
            except:
                print("no rlen for: ", rname)
                if runmode == 'user_bed':
                    continue
            sub = df.loc[rname]

            try: idlist = set(sub.ID)
            except: idlist = set([sub.ID])

            possub = sunkpos3.loc[rname].query("ID in @idlist")
            posfirst = possub.iloc[0]
            poslast = possub.iloc[-1]
            forward = True
            badlocs_trans = []

            if posfirst.pos > poslast.pos: forward=False

            if forward:
                posstart = posfirst.start - posfirst.pos
                posend =  poslast.start + (rlen - poslast.pos)  
            else:
                posend = poslast.pos + poslast.start
                posstart =  posfirst.start - (rlen - posfirst.pos)

            start = posstart
            end = posend


            if runmode == 'gaps':
                plotgroup=1
                columns = ['Start','End','Plotgroup']
                if rname in rightset: plotgroup=0
                intervals.append((start,end,plotgroup))
            elif runmode == 'user_bed':
                intervals.append((start,end))
                columns = ['Start','End']

            poss.append(list(idlist))

        interval_df = pd.DataFrame(intervals,columns=columns)
        ncls = NCLS(interval_df.Start,interval_df.End,interval_df.index)
        row_asn = dict()
        interval_df.sort_values(by='Start',inplace=True)
        print("INTERVAL_DF len: ", len(interval_df))
        rows_used = set([0])
        if runmode == 'user_bed':
            for i,s,e in interval_df.itertuples():
                overlap = ncls.find_overlap(s,e)
                rows_overlap = set()
                for i2,s2,e2 in overlap:
                    rows_overlap.add( row_asn.get(e2,0))
                rows_avail = rows_used - rows_overlap
                if len(rows_avail) == 0:
                    row_pick = sorted(rows_used)[-1]+1
                    rows_used.add(row_pick)
                    row_asn[i] = row_pick
                else:
                    row_asn[i] = sorted(rows_avail)[0]
            interval_df['row_asn'] = interval_df.index.map(lambda x: row_asn[x])
            patches = []
            ticks = []
            for i,s,e,r in interval_df.itertuples():
                patches.append(Rectangle((s, r), e-s, 0.7))
                tick = [Rectangle((x, r), 1, 0.7)  for x in poss[i]]
                ticks = ticks + tick
        elif runmode == 'gaps':
            for pgi in [0,1]:
                intv_sub = interval_df.query("Plotgroup == @pgi")
                intv_sub['End'] = intv_sub['End'] + 500 # Minimum separation between reads
                if pgi==1:
                    intv_sub = intv_sub.sort_values(by=['End'],ascending=False)
                for i,s,e,pg in intv_sub.itertuples():

                    overlap = ncls.find_overlap(s,e)
                    rows_overlap = set()
                    for i2,s2,e2 in overlap:
                        rows_overlap.add( row_asn.get(e2,0))

                    rows_avail = rows_used - rows_overlap
                    if len(rows_avail) == 0:
                        row_pick = sorted(rows_used)[-1]+1
                        rows_used.add(row_pick)
                        row_asn[i] = row_pick
                    else:
                        row_asn[i] = sorted(rows_avail)[0]
                    maxrow_prev = max(row_asn.values())

            interval_df['row_asn'] = interval_df.index.map(lambda x: row_asn[x])
            patches = []
            ticks = []
            for i,s,e,pg,r in interval_df.itertuples():
                patches.append(Rectangle((s, r), e-s, 0.7))
                tick = [Rectangle((x, r), 1, 0.7)  for x in poss[i]]
                ticks = ticks + tick

        never_seen3 = [x for x in never_seen2 if not ((x<constart) | (x>conend))]

        if 'colorbed' in opt_keys:
            patches2=[]
            for r in dm[['chrStart','chrEnd','color','chr']].query('chr==@contig & chrEnd > @xmin & chrStart < @xmax ').itertuples():
                polygon = Polygon([[r[1],-4],[r[1],-5],[r[2],-5],[r[2],-4]], True, color=r[3],alpha=1)
                patches2.append(polygon)
            p = PatchCollection(patches2, alpha=1, match_original=True)
            ax.add_collection(p)
        ax.add_collection(PatchCollection(patches,color='lightgray',zorder=1,aa=True,edgecolor='w',linewidth=0.01))
        ax.add_collection(PatchCollection(ticks,color='k',zorder=7,linewidth=0.5,edgecolor='k',aa=True))
        ax.scatter(x=never_seen3,y=[-2.5]*len(never_seen3),color='k',s=55,marker="|",facecolors=None,linewidths=0.4,zorder=2)
        ax.scatter(x=kmerlist,y=[-1.5]*len(kmerlist),color='k',s=55,marker="|",facecolors=None,linewidths=0.4,zorder=2)
        fmt = '{x:,.0f}'
        tick = mtick.StrMethodFormatter(fmt)
        ax.xaxis.set_major_formatter(tick) 
        ax.set_ylabel("ONT Read Depth")
        ax.set_xlabel("Contig coordinate")
        plt.rcParams['svg.fonttype'] = 'none'
        print(plotdir + "/" + snakemake.wildcards.sample +"_" + snakemake.wildcards.hap + "_" +cont2+"_"+str(regstart)+"_"+str(regend)+".svg")
        plt.savefig(plotdir + "/" + snakemake.wildcards.sample +"_" + snakemake.wildcards.hap + "_" +cont2+"_"+str(regstart)+"_"+str(regend)+".svg",format="svg",pad_inches=0,bbox_inches='tight')
        plt.savefig(plotdir + "/" + snakemake.wildcards.sample +"_" + snakemake.wildcards.hap + "_" +cont2+"_"+str(regstart)+"_"+str(regend)+".png",pad_inches=0,bbox_inches='tight',dpi=300)
        plt.savefig(plotdir + "/" + snakemake.wildcards.sample +"_" + snakemake.wildcards.hap + "_" +cont2+"_"+str(regstart)+"_"+str(regend)+".pdf",pad_inches=0,bbox_inches='tight')
        plt.close()
        print("Plotting complete for " +contig+"_"+str(regstart)+"_"+str(regend))
ShowHide 20 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/pdishuck/GAVISUNK
Name: gavisunk
Version: v1.0.0
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...