Analysis of multi-way chromatin interactions from very long sequencing reads

public public 1yr ago 0 bookmarks

The thesis is focused on the concept of “chromosomal walks” introduced by Olivares-Chauvet et al., 2016. These chromosomal walks are derived from interactions identified de novo in whole- genome chromatin conformation data. The structure of chromosomal

Code Snippets

63
64
shell:
	"src/py/digestion.py {input} {output}"
SnakeMake From line 63 of main/Snakefile
76
77
78
79
80
81
82
83
84
85
	shell:
		"bowtie2 -x {input.index}/hg19 -U {input.fastq} | samtools view -bS -> {output}" # human
		# "bowtie2 -x {input.index}/mm9 -U {input.fastq} | samtools view -bS -> {output}" # mouse



# rule bowtie2:  #  bedtools bamtofastq [OPTIONS] -i <BAM> -fq <FASTQ> - command to check (working)
# 	input:
# 		index = "data/Bowtie2Index/hg19",
# 		fastq = "data/digested/k562/{sample}.digested.fastq"
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
	shell:
		"src/py/filtering.py human {input} {output}"   # human
		# "src/py/filtering.py mouse {input} {output}"   # mouse

rule barh:
	input:
		"data/supportive/k562"  # human
		# "data/supportive/mesc"  # mouse
	output:
		"data/charts/k562/barh/human_barh.png"  # human
		# "data/charts/mesc/barh/mouse_barh.png"  # mouse
	shell:
	 	"src/py/barh.py human {input} {output}"  # human
		# "src/py/barh.py mouse {input} {output}"  # mouse

rule charts:
	input:
		"data/supportive/k562/{res}.tsv"  # human
		# "data/supportive/mesc/{res}.tsv"  # mouse
	output:
	# human
		RvsR_barplot = "data/charts/k562/{res}_RvsR.png",
		dist_0_5000_R = "data/charts/k562/{res}_0_5000_R.png",
		dist_500_10000_R = "data/charts/k562/{res}_500_10000_R.png",
		strand_1vs2_barplot = "data/charts/k562/{res}_strand_1vs2.png",
		dist_0_5000_S = "data/charts/k562/{res}_0_5000_S.png",
		dist_500_10000_S = "data/charts/k562/{res}_500_10000_S.png",
		log10 = "data/charts/k562/{res}_log10_500_1000.png"
SnakeMake From line 107 of main/Snakefile
144
145
146
shell:
	"src/py/charts.py human {input} {output.RvsR_barplot} {output.dist_0_5000_R} {output.dist_500_10000_R} \
	{output.strand_1vs2_barplot} {output.dist_0_5000_S} {output.dist_500_10000_S} {output.log10}"
SnakeMake From line 144 of main/Snakefile
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
	shell:
		"src/py/cwalk.py human {input.bedfile} {input.tsvfile} {output.cwalk_txt} {output.cwalk_bed}"  # human 
		# "src/py/cwalk.py mouse {input.bedfile} {input.tsvfile} {output.cwalk_txt} {output.cwalk_bed}"  # mouse

rule analysis:
	input:
		"data/cwalks/k562/txt"  # human
		# "data/cwalks/mesc/txt"  # mouse
	output:
		stats_intra = "data/charts/cwalks/k562/human_cwalk_hists.png",  # human
		fractions = "data/charts/cwalks/k562/human_fractions.png",  # human
		barh_all = "data/charts/cwalks/k562/cwalk_human_barh.png"  # human
		# stats = "data/analysis/cwalks/mesc/mouse_cwalk_hists.png",  # mouse
		# fractions = "data/analysis/cwalks/mesc/mouse_fractions.png",  # mouse
		# barh = "data/analysis/cwalks/mesc/cwalk_mouse_barh.png"  # mouse
	shell:
		"src/py/cwalk_analysis.py human {input} {output.stats_intra} {output.fractions} {output.barh_all}"  # human
		# "src/py/cwalk_analysis.py mouse {input} {output.stats} {output.fractions} {output.barh}"  # mouse

rule directionality:
	input:
		"data/cwalks/k562/txt"  # human
		# "data/cwalks/mesc/txt"  # mouse
	output:
		"data/charts/directionality/k562/human_directionality.png"  # human 
		# "data/charts/directionality/mesc/mouse_directionality.png"  # mouse
	shell:
		"src/py/directionality.py human data/cwalks/k562/txt {output}"  # human
		# "src/py/directionality.py mouse data/cwalks/mesc/txt {output}"  # mouse

rule TF:
	input:
		cwalk = "data/cwalks/k562/txt", 
		genome = "data/genomes/hg19.chrom.sizes",  
		chipseq = "data/factors/chip-seq_hg19_K562"
SnakeMake From line 164 of main/Snakefile
203
204
shell:
	"src/py/TF_human.py {input.cwalk} {input.genome} {input.chipseq} {output}"
SnakeMake From line 203 of main/Snakefile
231
232
233
shell:
	"src/py/TAD.py human {input.cwalk} {input.doms} {input.genome} {input.rest} \
	{output.frac_data} {output.frac_random} {output.types} {output.comparision} {output.counts} {output.doms}"
SnakeMake From line 231 of main/Snakefile
245
246
shell:
	"src/py/TF_mouse.py {input.cwalk} {input.genome} {input.chipseq}"
SnakeMake From line 245 of main/Snakefile
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
from os import listdir
from os.path import isfile, join
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import matplotlib.ticker as ticker
import sys


def load_tsvfiles():
    """ load all available .tsv files """
    return [sample for sample in listdir(sys.argv[1]) if isfile(join(sys.argv[1], sample))]


def count_aligns(infiles):
    """
    Extract information about the number of all aligns,
    aligns with abs_pos >= 500,
    rejected aligns
    and its labels.
    """
    global rejected, labels
    raw_data, data = [], []
    for file in infiles:
        df = pd.read_csv(f"{sys.argv[1]}/{file}", sep='\t')  # load single .tsv file
        df_clean = df.where(df.abs_pos >= 500).dropna()
        df_clean.reset_index(drop=True, inplace=True)
        raw_data.append(len(df.index))
        data.append(len(df_clean.index))
        rejected = [a - b for (a, b) in zip(raw_data, data)]
        labels = [file[:-4] for file in infiles]
    return [data, rejected, labels]


def barh(data, rejected, labels, name):
    sns.set_style("whitegrid")
    fig, ax = plt.subplots(figsize=(12, 16))
    y_pos = np.arange(len(labels))
    bar_1 = ax.barh(y_pos, data, color="tab:blue", edgecolor="black", height=1)
    bar_2 = ax.barh(y_pos, rejected, left=data, color="tab:red", edgecolor="black", height=1)
    ax.xaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: "{0:g}".format(x / 1000000) + "M"))
    plt.yticks(y_pos, labels)
    ax.set_title("Number of useful and rejected alignments in each sample", fontsize=18)
    plt.legend([bar_1, bar_2], ["Useful", "Rejected"], loc="upper right", prop={"size": 16})
    return plt.savefig(f"{name}"), plt.close()


if __name__ == '__main__':
    counted = count_aligns(load_tsvfiles())
    barh(counted[0], counted[1], counted[2], sys.argv[2])
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.ticker as ticker
import sys

organism = sys.argv[1]

""" load .tsv file """
df = pd.read_csv(sys.argv[2], sep='\t')
experiment_name = sys.argv[2][21:-4]

""" Initiation basic dependencies """
DIST_no_limit = df
DIST_500_inf = df.where(df.abs_pos >= 500)
DIST_0_5000 = df.where(df.abs_pos <= 5000)
DIST_500_10000 = df.where(df.abs_pos >= 500).where(df.abs_pos <= 10000)

df_RvsR_0_5000 = {x: y for x, y in DIST_0_5000.groupby("RvsR")}
df_RvsR_500_10000 = {x: y for x, y in DIST_500_10000.groupby("RvsR")}

df_strand_0_5000 = {x: y for x, y in DIST_0_5000.groupby("strand_1vs2")}
df_strand_500_10000 = {x: y for x, y in DIST_500_10000.groupby("strand_1vs2")}

RvsR_keys = list(df_RvsR_0_5000.keys())
strand_keys = list(df_strand_0_5000.keys())


def main():
    def distances_histplot(df, keys, experiment_name, name):
        sns.set_style("whitegrid")
        fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, sharex=False, sharey=False, figsize=(16, 12))
        plt.subplots_adjust(bottom=0.1, top=0.9, hspace=0.25, wspace=0.3, right=0.93, left=0.15)
        if "R1 vs R1" in keys:
            R1vsR1 = df["R1 vs R1"].reset_index(drop=True)
            sns.histplot(ax=ax1, data=df, x=R1vsR1["abs_pos"], color="red", edgecolor="black")
            ax1.set_title("R1 vs R1", fontsize=18)
        elif "forward vs forward" in keys:
            S1vsS1 = df["forward vs forward"].reset_index(drop=True)
            sns.histplot(ax=ax1, data=df, x=S1vsS1["abs_pos"], color="forestgreen", edgecolor="black")
            ax1.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: "{0:g}".format(x / 1000) + "k"))
            ax1.set_title("forward vs forward", fontsize=18)
        else:
            fig.delaxes(ax1)
        if "R2 vs R2" in keys:
            R2vsR2 = df["R2 vs R2"].reset_index(drop=True)
            sns.histplot(ax=ax2, data=df, x=R2vsR2["abs_pos"], color="red", edgecolor="black")
            ax2.set_title("R2 vs R2", fontsize=18)
        elif "reverse vs reverse" in keys:
            S1vsS1 = df["reverse vs reverse"].reset_index(drop=True)
            sns.histplot(ax=ax2, data=df, x=S1vsS1["abs_pos"], color="forestgreen", edgecolor="black")
            ax2.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: "{0:g}".format(x / 1000) + "k"))
            ax2.set_title("reverse vs reverse", fontsize=18)
        else:
            fig.delaxes(ax2)
        if "R1 vs R2" in keys:
            R1vsR2 = df["R1 vs R2"].reset_index(drop=True)
            sns.histplot(ax=ax3, data=df, x=R1vsR2["abs_pos"], color="red", edgecolor="black")
            ax3.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: "{0:g}".format(x / 1000) + "k"))
            ax3.set_title("R1 vs R2", fontsize=18)
        elif "forward vs reverse" in keys:
            S1vsS1 = df["forward vs reverse"].reset_index(drop=True)
            sns.histplot(ax=ax3, data=df, x=S1vsS1["abs_pos"], color="forestgreen", edgecolor="black")
            ax3.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: "{0:g}".format(x / 1000) + "k"))
            ax3.set_title("forward vs reverse", fontsize=18)
        else:
            fig.delaxes(ax3)
        if "R2 vs R1" in keys:
            R2vsR1 = df["R2 vs R1"].reset_index(drop=True)
            sns.histplot(ax=ax4, data=df, x=R2vsR1["abs_pos"], color="red", edgecolor="black")
            ax4.set_title("R2 vs R1", fontsize=18)
        elif "reverse vs forward" in keys:
            S1vsS1 = df["reverse vs forward"].reset_index(drop=True)
            sns.histplot(ax=ax4, data=df, x=S1vsS1["abs_pos"], color="forestgreen", edgecolor="black")
            ax4.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: "{0:g}".format(x / 1000) + 'K'))
            ax4.set_title("reverse vs forward", fontsize=18)
        else:
            fig.delaxes(ax4)
        fig.suptitle(f"Sample from {experiment_name} {organism} cells", fontsize=20)

        axes = [ax1, ax2, ax3, ax4]
        for ax in axes:
            ax.set_xlabel("Absolute value comparing each two aligns", fontsize=15)
            ax.set_ylabel("Number of compared aligns", fontsize=15)
        return plt.savefig(f"{name}"), plt.close()

    def feature_barplot(df, experiment_name, name, save_as):
        sns.set_style("whitegrid")
        plt.title(f"Sample from {experiment_name} {organism} cells")
        if name == "RvsR":
            fig = plt.figure(figsize=(10, 5))
            fig.suptitle(f"Sample from {experiment_name} {organism} cells")
            plt.subplots_adjust(wspace=0.3, top=0.85)
            fig.add_subplot(121)
            ax1 = sns.barplot(x=df[name].value_counts().index, y=df[name].value_counts(), color="royalblue", edgecolor="black")
            ax1.set(xlabel="Combinations", ylabel="Number of compared aligns")
            ax1.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: "{0:g}".format(x / 1000000) + "M"))
            fig.add_subplot(122)
            ax2 = sns.barplot(x=df[name].value_counts().iloc[1:].index, y=df[name].value_counts().iloc[1:], color="royalblue", edgecolor="black")
            ax2.set(xlabel="Combinations", ylabel="Number of compared aligns")
        elif name == "strand_1vs2":
            fig = plt.figure(figsize=(8, 5))
            fig.suptitle(f"Sample from {experiment_name} {organism} cells")
            ax = sns.barplot(x=df[name].value_counts().index, y=df[name].value_counts(), color="forestgreen", edgecolor="black")
            ax.set(xlabel="Combinations", ylabel="Number of compared aligns")
            ax.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: "{0:g}".format(x / 1000) + "k"))
        return plt.savefig(f"{save_as}"), plt.close()

    def log_scale_histplot(df, experiment_name, name):
        sns.set(rc={'figure.figsize': (12, 8)})
        sns.set_style("whitegrid")
        ax = sns.histplot(data=df, x=DIST_500_inf["abs_pos"], log_scale=True, color="red", edgecolor="black")
        ax.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: "{0:g}".format(x / 1000) + "k"))
        ax.set(xlabel="Absolute value of position comparing each two aligns [log10]",
               ylabel="Number of compared aligns",
               title=f"Sample from {experiment_name} {organism} cells")
        return plt.savefig(f"{name}"), plt.close()


    """ Plotting """
    # RvsR
    feature_barplot(DIST_no_limit, experiment_name, "RvsR", sys.argv[3])  # "RvsR"
    distances_histplot(df_RvsR_0_5000, RvsR_keys, experiment_name, sys.argv[4])  # "0_5000_R"
    distances_histplot(df_RvsR_500_10000, RvsR_keys, experiment_name, sys.argv[5])  # "500_10000_R"

    # strand 1vs2
    feature_barplot(DIST_no_limit, experiment_name, "strand_1vs2", sys.argv[6])  # "strand_1vs2"
    distances_histplot(df_strand_0_5000, strand_keys, experiment_name, sys.argv[7])  # "0_5000_S"
    distances_histplot(df_strand_500_10000, strand_keys, experiment_name, sys.argv[8])  # "500_10000_S"

    # log10-scale
    log_scale_histplot(DIST_500_10000, experiment_name, sys.argv[9])  # log10_500_1000"

if __name__ == '__main__':
    main()
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
import networkx as nx
import pandas as pd
import pickle
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.ticker as ticker
from matplotlib.ticker import MaxNLocator
import numpy as np
import statistics
import sys


import matplotlib.pylab as pylab
params = {'legend.fontsize': 'x-large',
         'axes.labelsize': 'x-large',
         'axes.titlesize':'x-large',
         'xtick.labelsize':'xx-large',
         'ytick.labelsize':'xx-large'}
pylab.rcParams.update(params)


def load_cwalk_graph(cwalk_graph):
    import pickle
    return pickle.load(open(cwalk_graph, "rb"))


def load_files(dictionary, parser):
    """
    load all available files from a folder
    and return lists of objects
    """
    import os
    files, labels = [], []
    for file in os.listdir(dictionary):
        labels.append(file)
        files.append(parser(f"{dictionary}/{file}"))
    return files, labels


def fractions(one_chr, two_chr, many_chr, name):
    sns.set_style("whitegrid")
    plt.figure(figsize=(12, 10))
    bar1 = plt.bar([i for i in range(3, 16)], one_chr, width=1, edgecolor="black", color="tab:blue")
    bar2 = plt.bar([i for i in range(3, 16)], two_chr, bottom=one_chr, width=1, edgecolor="black", color="tab:cyan")
    bar3 = plt.bar([i for i in range(3, 16)], many_chr,
                   bottom=np.add(one_chr, two_chr), width=1, edgecolor="black", color="tab:gray")
    plt.xlabel("Number of hops", fontsize=16)
    plt.ylabel("Percentage [%]", fontsize=16)
    plt.title(f"Fraction of c-walks in each class in {sys.argv[1]} cells", fontsize=18)
    plt.legend([bar1, bar2, bar3], ["one chromosome (class I)",
                                    "two chromosomes (class II)",
                                    "many chromosomes (class III)"])
    return plt.savefig(f"{name}"), plt.close() 


def barh(graphs, labels, name):
    data = []
    for graph in graphs:
        data.append(nx.number_connected_components(graph))
    sns.set_style("whitegrid")
    fig, ax = plt.subplots(figsize=(12, 16))
    ax.barh([label[:-11] for label in labels], data, color="tab:blue", edgecolor="black", height=1)
    ax.xaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: "{0:g}".format(x / 1000) + "k"))
    ax.set_title(f"Number of c-walks in each graph in {sys.argv[1]} cells", fontsize=18)
    return plt.savefig(f"{name}"), plt.close()


def stats(graphs, name):
    i, j, k = 0, 0, 0
    lengths = []
    for graph in graphs:
        for cwalk in list(nx.connected_components(graph)):
            cwalk = list(cwalk)
            k += 1
            if all(cwalk[i][2] == cwalk[0][2] for i in range(0, len(cwalk))):
                lengths.append(len(cwalk))
                i += 1
            else:
                j += 1


    print(f"Number of intra-chromosomal cwalks: {i}")
    print(f"Number of inter-chromosomal cwalks: {j}")
    print(f"Number of all cwalks: {k}")
    print(f"Percentage of intra-chromosomal cwalks is {round((i / k) * 100, 3)}%")
    print(f"Average: {round(statistics.mean(lengths), 2)}")
    print(f"Median: {round(statistics.median(lengths), 2)}")
    print(f"Mode: {statistics.mode(lengths)}")
    print(f"Standard deviation: {round(statistics.stdev(lengths), 2)}")
    sns.set_style("whitegrid")
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
    plt.suptitle(f"Distribution of c-walks length in {sys.argv[1]} cells", fontsize=20)
    ax1.hist(lengths, color="tab:blue", edgecolor="black")
    ax2.hist([x for x in lengths if x <= 6], color="tab:blue", edgecolor="black", rwidth=1)
    ax1.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: "{0:g}".format(x / 1000) + "k"))
    ax2.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: "{0:g}".format(x / 1000) + "k"))
    ax2.xaxis.set_major_locator(MaxNLocator(integer=True))
    ax1.set_xlabel("c-walks length", fontsize=16)
    ax1.set_ylabel("Number of c-walks", fontsize=16)
    ax2.set_xlabel("c-walks length", fontsize=16)
    ax2.set_ylabel("Number of c-walks", fontsize=16)
    return plt.savefig(f"{name}"), plt.close()


def identical(cwalk):
    span = 1
    chrs = [node[2] for node in cwalk]
    while not all(chrs[0] == element for element in chrs):
        if not all(chrs[0] == element for element in chrs):
            span += 1
            repeat = chrs[0]
            chrs = [chr for chr in chrs if chr != repeat]
    return span


def counting(graphs, length):
    i = 0
    one, two, many = 0, 0, 0
    for graph in graphs:
        for cwalk in list(nx.connected_components(graph)):
            cwalk = list(cwalk)
            if len(cwalk) == length:
                i += 1
                span = identical(cwalk)
                if span == 1:
                    one += 1
                elif span == 2:
                    two += 1
                elif span >= 3:
                    many += 1

    return [round((one / i) * 100, 2), round((two / i) * 100, 2), round((many / i) * 100, 2)]


def main():
    graphs, labels = load_files(sys.argv[2], load_cwalk_graph)  # load cwalk graphs
    one_chr, two_chr, many_chr = [], [], []
    for hop in range(3, 16):
        one, two, many = counting(graphs, hop)
        one_chr.append(one)  # fraction of inter-chrs cwalks of particular length
        two_chr.append(two)
        many_chr.append(many)

    """ Plotting """
    stats(graphs, sys.argv[3])
    fractions(one_chr, two_chr, many_chr, sys.argv[4])
    barh(graphs, labels, sys.argv[5])


if __name__ == '__main__':
    main()
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
import networkx as nx
import pandas as pd
from filtering import typical_chromosomes, collect_data
from intervaltree import IntervalTree
import pickle
import sys


def parse_positions(tsvfile: str, abs_threshold: int) -> zip:
    """ Returns lists of positions of aligns """
    df = pd.read_csv(tsvfile, sep="\t")
    df = df.where(df.abs_pos >= abs_threshold).dropna().reset_index(drop=True)
    return zip(df.chr_R1.tolist(), df.chr_R2.tolist(), df.pos_R1.astype(int).tolist(), df.pos_R2.astype(int).tolist())


def parse_bedfile(bedfile: str, organism: str) -> dict:
    """ Return lists of restrictions sites positions and chromosomes where they were found """
    df = pd.read_csv(bedfile, sep="\t", header=None)
    df = df.iloc[:, :2]
    df = df.loc[df[0].isin(typical_chromosomes(organism))].reset_index(drop=True)
    return {x: y for x, y in df.groupby(0)}


def add_edge(u, v):
    """ Creating weighted edges """
    if G.has_edge(u, v):
        G[u][v]["weight"] += 1
    else:
        G.add_edge(u, v, weight=1)


def matching_edges(interval_tree_dict: dict, positions: zip):
    """ Graph construction:
    interval_tree: a dictionary storing the restriction intervals (as IntervalTree) for each chromosome
    positions: list of C-walk positions
    """
    for chr1, chr2, position_R1, position_R2 in positions:

        left_edge = interval_tree_dict[chr1][position_R1]
        right_edge = interval_tree_dict[chr2][position_R2]

        if left_edge and right_edge:  # prevention of empty sets

            left_edge = list(list(left_edge)[0])
            right_edge = list(list(right_edge)[0])
            left_edge[2], right_edge[2] = chr1, chr2
            add_edge(tuple(left_edge), tuple(right_edge))  # ex. (77366342, 77367727, "chr1")


def cwalk_construction(edges):
    """ Resolve the C-walk graph """
    P = nx.Graph()
    edge_index = 0
    for u, v, a in edges:
        if (u not in P or P.degree[u] < 2) and (v not in P or P.degree[v] < 2):
            P.add_edge(u, v, weight=a["weight"], index=edge_index)
            edge_index += 1

    for cwalk in list(nx.connected_components(P)):
        if len(cwalk) < 3:
            for node in cwalk:
                P.remove_node(node)  # Remove cwalks that are include one hop
    return P


def save_as_bed(graph, experiment_name):
    numbers = list(range(1, nx.number_connected_components(graph) + 1))
    order = [2, 0, 1]
    for cwalk, number in zip(list(nx.connected_components(graph)), numbers):
        cwalk_length = range(1, len(cwalk) + 1)

        for node, length in zip(cwalk, cwalk_length):
            node = [node[i] for i in order]
            node.append(f"{length}_{number}")
            collect_data(node, f"{experiment_name}", "a")


def main():
    restrictions_dict = parse_bedfile(sys.argv[2], sys.argv[1])  # .bed file
    tree_dict = dict()  # ex. tree_dict["chr1"] will be an object of type IntervalTree
    for chr in typical_chromosomes(sys.argv[1]):
        """ Interval tree construction, separate for each chromosome """
        restrictions = restrictions_dict[chr][1].tolist()
        intervals = [(i, j) for i, j in zip(restrictions[:-1], restrictions[1:])]
        tree_dict[chr] = IntervalTree.from_tuples(intervals)

    """ Parse C-walk positions """
    positions = parse_positions(sys.argv[3], 500)  # .tsv file with selected absolute threshold
    matching_edges(tree_dict, positions)

    """  C-walks construction """
    sorted_edges = sorted(G.edges(data=True), key=lambda x: x[2]["weight"], reverse=True)  # Sort edges by read-coverage
    P = cwalk_construction(sorted_edges)

    save_as_bed(P, sys.argv[5])  # save as .bed outfile with cwalks
    pickle.dump(P, open(sys.argv[4], "wb"))  # save cwalks as .txt outfile in binary mode


if __name__ == '__main__':
    G = nx.Graph()
    main()
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
from Bio.SeqIO.QualityIO import FastqGeneralIterator
import sys


def division(i, seq, positions) -> str:
    if i == len(positions):
        sub_seq = seq[positions[i-1]:]
    elif i == 0:
        sub_seq = seq[:positions[i]+4]
    else:
        sub_seq = seq[positions[i-1]:positions[i]+4]
    return sub_seq


def digest(input_reads) -> zip:
    sub_title, sub_seq, sub_qual = [], [], []
    for title, seq, qual in input_reads:
        positions = [i for i in range(len(seq)) if seq.startswith("GATC", i) if i != 0]
        if positions:
            for i in range(0, len(positions)+1):
                sub_title.append(f"{i}.{title}")
                sub_seq.append(division(i, seq, positions))
                sub_qual.append(division(i, qual, positions))
        else:
            sub_title.append(title), sub_seq.append(seq), sub_qual.append(qual)
    return zip(sub_title, sub_seq, sub_qual)


def fastq_write(output_zip):
    with open(sys.argv[2], 'w') as output:
        for title, seq, qual in output_zip:
            output.write(f"@{title}\n{seq}\n+\n{qual}\n")


if __name__ == '__main__':
    reads = FastqGeneralIterator(sys.argv[1])
    fastq_write(digest(reads))
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
import networkx as nx
import matplotlib.pyplot as plt
import seaborn as sns
import sys
import statistics
import matplotlib.ticker as ticker
from cwalk_analysis import load_cwalk_graph, load_files
import random
import pandas as pd
import random


def directionality(path: list) -> bool:
    path = [(node[0]+node[1])/2 for node in path]  # the distance is between the middle point of the restriction interval
    dist = [b - a for a, b in zip(path[:-1], path[1:])]
    same_sign_dist = [(a * b >= 0) for a, b in zip(dist[:-1], dist[1:])]
    # 0 if nodes are in the same restriction interval
    # if the same node occurs more than one time it is irrelevant in the directionality concept
    # as well as the circle
    return same_sign_dist


def avg_directionality(signs: list) -> float:
    return round(statistics.mean(signs), 2)


def hist(data, data_random, avg, avg_random, length):
    sns.set_style("whitegrid")
    fig, axs = plt.subplots(1, 2, sharex=True, sharey=True, tight_layout=True, figsize=(15, 6))
    fig.suptitle(f"Distribution of directionality of c-walks of {length} length from {sys.argv[1]} cells", fontsize=14)
    axs[0].hist(data, color="tab:blue", edgecolor="black")
    axs[0].set_title(f"c-walks from data \n average value of directionality is {avg}", fontsize=14)
    axs[1].hist(data_random, color="tab:red", edgecolor="black")
    axs[1].set_title(f"randomize c-walks \n average value of directionality is {avg_random}", fontsize=14)
    axs[0].set_ylabel("number of c-walks", fontsize=14)
    fig.supxlabel("value of directionality", fontsize=14)
    axs[1].yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: "{0:g}".format(x / 1000) + "k"))
    axs[0].yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: "{0:g}".format(x / 1000) + "k"))
    return plt.savefig(f"{sys.argv[3]}"), plt.close()


def main():
    graphs, _ = load_files(sys.argv[2], load_cwalk_graph)  # load .txt cwalk graph
    cwalks = []

    for graph in graphs:
        S = [graph.subgraph(c).copy() for c in nx.connected_components(graph)]

        for subgraph in S:
            sorted_cwalk = sorted(subgraph.edges(data=True), key=lambda x: x[2]["index"])

            edges = []
            nodes = []
            for next_edge in sorted_cwalk[0:]:
                v1, v2, _ = next_edge
                edges.append([v1, v2])
                for i in range(1, len(edges)):
                    for edge in edges:
                        v1, v2 = edge
                        if v1 == (edges[i][0] or edges[i][1]) and v1 not in nodes:
                            nodes.append(v1)
                        elif v2 not in nodes:
                            nodes.append(v2)
            if nodes[0] != edges[0][0]:
                nodes.insert(0, edges[0][0])
            elif nodes[0] != edges[0][1]:
                nodes.insert(0, edges[0][1])

            if nodes[-1] == nodes[0]:  # remove repetitive nodes (in case of circle)
                nodes = nodes[1:]  # remove first node in circle

            if all(nodes[i][2] == nodes[0][2] for i in range(0, len(nodes))):  # only intra-chrs cwalks
                cwalks.append(nodes)


    avg_signs = []
    avg_shuffle_signs = []

    for cwalk in cwalks:

        avg_signs.append(avg_directionality(directionality(cwalk)))

        random.shuffle(cwalk)
        avg_shuffle_signs.append(avg_directionality(directionality(cwalk)))

    hist(avg_signs, avg_shuffle_signs, round(statistics.mean(avg_signs), 2),
         round(statistics.mean(avg_shuffle_signs), 2), "all")

    from scipy.stats import wilcoxon
    res = wilcoxon(x=avg_signs, y=avg_shuffle_signs, zero_method="wilcox", alternative="less")
    print(f"The value of the test statistic: {res.statistic} and p-value: {res.pvalue}")


if __name__ =='__main__':
    main()
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
import pysam
import sys


def parse_bam(inbamfile, read):
    """ load bam two bam files and create list of tuples """
    alignments = pysam.AlignmentFile(inbamfile, "rb")
    alignments_list = []
    for line in alignments:
        if line.is_reverse:
            strand = "reverse"
        else:
            strand = "forward"
        name = line.qname
        if line.qname[0].isdigit():
            name = line.qname[2:]
        if line.qname[0].isalpha():
            name = line.qname
        align = (name, f"R{read}", line.reference_name, line.pos, line.mapq, strand)  # order of tuples
        alignments_list.append(align)
    return alignments_list


def cleaning(alignments):
    """ filtering not mapped alignment and with quality below 30 """
    return [align for align in alignments if align[4] >= 30]


def collect_data(data, experiment_name, WvsA):
    """ saving supportive file to further analysis """
    import csv
    with open(f"{experiment_name}", WvsA, newline='') as outfile:
        tsv_output = csv.writer(outfile, delimiter="\t")
        tsv_output.writerow(data)


def typical_chromosomes(organism) -> list:
    """ return list of typical chromosomes for human or mouse """
    chrs = [f"chr{i}" for i in range(1, 23)] if organism == "human" else [f"chr{i}" for i in range(1, 20)]
    chrs.extend(["chrX", "chrY"])
    return chrs


def main():

    organism = sys.argv[1]
    experiment_R1 = sys.argv[2]
    experiment_R2 = sys.argv[3]
    experiment_name = sys.argv[4]

    """ create .tsv file with field names """
    fieldnames = ["seqname", "chr_R1", "chr_R2", "pos_R1", "pos_R2", "strand_1vs2", "RvsR", "abs_pos"]
    collect_data(fieldnames, experiment_name, "w")

    alignments_R1 = parse_bam(f"{experiment_R1}", 1)
    alignments_R2 = parse_bam(f"{experiment_R2}", 2)

    iter_1 = iter(alignments_R1)
    iter_2 = iter(alignments_R2)

    next1 = next(iter_1)
    next2 = next(iter_2)

    while next1 is not None:
        seqname = next1[0]
        assert(seqname == next2[0])

        align_1 = []
        try:
            while next1[0] == seqname:
                align_1.append(next1)
                next1 = next(iter_1)
        except StopIteration:
            next1 = None

        align_2 = []
        try:
            while next2[0] == seqname:
                align_2.append(next2)
                next2 = next(iter_2)
        except StopIteration:
            next2 = None

        all_alignments = align_1 + align_2
        filtered_alignments = cleaning(all_alignments)

        for i in range(len(filtered_alignments)):
            for j in range(i + 1, len(filtered_alignments)):
                if (filtered_alignments[j][2] in typical_chromosomes(organism) and filtered_alignments[i][2]) in typical_chromosomes(organism):
                    statistics = [
                        filtered_alignments[i][0],  # seqname
                        filtered_alignments[i][2],  # chromosome R1
                        filtered_alignments[j][2],  # chromosome R2
                        filtered_alignments[i][3],  # position R1
                        filtered_alignments[j][3],  # position R2
                        f"{filtered_alignments[i][5]} vs {filtered_alignments[j][5]}",  # strand_1vs2
                        f"{filtered_alignments[i][1]} vs {filtered_alignments[j][1]}",  # RvsR
                        abs(filtered_alignments[i][3] - filtered_alignments[j][3])      # abs_pos
                    ]
                    """ appends aligns to created .tsv file """
                    collect_data(statistics, experiment_name, "a")


if __name__ == '__main__':
    main()
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
from cwalk import parse_bedfile
from cwalk_analysis import load_cwalk_graph, load_files
import pandas as pd
import networkx as nx
from intervaltree import Interval, IntervalTree
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.ticker as ticker
import numpy as np
from cwalk import parse_bedfile
from filtering import typical_chromosomes
import sys

import matplotlib.pylab as pylab
params = {'legend.fontsize': 'x-large',
         'axes.labelsize': 'x-large',
         'axes.titlesize':'x-large',
         'xtick.labelsize':'xx-large',
         'ytick.labelsize':'xx-large'}
pylab.rcParams.update(params)


def boundaries(bedfile: str):
    # parse bedfile with TAD domains boundaries
    # return DataFrame with "chrom", "start", "end", "mode", "size" columns
    df = pd.read_csv(bedfile, delimiter="\t")
    return df[["chrom", "start", "end", "mode", "size"]]


def identical(itv: list) -> int:
    # return number of visited intervals from list of intervals
    span = 1
    while not all(itv[0] == element for element in itv):
        if not all(itv[0] == element for element in itv):
            span += 1
            a = itv[0]
            itv = [value for value in itv if value != a]
    return span


def chrs_sizes(chr_sizes: str) -> dict:
    # dict where chrs are keys and values are list with size of this chr
    df_sizes = pd.read_csv(chr_sizes, sep="\t", header=None)
    df_sizes = df_sizes.loc[df_sizes[0].isin(typical_chromosomes(sys.argv[1]))].reset_index(
        drop=True)
    df_sizes.columns = df_sizes.columns.map(str)
    return df_sizes.groupby("0")["1"].agg(list).to_dict()


def random(cwalk, chrs_dict, rest_dict):
    # return random cwalk (mirror image relative to the center of the chromosome)
    center = [(((i[0] + i[1]) / 2), i[2]) for i in cwalk]  # ex. (139285246.5, 'chr3')
    reflection = [(chrs_dict[i[1]][0] - i[0], i[1]) for i in center]
    itv = [(rest_dict[i[1]][i[0]], i[1]) for i in reflection]
    itv = [i for i in itv if len(i[0]) != 0]  # remove empty sets
    reflected = [(list(i[0])[0][0], list(i[0])[0][1], i[1]) for i in itv]
    return reflected


def tad_tree(boundaries):
    # interval tree where keys are chrs and values are zip object (start, end, mode)
    chrs_dict = {x: y for x, y in boundaries.groupby("chrom")}
    keys, values = [], []
    for key in chrs_dict.keys():
        left = chrs_dict[key]["start"].tolist()
        right = chrs_dict[key]["end"].tolist()
        mode = chrs_dict[key]["mode"].tolist()
        sets_list = [(a, b, c) for a, b, c in zip(left, right, mode)]
        keys.append(key)
        values.append(sets_list)
    boundaries_dict = dict(zip(keys, values))
    tree = dict()  # ex. tree_dict["chr1"] will be an object of type IntervalTree
    for key in boundaries_dict.keys():
        intervals = boundaries_dict[key]
        # Interval tree construction, separate for each chromosome
        tree[key] = IntervalTree.from_tuples(intervals)
    return tree


def counting(tad_dict, graphs, length):
    i = 0
    general_in, general_out = 0, 0
    active, passive, two, three, many = 0, 0, 0, 0, 0  # number of cwalks in eah TAD type
    for graph in graphs:
        for cwalk in list(nx.connected_components(graph)):
            cwalk = list(cwalk)
            if len(cwalk) == length and all(
                    cwalk[i][2] == cwalk[0][2] for i in range(0, len(cwalk))):  # only intra-chrs c-walks
                i += 1  # number of intra-chrs c-walks

                # list of the middle of restriction intervals of each node in cwalk
                cwalk_boundaries = [((node[0] + node[1]) / 2) for node in cwalk]

                # found list of TAD bounds (itv) for each node in cwalk
                itv_tad = [tad_dict[cwalk[0][2]][bound] for bound in cwalk_boundaries]

                in_tad = identical(itv_tad)  # number of TAD visited by cwalk
                modes = [list(i)[0][2] for i in list(itv_tad)]  # list of TAD mode for each node in cwalk

                if in_tad == 1:
                    general_in += 1  # number of cwalks in one TAD (active or passive)
                else:
                    general_out += 1  # number of cwalks in more than one TAD
                if in_tad == 1 and modes[0] == "active" and all(modes[0] == element for element in modes):
                    active += 1  # number of cwalks in one active TAD
                elif in_tad == 1 and modes[0] == "passive" and all(modes[0] == element for element in modes):
                    passive += 1  # number of cwalks in one passive TAD
                elif in_tad == 2:
                    two += 1  # number of cwalks in two TADs (active or passive)
                elif in_tad == 3:
                    three += 1  # number of cwalks in three TADs (active or passive)
                elif in_tad > 3:
                    many += 1  # number of cwalks in many TADs (active or passive)

    return active, passive, two, three, many, i, general_in, general_out


def random_counting(tad_dict, graphs, length, chrs_dict, rest_dict):
    i = 0
    general_in, general_out = 0, 0
    active, passive, two, three, many = 0, 0, 0, 0, 0  # number of random cwalks in eah TAD type
    for graph in graphs:
        for cwalk in list(nx.connected_components(graph)):
            cwalk = list(cwalk)

            if len(cwalk) == length and all(
                    cwalk[i][2] == cwalk[0][2] for i in range(0, len(cwalk))):  # only intra-chrs c-walks

                random_cwalk = random(cwalk, chrs_dict, rest_dict)

                if len(cwalk) == len(random_cwalk):
                    i += 1  # number of random intra-chrs c-walks

                    # list of the middle of restriction intervals of each node in cwalk
                    cwalk_boundaries = [((node[0] + node[1]) / 2) for node in random_cwalk]

                    # found list of TAD bounds (itv) for each node in cwalk
                    itv_tad = [tad_dict[random_cwalk[0][2]][bound] for bound in cwalk_boundaries]

                    in_tad = identical(itv_tad)  # number of TAD visited by cwalk
                    modes = [list(i)[0][2] for i in list(itv_tad)]  # list of TAD mode for each node in random cwalk

                    if in_tad == 1:
                        general_in += 1  # number of random cwalks in one TAD (active or passive)
                    else:
                        general_out += 1  # number of random cwalks in more than one TAD
                    if in_tad == 1 and modes[0] == "active" and all(modes[0] == element for element in modes):
                        active += 1  # number of cwalks in one active TAD
                    elif in_tad == 1 and modes[0] == "passive" and all(modes[0] == element for element in modes):
                        passive += 1  # number of cwalks in one passive TAD
                    elif in_tad == 2:
                        two += 1  # number of random cwalks in two TADs (active or passive)
                    elif in_tad == 3:
                        three += 1  # number of random cwalks in three TADs (active or passive)
                    elif in_tad > 3:
                        many += 1  # number of random cwalks in many TADs (active or passive)

    return active, passive, two, three, many, i, general_in, general_out


def tad_count(graphs, tree_dict, name):
    active, passive, two_active, two_passive, three_active, three_passive, other = 0, 0, 0, 0, 0, 0, 0
    labels = ["one active", "one passive", "two active", "two passive", "three active", "three passive", "other"]
    for graph in graphs:
        for cwalk in list(nx.connected_components(graph)):
            cwalk = list(cwalk)
            if all(cwalk[i][2] == cwalk[0][2] for i in range(0, len(cwalk))):  # only intra chrs cwalks
                cwalk_boundaries = [((node[0] + node[1]) / 2) for node in cwalk]
                itv_tad = [tree_dict[cwalk[0][2]][bound] for bound in cwalk_boundaries]
                in_tad = identical(itv_tad)
                modes = [list(i)[0][2] for i in list(itv_tad)]
                if in_tad == 1 and modes[0] == "active" and all(modes[0] == element for element in modes):
                    active += 1
                elif in_tad == 1 and modes[0] == "passive" and all(modes[0] == element for element in modes):
                    passive += 1
                elif in_tad == 2 and modes[0] == "active" and all(modes[0] == element for element in modes):
                    two_active += 1
                elif in_tad == 2 and modes[0] == "passive" and all(modes[0] == element for element in modes):
                    two_passive += 1
                elif in_tad == 3 and modes[0] == "active" and all(modes[0] == element for element in modes):
                    three_active += 1
                elif in_tad == 3 and modes[0] == "passive" and all(modes[0] == element for element in modes):
                    three_passive += 1
                else:
                    other += 1
    data = [active, passive, two_active, two_passive, three_active, three_passive, other]

    sns.set_style("whitegrid")
    fig, ax = plt.subplots(figsize=(15, 12))
    ax.bar(labels, data, color="tab:blue", edgecolor="black", width=1)
    plt.xticks(rotation=30)
    ax.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: "{0:g}".format(x / 1000) + "k"))
    plt.ylabel("Number of c-walks", fontsize=15)
    ax.set_title(f"Number of c-walks based on location in TAD types in {sys.argv[1]} cells", fontsize=20)
    return plt.savefig(f"{name}"), plt.close()


def chart_comparison(general_in, general_out, name):
    labels = ["data", "random"]
    sns.set_style("whitegrid")
    fig, ax = plt.subplots(figsize=(10, 8))
    bar1 = plt.bar(labels, general_in, label="in a single TAD", color="tab:blue", edgecolor="black", width=1)
    bar2 = plt.bar(labels, general_out, bottom=general_in, label="in more than one TAD", color="tab:red", edgecolor="black", width=1)
    ax.set_title(f"Fraction of c-walks contained in one TAD and c-walks in more than one TAD \n"
                 f"Comparison c-walks from data and random in {sys.argv[1]} cells", fontsize=16)
    plt.ylabel("Percentage")
    plt.legend([bar1, bar2], ["in a single TAD", "in more than one TAD"], loc="upper right")
    return plt.savefig(f"{name}"), plt.close()


def tad_fractions(one_tad_active, one_tad_passive, two_tad, three_tad, many_tad, name):
    sns.set_style("whitegrid")
    plt.figure(figsize=(12, 10))
    bar1 = plt.bar([i for i in range(3, 16)], one_tad_active, width=1, edgecolor="black", color="royalblue")
    bar2 = plt.bar([i for i in range(3, 16)], one_tad_passive, bottom=one_tad_active, width=1, edgecolor="black",
                   color="cornflowerblue")
    bar3 = plt.bar([i for i in range(3, 16)], two_tad,
                   bottom=np.add(one_tad_active, one_tad_passive), width=1, edgecolor="black", color="tab:cyan")
    bar4 = plt.bar([i for i in range(3, 16)], three_tad,
                   bottom=np.add(np.add(one_tad_active, one_tad_passive), two_tad), width=1, edgecolor="black",
                   color="lightsteelblue")
    bar5 = plt.bar([i for i in range(3, 16)], many_tad,
                   bottom=np.add(np.add(np.add(one_tad_active, one_tad_passive), two_tad), three_tad), width=1,
                   edgecolor="black", color="slategrey")
    plt.xlabel("Number of hops")
    plt.ylabel("Percentage")
    plt.title(f"Intra- and Inter-TAD c-walks in {sys.argv[1]} cells", fontsize=18)
    plt.legend([bar1, bar2, bar3, bar4, bar5], ["1 TAD active", "1 TAD passive", "2 TAD", "3 TAD", "more TADs"],
               loc="upper right",
               prop={"size": 16})
    return plt.savefig(f"{name}"), plt.close()


def hop_count(active, passive, two, three, many, name):
    active = active[:6]
    passive = passive[:6]
    two = two[:6]
    three = three[:6]
    many = many[:6]
    sns.set_style("whitegrid")
    fig, ax = plt.subplots(figsize=(10, 8))
    bar1 = plt.bar([i for i in range(3, 9)], active, width=1, edgecolor="black", color="royalblue")
    bar2 = plt.bar([i for i in range(3, 9)], passive, bottom=active, width=1, edgecolor="black",
                   color="cornflowerblue")
    bar3 = plt.bar([i for i in range(3, 9)], two,
                   bottom=np.add(active, passive), width=1, edgecolor="black", color="tab:cyan")
    bar4 = plt.bar([i for i in range(3, 9)], three,
                   bottom=np.add(np.add(active, passive), two), width=1, edgecolor="black",
                   color="lightsteelblue")
    bar5 = plt.bar([i for i in range(3, 9)], many,
                   bottom=np.add(np.add(np.add(active, passive), two), three), width=1,
                   edgecolor="black", color="slategrey")
    ax.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: "{0:g}".format(x / 1000) + "k"))
    plt.xlabel("Number of hops")
    plt.ylabel("Number of c-walks")
    plt.title(f"Intra- and inter-TAD c-walks in {sys.argv[1]} cells", fontsize=18)
    plt.legend([bar1, bar2, bar3, bar4, bar5], ["1 TAD active", "1 TAD passive", "2 TAD", "3 TAD", "more TADs"],
               loc="upper right",
               prop={"size": 12})
    return plt.savefig(f"{name}"), plt.close()


def domains(boundaries, name):
    active_boundaries = boundaries.loc[boundaries["mode"] == "active"].dropna().reset_index(drop=True)
    passive_boundaries = boundaries.loc[boundaries["mode"] == "passive"].dropna().reset_index(drop=True)
    sns.set_style("whitegrid")
    plt.figure(figsize=(10, 8))
    sns.distplot(active_boundaries["size"], hist=False, color="red")
    sns.distplot(passive_boundaries["size"], hist=False, color="black")
    plt.title(f"{sys.argv[1]} domains size distribution")
    plt.legend(labels=["Active", "Inactive"])
    plt.xlabel("Domain size [Mb]")
    return plt.savefig(f"{name}"), plt.close()


def main():
    graphs, _ = load_files(sys.argv[2], load_cwalk_graph)  # load .txt cwalk graph
    tad_dict = tad_tree(boundaries(sys.argv[3]))  # interval tree dict with TAD domains
    chrs_dict = chrs_sizes(sys.argv[4])

    # plot domains size distribution
    domains(boundaries(sys.argv[3]), sys.argv[11])

    # Return dict where chrs are keys and values are list of restriction itvs
    restrictions_dict = parse_bedfile(sys.argv[5], sys.argv[1])

    # Interval tree construction, separate for each chromosome
    rest_dict = dict()  # ex. tree_dict["chr1"] will be an object of type IntervalTree
    for chr in typical_chromosomes(sys.argv[1]):
        restrictions = restrictions_dict[chr][1].tolist()
        restrictions[0] = 0
        restrictions[-1] = chrs_dict[chr][0]
        intervals = [(i, j) for i, j in zip(restrictions[:-1], restrictions[1:])]
        rest_dict[chr] = IntervalTree.from_tuples(intervals)

    lst_active, lst_passive, lst_two, lst_three, lst_many, lst_i, lst_general_in, lst_general_out = \
        [], [], [], [], [], [], [], []
    lst_rdm_active, lst_rdm_passive, lst_rdm_two, lst_rdm_three, lst_rdm_many, lst_rdm_i, lst_rdm_general_in, lst_rdm_general_out = \
        [], [], [], [], [], [], [], []

    count_active, count_passive, count_two, count_three, count_many, count_general_in, count_general_out = \
        [], [], [], [], [], [], []
    rdm_count_active, rdm_count_passive, rdm_count_two, rdm_count_three, rdm_count_many, rdm_count_general_in, rdm_count_general_out = \
        [], [], [], [], [], [], []

    for length in range(3, 30):
        active, passive, two, three, many, i, general_in, general_out = counting(tad_dict, graphs, length)
        # list of % of intra-chrs cwalks of each length
        if i != 0:
            count_general_in.append(general_in)
            count_general_out.append(general_out)

            lst_general_in.append(round((general_in / i) * 100, 2))  # % of intra-chrs cwalks in a single TAD
            lst_general_out.append(round((general_out / i) * 100, 2))  # % of intra-chrs cwalks in more than one TAD
            lst_i.append(i)  # number of cwalks of each length

            rdm_active, rdm_passive, rdm_two, rdm_three, rdm_many, rdm_i, rdm_general_in, rdm_general_out = \
                random_counting(tad_dict, graphs, length, chrs_dict, rest_dict)

            rdm_count_general_in.append(rdm_general_in)
            rdm_count_general_out.append(rdm_general_out)

            lst_rdm_general_in.append(round((rdm_general_in / i) * 100, 2))  # % of intra-chrs cwalks in a single TAD
            lst_rdm_general_out.append(round((rdm_general_out / i) * 100, 2))  # % of intra-chrs cwalks in more than one TAD
            lst_rdm_i.append(rdm_i)  # number of random cwalks of each length

    """ Plotting """
    chart_comparison([round((sum(count_general_in) / sum(lst_i)) * 100, 2),
                      round((sum(rdm_count_general_in) / sum(lst_rdm_i)) * 100, 2)],
                     [round((sum(count_general_out) / sum(lst_i)) * 100, 2),
                      round((sum(rdm_count_general_out) / sum(lst_rdm_i)) * 100, 2)], sys.argv[9])

    print(f"Total number of c-walks: {sum(lst_i)}")
    print(f"Total number of random c-walks: {sum(lst_rdm_i)}")

    print(f"Percentage of intra-chrs cwalks in a single TAD: {round((sum(count_general_in) / sum(lst_i)) * 100, 2)}%")
    print(f"Percentage of intra-chrs cwalks in more than one TAD: {round((sum(count_general_out) / sum(lst_i)) * 100, 2)}%")

    print(f"Percentage of random intra-chrs cwalks in a single TAD: {round((sum(rdm_count_general_in) / sum(lst_rdm_i)) * 100, 2)}%")
    print(f"Percentage of random intra-chrs cwalks in more than one TAD: {round((sum(rdm_count_general_out) / sum(lst_rdm_i)) * 100, 2)}%")

    print(
        f"Number of c-walks from data inside one TAD: {[sum(count_general_in)][0]} and in more than one TAD:"
        f" {[sum(count_general_out)][0]}")
    print(
        f"Number of random c-walks inside one TAD: {[sum(rdm_count_general_in)][0]} and in more than one TAD: "
        f"{[sum(rdm_count_general_out)][0]}")
    print(f"We are considering sample consists with {sum(lst_i) + sum(lst_rdm_i)} c-walks "
          f"for which we examine two features (c-walks from data and random)."
          f" Significance level: 0.05")

    table = np.array([[[sum(count_general_in)][0], [sum(count_general_out)][0]],
                      [[sum(rdm_count_general_in)][0], [sum(rdm_count_general_out)][0]]])

    print("Fisher test:")
    from scipy.stats import fisher_exact

    oddsr, p = fisher_exact(table, alternative="two-sided")
    print(f"Two-sided Fisher: {oddsr} and p-value: {p}")

    oddsr, p = fisher_exact(table, alternative="greater")
    print(f"One-sided Fisher: {oddsr} and p-value: {p}")

    for length in range(3, 16):
        active, passive, two, three, many, i, general_in, general_out = counting(tad_dict, graphs, length)
        # list of % of intra-chrs cwalks of each length
        if i != 0:
            lst_active.append(round((active / i) * 100, 2))  # % of intra-chrs cwalks in a one active TAD
            lst_passive.append(round((passive / i) * 100, 2))  # % of intra-chrs cwalks in a one passive TAD
            lst_two.append(round((two / i) * 100, 2))  # % of intra-chrs cwalks in two TADs
            lst_three.append(round((three / i) * 100, 2))  # % of intra-chrs cwalks three TADs
            lst_many.append(round((many / i) * 100, 2))  # % of intra-chrs cwalks in many TADs

            # list of intra-chrs cwalks counts of each length
            count_active.append(active)
            count_passive.append(passive)
            count_two.append(two)
            count_three.append(three)
            count_many.append(many)

            rdm_active, rdm_passive, rdm_two, rdm_three, rdm_many, rdm_i, rdm_general_in, rdm_general_out = \
                random_counting(tad_dict, graphs, length, chrs_dict, rest_dict)
            # list of % of intra-chrs cwalks of each length
            lst_rdm_active.append(round((rdm_active / rdm_i) * 100, 2))  # % of intra-chrs cwalks in a one active TAD
            lst_rdm_passive.append(round((rdm_passive / rdm_i) * 100, 2))  # % of intra-chrs cwalks in a one passive TAD
            lst_rdm_two.append(round((rdm_two / rdm_i) * 100, 2))  # % of intra-chrs cwalks in two TADs
            lst_rdm_three.append(round((rdm_three / rdm_i) * 100, 2))  # % of intra-chrs cwalks three TADs
            lst_rdm_many.append(round((rdm_many / rdm_i) * 100, 2))  # % of intra-chrs cwalks in many TADs

            # list of intra-chrs random cwalks counts of each length
            rdm_count_active.append(active)
            rdm_count_passive.append(passive)
            rdm_count_two.append(two)
            rdm_count_three.append(three)
            rdm_count_many.append(many)

    """ Plotting """
    hop_count(count_active, count_passive, count_two, count_three, count_many, sys.argv[10])
    tad_count(graphs, tad_dict, sys.argv[8])
    tad_fractions(lst_active, lst_passive, lst_two, lst_three, lst_many, sys.argv[6])
    tad_fractions(lst_rdm_active, lst_rdm_passive, lst_rdm_two, lst_rdm_three, lst_rdm_many,
                  sys.argv[7])


if __name__ == '__main__':
    main()
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
import networkx as nx
import matplotlib.pyplot as plt
import seaborn as sns
import sys
import statistics
import matplotlib.ticker as ticker
from cwalk_analysis import load_cwalk_graph, load_files
import random
from filtering import typical_chromosomes
import pandas as pd
import random

import matplotlib.pylab as pylab

params = {'legend.fontsize': 'x-large',
          'axes.labelsize': 'x-large',
          'axes.titlesize': 'x-large',
          'xtick.labelsize': 'xx-large',
          'ytick.labelsize': 'xx-large'}
pylab.rcParams.update(params)


def parse_tf(tf: str) -> list:
    """
    load transcription factor binding sites
    return: list of dicts (dict for each tf) where chomosomes are keys and values are peaks within them
    """
    tf = pd.read_csv(tf, sep='\t', header=None)
    tf = tf.iloc[:, 0:3]
    tf[3] = ((tf[1] + tf[2]) / 2).astype(float)
    tf.columns = tf.columns.map(str)
    tf_dict = tf.groupby("0")["3"].agg(list).to_dict()
    return tf_dict


def mirror_peaks(chr_sizes: str, peaks_dict: dict) -> dict:
    """
    Mirror image of peaks
    load .tsv file with chromosomes sizes
    return: dict where chromosomes are keys and values are "mirror peaks" within them
    """
    df_sizes = pd.read_csv(chr_sizes, sep='\t', header=None)
    df_sizes = df_sizes.loc[df_sizes[0].isin(typical_chromosomes("human"))].reset_index(drop=True)  # sys instead of str
    df_sizes.columns = df_sizes.columns.map(str)
    sizes_dict = df_sizes.groupby("0")["1"].agg(list).to_dict()

    keys, mirrors = [], []
    for key in peaks_dict.keys():
        mirror = [sizes_dict[key][0] - peak for peak in peaks_dict[key]]
        keys.append(key)
        mirrors.append(mirror)
    return dict(zip(keys, mirrors))


def count_boundaries(subgraph) -> list:
    # edge boundaries for one cwalk
    # subgraph is one cwalk
    edge_boundaries = []
    sorted_cwalk = sorted(subgraph.edges(data=True), key=lambda x: x[2]["index"])
    if all(i[2] == j[2] for i, j, _ in sorted_cwalk):  # only inter-chrs cwalks
        for next_edge in sorted_cwalk[0:]:
            v1, v2, _ = next_edge

            edge_bound = (pd.Interval(min([v1[0], v1[0], v2[0], v2[0]]),
                                      max([v1[0], v1[0], v2[0], v2[0]]), closed="both"))
            edge_boundaries.append((edge_bound, v1[2]))
    return edge_boundaries


def peak_in_edges(bounds: list, chromosome: str, tf_dict: dict) -> int:
    # number of edges which has at least one peak in one cwalk
    isin = 0
    for edge in bounds:
        if chromosome != "chrY":
            for peak in tf_dict[chromosome]:
                if peak in edge:
                    isin += 1
                    break
    return isin


def histogram(labels, fractions, mirror_fractions):

    sns.set_style("whitegrid")
    fig, axs = plt.subplots(4, 3, sharex=True, sharey=True, tight_layout=True, figsize=(16, 18))

    if statistics.mean(fractions[0]) < statistics.mean(mirror_fractions[0]): 
        axs[0, 0].hist(fractions[0], color="tab:blue", edgecolor="black")
        axs[0, 0].hist(mirror_fractions[0], color="tab:red", edgecolor="black")
    else:
        axs[0, 0].hist(mirror_fractions[0], color="tab:red", edgecolor="black")
        axs[0, 0].hist(fractions[0], color="tab:blue", edgecolor="black")

    axs[0, 0].set_title(f"Overlap with {labels[0]} binding sites", fontsize=14)

    if statistics.mean(fractions[1]) < statistics.mean(mirror_fractions[1]):
        axs[0, 1].hist(fractions[1], color="tab:blue", edgecolor="black")
        axs[0, 1].hist(mirror_fractions[1], color="tab:red", edgecolor="black")
    else:
        axs[0, 1].hist(mirror_fractions[1], color="tab:red", edgecolor="black")
        axs[0, 1].hist(fractions[1], color="tab:blue", edgecolor="black")

    axs[0, 1].set_title(f"Overlap with {labels[1]} binding sites", fontsize=14)

    if statistics.mean(fractions[2]) < statistics.mean(mirror_fractions[2]):
        axs[0, 2].hist(fractions[2], color="tab:blue", edgecolor="black")
        axs[0, 2].hist(mirror_fractions[2], color="tab:red", edgecolor="black")
    else:
        axs[0, 2].hist(mirror_fractions[2], color="tab:red", edgecolor="black")
        axs[0, 2].hist(fractions[2], color="tab:blue", edgecolor="black")

    axs[0, 2].set_title(f"Overlap with {labels[2]} binding sites", fontsize=14)

    if statistics.mean(fractions[3]) < statistics.mean(mirror_fractions[3]):
        axs[1, 0].hist(fractions[3], color="tab:blue", edgecolor="black")
        axs[1, 0].hist(mirror_fractions[3], color="tab:red", edgecolor="black")
    else:
        axs[1, 0].hist(mirror_fractions[3], color="tab:red", edgecolor="black")
        axs[1, 0].hist(fractions[3], color="tab:blue", edgecolor="black")

    axs[1, 0].set_title(f"Overlap with {labels[3]} binding sites", fontsize=14)

    if statistics.mean(fractions[4]) < statistics.mean(mirror_fractions[4]):
        axs[1, 1].hist(fractions[4], color="tab:blue", edgecolor="black")
        axs[1, 1].hist(mirror_fractions[4], color="tab:red", edgecolor="black")
    else:
        axs[1, 1].hist(mirror_fractions[4], color="tab:red", edgecolor="black")
        axs[1, 1].hist(fractions[4], color="tab:blue", edgecolor="black")

    axs[1, 1].set_title(f"Overlap with {labels[4]} binding sites", fontsize=14)

    if statistics.mean(fractions[5]) < statistics.mean(mirror_fractions[5]):
        axs[1, 2].hist(fractions[5], color="tab:blue", edgecolor="black")
        axs[1, 2].hist(mirror_fractions[5], color="tab:red", edgecolor="black")
    else:
        axs[1, 2].hist(mirror_fractions[5], color="tab:red", edgecolor="black")
        axs[1, 2].hist(fractions[5], color="tab:blue", edgecolor="black")

    axs[1, 2].set_title(f"Overlap with {labels[5]} binding sites", fontsize=14)

    if statistics.mean(fractions[6]) < statistics.mean(mirror_fractions[6]):
        axs[2, 0].hist(fractions[6], color="tab:blue", edgecolor="black")
        axs[2, 0].hist(mirror_fractions[6], color="tab:red", edgecolor="black")
    else:
        axs[2, 0].hist(mirror_fractions[6], color="tab:red", edgecolor="black")
        axs[2, 0].hist(fractions[6], color="tab:blue", edgecolor="black")

    axs[2, 0].set_title(f"Overlap with {labels[6]} binding sites", fontsize=14)

    if statistics.mean(fractions[7]) < statistics.mean(mirror_fractions[7]):
        axs[2, 1].hist(fractions[7], color="tab:blue", edgecolor="black")
        axs[2, 1].hist(mirror_fractions[7], color="tab:red", edgecolor="black")
    else:
        axs[2, 1].hist(mirror_fractions[7], color="tab:red", edgecolor="black")
        axs[2, 1].hist(fractions[7], color="tab:blue", edgecolor="black")

    axs[2, 1].set_title(f"Overlap with {labels[7]} binding sites", fontsize=14)

    if statistics.mean(fractions[8]) < statistics.mean(mirror_fractions[8]):
        axs[2, 2].hist(fractions[8], color="tab:blue", edgecolor="black")
        axs[2, 2].hist(mirror_fractions[8], color="tab:red", edgecolor="black")
    else:
        axs[2, 2].hist(mirror_fractions[8], color="tab:red", edgecolor="black")
        axs[2, 2].hist(fractions[8], color="tab:blue", edgecolor="black")

    axs[2, 2].set_title(f"Overlap with {labels[8]} binding sites", fontsize=14)

    if statistics.mean(fractions[9]) < statistics.mean(mirror_fractions[9]):
        axs[3, 0].hist(fractions[9], color="tab:blue", edgecolor="black")
        axs[3, 0].hist(mirror_fractions[9], color="tab:red", edgecolor="black")
    else:
        axs[3, 0].hist(mirror_fractions[9], color="tab:red", edgecolor="black")
        axs[3, 0].hist(fractions[9], color="tab:blue", edgecolor="black")

    axs[3, 0].set_title(f"Overlap with {labels[9]} binding sites", fontsize=14)

    if statistics.mean(fractions[10]) < statistics.mean(mirror_fractions[10]):
        axs[3, 1].hist(fractions[10], color="tab:blue", edgecolor="black")
        axs[3, 1].hist(mirror_fractions[10], color="tab:red", edgecolor="black")
    else:
        axs[3, 1].hist(mirror_fractions[10], color="tab:red", edgecolor="black")
        axs[3, 1].hist(fractions[10], color="tab:blue", edgecolor="black")

    axs[3, 1].set_title(f"Overlap with {labels[10]} binding sites", fontsize=14)

    fig.delaxes(axs[3, 2])
    fig.supxlabel("Fraction of edges in c-walk having a peak", fontsize=18)

    axs[0, 1].yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: "{0:g}".format(x / 1000) + "k"))
    axs[1, 1].yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: "{0:g}".format(x / 1000) + "k"))
    axs[2, 1].yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: "{0:g}".format(x / 1000) + "k"))
    axs[3, 1].yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: "{0:g}".format(x / 1000) + "k"))

    return plt.savefig(sys.argv[4]), plt.close()


def counting(boundaries, tf_dict, tf_mirror_dict):
    boundaries = filter(None, boundaries)  # remove empty lists
    fractions, mirror_fractions = [], []
    for cwalk_bounds in boundaries:
        # iter over each cwalk edge bounds (edge bounds are pandas itvs)
        bounds = [i[0] for i in cwalk_bounds]
        chromosome = [i[1] for i in cwalk_bounds][0]
        isin = peak_in_edges(bounds, chromosome, tf_dict)
        mirror_isin = peak_in_edges(bounds, chromosome, tf_mirror_dict)
        # fraction of edges in cwalks which has at least one peak, normalized by cwalk length (number of nodes)
        fraction = round(isin / (len(cwalk_bounds) + 1), 2)
        mirror_fraction = round(mirror_isin / (len(cwalk_bounds) + 1), 2)

        fractions.append(fraction)
        mirror_fractions.append(mirror_fraction)

    return fractions, mirror_fractions


def main():
    graphs, _ = load_files(sys.argv[1], load_cwalk_graph)  # load folder with cwalks

    lst_tf_peaks, labels = load_files((sys.argv[3], parse_tf)
    lst_tf_mirror = []
    for tf_dict in lst_tf_peaks:
        lst_tf_mirror.append(mirror_peaks((sys.argv[2], tf_dict))

    boundaries = []
    for graph in graphs:
        S = [graph.subgraph(c).copy() for c in nx.connected_components(graph)]
        for subgraph in S:  # subgraph is one cwalk
            edge_bounds = count_boundaries(subgraph)  # edge bounds for one cwalk
            boundaries.append(edge_bounds)

    tf_labels = []
    tfs_fractions = []
    tfs_mirror_fractions = []
    for label, tf_dict, tf_mirror_dict in zip(labels, lst_tf_peaks, lst_tf_mirror):
        print(label[:-14])
        tf_labels.append(label[:-14])
        fractions, mirror_fractions = counting(boundaries, tf_dict, tf_mirror_dict)

        print(f"Average value of {label[:-14]} distribution is {statistics.mean(fractions)}")
        print(f"Average value of {label[:-14]} with randomize peaks distribution is {statistics.mean(mirror_fractions)}")

        from scipy.stats import wilcoxon
        if fractions and mirror_fractions:

            print("two-sided")
            res = wilcoxon(x=fractions, y=mirror_fractions, zero_method="zsplit", alternative="two-sided")
            print(f"The value of the test statistic: {res.statistic} and p-value: {res.pvalue}")

            res = wilcoxon(x=fractions, y=mirror_fractions, zero_method="zsplit", alternative="greater")
            print("greater")
            print(f"The value of the test statistic: {res.statistic} and p-value: {res.pvalue}")

            res = wilcoxon(x=fractions, y=mirror_fractions, zero_method="zsplit", alternative="less")
            print("less")
            print(f"The value of the test statistic: {res.statistic} and p-value: {res.pvalue}")

            print("wilcox")
            res = wilcoxon(x=fractions, y=mirror_fractions, zero_method="wilcox", alternative="greater")
            print("greater")
            print(f"The value of the test statistic: {res.statistic} and p-value: {res.pvalue}")

            res = wilcoxon(x=fractions, y=mirror_fractions, zero_method="wilcox", alternative="less")
            print("less")
            print(f"The value of the test statistic: {res.statistic} and p-value: {res.pvalue}")

        zeros = [each for each in fractions if each == 0.0]
        mirror_zeros = [each for each in mirror_fractions if each == 0.0]

        if fractions and mirror_fractions:
            print(
                f"Percentage of c-walks from data which edges has no peaks within is {round(len(zeros) / (len(fractions)) * 100, 2)}%")
            print(
                f"Percentage of c-walks which edges has no random peaks within is {round(len(mirror_zeros) / (len(mirror_fractions)) * 100, 2)}%")

        tfs_fractions.append(fractions)
        tfs_mirror_fractions.append(mirror_fractions)

    histogram(tf_labels, tfs_fractions, tfs_mirror_fractions)


if __name__ == '__main__':
    main()
ShowHide 10 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/pd410668/Analysis_of_multi_way_chromatin_interactions_from_very_long_sequencing_reads
Name: analysis_of_multi_way_chromatin_interactions_from_
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...