circDetector (CD): circular RNAs detection and annotation workflow

public public 1yr ago Version: 2 0 bookmarks

Description:

circDetector (CD) is a computational tool for detecting and annotation of circular RNAs (circRNAs) from Total RNA-Seq data. This tool is implemented with the Snakemake workflow management system allowing reproducible and scalable data analyses. CD was developped to identify circRNAs from reads mapped to the reference genome with the STAR tool (Spliced Transcripts Alignment to a Reference) in Single-End (SE). It consists of identifying reads with a circular junction ( chimeric reads ) from the chimeric.out.junction files provided by STAR. CD also provides a file reporting all statistics of STAR-SE mapping. Mapping informations were extracted from the STAR file Log.final.out .

Note: The documentation of the CD tool is in progress.

Table of contents:

  1. Installation

  2. Snakemake workflow

  3. Example Commands

  4. Contact & Support

Installation:

The source code can be downloaded from GitHub .

Compilation and configuration:

To install this tool, you need to first fetch the repository git repository or download the corresponding compressed files.

git clone https://github.com/GenEpi-GenPhySE/circRNA.git
cd circRNA

Required tools:

Local installation:

sudo apt-get install bedtools

Installation on the Genotoul cluster:

module load bioinfo/bedtools2-2.29.0

Setting the environment:

python3 -m venv circrnaenv
source circrnaenv/bin/activate
pip install Cython
pip install pybedtools
pip install snakemake
pip install pandas
pip install natsort
pip install tqdm
pip install networkx

Preparing the config file:

Snakemake provides a config file mechanism in which the path of the several files (for example annotation file, sample file, data...) can be specified. Note: The paths indicated in this file must be adapted to the data.

Preparing the sample file:

The script prepare.py takes as input an tabulated file containing the paths to all the mapping files and generates a simplified tabular file which will be taken at the entrance of the snakemake workflow.

python scripts/prepare.py -i metadata.tsv -o samples.tsv
  • Input: metadata.tsv is a tabulated file containing the following fields:
Column Type Description
1 string species (bos_taurus, sus_scrofa)
2 string tissue (testis, liver)
3 string sex (male, female)
4 string sample (single, orient)
5 string sample_unit (sample uniq name)
6 string animal_name (specific name for each individual)
7 string mapdir (file path to mapping files)

Note: The table must contain all these fields (column names).

  • Output: samples.tsv is a tabulated file containing the following fields:
Column Type Description
1 string sample (concatenation of species, tissue and animal_name)
2 string sample_unit (sample uniq name)
3 string species (bos_taurus = cow, sus_scrofa = pig)
4 string sex (male, female)
5 string mapdir (file path to mapping files)

Snakemake worflow:

The Snakefile is composed of the main following rules:

  • rule detection : circRNAs detection

  • rule mappingstat : analyses of statistics of STAR-SE mapping

  • rule annotation : circRNAs annotation

alt text

Rule detection:

 -r1 R1_INPUT_FILE, --r1_input_file R1_INPUT_FILE
 R1 input file name
 -r2 R2_INPUT_FILE, --r2_input_file R2_INPUT_FILE
 R2 input file name
 -min_cr CR_THRESHOLD, --cr_threshold CR_THRESHOLD
 Minimum number of chimeric reads
 -tol TOLERANCE, --tolerance TOLERANCE
 Tolerance of different positions to start and end
 -fmt {bed,gtf}, --output_file_format {bed,gtf}
 Output file format : 0-based (bed) or 1-based (gtf)
 -o OUTPUT_FILE, --output_file OUTPUT_FILE
 Output file path
 --verbose Print more info

Example of a command executed by Snakemake:

python3 scripts/circRNA_detection.py -r1 -r2 -o circ_rnas.bed

Rule mappingstat:

CD provides a file reporting all statistics of STAR-SE mapping. Mapping informations were extracted from the STAR file Log.final.out .

  • Input: Example of a Log.final.out file:
Started job on | Oct 26 05:21:11
 Started mapping on | Oct 26 05:29:19
 Finished on | Oct 26 05:52:39
 Mapping speed, Million of reads per hour | 121.89
 Number of input reads | 47400458
 Average input read length | 150
 UNIQUE READS:
 Uniquely mapped reads number | 33941471
 Uniquely mapped reads % | 71.61%
 Average mapped length | 149.32
 Number of splices: Total | 8294615
 Number of splices: Annotated (sjdb) | 7882009
 Number of splices: GT/AG | 8204363
 Number of splices: GC/AG | 58632
 Number of splices: AT/AC | 5362
 Number of splices: Non-canonical | 26258
 Mismatch rate per base, % | 0.57%
 Deletion rate per base | 0.02%
 Deletion average length | 2.11
 Insertion rate per base | 0.02%
 Insertion average length | 1.74
 MULTI-MAPPING READS:
 Number of reads mapped to multiple loci | 3381777
 % of reads mapped to multiple loci | 7.13%
 Number of reads mapped to too many loci | 28860
 % of reads mapped to too many loci | 0.06%
 UNMAPPED READS:
 Number of reads unmapped: too many mismatches | 0
 % of reads unmapped: too many mismatches | 0.00%
 Number of reads unmapped: too short | 9943085
 % of reads unmapped: too short | 20.98%
 Number of reads unmapped: other | 18180
 % of reads unmapped: other | 0.04%
 CHIMERIC READS:
 Number of chimeric reads | 129065
 % of chimeric reads | 0.27%
  • Output: reports/mapping_stat.tsv is a tabulated file containing all the informations of the input file (column) for each sample (row).

Rule annotation:

-circ CIRC_RNA_FILE, --circ_rna_file CIRC_RNA_FILE
 Circular RNA file path
 -fmt {bed,gtf}, --annot_format {bed,gtf}
 Annotation file format
 -annot ANNOTATION_FILE, --annotation_file ANNOTATION_FILE
 Annotation file path
 -oe EXONIC_OUTPUT_FILE, --exonic_output_file EXONIC_OUTPUT_FILE
 Exonic output file path
 -oi INTRONIC_OUTPUT_FILE, --intronic_output_file INTRONIC_OUTPUT_FILE
 Intronic output file path
 -o OUTPUT, --output OUTPUT
 Annotation output file path
 --verbose Print more info

Example of a command executed by Snakemake:

python3 scripts/circRNA_annotation.py -circ circ_rnas.bed -annot -o annotation_circRNAs.tsv

Note: Manual post-processing of data is required to remove short circRNAs.

Example Commands:

Launching the pipeline locally:

snakemake -p --jobs 8

Launching the pipeline on the Genotoul cluster:

sbatch circ_rnas.sh

Testing the detection rule locally for a single sample:

snakemake pig_testis_neonat-1/auzeville.bed -p --cores 1

Note: Make sure the circrnaenv environment is in this folder.

Contact & Support

For any bug report, feature request, or questions please fill out an issue through circRNA_project's issue page .

Copyright and License

This software is released under GNU General Public License (v3.0).

Code Snippets

  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
                                     # [-fmt {bed,gtf}] [-o OUTPUT_FILE]

# Imports
import circRNA as circ
from collections import defaultdict
import sys, os, re, argparse, natsort
import numpy as np
import csv
import pybedtools as pybed

# Utility functions
def eprint(*args, **kwargs):
    print(*args,  file=sys.stderr, **kwargs)
    dsp_control = circ.DisplayControl.instance()
    if dsp_control.verbose:
        print(*args,  file=sys.stderr, **kwargs)


def exonic_annotations(circ_rnas, exons):
    """
    Annotate the circrnas according to the exon annotations
    circ_rnas: an array of cirRNAs object
    exons: an array of circRNAs objects
    returns the circ_rnas array with the circRNAs annotated

    We annotate each circRNA's splice junction with the corresponding
    exons spice junctions using a dict of exons splice junctions
    This means that splice junction correspondances must be exact
    """
    eprint("\nCompute junction dictionary...\n")
    annotation_junctions = compute_junction_dict(exons)
    eprint("\nCircRNA annotation...\n")
    set_annotation(circ_rnas, annotation_junctions)

    return circ_rnas


def junction_key(chrom, pos):
    return "%s:%s" %(chrom, pos)


def compute_junction_dict(annot_array):
    """
    Compute a dictionnary of splic junctions:
    keys: chrom:pos
    values: list of tuples
    tuple: (exon, end_type)
    end_type: 3 (resp 5) for donor (resp acceptor)
    """
    junction_dict = defaultdict(list)
    seen_exons = defaultdict()
    for annot in annot_array:
        if annot.get_att("exon_id") in seen_exons:
            continue
        seen_exons[annot.get_att("exon_id")] = 1
        key_start = junction_key(annot.chrom, annot.start)
        key_end = junction_key(annot.chrom, annot.end)
        if annot.strand == "+":
            start_end_type = "5"
            end_end_type = "3"
        else:
            start_end_type = "3"
            end_end_type = "5"
        # annotating start
        junction_dict[key_start].append((annot, start_end_type))
        # annotating end
        junction_dict[key_end].append((annot, end_end_type))
    return junction_dict


def set_annotation(circ_rnas, annotation_junctions):
    """
    An annotation string would be very much appreciated
    """
    for circ_rna in circ_rnas:
        junction_start = junction_key(circ_rna.chrom, circ_rna.start)
        junction_end = junction_key(circ_rna.chrom, circ_rna.end)
        if junction_start in annotation_junctions:
            circ_rna.add_start_annotation(annotation_junctions[junction_start])
            circ_rna.add_start_transcript_id(annotation_junctions[junction_start])
            circ_rna.add_start_gene_id(annotation_junctions[junction_start])
        if junction_end in annotation_junctions:
            circ_rna.add_end_annotation(annotation_junctions[junction_end])
            circ_rna.add_end_transcript_id(annotation_junctions[junction_end])
            circ_rna.add_end_gene_id(annotation_junctions[junction_end])


def intronic_annotations(circ_rnas, exons):
    """
    Annotate the circrnas according to the introns annotations
    circ_rnas: an array of cirRNAs object
    exons: an array of circRNAs objects
    returns the circ_rnas array with the circRNAs intrno-annotated

    Because at least on circRNA junction is away from existing
    splice junctions, the annotation is solely based on overlap between
    the circRNA and an intron :
      - [start, end] interval must be within the intron
      - this interval must cover at least 90% of the intron
    """

    #eprint("\nCompute introns from exons...\n")
    transcripts = get_exons_per_transcript(exons)
    introns = circ.compute_intronic_positions(transcripts)

    #eprint("\nCompute intersection with circRNAs...\n")
    circ_rnas = annotate_intron_intersection(circ_rnas, introns)

    return circ_rnas


def annotate_intron_intersection(circ_rnas, introns):
    pybed_introns = annot_to_pybed(introns)
    pybed_circrnas = annot_to_pybed(circ_rnas)
    intersections = pybed_circrnas.intersect(pybed_introns, f=0.95, wo=True) 

    circrna_d = {circ.name: circ for circ in circ_rnas}
    introns_d = {intron.name: intron for intron in introns}
    for i in intersections:
        intron = introns_d[i[9]]
        circ = circrna_d[i[3]]
        overlap = i[-1]
        p_overlap = float(overlap)/intron.length # taille circ sur taille intron
        circ.annotate_intron(intron, p_overlap)

    return circ_rnas


def annot_to_pybed(annots):
    pybed_annot = []
    for a in annots:
        pybed_annot.append(a.topybed())
    return pybed.BedTool(pybed_annot).sort()


def infra_exonic_annotations(circ_rnas, exons):
    """
    """
    pybed_exons = annot_to_pybed_ife(exons)
    pybed_circrnas = annot_to_pybed_ife(circ_rnas)  
    intersections = pybed_circrnas.intersect(pybed_exons, f=0.95, wo=True)

    circrna_d = {circ.name: circ for circ in circ_rnas}
    exons_d = {exon.name: exon for exon in exons}
    for i in intersections:
        exon = exons_d[i[9]]
        circ = circrna_d[i[3]]
        circ.annotate_ife(exon)

    return circ_rnas


def annot_to_pybed_ife(annots):
    pybed_annot = []
    for a in annots:
        pybed_annot.append(a.topybed_ife())
    return pybed.BedTool(pybed_annot).sort()


def get_exons_per_transcript(annots):
    """
    returns a dictionnary
    key: ...
    """
    d_transcript = defaultdict(list)
    for exon in annots:
        transcript_id = exon.get_att("transcript_id")
        d_transcript[transcript_id].append(exon)
    return d_transcript


def write_annotated_circrnas(circ_annotated, outputfile):
    circ_rnas = natsort.natsorted(circ_annotated, key=lambda e: (e.chrom, e.start))
    header = "\t".join(["chrom", "start", "end", "strand", "nb_ccr", "circ_rna_name", "exons_id_start", "exons_id_end", 
                        "transcript_id_start", "transcript_id_end", "gene_id_start", "gene_id_end", 
                        "intron_name", "start_i", "end_i", "gene_id_i", "exon_id_i", 
                        'exon_id_ife', "exons_start_end_ife", "gene_id_ife"])
    with open(outputfile, "w") as fout:
        fout.write(header+"\n")
        for circ_rna in circ_rnas:
            intron_annot = circ_rna.intron_annotation 
            if len(intron_annot) == 0:
                start_i = ""
                end_i = ""
                name_i = ""
            else:
                for annot_i in intron_annot:
                    intron = annot_i[0]
                    start_i = intron.start
                    end_i = intron.end
                    name_i = intron.name
            att_d = dict(re.split(" |=", item) for item in circ_rna.attributes.split("; "))
            nb_ccr = att_d["nb_ccr"].replace('\"', '')
            s = [circ_rna.chrom, circ_rna.start, circ_rna.end, 
                circ_rna.strand, nb_ccr, circ_rna.name,
                circ_rna.get_start_annotation_str(), 
                circ_rna.get_end_annotation_str(),
                circ_rna.get_start_transcript_id_str(), 
                circ_rna.get_end_transcript_id_str(), 
                circ_rna.get_start_gene_id_str(), 
                circ_rna.get_end_gene_id_str(),
                name_i, start_i, end_i, 
                circ_rna.get_intron_annot_gene_str('gene_id'),
                circ_rna.get_intron_annot_str('exon_id'), 
                circ_rna.get_infra_exonic_annot_str("exon_id"),
                circ_rna.get_infra_exonic_exons_start_end(),
                circ_rna.get_infra_exonic_gene_annot_str("gene_id")]
            fout.write( "%s\n" % "\t".join(map(str, s)))


def main():

    # Exonic:
    eprint("\nIdentification of EXONIC circRNAs:\n")
    eprint("\nReading circRNA file...\n")
    circ_rnas = circ.read_annotation(args.circ_rna_file, fmt=args.annot_format)
    eprint("\nReading annotation file...\n")
    annots = circ.read_annotation(args.annotation_file, fmt=args.annot_format)

    eprint("\nExonic annotations..\n")
    circ_rnas = exonic_annotations(circ_rnas, annots)

    eprint("\nIntronic annotations..\n")
    circ_rnas = intronic_annotations(circ_rnas, annots)

    #####
    eprint("\nInfra-exonic annotations..\n")
    circ_rnas = infra_exonic_annotations(circ_rnas, annots)
    #####

    eprint("\nWrite annotations..\n")
    write_annotated_circrnas(circ_rnas, args.output)


def parse_arguments():
    parser = argparse.ArgumentParser(description='Annotation')
    parser.add_argument('-circ', '--circ_rna_file',
                        required=True, help='Circular RNA file path')
    parser.add_argument('-fmt', '--annot_format',
                        required=False, default="bed", choices=["bed","gtf"],
                        help='Annotation file format')
    parser.add_argument('-annot', '--annotation_file',
                        required=True, help='Annotataion file path')
    parser.add_argument('-oe', '--exonic_output_file',
                        required=False, default='exonic_circ_rnas_annot.out',
                        help='Exonic output file path')
    parser.add_argument('-oi', '--intronic_output_file',
                        required=False, default='annotation_introns.bed',
                        help='Intronic output file path')
    parser.add_argument('-o', '--output',
                        required=True,
                        help='Annotation output file path')
    parser.add_argument('--verbose', help='Print more info', action='store_true')

    args = parser.parse_args()

    dsp_control = circ.DisplayControl.instance()
    dsp_control.verbose = args.verbose

    return args


if __name__ == '__main__':
    args = parse_arguments()
    main()
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
                                    # [-min_cr CR_THRESHOLD] [-tol TOLERANCE]
                                    # [-fmt {bed,gtf}] [-o OUTPUT_FILE]

# Imports
import sys, os, re, argparse, natsort, time
if not sys.warnoptions:
    import warnings
    warnings.simplefilter("ignore")
from collections import defaultdict
import pandas as pd
import numpy as np
from networkx import (draw, DiGraph, Graph)
import networkx as netwx
import statistics
from statistics import mode
from time import sleep
from tqdm import tqdm
import circRNA as circ


# Utility functions
def eprint(*args, **kwargs):
    print(*args,  file=sys.stderr, **kwargs)
    dsp_control = circ.DisplayControl.instance()
    if dsp_control.verbose:
        print(*args,  file=sys.stderr, **kwargs)


def merge_junctions(df_a):
    """ Merge two dataframes
    :df_r1: dataframe of R1 reads
    :df_r2: dataframe of R2 reads
    :return the merged dataframe
    """
    df_merged = pd.concat(df_a,ignore_index=True)
    return df_merged


def read_junction_file(file, read_type=None):
    """ Read the data, adds the header of the table
        and adds a column with the corresponding read type.
    :file: file to read
    :read_type: read type (R1 or R2)
    :return the dataframe of reads
    """
    names = ["chr_donor", "pos_donor","strand_donor","chr_acceptor",
            "pos_acceptor","strand_acceptor","junction_type",
            "repeat_left","repeat_right","read_name","pos_s1",
            "CIGAR_s1","pos_s2","CIGAR_s2"] # Columns names of the input data
    # Read data as table with pandas:
    df = pd.read_table(file, sep = '\t', names=names, dtype=str,
                       converters = {'pos_acceptor':int, 'pos_donor':int})
    df['read_type'] = read_type # Add column "read" to specify the read type
    return df


def get_valid_circjunctions(df):
    """ Selects valid circular chimeric reads (ccr)
        according to the conditions of the "valid_ccr" function.
    :df: datatframe to filter
    return: the filtered dataframe
    """
    # Parsing dataframe to select valid rows
    # Boolean array with true if the read is a valid ccr:
    valid_ccrs = []
    #for index, row in tqdm(df.iterrows(), total=df.shape[0]):
    for index, row in circ.tqdm_wrapper(df):
        valid_ccrs.append(valid_ccr(row))
    df_circ_cr = df[valid_ccrs]
    ccr_a = [] # Array of circular junctions
    eprint(" "+"\n"+"Generating array of selected ccr..."+"\n")
    #for index, row in tqdm(df_circ_cr.iterrows(), total=df_circ_cr.shape[0]):
    for index, row in circ.tqdm_wrapper(df_circ_cr):
        ccr =  circ.CCR(row) # Create CCR object
        ccr_a.append(ccr) # Add chimeric reads into array
    eprint("\n")
    return ccr_a


def valid_ccr(row):
    """ Select ccr according to several conditions.
    :row: row of the dataframe
    :return boolean "False" for not valid ccr
            or boolean "True" for valid ccr
    """
    # Regular expression of the left dangling CIGAR:
    dangling_left = re.compile('^\d+S\d+M$')
    # Regular expression of the rigth dangling CIGAR:
    dangling_right = re.compile('^\d+M\d+S$')
    if (row['chr_donor']!=row['chr_acceptor']
            or row['strand_donor']!=row['strand_acceptor']):
        return False
    elif abs(row['pos_donor'] - row['pos_acceptor']) > 10000000:
        return False
    else:
        if row['strand_donor'] == '-' :
            if (row['pos_donor']>row['pos_acceptor']
                    or row['pos_s1']>row['pos_s2']):
                return False
            else:
                if (not dangling_left.match(row['CIGAR_s1'])
                        or not dangling_right.match(row['CIGAR_s2'])):
                    return False
                else:
                    return True
        elif row['strand_donor'] == '+' :
            if (row['pos_donor']<row['pos_acceptor']
                    or row['pos_s1']<row['pos_s2']):
                return False
            else:
                if (not dangling_right.match(row['CIGAR_s1'])
                        or not dangling_left.match(row['CIGAR_s2'])):
                    return False
                else:
                    return True


def get_exact_circrnas(circ_junctions, cr_threshold):
    """
    Detects circ RNAs as CR sharing exactly the same circ signature
    chrom start end strand
    """
    eprint("Exact circular RNA detection")
    circ_rna_dict = defaultdict(list)
    for circ_junc in circ_junctions:
        circ_rna_dict[circ_junc.key].append(circ_junc)

    circ_array = []
    for key, cr_array in circ_rna_dict.items():
        if len(cr_array) >= cr_threshold:
            circ_array.append(circ.CircRNA(args.output_file_format, ccr_array=cr_array))
    return circ_array


def same_circ_signature(cr_a, cr_b, tol):
    """ Two-by-two comparison of chimeric reads
    :cr_a: chimeric read a
    :cr_b: chimeric read b
    :tol: number of different start-end positions tolerated
    :return boolean "False" for not valid conditions
    and "True" for valid conditions
    """
    if cr_a.chrom != cr_b.chrom:
        return False
    if abs(cr_a.start - cr_b.start) > tol:
        return False
    if abs(cr_a.end - cr_b.end) > tol:
        return False
    if cr_a.read_type == cr_b.read_type:
        if cr_a.strand != cr_b.strand:
            return False
    else:
        if cr_a.strand == cr_b.strand:
            return False
    return True


def merge_circ_rnas(component):
    if len(component) == 1:
        return component[0]
    elements = []
    for circ_rna in component:
        elements.extend(circ_rna.ccr_array)
    return circ.CircRNA(args.output_file_format, ccr_array=elements)


def connected_components(circ_a, neighbours, cr_threshold):
    """ Generate an undirected graph and
        cluters ccr into connected components
    :ccr_a: array of ccr object
    :neighbours: dictionary of neighbours chimeric reads
    :return an array of circular RNAs
    """
    undirected = netwx.Graph()
    undirected.add_nodes_from(range(len(circ_a)))
    undirected.add_edges_from(neighbours)
    circ_rnas = []  # Circular RNAs array
    # Parsing of connected components composed of chimeric reads:
    for comp in netwx.connected_components(undirected):
        circ_rna = merge_circ_rnas([circ_a[i] for i in comp])
        if circ_rna.nb_ccr >= cr_threshold:
            circ_rnas.append(circ_rna)
    return circ_rnas


def cumulative_cr(circ_a):
    return sum([c.nb_ccr for c in circ_a])


def compute_fuzzy_circ_rnas(ccr_a, cr_threshold, tolerance):
    """ Get the circular RNA corresponding to each
        connected component of chimeric reads
    :ccr_a: array of ccr object
    :return array of circular RNAs
    """
    if cumulative_cr(ccr_a) < cr_threshold:
        return []
    n = len(ccr_a)
    # Neighbour cr share sign the same circular RNA
    neighbours = []
    circ_rnas = []
    for i in range(n):
        for j in range(i+1, n):
            if ccr_a[j].start > ccr_a[i].end:
                break
            # Two-by-two comparison of chimeric reads:
            if same_circ_signature(ccr_a[i], ccr_a[j], tolerance):
                neighbours.append((i, j))
    circ_rnas = connected_components(ccr_a, neighbours, cr_threshold)
    return circ_rnas


def get_independent_intervals(circ_rnas):
    # Get independant intervals
    independant_intervals = []
    circ_interval = []
    chrom = None
    for circ_rna in circ_rnas:
        if not chrom:
            chrom = circ_rna.chrom
            max_end = circ_rna.end
        if circ_rna.chrom != chrom:
            independant_intervals.append(circ_interval)
            circ_interval = []
            chrom = circ_rna.chrom
            max_end = circ_rna.start
        if circ_rna.chrom != chrom or circ_rna.start > max_end:
            independant_intervals.append(circ_interval)
            circ_interval = []
            chrom = circ_rna.chrom
            max_end = circ_rna.start
        max_end = max(max_end, circ_rna.end)
        circ_interval.append(circ_rna)
    independant_intervals.append(circ_interval)
    return independant_intervals


def get_fuzzy_circrnas(circ_rnas, cr_threshold, tolerance):
    eprint("Fuzzy circular RNA detection")
    circ_rnas = natsort.natsorted(circ_rnas, key=lambda e: (e.chrom,
                                                            e.start))
    intervals = get_independent_intervals(circ_rnas)
    circ_rnas = []
    for interval in intervals:
        fuzzy_circ_rnas = compute_fuzzy_circ_rnas(interval, cr_threshold,
                                                  tolerance)
        if fuzzy_circ_rnas:
            circ_rnas.extend(fuzzy_circ_rnas)
    return circ_rnas


def circrna_detection(circ_juntcions, cr_threshold, tolerance):
    """ Detecte circular RNAs
    :df_ccr: dataframe of circular chimeric reads
    :return the array of circular RNAs detected
    """
    # First pass, detect exact circular RNAs (no tolerance)
    circ_rnas = get_exact_circrnas(circ_juntcions, cr_threshold)
    eprint("nb of tol %d circRNAS %d" % (tolerance, len(circ_rnas)))#########
    # if tolerance > 0 compute fuzzy circrnas from previous list
    if tolerance > 0:
        circ_rnas = get_fuzzy_circrnas(circ_rnas, cr_threshold, tolerance)
    else:
        circ_rnas = [c for c in circ_rnas if c.nb_ccr >= cr_threshold]
    return circ_rnas


def write_tab_circ_rnas(circ_rnas, outputfile, fmt):
    """ Write the table of circ RNA retained
    :circ_rnas: array of circular RNAs object
    :outputfile: the output file path
    :return the table of circular RNAs retained
    :start Chimeric.out.junction file: last intron base in
    the reference genome order in the circular transcript
    :end Chimeric.out.junction file: first intron base in the
    reference genome order in the circular transcript
    :start GTF: first exon position
    :end GTF: last exon position
    """
    nb_line = 0 # name bed file
    circ_rnas = natsort.natsorted(circ_rnas, key=lambda e: (e.chrom, e.start))
    eprint(" "+"\n"+"Writing the table of circular RNAs..."+"\n")
    with open(outputfile, "w") as fout:
        eprint("\nConvertion the Chimeric.out.junction format to the %s format...\n" % fmt)
        for circ_rna in circ_rnas:
            # Convert Chimeric.out.junction format to out format:
            circ_rna.start += 1
            circ_rna.end -= 1
            # Convert out format to the chosen format:
            circ_rna.write_annot(nb_line, fout, fmt)
            nb_line += 1


def stats(min_cr, tol, nb_ccr_tot, nb_ccr, nb_circ_rna_detected, circ_rnas):
          nb_tot_cr_circ = sum([c.nb_ccr for c in circ_rnas])
          return "%d\t%d\t%d\t%d\t%d\t%d" % (min_cr, tol, nb_ccr_tot, nb_ccr,
                                             nb_circ_rna_detected, nb_tot_cr_circ)


def stats2(df):
    return df.groupby('nb_ccr')['nb_ccr'] \
           .count().reset_index(name='nb_circ_rnas')


def main(r1_input_file, r2_input_file, cr_threshold, tolerance,
         output_file):

    eprint("\nReading junction files...\n")
    df_junctions_a = []
    if r1_input_file:
        df_junctions_a.append(read_junction_file(r1_input_file, read_type="R1"))
    if r2_input_file:
        df_junctions_a.append(read_junction_file(r2_input_file, read_type="R2"))

    eprint("Merging junction files...\n")
    df = merge_junctions(df_junctions_a)
    eprint("%d chimeric reads \n" % len(df.index))

    # Filtering data:
    eprint("Selecting valid ccr...\n")
    circ_junctions = get_valid_circjunctions(df)
    eprint("%d circular chimeric reads" % len(circ_junctions))

    # Circular RNAs detection:
    circ_rnas = circrna_detection(circ_junctions, cr_threshold, tolerance) #### 'circRNA' object has no attribute 'ccr_array'
    eprint("%d circRNAs detected" % len(circ_rnas))

    #for cir in circ_rnas:
    #    print(cir.var_info_str())

    # Write the table of circular RNAs retained:
    circ.write_annotation(circ_rnas, output_file, fmt=args.output_file_format,
                          attributes=["nb_ccr", "genomic_size", "left",
                                      "right", "complete", "nb_distinct"])


def parse_arguments():
    parser = argparse.ArgumentParser(description='Identification of chimeric reads and circular RNAs')
    parser.add_argument('-r1', '--r1_input_file',
                        required=False, help='R1 input file name')
    parser.add_argument('-r2', '--r2_input_file',
                        required=False, help='R2 input file name')
    parser.add_argument('-min_cr', '--cr_threshold', type=int,
                        required=False, default=5,
                        help='Minimum number of chimeric reads')
    parser.add_argument('-tol', '--tolerance', type=int, required=False,
                        default=3,
                        help='Tolerance of different positions to start and end')
    parser.add_argument('-fmt', '--output_file_format',
                        required=False, default='gtf', choices=["bed","gtf"], # bed ou gtf
                        help='Output file format : 0-based (bed) or 1-based (gtf)')
    parser.add_argument('-o', '--output_file', required=False, default='circ_rnas_detected.out',
                        help='Output file path') # gft par défaut # attribut : nb ccr, min_cr_distinct
    parser.add_argument('--verbose', help='Print more info', action='store_true')

    args = parser.parse_args()

    dsp_control = circ.DisplayControl.instance()
    dsp_control.verbose = args.verbose

    if not (args.r1_input_file or args.r2_input_file):
        parser.error('At least one input file is requested')
    return args


if __name__ == '__main__':
    args = parse_arguments()
    main(args.r1_input_file, args.r2_input_file,
         args.cr_threshold, args.tolerance,
         args.output_file)
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
from collections import defaultdict
import operator, pandas, re
import pybedtools as pybed
import natsort
from tqdm import tqdm
import pandas as pd


from Singleton import Singleton

@Singleton
class DisplayControl():
     def __init__(self):
         self.verbose = False


def tqdm_wrapper(df):
    display_control = DisplayControl.instance()
    if display_control.verbose:
        return tqdm(df.iterrows(), total=df.shape[0])
    else:
        return df.iterrows()


class CircJunction():
    def __init__(self, chrom, start, end, transcription_strand):
        """Initialize parameters of the CCR object"""
        self.chrom = chrom
        self.start = start
        self.end = end
        self.transcript_strand = transcription_strand

    @property
    def key(self):
        return "-".join(map(str,[self.chrom, self.start, self.end,
                                 self.transcript_strand]))


class CCR(CircJunction):
    def __init__(self, record):
        """
        Initialize parameters of the CCR object from junction record of the
        data frame
        """
        chrom = record['chr_donor']
        start, end, transcript_strand = self._setjunction(record)
        super(CCR, self).__init__(chrom, start, end, transcript_strand)
        self.record = record

    @property
    def CIGAR_s1(self):
        return self.record['CIGAR_s1']

    @property
    def CIGAR_s2(self):
        return self.record['CIGAR_s2']

    @property
    def read_type(self):
        return self.record['read_type']

    def _setjunction(self, record):
        strand = record['strand_donor']  # == record['strand_acptor']
        if strand == "+":
            start = record['pos_acceptor']
            end = record['pos_donor']
        else:
            start = record['pos_donor']
            end = record['pos_acceptor']
        if record['read_type'] == "R2":
            transcript_strand = strand
        else:
            transcript_strand = complement(strand)
        return start, end, transcript_strand


class Annotation():

    # initialise class variable
    counter = 0

    def __init__(self, chrom, start, end, strand, feature, attributes, source,
                 name=None):
        Annotation.counter += 1
        self.chrom = chrom
        self.start = int(start)
        self.end = int(end)
        self.strand = strand
        self.feature = feature
        self.source = source
        self.attributes = attributes
        if name:
            self.name = name
        else:
            self.name = "circRNA_%d" % Annotation.counter
        if isinstance(attributes, dict):
            if "name" not in attributes:
                attributes["name"] = name
            self.attributes_d = attributes
            self.attribues = "; ".join(["%s = %s " %(key, value) for key, value in attributes.items()])
        else:
            self.attributes += "; name=%s" % str(self.name)
            self.attributes_d = dict(re.split(" |=", item) for item in self.attributes.split("; "))
            self.clean_quotes()


    def _to_gtf_record(self, attributes=["name"]):
        s = [
            self.chrom, self.source, self.feature,
            self.start, self.end, self._get_score(),
            self.strand, self._get_frame(),
            self._get_attributes(attributes)
        ]
        return "\t".join(map(str, s))


    def _to_bed_record(self, attributes=["name"]):
        start_b, end_b = (self.start - 1, self.end)
        s = [
            self.chrom, start_b, end_b,
            self.name, self._get_score(),
            self.strand, self.source,
            self.feature, self._get_phase(),
            self._get_attributes(attributes)
        ]
        return "\t".join(map(str, s))

    def topybed(self):
        start_b, end_b = (self.start - 1, self.end)
        record = [self.chrom, start_b, end_b, self.name, ".", self.strand]
        return pybed.create_interval_from_list(record)

    def topybed_ife(self):
        record = [self.chrom, self.start, self.end, self.name, ".", self.strand]
        return pybed.create_interval_from_list(record)

    def _get_attributes(self, attributes):
        attribute_str = "; ".join([ "%s=%s" % (key, self.get_att(key)) for key in attributes]) + ";"
        return attribute_str


    @property
    def biotype(self):
        long_biotype = self.get_att("transcript_biotype")
        biotype_keys = ["protein_coding", "pseudogene", "snRNA", "snoRNA", "miRNA", "lncRNA"]
        biotype_values = ["c", "pseudo", "snc", "sno", "mi", "lnc"]
        succint_biotype = dict(zip(biotype_keys, biotype_values))
        if long_biotype in succint_biotype:
            return succint_biotype[long_biotype]
        else:
            # ribozyme, Y_RNA, scaRNA
            return long_biotype # "misc" = miscRNA: Miscellaneous RNA. A non-coding RNA that cannot be classified.

    @property
    def length(self):
        return self.end - self.start + 1

    def _get_score(self):
        return "."

    def _get_frame(self):
        return "."

    def _get_phase(self):
        return "."

    def get_att(self, key):
        if key not in self.attributes_d:
            print("Warning %s is not a valid attribute" % key)
            exit(1)
        return self.attributes_d[key]

    def clean_quotes(self):
        d = self.attributes_d
        for key in d:
            d[key] = d[key].replace('\"', '')

    def get_annotation_tag(self, end_type):
        return "_".join([self.get_att("exon_id"), end_type, self.biotype, self.strand])

    def annot_str(self, fmt="gtf", attributes=["name"]):
        if fmt=="gtf":
            return self._to_gtf_record(attributes)
        elif fmt=="bed":
            return self._to_bed_record(attributes)
        else:
            return "ERROR: Unknown format..."

    def __str__(self):
        return self.annot_str()


class CircRNA(Annotation):
    def __init__(self, fmt, *args, **kwargs):
        """ Initialize parameters of the circRNA object
        :ccr_array: array of circular chimeric reads
        """
        for key, value in kwargs.items():###
            if "ccr_array" in kwargs:
                self._init_from_ccr(kwargs.get("ccr_array"))
            if len(args) > 0:
                self._init_default(fmt, args[0], args[1], args[2], args[3], args[4], args[5], name=value) ###
            self.start_annotation = []
            self.end_annotation = []
            self.start_transcript_id = []
            self.end_transcript_id = []
            self.start_gene_id = []
            self.end_gene_id = []
            self.intron_annotation = []
            self.infra_exonic_annotation = []


    def _init_from_ccr(self, ccr_array):
        # Number of distinct CIGARs:
        self.ccr_array = ccr_array  # Array of cicular chimeric reads
        #rep_ccr = self._consensus_start_end(ccr_array)
        rep_ccr = ccr_array[0]
        chrom = rep_ccr.chrom
        start, end, start_var, end_var = self._compute_consensus_start_end(ccr_array)
        start = rep_ccr.start + 1 ###
        end = rep_ccr.end - 1 ###
        strand = rep_ccr.transcript_strand
        feature = "circRNA"
        source = "circRNA_detection"
        attributes = "nb_ccr=%d" % self._compute_nb_cr(ccr_array)
        super(CircRNA, self).__init__(chrom, start, end, strand, feature,
                                      attributes, source) # name=None
        self.source = source
        self.feature = feature


        self.CIGAR_s1 = rep_ccr.CIGAR_s1  # CIGAR of the first segment
        self.CIGAR_s2 = rep_ccr.CIGAR_s2  # CIGAR of the second segment
        self.read_type = rep_ccr.read_type  # Read type (R1 or R2)

        self.attributes_d['nb_ccr'] = self._compute_nb_cr(ccr_array)
        self.attributes_d['nb_distinct'] = compute_distinct_cr(ccr_array)
        self.attributes_d['genomic_size'] = self._compute_genomic_size()
        self.attributes_d['left'] = self._compute_left()
        self.attributes_d['right'] = self._compute_right()
        self.attributes_d['complete'] = self._compute_complete()

        self.start_var = start_var
        self.end_var = end_var

    @property
    def nb_ccr(self):
        return self.get_att("nb_ccr")

    @property
    def nccr(self):
        return int(self.get_att("nb_ccr"))

    @property
    def complete(self):
        return self.get_att("complete")

    @property
    def left(self):
        return self.get_att("left")

    @property
    def right(self):
        return self.get_att("right")

    @property
    def nb_distinct(self):
        return self.get_att("nb_distinct")

    def _get_score(self):
        return self.nb_ccr


    def _init_default(self, chrom, start, end, strand, feature, attributes,
                      source, name):
        super(CircRNA, self).__init__(chrom, start, end, strand,
                                      feature, attributes, source, name)

    @property
    def read_names(self):
        """ Return an array of chimeric reads names """
        reads_str = []
        for cr in self.ccr_array:
            reads_str.append(cr.name)
        return reads_str


    def _compute_nb_cr(self, ccr_array):
        return len(self.ccr_array)

    def _compute_genomic_size(self):
        """ Compute the genomic size """
        return self.end - self.start + 1

    def _compute_left(self):
        """ Compute the coordinates of genomic regions to
            left [chr:start strand] format
        """
        return str(self.chrom)+":"+str(self.start)+self.strand

    def _compute_right(self):
        """ Compute the coordinates of genomic regions to
            right [chr:end strand] format
        """
        return str(self.chrom)+":"+str(self.end)+self.strand

    def _compute_complete(self):
        """ Compute the coordinates of genomic regions to
            complete [chr:start-end strand] format
        """
        return str(self.chrom)+":"+str(self.start)+":"+str(self.end)+self.strand


    def _compute_consensus_start_end(self, ccr_array):
        """
           Some documentation would be welcome
        """
        d_start = defaultdict(list)
        d_end = defaultdict(list)
        for ccr in ccr_array:
            d_start[ccr.start].append(ccr)
            d_end[ccr.end].append(ccr)
        dd_start = defaultdict(list)
        dd_end = defaultdict(list)
        for key_start, value_start in d_start.items():
            dd_start[key_start] = compute_distinct_cr(value_start)
        for key_end, value_end in d_end.items():
            dd_end[key_end] = compute_distinct_cr(value_end)
        cons_start = max(dd_start.items(), key=operator.itemgetter(1))[0]
        cons_end = max(dd_end.items(), key=operator.itemgetter(1))[0]
        # Now variability arround consensus start and end
        start_var = compute_variability(dd_start, cons_start)
        end_var = compute_variability(dd_end, cons_end)
        # for cr in ccr_array: I don't understand the reason for this loop
        #     cr.start = max_start
        #     cr.end = max_end
        #     return cr
        return cons_start, cons_end, start_var, end_var

    def var_info_str(self):
        starts = ",".join(map(str,[cr.start for cr in self.ccr_array]))
        ends =  ",".join(map(str,[cr.end for cr in self.ccr_array]))
        return "\t".join(map(str, [self.chrom, self.start, self.end, self.nb_ccr,
                                   self.start_var,
                                   self.end_var]))

    def str4diffexp(self):
        key = "%s:%d-%d:%s" % (self.chrom, self.start, self.end, self.strand)
        return "%s\t%d" %(key, self.nccr)

    def add_start_annotation(self, annot):
        self.start_annotation.extend(annot)

    def add_end_annotation(self, annot):
        self.end_annotation.extend(annot)

    def add_start_transcript_id(self, annot):
        self.start_transcript_id.extend(annot)

    def add_end_transcript_id(self, annot):
        self.end_transcript_id.extend(annot)

    def add_start_gene_id(self, annot):
        self.start_gene_id.extend(annot)

    def add_end_gene_id(self, annot):
        self.end_gene_id.extend(annot)

    def get_start_annotation_str(self):
        annotations = []
        for exon_tuple in self.start_annotation:
            annotations.append(exon_tuple[0].get_annotation_tag(exon_tuple[1]))
        return ",".join(annotations)

    def get_end_annotation_str(self):
        annotations = []
        for exon_tuple in self.end_annotation:
            annotations.append(exon_tuple[0].get_annotation_tag(exon_tuple[1]))
        return ",".join(annotations)

    def get_start_transcript_id_str(self):
        annotations = []
        for transcript_tuple in self.start_transcript_id:
            annotations.append(transcript_tuple[0].attributes_d["transcript_id"])
        return ",".join(annotations)

    def get_end_transcript_id_str(self):
        annotations = []
        for transcript_tuple in self.end_transcript_id:
            annotations.append(transcript_tuple[0].attributes_d["transcript_id"])
        return ",".join(annotations)

    def get_start_gene_id_str(self):
        annotations = []
        for gene_tuple in self.start_gene_id:
            annotations.append(gene_tuple[0].attributes_d["gene_id"])
        return ",".join(annotations)

    def get_end_gene_id_str(self):
        annotations = []
        for gene_tuple in self.end_gene_id:
            annotations.append(gene_tuple[0].attributes_d["gene_id"])
        return ",".join(annotations)

    def get_intron_annot_str(self, att):
        annotations = []
        for intron, p_overlap in self.intron_annotation:
            upstream = intron.get_att("up_exon").get_att(att)
            downstream = intron.get_att("down_exon").get_att(att)
            annotations.append("%s-%s" % (upstream, downstream))
        return ",".join(annotations)


    def get_intron_annot_gene_str(self, att):
        annotations = []
        for intron, p_overlap in self.intron_annotation:
            upstream = intron.get_att("up_exon").get_att(att)
            strand_up = intron.get_att("up_exon").strand
            biotype_up = intron.get_att("up_exon").biotype
            downstream = intron.get_att("down_exon").get_att(att)
            strand_d = intron.get_att("down_exon").strand
            biotype_d = intron.get_att("down_exon").biotype
            annotations.append("%s_%s_%s-%s_%s_%s" % (upstream, strand_up, biotype_up,
                                                      downstream, strand_d, biotype_d))
        return ",".join(annotations)


    def annotate_intron(self, intron, p_overlap):
        self.intron_annotation.append((intron, p_overlap))


    def get_infra_exonic_annot_str(self, att):
        annotations = []
        for exon in self.infra_exonic_annotation:
            id_ife = exon.get_att(att)
            annotations.append("%s" % (id_ife))
        return ",".join(annotations)


    def get_infra_exonic_gene_annot_str(self, att):
        annotations = []
        for exon in self.infra_exonic_annotation:
            id_ife = exon.get_att(att)
            strand = exon.strand
            biotype = exon.biotype
            annotations.append("%s_%s_%s" % (id_ife, strand, biotype))
        return ",".join(annotations)


    def get_infra_exonic_exons_start_end(self):
        annotations = []
        for exon in self.infra_exonic_annotation:
            start_exon_ife = exon.start
            end_exon_ife = exon.end
            annotations.append("%s_%s" % (start_exon_ife, end_exon_ife))
        return ",".join(annotations)


    def annotate_ife(self, exon):
        self.infra_exonic_annotation.append(exon)


def complement(strand):
    """ molecular biology complement """
    return "-" if strand == "+" else "+"


def compute_distinct_cr(ccr_array):
    """ From a list of CR compute the number of distincts CR
    Read type with different alignments and different CIGAR
    :ccr_array: array of circular chimeric reads
    """
    cigar_dict = defaultdict(list)
    for cr in ccr_array:
        key = "-".join(map(str, [cr.read_type, cr.CIGAR_s1]))
        cigar_dict[key].append(cr)
    return len(cigar_dict.keys())

def compute_variability(d_values, center):
    deviations = []
    occurences = 0
    for pos, occ in d_values.items():
        deviations.append(occ * abs(pos - center))
        occurences += occ
    return sum(deviations)/occurences


def static_vars(**kwargs):
    def decorate(func):
        for k in kwargs:
            setattr(func, k, kwargs[k])
        return func
    return decorate


@static_vars(counter=0, introns_key_dict=defaultdict())
def get_intron_name(chrom, start, end):
    key = "-".join(map(str, [chrom, start, end]))
    if key not in get_intron_name.introns_key_dict:
        get_intron_name.counter += 1
        name = "INTRON_%d" % get_intron_name.counter
        get_intron_name.introns_key_dict[key] = name
    return get_intron_name.introns_key_dict[key]


@static_vars(introns_name_dict=defaultdict())
def new_intron(upstream_exon, donwstreem_exon):
    chrom = upstream_exon.chrom
    start_i = int(upstream_exon.end) + 1
    end_i = int(donwstreem_exon.start) - 1
    strand = upstream_exon.strand
    name = get_intron_name(chrom, start_i, end_i)
    if name in new_intron.introns_name_dict:
        return None
    new_intron.introns_name_dict[name] = 1
    source = "exon_annot"
    feature = "intron"
    attributes = {"up_exon": upstream_exon, "down_exon":  donwstreem_exon}
    intron = Annotation(chrom, start_i, end_i, strand,
                        feature, attributes, source, name=name)
    return intron


def compute_intronic_positions(d_transcripts):
    """
    returns an array of intron Annotation objects
    """
    introns_a = []
    for transcript_exons in d_transcripts.values():
        # sort by transcript
        sorted_exons = natsort.natsorted(transcript_exons,
                                         key=lambda e: (e.chrom, e.start))
        previous = None
        for exon in sorted_exons:
            if previous:
                intron = new_intron(previous, exon)
                if intron is not None:
                    introns_a.append(intron)
            previous = exon
    return introns_a


def write_annotation(annot_a, outputfile, fmt, attributes):
    with open(outputfile, "w") as fout:
        for record in annot_a:
            fout.write(record.annot_str(fmt, attributes)+"\n")


def read_annotation(file, fmt):
    if fmt=="gtf":
        header = ["chrom", "source", "feature", "start", "end",
                  "score", "strand", "phase", "attributes"]
    elif fmt=="bed":
        header = ["chrom", "start", "end", "gene_id", "score",
                  "strand", "source", "feature", "phase", "attributes"]
    else:
        print("Error: Unknown format...")
    return read_annotation_from_file(file, header)


def read_annotation_from_file(file, header):
    annot_df = pd.read_table(file, sep = '\t', names=header,
                             dtype={"chrom":str, "start":str, "end":str})
    annot_a = []
    for index, row in tqdm_wrapper(annot_df):
        chrom = row.chrom
        start = row.start
        end = row.end
        strand = row.strand
        feature = row.feature
        source = row.source if "source" in header else "."
        attributes = row.attributes
        name = row["gene_id"]
        if feature == "circRNA":
            annot = CircRNA(chrom, start, end, strand, feature, attributes,
                            source, name=name)
        else:
            annot = Annotation(chrom, start, end, strand, feature, attributes,
                               source)
        annot_a.append(annot)
    return annot_a
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
import os
import argparse
import pandas as pd
import numpy as np
import re
import sys
from collections import defaultdict


def read_file(file):
    """ Read the circ_rnas_annotation.out file and return a pandas dataframe"""
    header = ["chrom", "start", "end", "circ_name", "nb_ccr", 
              "strand", "source", "feature", "phase", "attributes"]
    df = pd.read_table(file, sep='\t', names=header)
    df.replace(np.nan, "", inplace=True)
    return df


def get_folder_to_create(df): 
    folders_name = []
    dict_spl_sp = dict(zip(df["sample"], df["species"]))
    split_keys = [x.split("_") for x in dict_spl_sp.keys()]   

    for key in split_keys:
        folders_name.append(key[0]+"_"+key[1])
    folders = list(set(folders_name))
    return folders


def create_folders(folders):
    for folder in folders:
        if not os.path.exists(folder):
            os.makedirs(folder)


def filter_samples(df):
    df = df.drop(df[(df["sample"].isin(["ssc_liver_3", "ssc_liver_4", 
                                        "ssc_muscle_1", "ssc_testis_1"]))].index)
    df = df.reset_index(drop=True)
    return df


def main():

    # Read samples.tsv file:
    df_samples = pd.read_csv(args.sample_file, sep='\t')

    # Remove ssc_liver_3/4, ssc_muscle_1 and ssc_testis_1 samples:
    df_filtered = filter_samples(df_samples)

    samples_folders = df_filtered["sample"].tolist()
    df_filtered = df_filtered.loc[df_filtered['sample'].isin(samples_folders)] 

    # Get the folder list to create:
    folders = get_folder_to_create(df_filtered)

    # Create folders:
    create_folders(folders)

    # Group folders by species and tissue:
    df_grouped = df_filtered.groupby(["species", "tissue"])

    for key,item in df_grouped:

        group = df_grouped.get_group(key) 
        # Create col with path of folder:
        group["path_file_bed"] = group["sample"]+"/auzeville.bed"

        samples = group["sample"].tolist()
        samples_split = [s.split("_") for s in samples][0]
        folder_name = samples_split[0]+"_"+samples_split[1]

        paths = group["path_file_bed"].tolist()

        df_concat = pd.concat(read_file(f) for f in paths)
        df_concat = df_concat.sort_values(by=['chrom', 'start', 'end'])
        df_final = df_concat.groupby(['chrom', 'start', 'end', 'strand'], as_index=False)['nb_ccr'].sum()
        df_final.to_csv(folder_name+"/"+"auzeville.bed", sep="\t", index=False)


def parse_arguments():
    parser = argparse.ArgumentParser(description='Annotation')
    parser.add_argument('-sp', '--sample_file',
                        required=True, help='Sample file path')
    parser.add_argument('-circ', '--circ_rna_file',
                        required=False, help='Circular RNA file path')
    args = parser.parse_args()
    return args

if __name__ == '__main__':
    args = parse_arguments()
    main()
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import os
import argparse
import pandas as pd
import numpy as np
import re
import sys

def read_file(file):
    """ Read the sample file containing path files and return a pandas dataframe"""
    sample = pd.read_csv(file, sep='\t', dtype=str)
    return sample

def get_log_stats(log_file):
    """ Reading, parsing the log file and returning a dict"""
    log_statistics = []
    with open(log_file) as fin:
        for line in fin:
            line = line.lstrip().rstrip()
            if not line or line.endswith(":") :  # if the line is empty we simply skip it
                continue
            key, value = line.split("|")
            value = value.replace("%", "")
            stat = (key.lstrip().rstrip(), value.lstrip().rstrip())
            log_statistics.append(stat)
    return log_statistics


def get_mapping_stats(runs):
    """
        Parse the STAR Log.final.out for each run to gather alignments
        statistics and returns a pandas dataframe
    """
    statistics = []
    reads = ["R1", "R2"]
    for idx, row in runs.iterrows():
        sample_info = [('sample', row['sample']),
                        ('sample_unit',row['sample_unit'] )]
        for read in reads:
            mapdir = os.path.join(row["mapdir"],"se", read)
            if not os.path.exists(mapdir):
                raise Exception("WARNING mapdir %s is missing" % mapdir)
            log_file = os.path.join(mapdir, "Log.final.out")
            if not os.path.exists(log_file):
                raise Exception("WARNING logfile %s is missing" % log_file)
            sample_info.append(('read', read ))
            log_stats = get_log_stats(log_file)
            sample_stats = sample_info + log_stats

            statistics.append(dict(sample_stats))
    return pd.DataFrame(statistics)


def main(input, output):

    # Read the sample file:
    runs = read_file(input)

    # Reading the log files
    stats = get_mapping_stats(runs)

    # Writing statistics to a file in tsv format
    stats.to_csv(output, sep="\t", index=False)



def parse_arguments():
    parser = argparse.ArgumentParser(description='Sample file')
    parser.add_argument('-i', '--input_file',
                        required=True, help='Sample file')
    parser.add_argument('-o', '--output_file',
                        required=True, help='Statuistics in tsv format')
    args = parser.parse_args()
    return args


if __name__ == '__main__':
    args = parse_arguments()
    main(args.input_file, args.output_file)
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
import os
import argparse
import pandas as pd


def read_file(file):
    """ Read the metadata file containing path files and return a pandas dataframe"""
    metadata = pd.read_csv(file, sep='\t', dtype=str)
    return metadata


def check_mapdirs(mapdirs):
    """ Check if the mapdir exists or not"""
    for mapdir in mapdirs:
        if not os.path.exists(mapdir):
            raise Exception("WARNING following mapdir is missing %s" % mapdir)


def unique(liste):
    return sorted(set(liste))


def get_path_files(df):
    """ Group the samples by "sample".
    Return a dataframe with one line per individual.
    """
    df = df.groupby('sample').agg(lambda x: unique(list(x))).reset_index()
    return df


def get_species_short(species):
    animal_keys = ["bos_taurus", "sus_scrofa"]
    animal_values = ["cow", "pig"]
    succint_name = dict(zip(animal_keys, animal_values))
    if species in succint_name:
        return succint_name[species]
    else:
        return species


def write_sample_file(metadata, output_file):
    """Read the metadate file and return a pandas object"""
    sample_ids = []
    species_names = []
    for index, row in metadata.iterrows():
        short_name = get_species_short(row['species'])
        sample = "_".join([short_name, row['tissue'],
                           row['animal_name']])
        species = short_name
        sample_ids.append(sample)
        species_names.append(species)
    metadata['sample'] = sample_ids
    metadata['species'] = species_names
    samples = metadata[['sample', 'sample_unit', 'species', 'sex', 'mapdir']]
    samples.to_csv(output_file, sep="\t", index=False)


def main():

    # Read the metadata file:
    metadata = read_file(args.input_file)

    # Check the mapdir of each sample exists or not:
    check_mapdirs(metadata["mapdir"])

    # Write a new output file 'samples.tsv' with identical key per individual:
    write_sample_file(metadata, args.output_file)


def parse_arguments():
    parser = argparse.ArgumentParser(description='Metadata file')
    parser.add_argument('-i', '--input_file',
                        required=True, help='Metadata file')
    parser.add_argument('-o', '--output_file',
                        required=True, help='Metadata file')
    args = parser.parse_args()
    return args


if __name__ == '__main__':
    args = parse_arguments()
    main()
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
class Singleton:
    """
    A non-thread-safe helper class to ease implementing singletons.
    This should be used as a decorator -- not a metaclass -- to the
    class that should be a singleton.
    The decorated class can define one `__init__` function that
    takes only the `self` argument. Also, the decorated class cannot be
    inherited from. Other than that, there are no restrictions that apply
    to the decorated class.
    To get the singleton instance, use the `instance` method. Trying
    to use `__call__` will result in a `TypeError` being raised.
    """

    def __init__(self, decorated):
        self._decorated = decorated

    def instance(self):
        """
        Returns the singleton instance. Upon its first call, it creates a
        new instance of the decorated class and calls its `__init__` method.
        On all subsequent calls, the already created instance is returned.
        """
        try:
            return self._instance
        except AttributeError:
            self._instance = self._decorated()
            return self._instance

    def __call__(self):
        raise TypeError('Singletons must be accessed through `instance()`.')

    def __instancecheck__(self, inst):
        return isinstance(inst, self._decorated)


@Singleton
class Foo:
   def __init__(self):
       print('Foo created')


if __name__ == '__main__':
    # This enables to use this module as a script for testing
    f = Foo() # Error, this isn't how you get the instance of a singleton

    f = Foo.instance() # Good. Being explicit is in line with the Python Zen
    g = Foo.instance() # Returns already created instance

    print(f is g) # True
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
import os, re, sys, csv, argparse
import circRNA as circ
import pandas as pd
import numpy as np

# Utility functions:
def eprint(*args, **kwargs):
    print(*args,  file=sys.stderr, **kwargs)

def get_sample(file):
    """ Get the sample name from the file path"""
    sample_name = os.path.split(os.path.dirname(file))[1]
    return sample_name

def read_file(file):
    """ Read the circ_rnas_annotation.out file and return a pandas dataframe"""
    df = pd.read_table(file, sep='\t')
    df.replace(np.nan, "", inplace=True)
    return df

def intersection(lst1, lst2):
    """ Return the intersection between two lists"""
    return list(set(lst1) & set(lst2))


def get_exonic_circrnas(df):
    """ Get statistics about exonic circRNAs from the annotation_circRNA.out file"""
    # Arrays, dicts:
    exonic_circ_names, monoexonic_circ_names, true_exonic, single_end_circ_names = [], [], [], []
    ccr_start_end_exonic, ccr_single_annot = dict(), dict()
    # Counter exonic circRNAs:
    nb_monoexonic, nb_start_end_exonic, nb_antisens_exonic, nb_single_annotated_junction = 0, 0, 0, 0

    for index, row in df.iterrows():

        exon_id_start = row.exons_id_start.split(",")
        exon_start = str(exon_id_start[0])
        exon_start_tmp = exon_start.split("_")
        exon_start = exon_start_tmp[0]
        exon_id_end = row.exons_id_end.split(",")
        exon_end = str(exon_id_end[0])
        exon_end_tmp = exon_end.split("_")
        exon_end = exon_end_tmp[0]

        if row.nb_ccr >= 5:
            # if ((len(row.exons_id_start) > 0 or len(row.exons_id_end) > 0)
            #     and len(row.intron_name) == 0):
            if len(row.exons_id_start) > 0 or len(row.exons_id_end) > 0:
                if row.strand == "+":
                    if ("5" and "+") in row.exons_id_start:
                        if len(row.exons_id_end)==0:
                            nb_single_annotated_junction += 1
                            ccr_single_annot[row.circ_rna_name] = row.nb_ccr
                            single_end_circ_names.append(row.circ_rna_name)
                        if ("3" and "+") in row.exons_id_end:
                            nb_start_end_exonic += 1
                            true_exonic.append(row)
                            exonic_circ_names.append(row.circ_rna_name)
                            ccr_start_end_exonic[row.circ_rna_name] = row.nb_ccr
                            if ((len(exon_id_start)==1 and len(exon_id_end)==1) and (exon_start == exon_end)):
                                nb_monoexonic += 1
                                monoexonic_circ_names.append(row.circ_rna_name)
                    if ("3" and "+") in row.exons_id_end:
                        if len(row.exons_id_start)==0:
                            nb_single_annotated_junction += 1
                            ccr_single_annot[row.circ_rna_name] = row.nb_ccr
                            single_end_circ_names.append(row.circ_rna_name)
                    if ("3" and "-") in row.exons_id_start:
                        if ("5" and "-") in row.exons_id_end:
                            nb_antisens_exonic += 1
                elif row.strand == "-":
                    if ("5" and "-") in row.exons_id_end:
                        if len(row.exons_id_start)==0:
                            nb_single_annotated_junction += 1
                            ccr_single_annot[row.circ_rna_name] = row.nb_ccr
                            single_end_circ_names.append(row.circ_rna_name)
                        if ("3" and "-") in row.exons_id_start:
                            nb_start_end_exonic += 1
                            true_exonic.append(row)
                            exonic_circ_names.append(row.circ_rna_name)
                            ccr_start_end_exonic[row.circ_rna_name] = row.nb_ccr
                            if ((len(exon_id_start)==1 and len(exon_id_end)==1) and (exon_start == exon_end)):
                                nb_monoexonic += 1
                                monoexonic_circ_names.append(row.circ_rna_name)
                    if ("3" and "-") in row.exons_id_start:
                        if len(row.exons_id_end)==0:
                            nb_single_annotated_junction += 1
                            ccr_single_annot[row.circ_rna_name] = row.nb_ccr
                            single_end_circ_names.append(row.circ_rna_name)
                    if ("5" and "+") in row.exons_id_start:
                        if ("3" and "+") in row.exons_id_end:
                            nb_antisens_exonic += 1
    return true_exonic, nb_start_end_exonic, nb_single_annotated_junction, single_end_circ_names, ccr_single_annot, monoexonic_circ_names, exonic_circ_names


def get_intronic_circrnas(df, exonic_circrnas):
    # Arrays, dicts:
    intronic_circ_names, true_intronic = [], []
    ccr_true_intronic = dict()
    # Counter intronic circRNAs:
    nb_true_intronic = 0

    for index, row in df.iterrows():
        if row.circ_rna_name not in exonic_circrnas:
            if row.nb_ccr >= 5:
                if len(row.intron_name) > 0:
                    if row.strand == "+":
                        if (row.end_i - row.end) in range(-5,60):
                            if (row.start - row.start_i) in range(-5,5) or (row.start==row.start_i):
                                nb_true_intronic += 1
                                true_intronic.append(row)
                                ccr_true_intronic[row.circ_rna_name] = row.nb_ccr
                                intronic_circ_names.append(row.circ_rna_name)
                        elif (row.start == row.start_i and (row.end_i - row.end) > 32):
                            nb_true_intronic += 1
                            true_intronic.append(row)
                            ccr_true_intronic[row.circ_rna_name] = row.nb_ccr
                            intronic_circ_names.append(row.circ_rna_name)
                    elif row.strand == "-":
                        if (row.start - row.start_i) in range(-5,60):
                            if ((row.end - row.end_i) in range(-5,5) or (row.end == row.end_i)):
                                nb_true_intronic += 1
                                true_intronic.append(row)
                                ccr_true_intronic[row.circ_rna_name] = row.nb_ccr
                                intronic_circ_names.append(row.circ_rna_name)
                        elif (row.end == row.end_i and (row.start - row.start_i) > 60):
                            nb_true_intronic += 1
                            true_intronic.append(row)
                            ccr_true_intronic[row.circ_rna_name] = row.nb_ccr
                            intronic_circ_names.append(row.circ_rna_name)
    return true_intronic, nb_true_intronic, ccr_true_intronic, intronic_circ_names


def get_subexonic_circrnas(df, monoexonic_circ_names, intronic_circ_names):
    # Subexonic_meg : sens + antisens
    # Subexonic_pleg : sens
    # Arrays, dicts:
    infraexonic_tot_names, infraexonic_circ_names, subexonic, subexonic_antisens_names = [], [], [], []
    ccr_infraexonic = dict()
    # Counter subexonic circRNAs:
    nb_infraexonic, nb_infraexonic_tot, nb_infraexonic_sens, nb_infraexonic_antisens = 0, 0, 0, 0

    for index, row in df.iterrows():
        if row.nb_ccr >= 5:
            if ((len(row.gene_id_ife) > 0) and (row.circ_rna_name not in monoexonic_circ_names)):
                if (("_5_c_+" not in row.exons_id_start) and ("_5_lnc_+" not in row.exons_id_start) and
                    ("_3_c_-" not in row.exons_id_start) and ("_3_lnc_-" not in row.exons_id_start) and
                    ("_3_c_+" not in row.exons_id_end) and ("_3_lnc_+" not in row.exons_id_end) and
                    ("_5_c_-" not in row.exons_id_end) and ("_5_lnc_-" not in row.exons_id_end)):
                    if len(row.gene_id_start) == 0 and len(row.gene_id_end) == 0:
                        nb_infraexonic_tot += 1
                        infraexonic_tot_names.append(row.circ_rna_name)
                        if row.circ_rna_name not in intronic_circ_names:
                            if row.strand == "+":
                                if "+" in row.gene_id_ife:
                                    nb_infraexonic_sens += 1
                                    subexonic.append(row)
                                    ccr_infraexonic[row.circ_rna_name] = row.nb_ccr
                                    infraexonic_circ_names.append(row.circ_rna_name)
                                elif "-" in row.gene_id_ife:
                                    nb_infraexonic_antisens += 1
                                    subexonic.append(row)
                                    infraexonic_circ_names.append(row.circ_rna_name)
                                    subexonic_antisens_names.append(row.circ_rna_name)
                            elif row.strand == "-":
                                if "-" in row.gene_id_ife:
                                    nb_infraexonic_sens += 1
                                    subexonic.append(row)
                                    infraexonic_circ_names.append(row.circ_rna_name)
                                    ccr_infraexonic[row.circ_rna_name] = row.nb_ccr
                                elif "+" in row.gene_id_ife:
                                    nb_infraexonic_antisens += 1
                                    subexonic.append(row)
                                    infraexonic_circ_names.append(row.circ_rna_name)
                                    subexonic_antisens_names.append(row.circ_rna_name)
                    elif ((len(row.gene_id_start) > 0 and len(row.gene_id_end) == 0) or
                            (len(row.gene_id_start) == 0  and len(row.gene_id_end) > 0)):
                        nb_infraexonic_tot += 1
                        infraexonic_tot_names.append(row.circ_rna_name)
                        if row.circ_rna_name not in intronic_circ_names:
                            if row.strand == "+":
                                if "+" in row.gene_id_ife:
                                    nb_infraexonic_sens += 1
                                    subexonic.append(row)
                                    infraexonic_circ_names.append(row.circ_rna_name)
                                    ccr_infraexonic[row.circ_rna_name] = row.nb_ccr
                                elif "-" in row.gene_id_ife:
                                    nb_infraexonic_antisens += 1
                                    subexonic.append(row)
                                    infraexonic_circ_names.append(row.circ_rna_name)
                                    subexonic_antisens_names.append(row.circ_rna_name)
                            elif row.strand == "-":
                                if "-" in row.gene_id_ife:
                                    nb_infraexonic_sens += 1
                                    subexonic.append(row)
                                    infraexonic_circ_names.append(row.circ_rna_name)
                                    ccr_infraexonic[row.circ_rna_name] = row.nb_ccr
                                elif "+" in row.gene_id_ife:
                                    nb_infraexonic_antisens += 1
                                    subexonic.append(row)
                                    infraexonic_circ_names.append(row.circ_rna_name)
                                    subexonic_antisens_names.append(row.circ_rna_name)
    return subexonic, infraexonic_circ_names, nb_infraexonic_sens, ccr_infraexonic, subexonic_antisens_names


def get_circrnas(df):
    """ Read the annotation_circRNA.out dataframe and return all statistics to tabular
        format about circRNAs"""
    # Get exonic circRNAs:
    stats_exonic_circrnas = get_exonic_circrnas(df)
    exonic = stats_exonic_circrnas[0]
    monoexonic_names = stats_exonic_circrnas[5]
    exonic_names = stats_exonic_circrnas[6]
    # Get intronic circRNAs:
    stats_intronic_circrnas = get_intronic_circrnas(df, exonic_names)
    intronic = stats_intronic_circrnas[0]
    intronic_names = stats_intronic_circrnas[3]
    # Get subexonic circRNAs:
    stats_subexonic_circrnas = get_subexonic_circrnas(df, monoexonic_names, intronic_names)
    subexonic = stats_subexonic_circrnas[0]
    subexonic_antisens = stats_subexonic_circrnas[4]

    return exonic, subexonic, intronic, subexonic_antisens


def get_stats_circrnas(sample, df, output_file_name):
    """ Read the annotation_circRNA.out dataframe and return all statistics to tabular
        format about circRNAs"""
    nb_circ_tot = len(df)
    # Get circRNAs:
    exonic = get_circrnas(df)[0]
    subexonic = get_circrnas(df)[1]
    subexonic_antisens = get_circrnas(df)[3]
    intronic = get_circrnas(df)[2]

    # Get exonic circRNAs stats:
    stats_exonic = write_comparison_exonic_table(sample, exonic, output_file_name)

    # Get subexonic circRNAs stats:
    nb_subexonic = get_nb_type(sample, subexonic, subexonic_antisens)[0]
    nb_gene_subexonic = get_nb_type(sample, subexonic, subexonic_antisens)[1]
    nb_subexonic_pleg = get_nb_type(sample, subexonic, subexonic_antisens)[2]
    nb_subexonic_meg = get_nb_type(sample, subexonic, subexonic_antisens)[3]
    nb_ccr_pleg = get_nb_type(sample, subexonic, subexonic_antisens)[4]
    nb_ccr_meg = get_nb_type(sample, subexonic, subexonic_antisens)[5]


    if len(stats_exonic) > 5:
        nb_c = stats_exonic[0]
        nb_lnc = stats_exonic[1]
        nb_autres = stats_exonic[2]
        nb_start_end_exonic_selected = stats_exonic[3]
        nb_ccr_start_end_exonic = stats_exonic[4]
        nb_exonic = write_comparison_exonic_table(sample, exonic, args.output_comp_exonic_file)[5]
    else:
        nb_c = stats_exonic[0]
        nb_lnc = 0
        nb_autres = stats_exonic[1]
        nb_start_end_exonic_selected = stats_exonic[2]
        nb_ccr_start_end_exonic = stats_exonic[3]
        nb_exonic = write_comparison_exonic_table(sample, exonic, args.output_comp_exonic_file)[4]

    nb_start_end_exonic = get_exonic_circrnas(df)[1]
    nb_single_annotated_junction = get_exonic_circrnas(df)[2]
    monoexonic_circ_names =  get_exonic_circrnas(df)[5]
    exonic_names = get_exonic_circrnas(df)[6]
    intronic_circ_names = get_intronic_circrnas(df, exonic_names)[3]
    infraexonic_circ_names = get_subexonic_circrnas(df, monoexonic_circ_names, intronic_circ_names)[1]
    single_end_circ_names = get_exonic_circrnas(df)[3]
    nb_infraexonic_sens = get_subexonic_circrnas(df, monoexonic_circ_names, intronic_circ_names)[2]
    nb_true_intronic = get_intronic_circrnas(df, exonic_names)[1]
    ccr_single_annot = get_exonic_circrnas(df)[4]
    ccr_true_intronic = get_intronic_circrnas(df, exonic_names)[2]
    ccr_infraexonic = get_subexonic_circrnas(df, monoexonic_circ_names, intronic_circ_names)[3]

    nb_start_end_false_exonic = nb_start_end_exonic - nb_start_end_exonic_selected
    nb_tot_exonic = nb_start_end_exonic + nb_single_annotated_junction
    nb_common_infra_single = len(intersection(infraexonic_circ_names, single_end_circ_names))
    nb_infraexonic = nb_infraexonic_sens - nb_common_infra_single
    nb_circ_non_annotated = nb_circ_tot - (nb_tot_exonic + nb_infraexonic + nb_true_intronic - nb_start_end_false_exonic)
    nb_ccr_single_annot = sum(ccr_single_annot.values())
    nb_ccr_intronic = sum(ccr_true_intronic.values())
    nb_ccr_infraexonic = sum(ccr_infraexonic.values())

    return "\t".join(map(str,[sample, nb_circ_tot, nb_exonic, nb_lnc, nb_autres, nb_ccr_start_end_exonic,
                              nb_subexonic_pleg, nb_ccr_pleg, nb_subexonic_meg, nb_ccr_meg, nb_gene_subexonic,
                              nb_true_intronic, nb_ccr_intronic]))+"\n"


def write_stats_table(stats, output_file):
    with open(output_file, "w") as fout:
        fout.write(stats)


def write_circ_table(self, header, path, index=None, sep="\t", na_rep='', float_format=None,
                     index_label=None, mode='w', encoding=None, date_format=None, decimal='.'):
    """
    Write a circRNAs list to a tabular-separated file (tsv).
    """
    from pandas.core.frame import DataFrame
    df = DataFrame(self)
    # result is only a string if no path provided, otherwise None
    result = df.to_csv(path, index=index, sep=sep, na_rep=na_rep, float_format=float_format,
                       header=header, index_label=index_label, mode=mode, encoding=encoding,
                       date_format=date_format, decimal=decimal)
    if path is None:
        return result


def write_comparison_exonic_table(sample, circ_rnas, output_file_name):
    # Write exonic table for comparaison between tissues/species:
    df_circ_rnas = pd.DataFrame(circ_rnas, index=None)
    header = ["chrom:start-end,strand", "nb_ccr", "gene_id", "biotype", "genomic_size"]
    nb_start_end_annotated, nb_exonic = 0, 0
    nb_ccr_exonics, biotypes = [], []

    with open(output_file_name, 'w') as fout:
        tsv_writer = csv.writer(fout, delimiter='\t')
        tsv_writer.writerow(header)
        for index, row in df_circ_rnas.iterrows():
            exons_id_start_a, exons_id_end_a = [], []
            # Select well-annotated true exonic circRNAs bases on the gene_id:
            # Get key:
            chrom_start = ":".join(map(str,[row.chrom, row.start]))
            chrom_start_end = "-".join(map(str, [chrom_start , row.end]))
            chrom_start_end_strand = ",".join(map(str, [chrom_start_end , row.strand]))
            # Get the ccr number:
            nb_ccr = row.nb_ccr
            nb_ccr_exonics.append(nb_ccr)
            # Get gene_id:
            genes_id_start = list(set(list(row.gene_id_start.split(","))))
            genes_id_end = list(set(list(row.gene_id_end.split(","))))
            # Get biotype:
            exons_id_start = list(set(list(row.exons_id_start.split(","))))
            exons_id_end = list(set(list(row.exons_id_end.split(","))))
            exons_id_start = list(i.split("_") for i in exons_id_start)
            exons_id_end = list(i.split("_") for i in exons_id_end)
            intersect_gene_id = ",".join(list(set(genes_id_start).intersection(genes_id_end)))
            if len(intersect_gene_id) > 0:
                nb_start_end_annotated += 1
                # Get biotype:
                biotypes_start = list(set(list(i[2] for i in exons_id_start)))
                biotypes_end = list(set(list(i[2] for i in exons_id_end)))
                intersect_biotypes = "".join(list(set(biotypes_start).intersection(biotypes_end)))
                # Write line into the output file:
                nb_exonic += 1
                genomic_size = row.end - row.start
                s = [chrom_start_end_strand, nb_ccr, intersect_gene_id, intersect_biotypes, genomic_size]
                tsv_writer.writerow(s)
                biotypes.append(intersect_biotypes)
    # Dictionary with key = biotype and value = counter:
    d = {}
    d["sample"] = sample
    for biotype in biotypes:
        d[biotype] = d.get(biotype, 0) + 1
    values = []
    valid_keys = ["c", "lnc", "sample"]
    for key, value in d.items():
        if key not in valid_keys:
            values.append(value)
    d["autres"] = sum(values)
    nb_ccr_exonic = sum(nb_ccr_exonics)
    for key, value in d.items():
        if ("c" in d.keys()) and ("lnc" in d.keys()):
            return d["c"], d["lnc"], d["autres"], nb_start_end_annotated, nb_ccr_exonic, nb_exonic
        elif ("c" in d.keys() and "lnc" not in d.keys()):
            return d["c"], d["autres"], nb_start_end_annotated, nb_ccr_exonic, nb_exonic


def compute_size(exons_start_end):
    pos_start_end = list(exons_start_end.split("_"))
    pos_start = int(pos_start_end[0])
    pos_end = int(pos_start_end[1])
    size = pos_end - pos_start
    return size

def get_true_exons_gene_id(exons_start_end, exons_id, gene_id):
    if len(exons_start_end)==1:
        exon_start_end = ",".join(exons_start_end)
        exon_id = ",".join(exons_id)
        gene_id = gene_id
    else:
        annot = dict(zip(exons_start_end, exons_id))
        sizes = []
        for key in annot:
            size = compute_size(key)
            sizes.append(size)
        values = zip(exons_start_end, sizes)
        annot_f = dict(zip(exons_id, values))
        min_size = min([i[1] for i in list(annot_f.values())])
        for key, values in annot_f.items():
            if values[1] == min_size:
                exon_id = key
                exon_start_end = values[0]
    return exon_start_end, exon_id, gene_id


def get_nb_type(sample, circ_rnas, subexonic_antisens):
    # Write exonic table for comparaison between tissues/species:
    df_circ_rnas = pd.DataFrame(circ_rnas, index=None)
    nb_sub_exonic, nb_sub_exonic_pleg, nb_sub_exonic_meg = 0, 0, 0
    subexonic_genes, nb_ccr_pleg, nb_ccr_meg = [], [], []

    for index, row in df_circ_rnas.iterrows():
        # Select well-annotated subexonic circRNAs according to the gene_id:
        # Get gene_id:
        genes_id = list(set(list(row.gene_id_ife.split(","))))
        genes_id = list(i.split("_") for i in genes_id)
        exons_start_end = sorted(set(row.exons_start_end_ife.split(",")), key=row.exons_start_end_ife.split(',').index)
        exon_id = sorted(set(row.exon_id_ife.split(",")), key=row.exon_id_ife.split(',').index)
        gene_id = ",".join(list(set(list(row.gene_id_ife.split(",")))))
        biotypes = []

        exon_start_end = get_true_exons_gene_id(exons_start_end, exon_id, gene_id)[0]
        exon_id = get_true_exons_gene_id(exons_start_end, exon_id, gene_id)[1]
        gene_id = get_true_exons_gene_id(exons_start_end, exon_id, gene_id)[2]

        row.exons_start_end_ife = exon_start_end
        row.exon_id_ife = exon_id
        row.gene_id_ife = gene_id

        if len(genes_id)==1:
            for i in genes_id:
                # Subexonic pleg circRNAs:
                if ('c' in i or 'lnc' in i or 'pseudo' in i) and (row.circ_rna_name not in subexonic_antisens):
                    nb_sub_exonic += 1
                    nb_sub_exonic_pleg += 1
                    nb_ccr_pleg.append(row.nb_ccr)
                    subexonic_genes.append(gene_id)
                    # Subexonic meg circRNAs:
                elif 'c' not in i and 'lnc' not in i and 'pseudo' not in i and 'rRNA' not in i:
                    nb_sub_exonic += 1
                    nb_sub_exonic_meg += 1
                    nb_ccr_meg.append(row.nb_ccr)
                    subexonic_genes.append(gene_id)
                elif 'rRNA' in i:
                    pass

        if len(genes_id) > 1:
            for i in genes_id:
                if len(i) == 3:
                    biotypes.append(i[2])
                    biotypes = list(set(biotypes))
                elif len(i) > 3:
                    p_biotypes = [i[2],i[3]]
                    biotypes = [p for p in p_biotypes]
            if 'rRNA' in biotypes:
                pass
            # Subexonic pleg circRNAs:
            elif ((biotypes == ['c'] or biotypes == ['lnc'] or biotypes == ['pseudo']
                or ('c' and 'pseudo' in biotypes) or ('c' and 'lnc' in biotypes)
                or ('lnc' and 'pseudo' in biotypes)) and (row.circ_rna_name not in subexonic_antisens)):
                nb_sub_exonic += 1
                nb_sub_exonic_pleg += 1
                nb_ccr_pleg.append(row.nb_ccr)
                subexonic_genes.append(gene_id)
            # Subexonic meg circRNAs:
            else:
                nb_sub_exonic += 1
                nb_sub_exonic_meg += 1
                nb_ccr_meg.append(row.nb_ccr)
                subexonic_genes.append(gene_id)

    nb_subexonic_genes = len(list(set([val for sublist in subexonic_genes for val in sublist])))
    nb_ccr_p = sum(nb_ccr_pleg)
    nb_ccr_m = sum(nb_ccr_meg)
    return nb_sub_exonic, nb_subexonic_genes, nb_sub_exonic_pleg, nb_sub_exonic_meg, nb_ccr_p, nb_ccr_m


def write_subexonic_tables(sample, circ_rnas, subexonic_antisens, output_file_name_meg, output_file_name_pleg):
    # Write exonic table for comparaison between tissues/species:
    df_circ_rnas = pd.DataFrame(circ_rnas, index=None)
    nb_sub_exonic, nb_sub_exonic_pleg, nb_sub_exonic_meg = 0, 0, 0
    subexonic_genes, nb_ccr_pleg, nb_ccr_meg = [], [], []
    header = list(df_circ_rnas.columns.values.tolist())

    with open(output_file_name_meg, 'w') as fout_meg, open(output_file_name_pleg, 'w') as fout_pleg:
        # Subexonic pleg circRNAs:
        tsv_writer_pleg = csv.writer(fout_pleg, delimiter='\t')
        tsv_writer_pleg.writerow(header)
        # Subexonic meg circRNAs:
        tsv_writer_meg = csv.writer(fout_meg, delimiter='\t')
        tsv_writer_meg.writerow(header)

        for index, row in df_circ_rnas.iterrows():
            # Select well-annotated subexonic circRNAs according to the gene_id:
            # Get gene_id:
            genes_id = list(set(list(row.gene_id_ife.split(","))))
            genes_id = list(i.split("_") for i in genes_id)
            exons_start_end = sorted(set(row.exons_start_end_ife.split(",")), key=row.exons_start_end_ife.split(',').index)
            exon_id = sorted(set(row.exon_id_ife.split(",")), key=row.exon_id_ife.split(',').index)
            gene_id = ",".join(list(set(list(row.gene_id_ife.split(",")))))
            biotypes = []

            exon_start_end = get_true_exons_gene_id(exons_start_end, exon_id, gene_id)[0]
            exon_id = get_true_exons_gene_id(exons_start_end, exon_id, gene_id)[1]
            gene_id = get_true_exons_gene_id(exons_start_end, exon_id, gene_id)[2]

            row.exons_start_end_ife = exon_start_end
            row.exon_id_ife = exon_id
            row.gene_id_ife = gene_id

            if len(genes_id)==1:
                for i in genes_id:
                    # Subexonic pleg circRNAs:
                    if ('c' in i or 'lnc' in i or 'pseudo' in i) and (row.circ_rna_name not in subexonic_antisens):
                        nb_sub_exonic += 1
                        nb_sub_exonic_pleg += 1
                        nb_ccr_pleg.append(row.nb_ccr)
                        subexonic_genes.append(gene_id)
                        s = row
                        tsv_writer_pleg.writerow(s)
                    # Subexonic meg circRNAs:
                    elif 'c' not in i and 'lnc' not in i and 'pseudo' not in i and 'rRNA' not in i:
                        nb_sub_exonic += 1
                        nb_sub_exonic_meg += 1
                        nb_ccr_meg.append(row.nb_ccr)
                        subexonic_genes.append(gene_id)
                        s = row
                        tsv_writer_meg.writerow(s)
                    elif 'rRNA' in i:
                        pass

            if len(genes_id) > 1:
                for i in genes_id:
                    if len(i) == 3:
                        biotypes.append(i[2])
                        biotypes = list(set(biotypes))
                    elif len(i) > 3:
                        p_biotypes = [i[2],i[3]]
                        biotypes = [p for p in p_biotypes]
                if 'rRNA' in biotypes:
                    pass
                # Subexonic pleg circRNAs:
                elif ((biotypes == ['c'] or biotypes == ['lnc'] or biotypes == ['pseudo']
                    or ('c' and 'pseudo' in biotypes) or ('c' and 'lnc' in biotypes)
                    or ('lnc' and 'pseudo' in biotypes)) and (row.circ_rna_name not in subexonic_antisens)):
                    nb_sub_exonic += 1
                    nb_sub_exonic_pleg += 1
                    nb_ccr_pleg.append(row.nb_ccr)
                    subexonic_genes.append(gene_id)
                    s = row
                    tsv_writer_pleg.writerow(s)
                # Subexonic meg circRNAs:
                else:
                    nb_sub_exonic += 1
                    nb_sub_exonic_meg += 1
                    nb_ccr_meg.append(row.nb_ccr)
                    subexonic_genes.append(gene_id)
                    s = row
                    tsv_writer_meg.writerow(s)

    nb_subexonic_genes = len(list(set([val for sublist in subexonic_genes for val in sublist])))
    nb_ccr_p = sum(nb_ccr_pleg)
    nb_ccr_m = sum(nb_ccr_meg)
    return nb_sub_exonic, nb_subexonic_genes, nb_sub_exonic_pleg, nb_sub_exonic_meg, nb_ccr_p, nb_ccr_m


def write_circrnas_tables(sample, df, exonic_circrnas, intronic_circrnas, subexonic_circrnas, subexonic_antisens):
    header = list(df.columns.values.tolist())
    if len(exonic_circrnas)>0:
        # Write the exonic circRNAs table:
        write_circ_table(exonic_circrnas, header, args.output_exonic_file)
    else:
        with open(args.output_exonic_file, 'w') as fout:
            tsv_writer = csv.writer(fout, delimiter='\t')
            tsv_writer.writerow(header)
    if len(intronic_circrnas)>0:
        # Write the intronic circRNAs table:
        write_circ_table(intronic_circrnas, header, args.output_intronic_file)
    else:
        with open(args.output_intronic_file, 'w') as fout:
            tsv_writer = csv.writer(fout, delimiter='\t')
            tsv_writer.writerow(header)

    eprint("subexonic_circrnas", len(subexonic_circrnas))
    eprint("subexonic_antisens", len(subexonic_antisens))
    if len(subexonic_circrnas)>0 or len(subexonic_antisens)>0:
        # Write the subexonic circRNAs tables:
        write_subexonic_tables(sample, subexonic_circrnas, subexonic_antisens, args.output_subexonic_meg_file,
                               args.output_subexonic_pleg_file)
    else:
        eprint("ok")
        open(args.output_subexonic_meg_file, 'a').close()
        open(args.output_subexonic_pleg_file, 'a').close()

def main():
    # Get sample name:
    sample = get_sample(args.input_file)

    # Read the circRNAs annotation file:
    df_circ_annot = read_file(args.input_file)

    # Get the list of exonic, subexonic and intronic circRNAs:
    exonic_circrnas = get_circrnas(df_circ_annot)[0]
    subexonic_circrnas = get_circrnas(df_circ_annot)[1]
    intronic_circrnas = get_circrnas(df_circ_annot)[2]
    subexonic_antisens = get_circrnas(df_circ_annot)[3]

    # Write exonic, intronic and subexonic circRNAs tables:
    circrnas_tables = write_circrnas_tables(sample, df_circ_annot, exonic_circrnas,
                                            intronic_circrnas, subexonic_circrnas, subexonic_antisens)

    # Compute statistics about exonic, subexonic and intronic circRNAs:
    stats = get_stats_circrnas(sample, df_circ_annot, args.output_comp_exonic_file)

    # Write the circRNAs statistics table:
    write_stats_table(stats, args.output_stats_file)


def parse_arguments():
    parser = argparse.ArgumentParser(description='Return annotation statistics tables')
    parser.add_argument('-i', '--input_file', required=True,
                        help='Annotation circRNAs file')
    parser.add_argument('-o_stats', '--output_stats_file', required=False,
                        default="stats_annotation.tsv",
                        help='Table containing statistics of annotation')
    parser.add_argument('-oi', '--output_intronic_file', required=False,
                        default="intronic_circRNAs.tsv",
                        help='Table containing intronic circRNAs')
    parser.add_argument('-oe', '--output_exonic_file', required=False,
                        default="exonic_circRNAs.tsv",
                        help='Table containing exonic circRNAs')
    parser.add_argument('-oce', '--output_comp_exonic_file', required=False,
                        default="exonic_summary.tsv",
                        help='Table containing exonics circRNAs for comparisons')
    parser.add_argument('-osepleg', '--output_subexonic_pleg_file', required=False,
                        default="subexonic_pleg_circRNAs.tsv",
                        help='Table containing subexonic circRNAs')
    parser.add_argument('-osemeg', '--output_subexonic_meg_file', required=False,
                        default="subexonic_meg_circRNAs.tsv",
                        help='Table containing subexonic circRNAs')
    args = parser.parse_args()
    return args

if __name__ == '__main__':
    args = parse_arguments()
    main()
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
import os
import argparse
import pandas as pd
import numpy as np
import re
import sys


# Utility functions
def create_folder(path):
    if not os.path.exists(path):
        os.mkdir(path)

def eprint(*args, **kwargs):
    print(*args,  file=sys.stderr, **kwargs)

def read_file(file):
    """ Read the sample file containing path files and return a pandas dataframe"""
    sample = pd.read_csv(file, sep='\t', dtype=str)
    return sample

def check_mapdirs(mapdirs):
    """ Check if the mapdir exists or not"""
    for mapdir in mapdirs:
        if not os.path.exists(mapdir):
            raise Exception("WARNING following mapdir is missing %s" % mapdir)

def read_log_file(sample):
    """ Read and clean the Log.final.out files containing the summary mapping statistics
        after mapping job is complete and return an array of the pandas dataframe
        containing statistics"""
    stats = []
    d = {}
    reads = ["R1", "R2"]
    paths_out = []
    for index, row in sample.iterrows():
        sample_name = row["sample"]
        sample_unit_name = row["sample_unit"]
        path = row["mapdir"]
        for read in reads:
            path_file = path+"/se/"+read+"/Log.final.out"
            f = open(path_file, "r")
            for line in f:
                l = line.split("|")
                stats.append(l)
            for stat in stats:
                if len(stat) == 1:
                    stats.remove(stat)
            for stat in stats:
                stat[0] = ' '.join(stat[0].split("\t"))
                stat[0] = ' '.join(stat[0].split())
                stat[1] = ' '.join(stat[1].split())
                stat[1] = stat[1].replace("%", "")
                d['sample'] = sample_name
                d['sample_unit'] = sample_unit_name
                d['read'] = read
                d[stat[0]] = stat[1]
            f.close()
            create_folder("reports/"+sample_unit_name+"/")
            create_folder("reports/"+sample_unit_name+"/"+read+"/")
            path_out = "reports/"+sample_unit_name+"/"+read+"/"+args.output_file
            paths_out.append(path_out)
            write_sample_file(d, path_out)
    return paths_out


def write_sample_file(dict, output_file):
    """Get value from a statistic dictionnary and return a pandas object"""
    df = pd.DataFrame([dict])
    df.to_csv(output_file, sep="\t", index=False)


def write_final_stat_tab(paths_out, output_file):
    df_final = pd.DataFrame()
    for path in paths_out:
        df = pd.read_csv(path, sep="\t")
        df_final = pd.concat([df_final, df], ignore_index=True, sort =False)
    df_final.to_csv(output_file, sep="\t", index=False)


def main():

    # Read the sample file:
    sample = read_file(args.input_file)

    # Check the mapdir of each sample exists or not:
    check_mapdirs(sample["mapdir"])

    create_folder("reports")

    # Read log files and get the path of a table containing statistics:
    paths_out = read_log_file(sample)

    # Write the statistic table with all samples:
    write_final_stat_tab(paths_out, args.output_file)


def parse_arguments():
    parser = argparse.ArgumentParser(description='Sample file')
    parser.add_argument('-i', '--input_file',
                        required=True, help='Sample file')
    parser.add_argument('-o', '--output_file', required=False, 
                        default="mapping_stat.tsv", help='Sample file')
    args = parser.parse_args()
    return args


if __name__ == '__main__':
    args = parse_arguments()
    main()
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
import os, re, sys, csv, argparse
import pandas as pd
import numpy as np

# Utility functions:
def eprint(*args, **kwargs):
    print(*args,  file=sys.stderr, **kwargs)


def read_file(file):
    """ Read the circRNAs annotation file and return a pandas dataframe"""
    df = pd.read_table(file, sep='\t')
    df.replace(np.nan, "", inplace=True)
    return df


def get_item(item):
    return ''.join(list(set(item)))


def write_summary_meg_table(df, output_file_name, min_size):
    header = ["gene_name", "biotype", "nb_circ_sens", "nb_circ_antisens", "nb_ccr", "exon_name", "start_exon", "end_exon"]
    with open(output_file_name, 'w') as fout:
        tsv_writer = csv.writer(fout, delimiter='\t')
        tsv_writer.writerow(header)
        if len(df) > 0:
            for key, item in df:
                exon_start = get_item(item.exons_start_end_ife).split("_")[0]
                exon_end = get_item(item.exons_start_end_ife).split("_")[1]
                if int(exon_end) - int(exon_start) > min_size:
                    strand = list(item.strand)
                    gene_id = list(key.split("_"))[0]
                    strand_gene_ife = list(key.split("_"))[1]
                    # Get biotype:
                    if len(list(key.split("_"))) == 3:
                        biotype = list(key.split("_"))[2]
                    else:
                        biotype = list(key.split("_"))[2]+"_"+list(key.split("_"))[3]
                    nb_circ_rna = len(df.get_group(key))
                    nb_circ_sens = strand.count(strand_gene_ife)
                    nb_circ_antisens = nb_circ_rna - nb_circ_sens
                    nb_ccr = sum(item.nb_ccr)
                    exon_name = get_item(item.exon_id_ife)
                    # print(df.get_group(key), "\n\n")
                    row = [gene_id, biotype, nb_circ_sens, nb_circ_antisens, nb_ccr, exon_name, exon_start, exon_end]
                    tsv_writer.writerow(row)


def write_summary_intronic_table(df, output_file_name, min_size):
    header = ["gene_name", "biotype", "intron_name", "nb_circ_rna", "nb_ccr", "start_intron", "end_intron"]
    with open(output_file_name, 'w') as fout:
        tsv_writer = csv.writer(fout, delimiter='\t')
        tsv_writer.writerow(header)
        if len(df) > 0:
            for key, item in df:
                start_i = ''.join([str(round(x)) for x in list(set(item.start_i))])
                end_i = ''.join([str(round(x)) for x in list(set(item.end_i))])
                if int(end_i) - int(start_i) > min_size:
                    intron_name = key
                    gene_id = get_item(item.gene_id_i).split("_")[0]
                    # Get biotype:
                    biotype = get_item(item.gene_id_i).split(",")
                    if len(biotype)>1:
                        biotype = biotype[0].split("_")[4]
                    else:
                        biotype = ''.join(biotype).split("_")[4]
                    nb_ccr = sum(item.nb_ccr)
                    nb_circ_rna = len(df.get_group(key))
                    row = [gene_id, biotype, intron_name, nb_circ_rna, nb_ccr, start_i, end_i]
                    tsv_writer.writerow(row)


def write_summary_pleg_table(df, output_file_name, min_size):
    header = ["gene_name", "nb_gene", "biotype", "nb_circ_sens", "nb_circ_antisens", "nb_ccr", "exon_name", "start_exon", "end_exon"]
    with open(output_file_name, 'w') as fout:
        tsv_writer = csv.writer(fout, delimiter='\t')
        tsv_writer.writerow(header)
        if len(df) > 0:
            for key, item in df:
                exon_start = get_item(item.exons_start_end_ife).split("_")[0]
                exon_end = get_item(item.exons_start_end_ife).split("_")[1]
                if int(exon_end) - int(exon_start) > min_size:
                    # Get gene_id and biotype:
                    gene_id_ife = get_item(item.gene_id_ife).split(",")
                    nb_gene = len(gene_id_ife)
                    if len(gene_id_ife) > 1:
                        gene_id = gene_id_ife[0].split("_")[0]+","+gene_id_ife[1].split("_")[0]
                        strand_gene_ife = get_item(gene_id_ife[0].split("_")[1]+gene_id_ife[1].split("_")[1])
                        biotype = get_item(gene_id_ife[0].split("_")[2]+gene_id_ife[1].split("_")[2])
                    else:
                        gene_id = ''.join(gene_id_ife).split("_")[0]
                        strand_gene_ife = ''.join(gene_id_ife).split("_")[1]
                        biotype = ''.join(gene_id_ife).split("_")[2]
                    strand = list(item.strand)
                    nb_circ_rna = len(df.get_group(key))
                    nb_circ_sens = strand.count(strand_gene_ife)
                    nb_circ_antisens = nb_circ_rna - nb_circ_sens
                    nb_ccr = sum(item.nb_ccr)
                    exon_name = key
                    row = [gene_id, nb_gene, biotype, nb_circ_sens, nb_circ_antisens, nb_ccr, exon_name, exon_start, exon_end]
                    tsv_writer.writerow(row)

def check_file(file):
    with open(file, 'r') as f:
        if len(f.read()) <= 1:
            result="Empty"
        else:
            result="Not empty"
    return result


def main():

    # Get min_size:
    min_size = int(args.min_size)

    # Read input annotation circRNAs files:
    if check_file(args.input_meg_file) == "Not empty":
        df_meg = read_file(args.input_meg_file)
        # Meg circRNAs:
        ## Group by gene:
        df_meg = df_meg.groupby('gene_id_ife')
        write_summary_meg_table(df_meg, args.output_meg_file, min_size)
    else:
        open(args.output_meg_file, 'a').close()

    if check_file(args.input_intronic_file) == "Not empty":
        df_intronic = read_file(args.input_intronic_file)
        # Intronic circRNAs:
        ## Group by intron name:
        df_intronic = df_intronic.groupby('intron_name')
        write_summary_intronic_table(df_intronic, args.output_intronic_file, min_size)
    else:
        open(args.output_intronic_file, 'a').close()

    if check_file(args.input_pleg_file) == "Not empty":
        df_pleg = read_file(args.input_pleg_file)
        # Pleg circRNAs:
        ## Group by exon:
        df_pleg = df_pleg.groupby('exon_id_ife')
        write_summary_pleg_table(df_pleg, args.output_pleg_file, min_size)
    else:
        open(args.output_pleg_file, 'a').close()


def parse_arguments():
    parser = argparse.ArgumentParser(description='Return summary circRNAs tables')
    parser.add_argument('-ip', '--input_pleg_file', required=True,
                        help='Annotation pleg circRNAs file')
    parser.add_argument('-im', '--input_meg_file', required=True,
                        help='Annotation meg circRNAs file')
    parser.add_argument('-ii', '--input_intronic_file', required=True,
                        help='Annotation intronic circRNAs file')
    parser.add_argument('-op', '--output_pleg_file', required=False,
                        default="pleg_summary.tsv",
                        help='Table containing summary of pleg circRNAs annotation')
    parser.add_argument('-om', '--output_meg_file', required=False,
                        default="meg_summary.tsv",
                        help='Table containing summary of meg circRNAs annotation')
    parser.add_argument('-oi', '--output_intronic_file', required=False,
                        default="intronic_summary.tsv",
                        help='Table containing summary of intronic circRNAs annotation')
    parser.add_argument('-ms', '--min_size', required=False, default=55,
                        help='Minimum size of circRNAs')
    args = parser.parse_args()
    return args


if __name__ == '__main__':
    args = parse_arguments()
    main()
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
                                     # [-fmt {bed,gtf}] [-o OUTPUT_FILE]

# Imports
import circRNA as circ
from collections import defaultdict
import sys, os, re, argparse, natsort
import numpy as np
import csv
import pybedtools as pybed

# Utility functions
def eprint(*args, **kwargs):
    print(*args,  file=sys.stderr, **kwargs)
    dsp_control = circ.DisplayControl.instance()
    if dsp_control.verbose:
        print(*args,  file=sys.stderr, **kwargs)


def exonic_annotations(circ_rnas, exons):
    """
    Annotate the circrnas according to the exon annotations
    circ_rnas: an array of cirRNAs object
    exons: an array of circRNAs objects
    returns the circ_rnas array with the circRNAs annotated

    We annotate each circRNA's splice junction with the corresponding
    exons spice junctions using a dict of exons splice junctions
    This means that splice junction correspondances must be exact
    """
    eprint("\nCompute junction dictionary...\n")
    annotation_junctions = compute_junction_dict(exons)
    eprint("\nCircRNA annotation...\n")
    set_annotation(circ_rnas, annotation_junctions)

    return circ_rnas


def junction_key(chrom, pos):
    return "%s:%s" %(chrom, pos)


def compute_junction_dict(annot_array):
    """
    Compute a dictionnary of splic junctions:
    keys: chrom:pos
    values: list of tuples
    tuple: (exon, end_type)
    end_type: 3 (resp 5) for donor (resp acceptor)
    """
    junction_dict = defaultdict(list)
    seen_exons = defaultdict()
    for annot in annot_array:
        if annot.get_att("exon_id") in seen_exons:
            continue
        seen_exons[annot.get_att("exon_id")] = 1
        key_start = junction_key(annot.chrom, annot.start)
        key_end = junction_key(annot.chrom, annot.end)
        if annot.strand == "+":
            start_end_type = "5"
            end_end_type = "3"
        else:
            start_end_type = "3"
            end_end_type = "5"
        # annotating start
        junction_dict[key_start].append((annot, start_end_type))
        # annotating end
        junction_dict[key_end].append((annot, end_end_type))
    return junction_dict


def set_annotation(circ_rnas, annotation_junctions):
    """
    An annotation string would be very much appreciated
    """
    for circ_rna in circ_rnas:
        junction_start = junction_key(circ_rna.chrom, circ_rna.start)
        junction_end = junction_key(circ_rna.chrom, circ_rna.end)
        if junction_start in annotation_junctions:
            circ_rna.add_start_annotation(annotation_junctions[junction_start])
            circ_rna.add_start_transcript_id(annotation_junctions[junction_start])
            circ_rna.add_start_gene_id(annotation_junctions[junction_start])
        if junction_end in annotation_junctions:
            circ_rna.add_end_annotation(annotation_junctions[junction_end])
            circ_rna.add_end_transcript_id(annotation_junctions[junction_end])
            circ_rna.add_end_gene_id(annotation_junctions[junction_end])


def intronic_annotations(circ_rnas, exons):
    """
    Annotate the circrnas according to the introns annotations
    circ_rnas: an array of cirRNAs object
    exons: an array of circRNAs objects
    returns the circ_rnas array with the circRNAs intrno-annotated

    Because at least on circRNA junction is away from existing
    splice junctions, the annotation is solely based on overlap between
    the circRNA and an intron :
      - [start, end] interval must be within the intron
      - this interval must cover at least 90% of the intron
    """

    #eprint("\nCompute introns from exons...\n")
    transcripts = get_exons_per_transcript(exons)
    introns = circ.compute_intronic_positions(transcripts)

    #eprint("\nCompute intersection with circRNAs...\n")
    circ_rnas = annotate_intron_intersection(circ_rnas, introns)

    return circ_rnas


def annotate_intron_intersection(circ_rnas, introns):
    pybed_introns = annot_to_pybed(introns)
    pybed_circrnas = annot_to_pybed(circ_rnas)
    intersections = pybed_circrnas.intersect(pybed_introns, f=0.95, wo=True) 

    circrna_d = {circ.name: circ for circ in circ_rnas}
    introns_d = {intron.name: intron for intron in introns}
    for i in intersections:
        intron = introns_d[i[9]]
        circ = circrna_d[i[3]]
        overlap = i[-1]
        p_overlap = float(overlap)/intron.length # taille circ sur taille intron
        circ.annotate_intron(intron, p_overlap)

    return circ_rnas


def annot_to_pybed(annots):
    pybed_annot = []
    for a in annots:
        pybed_annot.append(a.topybed())
    return pybed.BedTool(pybed_annot).sort()


def infra_exonic_annotations(circ_rnas, exons):
    """
    """
    pybed_exons = annot_to_pybed_ife(exons)
    pybed_circrnas = annot_to_pybed_ife(circ_rnas)  
    intersections = pybed_circrnas.intersect(pybed_exons, f=0.95, wo=True)

    circrna_d = {circ.name: circ for circ in circ_rnas}
    exons_d = {exon.name: exon for exon in exons}
    for i in intersections:
        exon = exons_d[i[9]]
        circ = circrna_d[i[3]]
        circ.annotate_ife(exon)

    return circ_rnas


def annot_to_pybed_ife(annots):
    pybed_annot = []
    for a in annots:
        pybed_annot.append(a.topybed_ife())
    return pybed.BedTool(pybed_annot).sort()


def get_exons_per_transcript(annots):
    """
    returns a dictionnary
    key: ...
    """
    d_transcript = defaultdict(list)
    for exon in annots:
        transcript_id = exon.get_att("transcript_id")
        d_transcript[transcript_id].append(exon)
    return d_transcript


def write_annotated_circrnas(circ_annotated, outputfile):
    circ_rnas = natsort.natsorted(circ_annotated, key=lambda e: (e.chrom, e.start))
    header = "\t".join(["chrom", "start", "end", "strand", "nb_ccr", "circ_rna_name", "exons_id_start", "exons_id_end", 
                        "transcript_id_start", "transcript_id_end", "gene_id_start", "gene_id_end", 
                        "intron_name", "start_i", "end_i", "gene_id_i", "exon_id_i", 
                        'exon_id_ife', "exons_start_end_ife", "gene_id_ife"])
    with open(outputfile, "w") as fout:
        fout.write(header+"\n")
        for circ_rna in circ_rnas:
            intron_annot = circ_rna.intron_annotation 
            if len(intron_annot) == 0:
                start_i = ""
                end_i = ""
                name_i = ""
            else:
                for annot_i in intron_annot:
                    intron = annot_i[0]
                    start_i = intron.start
                    end_i = intron.end
                    name_i = intron.name
            att_d = dict(re.split(" |=", item) for item in circ_rna.attributes.split("; "))
            nb_ccr = att_d["nb_ccr"].replace('\"', '')
            s = [circ_rna.chrom, circ_rna.start, circ_rna.end, 
                circ_rna.strand, nb_ccr, circ_rna.name,
                circ_rna.get_start_annotation_str(), 
                circ_rna.get_end_annotation_str(),
                circ_rna.get_start_transcript_id_str(), 
                circ_rna.get_end_transcript_id_str(), 
                circ_rna.get_start_gene_id_str(), 
                circ_rna.get_end_gene_id_str(),
                name_i, start_i, end_i, 
                circ_rna.get_intron_annot_gene_str('gene_id'),
                circ_rna.get_intron_annot_str('exon_id'), 
                circ_rna.get_infra_exonic_annot_str("exon_id"),
                circ_rna.get_infra_exonic_exons_start_end(),
                circ_rna.get_infra_exonic_gene_annot_str("gene_id")]
            fout.write( "%s\n" % "\t".join(map(str, s)))


def main():

    # Exonic:
    eprint("\nIdentification of EXONIC circRNAs:\n")
    eprint("\nReading circRNA file...\n")
    circ_rnas = circ.read_annotation(args.circ_rna_file, fmt=args.annot_format)
    eprint("\nReading annotation file...\n")
    annots = circ.read_annotation(args.annotation_file, fmt=args.annot_format)

    eprint("\nExonic annotations..\n")
    circ_rnas = exonic_annotations(circ_rnas, annots)

    eprint("\nIntronic annotations..\n")
    circ_rnas = intronic_annotations(circ_rnas, annots)

    #####
    eprint("\nInfra-exonic annotations..\n")
    circ_rnas = infra_exonic_annotations(circ_rnas, annots)
    #####

    eprint("\nWrite annotations..\n")
    write_annotated_circrnas(circ_rnas, args.output)


def parse_arguments():
    parser = argparse.ArgumentParser(description='Annotation')
    parser.add_argument('-circ', '--circ_rna_file',
                        required=True, help='Circular RNA file path')
    parser.add_argument('-fmt', '--annot_format',
                        required=False, default="bed", choices=["bed","gtf"],
                        help='Annotation file format')
    parser.add_argument('-annot', '--annotation_file',
                        required=True, help='Annotataion file path')
    parser.add_argument('-oe', '--exonic_output_file',
                        required=False, default='exonic_circ_rnas_annot.out',
                        help='Exonic output file path')
    parser.add_argument('-oi', '--intronic_output_file',
                        required=False, default='annotation_introns.bed',
                        help='Intronic output file path')
    parser.add_argument('-o', '--output',
                        required=True,
                        help='Annotation output file path')
    parser.add_argument('--verbose', help='Print more info', action='store_true')

    args = parser.parse_args()

    dsp_control = circ.DisplayControl.instance()
    dsp_control.verbose = args.verbose

    return args


if __name__ == '__main__':
    args = parse_arguments()
    main()
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
                                    # [-min_cr CR_THRESHOLD] [-tol TOLERANCE]
                                    # [-fmt {bed,gtf}] [-o OUTPUT_FILE]

# Imports
import sys, os, re, argparse, natsort, time
if not sys.warnoptions:
    import warnings
    warnings.simplefilter("ignore")
from collections import defaultdict
import pandas as pd
import numpy as np
from networkx import (draw, DiGraph, Graph)
import networkx as netwx
import statistics
from statistics import mode
from time import sleep
from tqdm import tqdm
import circRNA as circ


# Utility functions
def eprint(*args, **kwargs):
    print(*args,  file=sys.stderr, **kwargs)
    dsp_control = circ.DisplayControl.instance()
    if dsp_control.verbose:
        print(*args,  file=sys.stderr, **kwargs)


def merge_junctions(df_a):
    """ Merge two dataframes
    :df_r1: dataframe of R1 reads
    :df_r2: dataframe of R2 reads
    :return the merged dataframe
    """
    df_merged = pd.concat(df_a,ignore_index=True)
    return df_merged


def read_junction_file(file, read_type=None):
    """ Read the data, adds the header of the table
        and adds a column with the corresponding read type.
    :file: file to read
    :read_type: read type (R1 or R2)
    :return the dataframe of reads
    """
    names = ["chr_donor", "pos_donor","strand_donor","chr_acceptor",
            "pos_acceptor","strand_acceptor","junction_type",
            "repeat_left","repeat_right","read_name","pos_s1",
            "CIGAR_s1","pos_s2","CIGAR_s2"] # Columns names of the input data
    # Read data as table with pandas:
    df = pd.read_table(file, sep = '\t', names=names, dtype=str,
                       converters = {'pos_acceptor':int, 'pos_donor':int})
    df['read_type'] = read_type # Add column "read" to specify the read type
    return df


def get_valid_circjunctions(df):
    """ Selects valid circular chimeric reads (ccr)
        according to the conditions of the "valid_ccr" function.
    :df: datatframe to filter
    return: the filtered dataframe
    """
    # Parsing dataframe to select valid rows
    # Boolean array with true if the read is a valid ccr:
    valid_ccrs = []
    #for index, row in tqdm(df.iterrows(), total=df.shape[0]):
    for index, row in circ.tqdm_wrapper(df):
        valid_ccrs.append(valid_ccr(row))
    df_circ_cr = df[valid_ccrs]
    ccr_a = [] # Array of circular junctions
    eprint(" "+"\n"+"Generating array of selected ccr..."+"\n")
    #for index, row in tqdm(df_circ_cr.iterrows(), total=df_circ_cr.shape[0]):
    for index, row in circ.tqdm_wrapper(df_circ_cr):
        ccr =  circ.CCR(row) # Create CCR object
        ccr_a.append(ccr) # Add chimeric reads into array
    eprint("\n")
    return ccr_a


def valid_ccr(row):
    """ Select ccr according to several conditions.
    :row: row of the dataframe
    :return boolean "False" for not valid ccr
            or boolean "True" for valid ccr
    """
    # Regular expression of the left dangling CIGAR:
    dangling_left = re.compile('^\d+S\d+M$')
    # Regular expression of the rigth dangling CIGAR:
    dangling_right = re.compile('^\d+M\d+S$')
    if (row['chr_donor']!=row['chr_acceptor']
            or row['strand_donor']!=row['strand_acceptor']):
        return False
    elif abs(row['pos_donor'] - row['pos_acceptor']) > 10000000:
        return False
    else:
        if row['strand_donor'] == '-' :
            if (row['pos_donor']>row['pos_acceptor']
                    or row['pos_s1']>row['pos_s2']):
                return False
            else:
                if (not dangling_left.match(row['CIGAR_s1'])
                        or not dangling_right.match(row['CIGAR_s2'])):
                    return False
                else:
                    return True
        elif row['strand_donor'] == '+' :
            if (row['pos_donor']<row['pos_acceptor']
                    or row['pos_s1']<row['pos_s2']):
                return False
            else:
                if (not dangling_right.match(row['CIGAR_s1'])
                        or not dangling_left.match(row['CIGAR_s2'])):
                    return False
                else:
                    return True


def get_exact_circrnas(circ_junctions, cr_threshold):
    """
    Detects circ RNAs as CR sharing exactly the same circ signature
    chrom start end strand
    """
    eprint("Exact circular RNA detection")
    circ_rna_dict = defaultdict(list)
    for circ_junc in circ_junctions:
        circ_rna_dict[circ_junc.key].append(circ_junc)

    circ_array = []
    for key, cr_array in circ_rna_dict.items():
        if len(cr_array) >= cr_threshold:
            circ_array.append(circ.CircRNA(args.output_file_format, ccr_array=cr_array))
    return circ_array


def same_circ_signature(cr_a, cr_b, tol):
    """ Two-by-two comparison of chimeric reads
    :cr_a: chimeric read a
    :cr_b: chimeric read b
    :tol: number of different start-end positions tolerated
    :return boolean "False" for not valid conditions
    and "True" for valid conditions
    """
    if cr_a.chrom != cr_b.chrom:
        return False
    if abs(cr_a.start - cr_b.start) > tol:
        return False
    if abs(cr_a.end - cr_b.end) > tol:
        return False
    if cr_a.read_type == cr_b.read_type:
        if cr_a.strand != cr_b.strand:
            return False
    else:
        if cr_a.strand == cr_b.strand:
            return False
    return True


def merge_circ_rnas(component):
    if len(component) == 1:
        return component[0]
    elements = []
    for circ_rna in component:
        elements.extend(circ_rna.ccr_array)
    return circ.CircRNA(args.output_file_format, ccr_array=elements)


def connected_components(circ_a, neighbours, cr_threshold):
    """ Generate an undirected graph and
        cluters ccr into connected components
    :ccr_a: array of ccr object
    :neighbours: dictionary of neighbours chimeric reads
    :return an array of circular RNAs
    """
    undirected = netwx.Graph()
    undirected.add_nodes_from(range(len(circ_a)))
    undirected.add_edges_from(neighbours)
    circ_rnas = []  # Circular RNAs array
    # Parsing of connected components composed of chimeric reads:
    for comp in netwx.connected_components(undirected):
        circ_rna = merge_circ_rnas([circ_a[i] for i in comp])
        if circ_rna.nb_ccr >= cr_threshold:
            circ_rnas.append(circ_rna)
    return circ_rnas


def cumulative_cr(circ_a):
    return sum([c.nb_ccr for c in circ_a])


def compute_fuzzy_circ_rnas(ccr_a, cr_threshold, tolerance):
    """ Get the circular RNA corresponding to each
        connected component of chimeric reads
    :ccr_a: array of ccr object
    :return array of circular RNAs
    """
    if cumulative_cr(ccr_a) < cr_threshold:
        return []
    n = len(ccr_a)
    # Neighbour cr share sign the same circular RNA
    neighbours = []
    circ_rnas = []
    for i in range(n):
        for j in range(i+1, n):
            if ccr_a[j].start > ccr_a[i].end:
                break
            # Two-by-two comparison of chimeric reads:
            if same_circ_signature(ccr_a[i], ccr_a[j], tolerance):
                neighbours.append((i, j))
    circ_rnas = connected_components(ccr_a, neighbours, cr_threshold)
    return circ_rnas


def get_independent_intervals(circ_rnas):
    # Get independant intervals
    independant_intervals = []
    circ_interval = []
    chrom = None
    for circ_rna in circ_rnas:
        if not chrom:
            chrom = circ_rna.chrom
            max_end = circ_rna.end
        if circ_rna.chrom != chrom:
            independant_intervals.append(circ_interval)
            circ_interval = []
            chrom = circ_rna.chrom
            max_end = circ_rna.start
        if circ_rna.chrom != chrom or circ_rna.start > max_end:
            independant_intervals.append(circ_interval)
            circ_interval = []
            chrom = circ_rna.chrom
            max_end = circ_rna.start
        max_end = max(max_end, circ_rna.end)
        circ_interval.append(circ_rna)
    independant_intervals.append(circ_interval)
    return independant_intervals


def get_fuzzy_circrnas(circ_rnas, cr_threshold, tolerance):
    eprint("Fuzzy circular RNA detection")
    circ_rnas = natsort.natsorted(circ_rnas, key=lambda e: (e.chrom,
                                                            e.start))
    intervals = get_independent_intervals(circ_rnas)
    circ_rnas = []
    for interval in intervals:
        fuzzy_circ_rnas = compute_fuzzy_circ_rnas(interval, cr_threshold,
                                                  tolerance)
        if fuzzy_circ_rnas:
            circ_rnas.extend(fuzzy_circ_rnas)
    return circ_rnas


def circrna_detection(circ_juntcions, cr_threshold, tolerance):
    """ Detecte circular RNAs
    :df_ccr: dataframe of circular chimeric reads
    :return the array of circular RNAs detected
    """
    # First pass, detect exact circular RNAs (no tolerance)
    circ_rnas = get_exact_circrnas(circ_juntcions, cr_threshold)
    eprint("nb of tol %d circRNAS %d" % (tolerance, len(circ_rnas)))#########
    # if tolerance > 0 compute fuzzy circrnas from previous list
    if tolerance > 0:
        circ_rnas = get_fuzzy_circrnas(circ_rnas, cr_threshold, tolerance)
    else:
        circ_rnas = [c for c in circ_rnas if c.nb_ccr >= cr_threshold]
    return circ_rnas


def write_tab_circ_rnas(circ_rnas, outputfile, fmt):
    """ Write the table of circ RNA retained
    :circ_rnas: array of circular RNAs object
    :outputfile: the output file path
    :return the table of circular RNAs retained
    :start Chimeric.out.junction file: last intron base in
    the reference genome order in the circular transcript
    :end Chimeric.out.junction file: first intron base in the
    reference genome order in the circular transcript
    :start GTF: first exon position
    :end GTF: last exon position
    """
    nb_line = 0 # name bed file
    circ_rnas = natsort.natsorted(circ_rnas, key=lambda e: (e.chrom, e.start))
    eprint(" "+"\n"+"Writing the table of circular RNAs..."+"\n")
    with open(outputfile, "w") as fout:
        eprint("\nConvertion the Chimeric.out.junction format to the %s format...\n" % fmt)
        for circ_rna in circ_rnas:
            # Convert Chimeric.out.junction format to out format:
            circ_rna.start += 1
            circ_rna.end -= 1
            # Convert out format to the chosen format:
            circ_rna.write_annot(nb_line, fout, fmt)
            nb_line += 1


def stats(min_cr, tol, nb_ccr_tot, nb_ccr, nb_circ_rna_detected, circ_rnas):
          nb_tot_cr_circ = sum([c.nb_ccr for c in circ_rnas])
          return "%d\t%d\t%d\t%d\t%d\t%d" % (min_cr, tol, nb_ccr_tot, nb_ccr,
                                             nb_circ_rna_detected, nb_tot_cr_circ)


def stats2(df):
    return df.groupby('nb_ccr')['nb_ccr'] \
           .count().reset_index(name='nb_circ_rnas')


def main(r1_input_file, r2_input_file, cr_threshold, tolerance,
         output_file):

    eprint("\nReading junction files...\n")
    df_junctions_a = []
    if r1_input_file:
        df_junctions_a.append(read_junction_file(r1_input_file, read_type="R1"))
    if r2_input_file:
        df_junctions_a.append(read_junction_file(r2_input_file, read_type="R2"))

    eprint("Merging junction files...\n")
    df = merge_junctions(df_junctions_a)
    eprint("%d chimeric reads \n" % len(df.index))

    # Filtering data:
    eprint("Selecting valid ccr...\n")
    circ_junctions = get_valid_circjunctions(df)
    eprint("%d circular chimeric reads" % len(circ_junctions))

    # Circular RNAs detection:
    circ_rnas = circrna_detection(circ_junctions, cr_threshold, tolerance) #### 'circRNA' object has no attribute 'ccr_array'
    eprint("%d circRNAs detected" % len(circ_rnas))

    #for cir in circ_rnas:
    #    print(cir.var_info_str())

    # Write the table of circular RNAs retained:
    circ.write_annotation(circ_rnas, output_file, fmt=args.output_file_format,
                          attributes=["nb_ccr", "genomic_size", "left",
                                      "right", "complete", "nb_distinct"])


def parse_arguments():
    parser = argparse.ArgumentParser(description='Identification of chimeric reads and circular RNAs')
    parser.add_argument('-r1', '--r1_input_file',
                        required=False, help='R1 input file name')
    parser.add_argument('-r2', '--r2_input_file',
                        required=False, help='R2 input file name')
    parser.add_argument('-min_cr', '--cr_threshold', type=int,
                        required=False, default=5,
                        help='Minimum number of chimeric reads')
    parser.add_argument('-tol', '--tolerance', type=int, required=False,
                        default=3,
                        help='Tolerance of different positions to start and end')
    parser.add_argument('-fmt', '--output_file_format',
                        required=False, default='gtf', choices=["bed","gtf"], # bed ou gtf
                        help='Output file format : 0-based (bed) or 1-based (gtf)')
    parser.add_argument('-o', '--output_file', required=False, default='circ_rnas_detected.out',
                        help='Output file path') # gft par défaut # attribut : nb ccr, min_cr_distinct
    parser.add_argument('--verbose', help='Print more info', action='store_true')

    args = parser.parse_args()

    dsp_control = circ.DisplayControl.instance()
    dsp_control.verbose = args.verbose

    if not (args.r1_input_file or args.r2_input_file):
        parser.error('At least one input file is requested')
    return args


if __name__ == '__main__':
    args = parse_arguments()
    main(args.r1_input_file, args.r2_input_file,
         args.cr_threshold, args.tolerance,
         args.output_file)
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
import os
import argparse
import pandas as pd
import numpy as np
import re
import sys
from collections import defaultdict


def read_file(file):
    """ Read the circ_rnas_annotation.out file and return a pandas dataframe"""
    header = ["chrom", "start", "end", "circ_name", "nb_ccr", 
              "strand", "source", "feature", "phase", "attributes"]
    df = pd.read_table(file, sep='\t', names=header)
    df.replace(np.nan, "", inplace=True)
    return df


def get_folder_to_create(df): 
    folders_name = []
    dict_spl_sp = dict(zip(df["sample"], df["species"]))
    split_keys = [x.split("_") for x in dict_spl_sp.keys()]   

    for key in split_keys:
        folders_name.append(key[0]+"_"+key[1])
    folders = list(set(folders_name))
    return folders


def create_folders(folders):
    for folder in folders:
        if not os.path.exists(folder):
            os.makedirs(folder)


def filter_samples(df):
    df = df.drop(df[(df["sample"].isin(["ssc_liver_3", "ssc_liver_4", 
                                        "ssc_muscle_1", "ssc_testis_1"]))].index)
    df = df.reset_index(drop=True)
    return df


def main():

    # Read samples.tsv file:
    df_samples = pd.read_csv(args.sample_file, sep='\t')

    # Remove ssc_liver_3/4, ssc_muscle_1 and ssc_testis_1 samples:
    df_filtered = filter_samples(df_samples)

    samples_folders = df_filtered["sample"].tolist()
    df_filtered = df_filtered.loc[df_filtered['sample'].isin(samples_folders)] 

    # Get the folder list to create:
    folders = get_folder_to_create(df_filtered)

    # Create folders:
    create_folders(folders)

    # Group folders by species and tissue:
    df_grouped = df_filtered.groupby(["species", "tissue"])

    for key,item in df_grouped:

        group = df_grouped.get_group(key) 
        # Create col with path of folder:
        group["path_file_bed"] = group["sample"]+"/auzeville.bed"

        samples = group["sample"].tolist()
        samples_split = [s.split("_") for s in samples][0]
        folder_name = samples_split[0]+"_"+samples_split[1]

        paths = group["path_file_bed"].tolist()

        df_concat = pd.concat(read_file(f) for f in paths)
        df_concat = df_concat.sort_values(by=['chrom', 'start', 'end'])
        df_final = df_concat.groupby(['chrom', 'start', 'end', 'strand'], as_index=False)['nb_ccr'].sum()
        df_final.to_csv(folder_name+"/"+"auzeville.bed", sep="\t", index=False)


def parse_arguments():
    parser = argparse.ArgumentParser(description='Annotation')
    parser.add_argument('-sp', '--sample_file',
                        required=True, help='Sample file path')
    parser.add_argument('-circ', '--circ_rna_file',
                        required=False, help='Circular RNA file path')
    args = parser.parse_args()
    return args

if __name__ == '__main__':
    args = parse_arguments()
    main()
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
import os, re, sys, csv, argparse
import circRNA as circ
import pandas as pd
import numpy as np

# Utility functions:
def eprint(*args, **kwargs):
    print(*args,  file=sys.stderr, **kwargs)

def get_sample(file):
    """ Get the sample name from the file path"""
    sample_name = os.path.split(os.path.dirname(file))[1]
    return sample_name

def read_file(file):
    """ Read the circ_rnas_annotation.out file and return a pandas dataframe"""
    df = pd.read_table(file, sep='\t')
    df.replace(np.nan, "", inplace=True)
    return df

def intersection(lst1, lst2):
    """ Return the intersection between two lists"""
    return list(set(lst1) & set(lst2))


def get_exonic_circrnas(df):
    """ Get statistics about exonic circRNAs from the annotation_circRNA.out file"""
    # Arrays, dicts:
    exonic_circ_names, monoexonic_circ_names, true_exonic, single_end_circ_names = [], [], [], []
    ccr_start_end_exonic, ccr_single_annot = dict(), dict()
    # Counter exonic circRNAs:
    nb_monoexonic, nb_start_end_exonic, nb_antisens_exonic, nb_single_annotated_junction = 0, 0, 0, 0

    for index, row in df.iterrows():

        exon_id_start = row.exons_id_start.split(",")
        exon_start = str(exon_id_start[0])
        exon_start_tmp = exon_start.split("_")
        exon_start = exon_start_tmp[0]
        exon_id_end = row.exons_id_end.split(",")
        exon_end = str(exon_id_end[0])
        exon_end_tmp = exon_end.split("_")
        exon_end = exon_end_tmp[0]

        if row.nb_ccr >= 5:
            # if ((len(row.exons_id_start) > 0 or len(row.exons_id_end) > 0)
            #     and len(row.intron_name) == 0):
            if len(row.exons_id_start) > 0 or len(row.exons_id_end) > 0:
                if row.strand == "+":
                    if ("5" and "+") in row.exons_id_start:
                        if len(row.exons_id_end)==0:
                            nb_single_annotated_junction += 1
                            ccr_single_annot[row.circ_rna_name] = row.nb_ccr
                            single_end_circ_names.append(row.circ_rna_name)
                        if ("3" and "+") in row.exons_id_end:
                            nb_start_end_exonic += 1
                            true_exonic.append(row)
                            exonic_circ_names.append(row.circ_rna_name)
                            ccr_start_end_exonic[row.circ_rna_name] = row.nb_ccr
                            if ((len(exon_id_start)==1 and len(exon_id_end)==1) and (exon_start == exon_end)):
                                nb_monoexonic += 1
                                monoexonic_circ_names.append(row.circ_rna_name)
                    if ("3" and "+") in row.exons_id_end:
                        if len(row.exons_id_start)==0:
                            nb_single_annotated_junction += 1
                            ccr_single_annot[row.circ_rna_name] = row.nb_ccr
                            single_end_circ_names.append(row.circ_rna_name)
                    if ("3" and "-") in row.exons_id_start:
                        if ("5" and "-") in row.exons_id_end:
                            nb_antisens_exonic += 1
                elif row.strand == "-":
                    if ("5" and "-") in row.exons_id_end:
                        if len(row.exons_id_start)==0:
                            nb_single_annotated_junction += 1
                            ccr_single_annot[row.circ_rna_name] = row.nb_ccr
                            single_end_circ_names.append(row.circ_rna_name)
                        if ("3" and "-") in row.exons_id_start:
                            nb_start_end_exonic += 1
                            true_exonic.append(row)
                            exonic_circ_names.append(row.circ_rna_name)
                            ccr_start_end_exonic[row.circ_rna_name] = row.nb_ccr
                            if ((len(exon_id_start)==1 and len(exon_id_end)==1) and (exon_start == exon_end)):
                                nb_monoexonic += 1
                                monoexonic_circ_names.append(row.circ_rna_name)
                    if ("3" and "-") in row.exons_id_start:
                        if len(row.exons_id_end)==0:
                            nb_single_annotated_junction += 1
                            ccr_single_annot[row.circ_rna_name] = row.nb_ccr
                            single_end_circ_names.append(row.circ_rna_name)
                    if ("5" and "+") in row.exons_id_start:
                        if ("3" and "+") in row.exons_id_end:
                            nb_antisens_exonic += 1
    return true_exonic, nb_start_end_exonic, nb_single_annotated_junction, single_end_circ_names, ccr_single_annot, monoexonic_circ_names, exonic_circ_names


def get_intronic_circrnas(df, exonic_circrnas):
    # Arrays, dicts:
    intronic_circ_names, true_intronic = [], []
    ccr_true_intronic = dict()
    # Counter intronic circRNAs:
    nb_true_intronic = 0

    for index, row in df.iterrows():
        if row.circ_rna_name not in exonic_circrnas:
            if row.nb_ccr >= 5:
                if len(row.intron_name) > 0:
                    if row.strand == "+":
                        if (row.end_i - row.end) in range(-5,60):
                            if (row.start - row.start_i) in range(-5,5) or (row.start==row.start_i):
                                nb_true_intronic += 1
                                true_intronic.append(row)
                                ccr_true_intronic[row.circ_rna_name] = row.nb_ccr
                                intronic_circ_names.append(row.circ_rna_name)
                        elif (row.start == row.start_i and (row.end_i - row.end) > 32):
                            nb_true_intronic += 1
                            true_intronic.append(row)
                            ccr_true_intronic[row.circ_rna_name] = row.nb_ccr
                            intronic_circ_names.append(row.circ_rna_name)
                    elif row.strand == "-":
                        if (row.start - row.start_i) in range(-5,60):
                            if ((row.end - row.end_i) in range(-5,5) or (row.end == row.end_i)):
                                nb_true_intronic += 1
                                true_intronic.append(row)
                                ccr_true_intronic[row.circ_rna_name] = row.nb_ccr
                                intronic_circ_names.append(row.circ_rna_name)
                        elif (row.end == row.end_i and (row.start - row.start_i) > 60):
                            nb_true_intronic += 1
                            true_intronic.append(row)
                            ccr_true_intronic[row.circ_rna_name] = row.nb_ccr
                            intronic_circ_names.append(row.circ_rna_name)
    return true_intronic, nb_true_intronic, ccr_true_intronic, intronic_circ_names


def get_subexonic_circrnas(df, monoexonic_circ_names, intronic_circ_names):
    # Subexonic_meg : sens + antisens
    # Subexonic_pleg : sens
    # Arrays, dicts:
    infraexonic_tot_names, infraexonic_circ_names, subexonic, subexonic_antisens_names = [], [], [], []
    ccr_infraexonic = dict()
    # Counter subexonic circRNAs:
    nb_infraexonic, nb_infraexonic_tot, nb_infraexonic_sens, nb_infraexonic_antisens = 0, 0, 0, 0

    for index, row in df.iterrows():
        if row.nb_ccr >= 5:
            if ((len(row.gene_id_ife) > 0) and (row.circ_rna_name not in monoexonic_circ_names)):
                if (("_5_c_+" not in row.exons_id_start) and ("_5_lnc_+" not in row.exons_id_start) and
                    ("_3_c_-" not in row.exons_id_start) and ("_3_lnc_-" not in row.exons_id_start) and
                    ("_3_c_+" not in row.exons_id_end) and ("_3_lnc_+" not in row.exons_id_end) and
                    ("_5_c_-" not in row.exons_id_end) and ("_5_lnc_-" not in row.exons_id_end)):
                    if len(row.gene_id_start) == 0 and len(row.gene_id_end) == 0:
                        nb_infraexonic_tot += 1
                        infraexonic_tot_names.append(row.circ_rna_name)
                        if row.circ_rna_name not in intronic_circ_names:
                            if row.strand == "+":
                                if "+" in row.gene_id_ife:
                                    nb_infraexonic_sens += 1
                                    subexonic.append(row)
                                    ccr_infraexonic[row.circ_rna_name] = row.nb_ccr
                                    infraexonic_circ_names.append(row.circ_rna_name)
                                elif "-" in row.gene_id_ife:
                                    nb_infraexonic_antisens += 1
                                    subexonic.append(row)
                                    infraexonic_circ_names.append(row.circ_rna_name)
                                    subexonic_antisens_names.append(row.circ_rna_name)
                            elif row.strand == "-":
                                if "-" in row.gene_id_ife:
                                    nb_infraexonic_sens += 1
                                    subexonic.append(row)
                                    infraexonic_circ_names.append(row.circ_rna_name)
                                    ccr_infraexonic[row.circ_rna_name] = row.nb_ccr
                                elif "+" in row.gene_id_ife:
                                    nb_infraexonic_antisens += 1
                                    subexonic.append(row)
                                    infraexonic_circ_names.append(row.circ_rna_name)
                                    subexonic_antisens_names.append(row.circ_rna_name)
                    elif ((len(row.gene_id_start) > 0 and len(row.gene_id_end) == 0) or
                            (len(row.gene_id_start) == 0  and len(row.gene_id_end) > 0)):
                        nb_infraexonic_tot += 1
                        infraexonic_tot_names.append(row.circ_rna_name)
                        if row.circ_rna_name not in intronic_circ_names:
                            if row.strand == "+":
                                if "+" in row.gene_id_ife:
                                    nb_infraexonic_sens += 1
                                    subexonic.append(row)
                                    infraexonic_circ_names.append(row.circ_rna_name)
                                    ccr_infraexonic[row.circ_rna_name] = row.nb_ccr
                                elif "-" in row.gene_id_ife:
                                    nb_infraexonic_antisens += 1
                                    subexonic.append(row)
                                    infraexonic_circ_names.append(row.circ_rna_name)
                                    subexonic_antisens_names.append(row.circ_rna_name)
                            elif row.strand == "-":
                                if "-" in row.gene_id_ife:
                                    nb_infraexonic_sens += 1
                                    subexonic.append(row)
                                    infraexonic_circ_names.append(row.circ_rna_name)
                                    ccr_infraexonic[row.circ_rna_name] = row.nb_ccr
                                elif "+" in row.gene_id_ife:
                                    nb_infraexonic_antisens += 1
                                    subexonic.append(row)
                                    infraexonic_circ_names.append(row.circ_rna_name)
                                    subexonic_antisens_names.append(row.circ_rna_name)
    return subexonic, infraexonic_circ_names, nb_infraexonic_sens, ccr_infraexonic, subexonic_antisens_names


def get_circrnas(df):
    """ Read the annotation_circRNA.out dataframe and return all statistics to tabular
        format about circRNAs"""
    # Get exonic circRNAs:
    stats_exonic_circrnas = get_exonic_circrnas(df)
    exonic = stats_exonic_circrnas[0]
    monoexonic_names = stats_exonic_circrnas[5]
    exonic_names = stats_exonic_circrnas[6]
    # Get intronic circRNAs:
    stats_intronic_circrnas = get_intronic_circrnas(df, exonic_names)
    intronic = stats_intronic_circrnas[0]
    intronic_names = stats_intronic_circrnas[3]
    # Get subexonic circRNAs:
    stats_subexonic_circrnas = get_subexonic_circrnas(df, monoexonic_names, intronic_names)
    subexonic = stats_subexonic_circrnas[0]
    subexonic_antisens = stats_subexonic_circrnas[4]

    return exonic, subexonic, intronic, subexonic_antisens


def get_stats_circrnas(sample, df, output_file_name):
    """ Read the annotation_circRNA.out dataframe and return all statistics to tabular
        format about circRNAs"""
    nb_circ_tot = len(df)
    # Get circRNAs:
    exonic = get_circrnas(df)[0]
    subexonic = get_circrnas(df)[1]
    subexonic_antisens = get_circrnas(df)[3]
    intronic = get_circrnas(df)[2]

    # Get exonic circRNAs stats:
    stats_exonic = write_comparison_exonic_table(sample, exonic, output_file_name)

    # Get subexonic circRNAs stats:
    nb_subexonic = get_nb_type(sample, subexonic, subexonic_antisens)[0]
    nb_gene_subexonic = get_nb_type(sample, subexonic, subexonic_antisens)[1]
    nb_subexonic_pleg = get_nb_type(sample, subexonic, subexonic_antisens)[2]
    nb_subexonic_meg = get_nb_type(sample, subexonic, subexonic_antisens)[3]
    nb_ccr_pleg = get_nb_type(sample, subexonic, subexonic_antisens)[4]
    nb_ccr_meg = get_nb_type(sample, subexonic, subexonic_antisens)[5]


    if len(stats_exonic) > 5:
        nb_c = stats_exonic[0]
        nb_lnc = stats_exonic[1]
        nb_autres = stats_exonic[2]
        nb_start_end_exonic_selected = stats_exonic[3]
        nb_ccr_start_end_exonic = stats_exonic[4]
        nb_exonic = write_comparison_exonic_table(sample, exonic, args.output_comp_exonic_file)[5]
    else:
        nb_c = stats_exonic[0]
        nb_lnc = 0
        nb_autres = stats_exonic[1]
        nb_start_end_exonic_selected = stats_exonic[2]
        nb_ccr_start_end_exonic = stats_exonic[3]
        nb_exonic = write_comparison_exonic_table(sample, exonic, args.output_comp_exonic_file)[4]

    nb_start_end_exonic = get_exonic_circrnas(df)[1]
    nb_single_annotated_junction = get_exonic_circrnas(df)[2]
    monoexonic_circ_names =  get_exonic_circrnas(df)[5]
    exonic_names = get_exonic_circrnas(df)[6]
    intronic_circ_names = get_intronic_circrnas(df, exonic_names)[3]
    infraexonic_circ_names = get_subexonic_circrnas(df, monoexonic_circ_names, intronic_circ_names)[1]
    single_end_circ_names = get_exonic_circrnas(df)[3]
    nb_infraexonic_sens = get_subexonic_circrnas(df, monoexonic_circ_names, intronic_circ_names)[2]
    nb_true_intronic = get_intronic_circrnas(df, exonic_names)[1]
    ccr_single_annot = get_exonic_circrnas(df)[4]
    ccr_true_intronic = get_intronic_circrnas(df, exonic_names)[2]
    ccr_infraexonic = get_subexonic_circrnas(df, monoexonic_circ_names, intronic_circ_names)[3]

    nb_start_end_false_exonic = nb_start_end_exonic - nb_start_end_exonic_selected
    nb_tot_exonic = nb_start_end_exonic + nb_single_annotated_junction
    nb_common_infra_single = len(intersection(infraexonic_circ_names, single_end_circ_names))
    nb_infraexonic = nb_infraexonic_sens - nb_common_infra_single
    nb_circ_non_annotated = nb_circ_tot - (nb_tot_exonic + nb_infraexonic + nb_true_intronic - nb_start_end_false_exonic)
    nb_ccr_single_annot = sum(ccr_single_annot.values())
    nb_ccr_intronic = sum(ccr_true_intronic.values())
    nb_ccr_infraexonic = sum(ccr_infraexonic.values())

    return "\t".join(map(str,[sample, nb_circ_tot, nb_exonic, nb_lnc, nb_autres, nb_ccr_start_end_exonic,
                              nb_subexonic_pleg, nb_ccr_pleg, nb_subexonic_meg, nb_ccr_meg, nb_gene_subexonic,
                              nb_true_intronic, nb_ccr_intronic]))+"\n"


def write_stats_table(stats, output_file):
    with open(output_file, "w") as fout:
        fout.write(stats)


def write_circ_table(self, header, path, index=None, sep="\t", na_rep='', float_format=None,
                     index_label=None, mode='w', encoding=None, date_format=None, decimal='.'):
    """
    Write a circRNAs list to a tabular-separated file (tsv).
    """
    from pandas.core.frame import DataFrame
    df = DataFrame(self)
    # result is only a string if no path provided, otherwise None
    result = df.to_csv(path, index=index, sep=sep, na_rep=na_rep, float_format=float_format,
                       header=header, index_label=index_label, mode=mode, encoding=encoding,
                       date_format=date_format, decimal=decimal)
    if path is None:
        return result


def write_comparison_exonic_table(sample, circ_rnas, output_file_name):
    # Write exonic table for comparaison between tissues/species:
    df_circ_rnas = pd.DataFrame(circ_rnas, index=None)
    header = ["chrom:start-end,strand", "nb_ccr", "gene_id", "biotype", "genomic_size"]
    nb_start_end_annotated, nb_exonic = 0, 0
    nb_ccr_exonics, biotypes = [], []

    with open(output_file_name, 'w') as fout:
        tsv_writer = csv.writer(fout, delimiter='\t')
        tsv_writer.writerow(header)
        for index, row in df_circ_rnas.iterrows():
            exons_id_start_a, exons_id_end_a = [], []
            # Select well-annotated true exonic circRNAs bases on the gene_id:
            # Get key:
            chrom_start = ":".join(map(str,[row.chrom, row.start]))
            chrom_start_end = "-".join(map(str, [chrom_start , row.end]))
            chrom_start_end_strand = ",".join(map(str, [chrom_start_end , row.strand]))
            # Get the ccr number:
            nb_ccr = row.nb_ccr
            nb_ccr_exonics.append(nb_ccr)
            # Get gene_id:
            genes_id_start = list(set(list(row.gene_id_start.split(","))))
            genes_id_end = list(set(list(row.gene_id_end.split(","))))
            # Get biotype:
            exons_id_start = list(set(list(row.exons_id_start.split(","))))
            exons_id_end = list(set(list(row.exons_id_end.split(","))))
            exons_id_start = list(i.split("_") for i in exons_id_start)
            exons_id_end = list(i.split("_") for i in exons_id_end)
            intersect_gene_id = ",".join(list(set(genes_id_start).intersection(genes_id_end)))
            if len(intersect_gene_id) > 0:
                nb_start_end_annotated += 1
                # Get biotype:
                biotypes_start = list(set(list(i[2] for i in exons_id_start)))
                biotypes_end = list(set(list(i[2] for i in exons_id_end)))
                intersect_biotypes = "".join(list(set(biotypes_start).intersection(biotypes_end)))
                # Write line into the output file:
                nb_exonic += 1
                genomic_size = row.end - row.start
                s = [chrom_start_end_strand, nb_ccr, intersect_gene_id, intersect_biotypes, genomic_size]
                tsv_writer.writerow(s)
                biotypes.append(intersect_biotypes)
    # Dictionary with key = biotype and value = counter:
    d = {}
    d["sample"] = sample
    for biotype in biotypes:
        d[biotype] = d.get(biotype, 0) + 1
    values = []
    valid_keys = ["c", "lnc", "sample"]
    for key, value in d.items():
        if key not in valid_keys:
            values.append(value)
    d["autres"] = sum(values)
    nb_ccr_exonic = sum(nb_ccr_exonics)
    for key, value in d.items():
        if ("c" in d.keys()) and ("lnc" in d.keys()):
            return d["c"], d["lnc"], d["autres"], nb_start_end_annotated, nb_ccr_exonic, nb_exonic
        elif ("c" in d.keys() and "lnc" not in d.keys()):
            return d["c"], d["autres"], nb_start_end_annotated, nb_ccr_exonic, nb_exonic


def compute_size(exons_start_end):
    pos_start_end = list(exons_start_end.split("_"))
    pos_start = int(pos_start_end[0])
    pos_end = int(pos_start_end[1])
    size = pos_end - pos_start
    return size

def get_true_exons_gene_id(exons_start_end, exons_id, gene_id):
    if len(exons_start_end)==1:
        exon_start_end = ",".join(exons_start_end)
        exon_id = ",".join(exons_id)
        gene_id = gene_id
    else:
        annot = dict(zip(exons_start_end, exons_id))
        sizes = []
        for key in annot:
            size = compute_size(key)
            sizes.append(size)
        values = zip(exons_start_end, sizes)
        annot_f = dict(zip(exons_id, values))
        min_size = min([i[1] for i in list(annot_f.values())])
        for key, values in annot_f.items():
            if values[1] == min_size:
                exon_id = key
                exon_start_end = values[0]
    return exon_start_end, exon_id, gene_id


def get_nb_type(sample, circ_rnas, subexonic_antisens):
    # Write exonic table for comparaison between tissues/species:
    df_circ_rnas = pd.DataFrame(circ_rnas, index=None)
    nb_sub_exonic, nb_sub_exonic_pleg, nb_sub_exonic_meg = 0, 0, 0
    subexonic_genes, nb_ccr_pleg, nb_ccr_meg = [], [], []

    for index, row in df_circ_rnas.iterrows():
        # Select well-annotated subexonic circRNAs according to the gene_id:
        # Get gene_id:
        genes_id = list(set(list(row.gene_id_ife.split(","))))
        genes_id = list(i.split("_") for i in genes_id)
        exons_start_end = sorted(set(row.exons_start_end_ife.split(",")), key=row.exons_start_end_ife.split(',').index)
        exon_id = sorted(set(row.exon_id_ife.split(",")), key=row.exon_id_ife.split(',').index)
        gene_id = ",".join(list(set(list(row.gene_id_ife.split(",")))))
        biotypes = []

        exon_start_end = get_true_exons_gene_id(exons_start_end, exon_id, gene_id)[0]
        exon_id = get_true_exons_gene_id(exons_start_end, exon_id, gene_id)[1]
        gene_id = get_true_exons_gene_id(exons_start_end, exon_id, gene_id)[2]

        row.exons_start_end_ife = exon_start_end
        row.exon_id_ife = exon_id
        row.gene_id_ife = gene_id

        if len(genes_id)==1:
            for i in genes_id:
                # Subexonic pleg circRNAs:
                if ('c' in i or 'lnc' in i or 'pseudo' in i) and (row.circ_rna_name not in subexonic_antisens):
                    nb_sub_exonic += 1
                    nb_sub_exonic_pleg += 1
                    nb_ccr_pleg.append(row.nb_ccr)
                    subexonic_genes.append(gene_id)
                    # Subexonic meg circRNAs:
                elif 'c' not in i and 'lnc' not in i and 'pseudo' not in i and 'rRNA' not in i:
                    nb_sub_exonic += 1
                    nb_sub_exonic_meg += 1
                    nb_ccr_meg.append(row.nb_ccr)
                    subexonic_genes.append(gene_id)
                elif 'rRNA' in i:
                    pass

        if len(genes_id) > 1:
            for i in genes_id:
                if len(i) == 3:
                    biotypes.append(i[2])
                    biotypes = list(set(biotypes))
                elif len(i) > 3:
                    p_biotypes = [i[2],i[3]]
                    biotypes = [p for p in p_biotypes]
            if 'rRNA' in biotypes:
                pass
            # Subexonic pleg circRNAs:
            elif ((biotypes == ['c'] or biotypes == ['lnc'] or biotypes == ['pseudo']
                or ('c' and 'pseudo' in biotypes) or ('c' and 'lnc' in biotypes)
                or ('lnc' and 'pseudo' in biotypes)) and (row.circ_rna_name not in subexonic_antisens)):
                nb_sub_exonic += 1
                nb_sub_exonic_pleg += 1
                nb_ccr_pleg.append(row.nb_ccr)
                subexonic_genes.append(gene_id)
            # Subexonic meg circRNAs:
            else:
                nb_sub_exonic += 1
                nb_sub_exonic_meg += 1
                nb_ccr_meg.append(row.nb_ccr)
                subexonic_genes.append(gene_id)

    nb_subexonic_genes = len(list(set([val for sublist in subexonic_genes for val in sublist])))
    nb_ccr_p = sum(nb_ccr_pleg)
    nb_ccr_m = sum(nb_ccr_meg)
    return nb_sub_exonic, nb_subexonic_genes, nb_sub_exonic_pleg, nb_sub_exonic_meg, nb_ccr_p, nb_ccr_m


def write_subexonic_tables(sample, circ_rnas, subexonic_antisens, output_file_name_meg, output_file_name_pleg):
    # Write exonic table for comparaison between tissues/species:
    df_circ_rnas = pd.DataFrame(circ_rnas, index=None)
    nb_sub_exonic, nb_sub_exonic_pleg, nb_sub_exonic_meg = 0, 0, 0
    subexonic_genes, nb_ccr_pleg, nb_ccr_meg = [], [], []
    header = list(df_circ_rnas.columns.values.tolist())

    with open(output_file_name_meg, 'w') as fout_meg, open(output_file_name_pleg, 'w') as fout_pleg:
        # Subexonic pleg circRNAs:
        tsv_writer_pleg = csv.writer(fout_pleg, delimiter='\t')
        tsv_writer_pleg.writerow(header)
        # Subexonic meg circRNAs:
        tsv_writer_meg = csv.writer(fout_meg, delimiter='\t')
        tsv_writer_meg.writerow(header)

        for index, row in df_circ_rnas.iterrows():
            # Select well-annotated subexonic circRNAs according to the gene_id:
            # Get gene_id:
            genes_id = list(set(list(row.gene_id_ife.split(","))))
            genes_id = list(i.split("_") for i in genes_id)
            exons_start_end = sorted(set(row.exons_start_end_ife.split(",")), key=row.exons_start_end_ife.split(',').index)
            exon_id = sorted(set(row.exon_id_ife.split(",")), key=row.exon_id_ife.split(',').index)
            gene_id = ",".join(list(set(list(row.gene_id_ife.split(",")))))
            biotypes = []

            exon_start_end = get_true_exons_gene_id(exons_start_end, exon_id, gene_id)[0]
            exon_id = get_true_exons_gene_id(exons_start_end, exon_id, gene_id)[1]
            gene_id = get_true_exons_gene_id(exons_start_end, exon_id, gene_id)[2]

            row.exons_start_end_ife = exon_start_end
            row.exon_id_ife = exon_id
            row.gene_id_ife = gene_id

            if len(genes_id)==1:
                for i in genes_id:
                    # Subexonic pleg circRNAs:
                    if ('c' in i or 'lnc' in i or 'pseudo' in i) and (row.circ_rna_name not in subexonic_antisens):
                        nb_sub_exonic += 1
                        nb_sub_exonic_pleg += 1
                        nb_ccr_pleg.append(row.nb_ccr)
                        subexonic_genes.append(gene_id)
                        s = row
                        tsv_writer_pleg.writerow(s)
                    # Subexonic meg circRNAs:
                    elif 'c' not in i and 'lnc' not in i and 'pseudo' not in i and 'rRNA' not in i:
                        nb_sub_exonic += 1
                        nb_sub_exonic_meg += 1
                        nb_ccr_meg.append(row.nb_ccr)
                        subexonic_genes.append(gene_id)
                        s = row
                        tsv_writer_meg.writerow(s)
                    elif 'rRNA' in i:
                        pass

            if len(genes_id) > 1:
                for i in genes_id:
                    if len(i) == 3:
                        biotypes.append(i[2])
                        biotypes = list(set(biotypes))
                    elif len(i) > 3:
                        p_biotypes = [i[2],i[3]]
                        biotypes = [p for p in p_biotypes]
                if 'rRNA' in biotypes:
                    pass
                # Subexonic pleg circRNAs:
                elif ((biotypes == ['c'] or biotypes == ['lnc'] or biotypes == ['pseudo']
                    or ('c' and 'pseudo' in biotypes) or ('c' and 'lnc' in biotypes)
                    or ('lnc' and 'pseudo' in biotypes)) and (row.circ_rna_name not in subexonic_antisens)):
                    nb_sub_exonic += 1
                    nb_sub_exonic_pleg += 1
                    nb_ccr_pleg.append(row.nb_ccr)
                    subexonic_genes.append(gene_id)
                    s = row
                    tsv_writer_pleg.writerow(s)
                # Subexonic meg circRNAs:
                else:
                    nb_sub_exonic += 1
                    nb_sub_exonic_meg += 1
                    nb_ccr_meg.append(row.nb_ccr)
                    subexonic_genes.append(gene_id)
                    s = row
                    tsv_writer_meg.writerow(s)

    nb_subexonic_genes = len(list(set([val for sublist in subexonic_genes for val in sublist])))
    nb_ccr_p = sum(nb_ccr_pleg)
    nb_ccr_m = sum(nb_ccr_meg)
    return nb_sub_exonic, nb_subexonic_genes, nb_sub_exonic_pleg, nb_sub_exonic_meg, nb_ccr_p, nb_ccr_m


def write_circrnas_tables(sample, df, exonic_circrnas, intronic_circrnas, subexonic_circrnas, subexonic_antisens):
    header = list(df.columns.values.tolist())
    if len(exonic_circrnas)>0:
        # Write the exonic circRNAs table:
        write_circ_table(exonic_circrnas, header, args.output_exonic_file)
    else:
        with open(args.output_exonic_file, 'w') as fout:
            tsv_writer = csv.writer(fout, delimiter='\t')
            tsv_writer.writerow(header)
    if len(intronic_circrnas)>0:
        # Write the intronic circRNAs table:
        write_circ_table(intronic_circrnas, header, args.output_intronic_file)
    else:
        with open(args.output_intronic_file, 'w') as fout:
            tsv_writer = csv.writer(fout, delimiter='\t')
            tsv_writer.writerow(header)

    eprint("subexonic_circrnas", len(subexonic_circrnas))
    eprint("subexonic_antisens", len(subexonic_antisens))
    if len(subexonic_circrnas)>0 or len(subexonic_antisens)>0:
        # Write the subexonic circRNAs tables:
        write_subexonic_tables(sample, subexonic_circrnas, subexonic_antisens, args.output_subexonic_meg_file,
                               args.output_subexonic_pleg_file)
    else:
        eprint("ok")
        open(args.output_subexonic_meg_file, 'a').close()
        open(args.output_subexonic_pleg_file, 'a').close()

def main():
    # Get sample name:
    sample = get_sample(args.input_file)

    # Read the circRNAs annotation file:
    df_circ_annot = read_file(args.input_file)

    # Get the list of exonic, subexonic and intronic circRNAs:
    exonic_circrnas = get_circrnas(df_circ_annot)[0]
    subexonic_circrnas = get_circrnas(df_circ_annot)[1]
    intronic_circrnas = get_circrnas(df_circ_annot)[2]
    subexonic_antisens = get_circrnas(df_circ_annot)[3]

    # Write exonic, intronic and subexonic circRNAs tables:
    circrnas_tables = write_circrnas_tables(sample, df_circ_annot, exonic_circrnas,
                                            intronic_circrnas, subexonic_circrnas, subexonic_antisens)

    # Compute statistics about exonic, subexonic and intronic circRNAs:
    stats = get_stats_circrnas(sample, df_circ_annot, args.output_comp_exonic_file)

    # Write the circRNAs statistics table:
    write_stats_table(stats, args.output_stats_file)


def parse_arguments():
    parser = argparse.ArgumentParser(description='Return annotation statistics tables')
    parser.add_argument('-i', '--input_file', required=True,
                        help='Annotation circRNAs file')
    parser.add_argument('-o_stats', '--output_stats_file', required=False,
                        default="stats_annotation.tsv",
                        help='Table containing statistics of annotation')
    parser.add_argument('-oi', '--output_intronic_file', required=False,
                        default="intronic_circRNAs.tsv",
                        help='Table containing intronic circRNAs')
    parser.add_argument('-oe', '--output_exonic_file', required=False,
                        default="exonic_circRNAs.tsv",
                        help='Table containing exonic circRNAs')
    parser.add_argument('-oce', '--output_comp_exonic_file', required=False,
                        default="exonic_summary.tsv",
                        help='Table containing exonics circRNAs for comparisons')
    parser.add_argument('-osepleg', '--output_subexonic_pleg_file', required=False,
                        default="subexonic_pleg_circRNAs.tsv",
                        help='Table containing subexonic circRNAs')
    parser.add_argument('-osemeg', '--output_subexonic_meg_file', required=False,
                        default="subexonic_meg_circRNAs.tsv",
                        help='Table containing subexonic circRNAs')
    args = parser.parse_args()
    return args

if __name__ == '__main__':
    args = parse_arguments()
    main()
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
import os
import argparse
import pandas as pd
import numpy as np
import re
import sys


# Utility functions
def create_folder(path):
    if not os.path.exists(path):
        os.mkdir(path)

def eprint(*args, **kwargs):
    print(*args,  file=sys.stderr, **kwargs)

def read_file(file):
    """ Read the sample file containing path files and return a pandas dataframe"""
    sample = pd.read_csv(file, sep='\t', dtype=str)
    return sample

def check_mapdirs(mapdirs):
    """ Check if the mapdir exists or not"""
    for mapdir in mapdirs:
        if not os.path.exists(mapdir):
            raise Exception("WARNING following mapdir is missing %s" % mapdir)

def read_log_file(sample):
    """ Read and clean the Log.final.out files containing the summary mapping statistics
        after mapping job is complete and return an array of the pandas dataframe
        containing statistics"""
    stats = []
    d = {}
    reads = ["R1", "R2"]
    paths_out = []
    for index, row in sample.iterrows():
        sample_name = row["sample"]
        sample_unit_name = row["sample_unit"]
        path = row["mapdir"]
        for read in reads:
            path_file = path+"/se/"+read+"/Log.final.out"
            f = open(path_file, "r")
            for line in f:
                l = line.split("|")
                stats.append(l)
            for stat in stats:
                if len(stat) == 1:
                    stats.remove(stat)
            for stat in stats:
                stat[0] = ' '.join(stat[0].split("\t"))
                stat[0] = ' '.join(stat[0].split())
                stat[1] = ' '.join(stat[1].split())
                stat[1] = stat[1].replace("%", "")
                d['sample'] = sample_name
                d['sample_unit'] = sample_unit_name
                d['read'] = read
                d[stat[0]] = stat[1]
            f.close()
            create_folder("reports/"+sample_unit_name+"/")
            create_folder("reports/"+sample_unit_name+"/"+read+"/")
            path_out = "reports/"+sample_unit_name+"/"+read+"/"+args.output_file
            paths_out.append(path_out)
            write_sample_file(d, path_out)
    return paths_out


def write_sample_file(dict, output_file):
    """Get value from a statistic dictionnary and return a pandas object"""
    df = pd.DataFrame([dict])
    df.to_csv(output_file, sep="\t", index=False)


def write_final_stat_tab(paths_out, output_file):
    df_final = pd.DataFrame()
    for path in paths_out:
        df = pd.read_csv(path, sep="\t")
        df_final = pd.concat([df_final, df], ignore_index=True, sort =False)
    df_final.to_csv(output_file, sep="\t", index=False)


def main():

    # Read the sample file:
    sample = read_file(args.input_file)

    # Check the mapdir of each sample exists or not:
    check_mapdirs(sample["mapdir"])

    create_folder("reports")

    # Read log files and get the path of a table containing statistics:
    paths_out = read_log_file(sample)

    # Write the statistic table with all samples:
    write_final_stat_tab(paths_out, args.output_file)


def parse_arguments():
    parser = argparse.ArgumentParser(description='Sample file')
    parser.add_argument('-i', '--input_file',
                        required=True, help='Sample file')
    parser.add_argument('-o', '--output_file', required=False, 
                        default="mapping_stat.tsv", help='Sample file')
    args = parser.parse_args()
    return args


if __name__ == '__main__':
    args = parse_arguments()
    main()
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
import os, re, sys, csv, argparse
import pandas as pd
import numpy as np

# Utility functions:
def eprint(*args, **kwargs):
    print(*args,  file=sys.stderr, **kwargs)


def read_file(file):
    """ Read the circRNAs annotation file and return a pandas dataframe"""
    df = pd.read_table(file, sep='\t')
    df.replace(np.nan, "", inplace=True)
    return df


def get_item(item):
    return ''.join(list(set(item)))


def write_summary_meg_table(df, output_file_name, min_size):
    header = ["gene_name", "biotype", "nb_circ_sens", "nb_circ_antisens", "nb_ccr", "exon_name", "start_exon", "end_exon"]
    with open(output_file_name, 'w') as fout:
        tsv_writer = csv.writer(fout, delimiter='\t')
        tsv_writer.writerow(header)
        if len(df) > 0:
            for key, item in df:
                exon_start = get_item(item.exons_start_end_ife).split("_")[0]
                exon_end = get_item(item.exons_start_end_ife).split("_")[1]
                if int(exon_end) - int(exon_start) > min_size:
                    strand = list(item.strand)
                    gene_id = list(key.split("_"))[0]
                    strand_gene_ife = list(key.split("_"))[1]
                    # Get biotype:
                    if len(list(key.split("_"))) == 3:
                        biotype = list(key.split("_"))[2]
                    else:
                        biotype = list(key.split("_"))[2]+"_"+list(key.split("_"))[3]
                    nb_circ_rna = len(df.get_group(key))
                    nb_circ_sens = strand.count(strand_gene_ife)
                    nb_circ_antisens = nb_circ_rna - nb_circ_sens
                    nb_ccr = sum(item.nb_ccr)
                    exon_name = get_item(item.exon_id_ife)
                    # print(df.get_group(key), "\n\n")
                    row = [gene_id, biotype, nb_circ_sens, nb_circ_antisens, nb_ccr, exon_name, exon_start, exon_end]
                    tsv_writer.writerow(row)


def write_summary_intronic_table(df, output_file_name, min_size):
    header = ["gene_name", "biotype", "intron_name", "nb_circ_rna", "nb_ccr", "start_intron", "end_intron"]
    with open(output_file_name, 'w') as fout:
        tsv_writer = csv.writer(fout, delimiter='\t')
        tsv_writer.writerow(header)
        if len(df) > 0:
            for key, item in df:
                start_i = ''.join([str(round(x)) for x in list(set(item.start_i))])
                end_i = ''.join([str(round(x)) for x in list(set(item.end_i))])
                if int(end_i) - int(start_i) > min_size:
                    intron_name = key
                    gene_id = get_item(item.gene_id_i).split("_")[0]
                    # Get biotype:
                    biotype = get_item(item.gene_id_i).split(",")
                    if len(biotype)>1:
                        biotype = biotype[0].split("_")[4]
                    else:
                        biotype = ''.join(biotype).split("_")[4]
                    nb_ccr = sum(item.nb_ccr)
                    nb_circ_rna = len(df.get_group(key))
                    row = [gene_id, biotype, intron_name, nb_circ_rna, nb_ccr, start_i, end_i]
                    tsv_writer.writerow(row)


def write_summary_pleg_table(df, output_file_name, min_size):
    header = ["gene_name", "nb_gene", "biotype", "nb_circ_sens", "nb_circ_antisens", "nb_ccr", "exon_name", "start_exon", "end_exon"]
    with open(output_file_name, 'w') as fout:
        tsv_writer = csv.writer(fout, delimiter='\t')
        tsv_writer.writerow(header)
        if len(df) > 0:
            for key, item in df:
                exon_start = get_item(item.exons_start_end_ife).split("_")[0]
                exon_end = get_item(item.exons_start_end_ife).split("_")[1]
                if int(exon_end) - int(exon_start) > min_size:
                    # Get gene_id and biotype:
                    gene_id_ife = get_item(item.gene_id_ife).split(",")
                    nb_gene = len(gene_id_ife)
                    if len(gene_id_ife) > 1:
                        gene_id = gene_id_ife[0].split("_")[0]+","+gene_id_ife[1].split("_")[0]
                        strand_gene_ife = get_item(gene_id_ife[0].split("_")[1]+gene_id_ife[1].split("_")[1])
                        biotype = get_item(gene_id_ife[0].split("_")[2]+gene_id_ife[1].split("_")[2])
                    else:
                        gene_id = ''.join(gene_id_ife).split("_")[0]
                        strand_gene_ife = ''.join(gene_id_ife).split("_")[1]
                        biotype = ''.join(gene_id_ife).split("_")[2]
                    strand = list(item.strand)
                    nb_circ_rna = len(df.get_group(key))
                    nb_circ_sens = strand.count(strand_gene_ife)
                    nb_circ_antisens = nb_circ_rna - nb_circ_sens
                    nb_ccr = sum(item.nb_ccr)
                    exon_name = key
                    row = [gene_id, nb_gene, biotype, nb_circ_sens, nb_circ_antisens, nb_ccr, exon_name, exon_start, exon_end]
                    tsv_writer.writerow(row)

def check_file(file):
    with open(file, 'r') as f:
        if len(f.read()) <= 1:
            result="Empty"
        else:
            result="Not empty"
    return result


def main():

    # Get min_size:
    min_size = int(args.min_size)

    # Read input annotation circRNAs files:
    if check_file(args.input_meg_file) == "Not empty":
        df_meg = read_file(args.input_meg_file)
        # Meg circRNAs:
        ## Group by gene:
        df_meg = df_meg.groupby('gene_id_ife')
        write_summary_meg_table(df_meg, args.output_meg_file, min_size)
    else:
        open(args.output_meg_file, 'a').close()

    if check_file(args.input_intronic_file) == "Not empty":
        df_intronic = read_file(args.input_intronic_file)
        # Intronic circRNAs:
        ## Group by intron name:
        df_intronic = df_intronic.groupby('intron_name')
        write_summary_intronic_table(df_intronic, args.output_intronic_file, min_size)
    else:
        open(args.output_intronic_file, 'a').close()

    if check_file(args.input_pleg_file) == "Not empty":
        df_pleg = read_file(args.input_pleg_file)
        # Pleg circRNAs:
        ## Group by exon:
        df_pleg = df_pleg.groupby('exon_id_ife')
        write_summary_pleg_table(df_pleg, args.output_pleg_file, min_size)
    else:
        open(args.output_pleg_file, 'a').close()


def parse_arguments():
    parser = argparse.ArgumentParser(description='Return summary circRNAs tables')
    parser.add_argument('-ip', '--input_pleg_file', required=True,
                        help='Annotation pleg circRNAs file')
    parser.add_argument('-im', '--input_meg_file', required=True,
                        help='Annotation meg circRNAs file')
    parser.add_argument('-ii', '--input_intronic_file', required=True,
                        help='Annotation intronic circRNAs file')
    parser.add_argument('-op', '--output_pleg_file', required=False,
                        default="pleg_summary.tsv",
                        help='Table containing summary of pleg circRNAs annotation')
    parser.add_argument('-om', '--output_meg_file', required=False,
                        default="meg_summary.tsv",
                        help='Table containing summary of meg circRNAs annotation')
    parser.add_argument('-oi', '--output_intronic_file', required=False,
                        default="intronic_summary.tsv",
                        help='Table containing summary of intronic circRNAs annotation')
    parser.add_argument('-ms', '--min_size', required=False, default=55,
                        help='Minimum size of circRNAs')
    args = parser.parse_args()
    return args


if __name__ == '__main__':
    args = parse_arguments()
    main()
86
87
shell:
    "find {input} -type f -empty -delete"
95
96
shell:
    "cat {{bta,oar,ssc}}_*/*_exonic_summary.tsv | cut -f1,3,4 |tail -n+2|sort|uniq > {output}"
113
114
115
116
shell:
    "python3 ../scripts/summary_table.py -ip {input.pleg} -im {input.meg} -ii {input.intronic}"
    " -op {output.pleg} -om {output.meg} -oi {output.intronic} -ms {params.min_size}"
    " 1>{log.stdout} 2>{log.stderr}"
SnakeMake From line 113 of master/Snakefile
124
125
shell:
    "cat {input} >> {output}"
SnakeMake From line 124 of master/Snakefile
141
142
143
144
145
shell:
    "python3 ../scripts/stats_annotation.py -i {input} -o_stats {output.stats}"
    " -oi {output.intronic} -oe {output.exonic} -oce {output.comp_exonic} -osepleg {output.sub_exonic_pleg}"
    " -osemeg {output.sub_exonic_meg}"
    " 1>{log.stdout} 2>{log.stderr}"
SnakeMake From line 141 of master/Snakefile
157
158
159
shell:
    "python3 ../scripts/circRNA_annotation.py -circ {input.circ_detected}"
    " -annot {input.annot_exon} -fmt bed -o {output} 1>{log.stdout} 2>{log.stderr}"
SnakeMake From line 157 of master/Snakefile
167
168
shell:
    "python3 ../scripts/cumul_bed.py -sp {input} > {log}"
SnakeMake From line 167 of master/Snakefile
180
181
182
183
184
shell:
    "if grep -q 'Nreads' {input.R1}; then head -n -2 {input.R1} > {input.R1}2; mv {input.R1}2 {input.R1}; fi ;"
    " if grep -q 'Nreads' {input.R2}; then head -n -2 {input.R2} > {input.R2}2; mv {input.R2}2 {input.R2}; fi ;"
    " python3 ../scripts/circRNA_detection.py -r1 {input.R1} -r2 {input.R2}"
    " -min_cr {params.min_ccr} -tol 0 -fmt bed -o {output} 1>{log.stdout} 2>{log.stderr}"
SnakeMake From line 180 of master/Snakefile
191
192
shell:
    "python3 ../scripts/stats_mapping.py -i {input} -o {output}"
SnakeMake From line 191 of master/Snakefile
ShowHide 21 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/ccerutti88/circRNA
Name: circrna
Version: 2
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Other Versions:
Downloaded: 0
Copyright: Public Domain
License: GNU General Public License v3.0
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...