circDetector (CD): circular RNAs detection and annotation workflow

public 1yr ago Version: 2 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Description:

circDetector (CD) is a computational tool for detecting and annotation of circular RNAs (circRNAs) from Total RNA-Seq data. This tool is implemented with the Snakemake workflow management system allowing reproducible and scalable data analyses. CD was developped to identify circRNAs from reads mapped to the reference genome with the STAR tool (Spliced Transcripts Alignment to a Reference) in Single-End (SE). It consists of identifying reads with a circular junction ( chimeric reads ) from the chimeric.out.junction files provided by STAR. CD also provides a file reporting all statistics of STAR-SE mapping. Mapping informations were extracted from the STAR file Log.final.out .

Note: The documentation of the CD tool is in progress.

Column	Type	Description
1	string	species (bos_taurus, sus_scrofa)
2	string	tissue (testis, liver)
3	string	sex (male, female)
4	string	sample (single, orient)
5	string	sample_unit (sample uniq name)
6	string	animal_name (specific name for each individual)
7	string	mapdir (file path to mapping files)

Column	Type	Description
1	string	sample (concatenation of species, tissue and animal_name)
2	string	sample_unit (sample uniq name)
3	string	species (bos_taurus = cow, sus_scrofa = pig)
4	string	sex (male, female)
5	string	mapdir (file path to mapping files)

 -r1 R1_INPUT_FILE, --r1_input_file R1_INPUT_FILE
 R1 input file name
 -r2 R2_INPUT_FILE, --r2_input_file R2_INPUT_FILE
 R2 input file name
 -min_cr CR_THRESHOLD, --cr_threshold CR_THRESHOLD
 Minimum number of chimeric reads
 -tol TOLERANCE, --tolerance TOLERANCE
 Tolerance of different positions to start and end
 -fmt {bed,gtf}, --output_file_format {bed,gtf}
 Output file format : 0-based (bed) or 1-based (gtf)
 -o OUTPUT_FILE, --output_file OUTPUT_FILE
 Output file path
 --verbose Print more info

Example of a command executed by Snakemake:

python3 scripts/circRNA_detection.py -r1 -r2 -o circ_rnas.bed

Rule mappingstat:

CD provides a file reporting all statistics of STAR-SE mapping. Mapping informations were extracted from the STAR file Log.final.out .

Input: Example of a Log.final.out file:

Started job on | Oct 26 05:21:11
 Started mapping on | Oct 26 05:29:19
 Finished on | Oct 26 05:52:39
 Mapping speed, Million of reads per hour | 121.89
 Number of input reads | 47400458
 Average input read length | 150
 UNIQUE READS:
 Uniquely mapped reads number | 33941471
 Uniquely mapped reads % | 71.61%
 Average mapped length | 149.32
 Number of splices: Total | 8294615
 Number of splices: Annotated (sjdb) | 7882009
 Number of splices: GT/AG | 8204363
 Number of splices: GC/AG | 58632
 Number of splices: AT/AC | 5362
 Number of splices: Non-canonical | 26258
 Mismatch rate per base, % | 0.57%
 Deletion rate per base | 0.02%
 Deletion average length | 2.11
 Insertion rate per base | 0.02%
 Insertion average length | 1.74
 MULTI-MAPPING READS:
 Number of reads mapped to multiple loci | 3381777
 % of reads mapped to multiple loci | 7.13%
 Number of reads mapped to too many loci | 28860
 % of reads mapped to too many loci | 0.06%
 UNMAPPED READS:
 Number of reads unmapped: too many mismatches | 0
 % of reads unmapped: too many mismatches | 0.00%
 Number of reads unmapped: too short | 9943085
 % of reads unmapped: too short | 20.98%
 Number of reads unmapped: other | 18180
 % of reads unmapped: other | 0.04%
 CHIMERIC READS:
 Number of chimeric reads | 129065
 % of chimeric reads | 0.27%

Output: reports/mapping_stat.tsv is a tabulated file containing all the informations of the input file (column) for each sample (row).

Rule annotation:

-circ CIRC_RNA_FILE, --circ_rna_file CIRC_RNA_FILE
 Circular RNA file path
 -fmt {bed,gtf}, --annot_format {bed,gtf}
 Annotation file format
 -annot ANNOTATION_FILE, --annotation_file ANNOTATION_FILE
 Annotation file path
 -oe EXONIC_OUTPUT_FILE, --exonic_output_file EXONIC_OUTPUT_FILE
 Exonic output file path
 -oi INTRONIC_OUTPUT_FILE, --intronic_output_file INTRONIC_OUTPUT_FILE
 Intronic output file path
 -o OUTPUT, --output OUTPUT
 Annotation output file path
 --verbose Print more info

Example of a command executed by Snakemake:

python3 scripts/circRNA_annotation.py -circ circ_rnas.bed -annot -o annotation_circRNAs.tsv

Note: Manual post-processing of data is required to remove short circRNAs.

Example Commands:

Launching the pipeline locally:

snakemake -p --jobs 8

Launching the pipeline on the Genotoul cluster:

sbatch circ_rnas.sh

Testing the detection rule locally for a single sample:

snakemake pig_testis_neonat-1/auzeville.bed -p --cores 1

Note: Make sure the circrnaenv environment is in this folder.

Contact & Support

For any bug report, feature request, or questions please fill out an issue through circRNA_project's issue page .

Copyright and License

This software is released under GNU General Public License (v3.0).

Code Snippets

                                     # [-fmt {bed,gtf}] [-o OUTPUT_FILE]

# Imports
import circRNA as circ
from collections import defaultdict
import sys, os, re, argparse, natsort
import numpy as np
import csv
import pybedtools as pybed

# Utility functions
def eprint(*args, **kwargs):
    print(*args,  file=sys.stderr, **kwargs)
    dsp_control = circ.DisplayControl.instance()
    if dsp_control.verbose:
        print(*args,  file=sys.stderr, **kwargs)


def exonic_annotations(circ_rnas, exons):
    """
    Annotate the circrnas according to the exon annotations
    circ_rnas: an array of cirRNAs object
    exons: an array of circRNAs objects
    returns the circ_rnas array with the circRNAs annotated

    We annotate each circRNA's splice junction with the corresponding
    exons spice junctions using a dict of exons splice junctions
    This means that splice junction correspondances must be exact
    """
    eprint("\nCompute junction dictionary...\n")
    annotation_junctions = compute_junction_dict(exons)
    eprint("\nCircRNA annotation...\n")
    set_annotation(circ_rnas, annotation_junctions)

    return circ_rnas


def junction_key(chrom, pos):
    return "%s:%s" %(chrom, pos)


def compute_junction_dict(annot_array):
    """
    Compute a dictionnary of splic junctions:
    keys: chrom:pos
    values: list of tuples
    tuple: (exon, end_type)
    end_type: 3 (resp 5) for donor (resp acceptor)
    """
    junction_dict = defaultdict(list)
    seen_exons = defaultdict()
    for annot in annot_array:
        if annot.get_att("exon_id") in seen_exons:
            continue
        seen_exons[annot.get_att("exon_id")] = 1
        key_start = junction_key(annot.chrom, annot.start)
        key_end = junction_key(annot.chrom, annot.end)
        if annot.strand == "+":
            start_end_type = "5"
            end_end_type = "3"
        else:
            start_end_type = "3"
            end_end_type = "5"
        # annotating start
        junction_dict[key_start].append((annot, start_end_type))
        # annotating end
        junction_dict[key_end].append((annot, end_end_type))
    return junction_dict


def set_annotation(circ_rnas, annotation_junctions):
    """
    An annotation string would be very much appreciated
    """
    for circ_rna in circ_rnas:
        junction_start = junction_key(circ_rna.chrom, circ_rna.start)
        junction_end = junction_key(circ_rna.chrom, circ_rna.end)
        if junction_start in annotation_junctions:
            circ_rna.add_start_annotation(annotation_junctions[junction_start])
            circ_rna.add_start_transcript_id(annotation_junctions[junction_start])
            circ_rna.add_start_gene_id(annotation_junctions[junction_start])
        if junction_end in annotation_junctions:
            circ_rna.add_end_annotation(annotation_junctions[junction_end])
            circ_rna.add_end_transcript_id(annotation_junctions[junction_end])
            circ_rna.add_end_gene_id(annotation_junctions[junction_end])


def intronic_annotations(circ_rnas, exons):
    """
    Annotate the circrnas according to the introns annotations
    circ_rnas: an array of cirRNAs object
    exons: an array of circRNAs objects
    returns the circ_rnas array with the circRNAs intrno-annotated

    Because at least on circRNA junction is away from existing
    splice junctions, the annotation is solely based on overlap between
    the circRNA and an intron :
      - [start, end] interval must be within the intron
      - this interval must cover at least 90% of the intron
    """

    #eprint("\nCompute introns from exons...\n")
    transcripts = get_exons_per_transcript(exons)
    introns = circ.compute_intronic_positions(transcripts)

    #eprint("\nCompute intersection with circRNAs...\n")
    circ_rnas = annotate_intron_intersection(circ_rnas, introns)

    return circ_rnas


def annotate_intron_intersection(circ_rnas, introns):
    pybed_introns = annot_to_pybed(introns)
    pybed_circrnas = annot_to_pybed(circ_rnas)
    intersections = pybed_circrnas.intersect(pybed_introns, f=0.95, wo=True) 

    circrna_d = {circ.name: circ for circ in circ_rnas}
    introns_d = {intron.name: intron for intron in introns}
    for i in intersections:
        intron = introns_d[i[9]]
        circ = circrna_d[i[3]]
        overlap = i[-1]
        p_overlap = float(overlap)/intron.length # taille circ sur taille intron
        circ.annotate_intron(intron, p_overlap)

    return circ_rnas


def annot_to_pybed(annots):
    pybed_annot = []
    for a in annots:
        pybed_annot.append(a.topybed())
    return pybed.BedTool(pybed_annot).sort()


def infra_exonic_annotations(circ_rnas, exons):
    """
    """
    pybed_exons = annot_to_pybed_ife(exons)
    pybed_circrnas = annot_to_pybed_ife(circ_rnas)  
    intersections = pybed_circrnas.intersect(pybed_exons, f=0.95, wo=True)

    circrna_d = {circ.name: circ for circ in circ_rnas}
    exons_d = {exon.name: exon for exon in exons}
    for i in intersections:
        exon = exons_d[i[9]]
        circ = circrna_d[i[3]]
        circ.annotate_ife(exon)

    return circ_rnas


def annot_to_pybed_ife(annots):
    pybed_annot = []
    for a in annots:
        pybed_annot.append(a.topybed_ife())
    return pybed.BedTool(pybed_annot).sort()


def get_exons_per_transcript(annots):
    """
    returns a dictionnary
    key: ...
    """
    d_transcript = defaultdict(list)
    for exon in annots:
        transcript_id = exon.get_att("transcript_id")
        d_transcript[transcript_id].append(exon)
    return d_transcript


def write_annotated_circrnas(circ_annotated, outputfile):
    circ_rnas = natsort.natsorted(circ_annotated, key=lambda e: (e.chrom, e.start))
    header = "\t".join(["chrom", "start", "end", "strand", "nb_ccr", "circ_rna_name", "exons_id_start", "exons_id_end", 
                        "transcript_id_start", "transcript_id_end", "gene_id_start", "gene_id_end", 
                        "intron_name", "start_i", "end_i", "gene_id_i", "exon_id_i", 
                        'exon_id_ife', "exons_start_end_ife", "gene_id_ife"])
    with open(outputfile, "w") as fout:
        fout.write(header+"\n")
        for circ_rna in circ_rnas:
            intron_annot = circ_rna.intron_annotation 
            if len(intron_annot) == 0:
                start_i = ""
                end_i = ""
                name_i = ""
            else:
                for annot_i in intron_annot:
                    intron = annot_i[0]
                    start_i = intron.start
                    end_i = intron.end
                    name_i = intron.name
            att_d = dict(re.split(" |=", item) for item in circ_rna.attributes.split("; "))
            nb_ccr = att_d["nb_ccr"].replace('\"', '')
            s = [circ_rna.chrom, circ_rna.start, circ_rna.end, 
                circ_rna.strand, nb_ccr, circ_rna.name,
                circ_rna.get_start_annotation_str(), 
                circ_rna.get_end_annotation_str(),
                circ_rna.get_start_transcript_id_str(), 
                circ_rna.get_end_transcript_id_str(), 
                circ_rna.get_start_gene_id_str(), 
                circ_rna.get_end_gene_id_str(),
                name_i, start_i, end_i, 
                circ_rna.get_intron_annot_gene_str('gene_id'),
                circ_rna.get_intron_annot_str('exon_id'), 
                circ_rna.get_infra_exonic_annot_str("exon_id"),
                circ_rna.get_infra_exonic_exons_start_end(),
                circ_rna.get_infra_exonic_gene_annot_str("gene_id")]
            fout.write( "%s\n" % "\t".join(map(str, s)))


def main():

    # Exonic:
    eprint("\nIdentification of EXONIC circRNAs:\n")
    eprint("\nReading circRNA file...\n")
    circ_rnas = circ.read_annotation(args.circ_rna_file, fmt=args.annot_format)
    eprint("\nReading annotation file...\n")
    annots = circ.read_annotation(args.annotation_file, fmt=args.annot_format)

    eprint("\nExonic annotations..\n")
    circ_rnas = exonic_annotations(circ_rnas, annots)

    eprint("\nIntronic annotations..\n")
    circ_rnas = intronic_annotations(circ_rnas, annots)

    #####
    eprint("\nInfra-exonic annotations..\n")
    circ_rnas = infra_exonic_annotations(circ_rnas, annots)
    #####

    eprint("\nWrite annotations..\n")
    write_annotated_circrnas(circ_rnas, args.output)


def parse_arguments():
    parser = argparse.ArgumentParser(description='Annotation')
    parser.add_argument('-circ', '--circ_rna_file',
                        required=True, help='Circular RNA file path')
    parser.add_argument('-fmt', '--annot_format',
                        required=False, default="bed", choices=["bed","gtf"],
                        help='Annotation file format')
    parser.add_argument('-annot', '--annotation_file',
                        required=True, help='Annotataion file path')
    parser.add_argument('-oe', '--exonic_output_file',
                        required=False, default='exonic_circ_rnas_annot.out',
                        help='Exonic output file path')
    parser.add_argument('-oi', '--intronic_output_file',
                        required=False, default='annotation_introns.bed',
                        help='Intronic output file path')
    parser.add_argument('-o', '--output',
                        required=True,
                        help='Annotation output file path')
    parser.add_argument('--verbose', help='Print more info', action='store_true')

    args = parser.parse_args()

    dsp_control = circ.DisplayControl.instance()
    dsp_control.verbose = args.verbose

    return args


if __name__ == '__main__':
    args = parse_arguments()
    main()

Python numpy Pybedtools Circr From line 3 of scripts/circRNA_annotation.py

                                    # [-min_cr CR_THRESHOLD] [-tol TOLERANCE]
                                    # [-fmt {bed,gtf}] [-o OUTPUT_FILE]

# Imports
import sys, os, re, argparse, natsort, time
if not sys.warnoptions:
    import warnings
    warnings.simplefilter("ignore")
from collections import defaultdict
import pandas as pd
import numpy as np
from networkx import (draw, DiGraph, Graph)
import networkx as netwx
import statistics
from statistics import mode
from time import sleep
from tqdm import tqdm
import circRNA as circ


# Utility functions
def eprint(*args, **kwargs):
    print(*args,  file=sys.stderr, **kwargs)
    dsp_control = circ.DisplayControl.instance()
    if dsp_control.verbose:
        print(*args,  file=sys.stderr, **kwargs)


def merge_junctions(df_a):
    """ Merge two dataframes
    :df_r1: dataframe of R1 reads
    :df_r2: dataframe of R2 reads
    :return the merged dataframe
    """
    df_merged = pd.concat(df_a,ignore_index=True)
    return df_merged


def read_junction_file(file, read_type=None):
    """ Read the data, adds the header of the table
        and adds a column with the corresponding read type.
    :file: file to read
    :read_type: read type (R1 or R2)
    :return the dataframe of reads
    """
    names = ["chr_donor", "pos_donor","strand_donor","chr_acceptor",
            "pos_acceptor","strand_acceptor","junction_type",
            "repeat_left","repeat_right","read_name","pos_s1",
            "CIGAR_s1","pos_s2","CIGAR_s2"] # Columns names of the input data
    # Read data as table with pandas:
    df = pd.read_table(file, sep = '\t', names=names, dtype=str,
                       converters = {'pos_acceptor':int, 'pos_donor':int})
    df['read_type'] = read_type # Add column "read" to specify the read type
    return df


def get_valid_circjunctions(df):
    """ Selects valid circular chimeric reads (ccr)
        according to the conditions of the "valid_ccr" function.
    :df: datatframe to filter
    return: the filtered dataframe
    """
    # Parsing dataframe to select valid rows
    # Boolean array with true if the read is a valid ccr:
    valid_ccrs = []
    #for index, row in tqdm(df.iterrows(), total=df.shape[0]):
    for index, row in circ.tqdm_wrapper(df):
        valid_ccrs.append(valid_ccr(row))
    df_circ_cr = df[valid_ccrs]
    ccr_a = [] # Array of circular junctions
    eprint(" "+"\n"+"Generating array of selected ccr..."+"\n")
    #for index, row in tqdm(df_circ_cr.iterrows(), total=df_circ_cr.shape[0]):
    for index, row in circ.tqdm_wrapper(df_circ_cr):
        ccr =  circ.CCR(row) # Create CCR object
        ccr_a.append(ccr) # Add chimeric reads into array
    eprint("\n")
    return ccr_a


def valid_ccr(row):
    """ Select ccr according to several conditions.
    :row: row of the dataframe
    :return boolean "False" for not valid ccr
            or boolean "True" for valid ccr
    """
    # Regular expression of the left dangling CIGAR:
    dangling_left = re.compile('^\d+S\d+M$')
    # Regular expression of the rigth dangling CIGAR:
    dangling_right = re.compile('^\d+M\d+S$')
    if (row['chr_donor']!=row['chr_acceptor']
            or row['strand_donor']!=row['strand_acceptor']):
        return False
    elif abs(row['pos_donor'] - row['pos_acceptor']) > 10000000:
        return False
    else:
        if row['strand_donor'] == '-' :
            if (row['pos_donor']>row['pos_acceptor']
                    or row['pos_s1']>row['pos_s2']):
                return False
            else:
                if (not dangling_left.match(row['CIGAR_s1'])
                        or not dangling_right.match(row['CIGAR_s2'])):
                    return False
                else:
                    return True
        elif row['strand_donor'] == '+' :
            if (row['pos_donor']<row['pos_acceptor']
                    or row['pos_s1']<row['pos_s2']):
                return False
            else:
                if (not dangling_right.match(row['CIGAR_s1'])
                        or not dangling_left.match(row['CIGAR_s2'])):
                    return False
                else:
                    return True


def get_exact_circrnas(circ_junctions, cr_threshold):
    """
    Detects circ RNAs as CR sharing exactly the same circ signature
    chrom start end strand
    """
    eprint("Exact circular RNA detection")
    circ_rna_dict = defaultdict(list)
    for circ_junc in circ_junctions:
        circ_rna_dict[circ_junc.key].append(circ_junc)

    circ_array = []
    for key, cr_array in circ_rna_dict.items():
        if len(cr_array) >= cr_threshold:
            circ_array.append(circ.CircRNA(args.output_file_format, ccr_array=cr_array))
    return circ_array


def same_circ_signature(cr_a, cr_b, tol):
    """ Two-by-two comparison of chimeric reads
    :cr_a: chimeric read a
    :cr_b: chimeric read b
    :tol: number of different start-end positions tolerated
    :return boolean "False" for not valid conditions
    and "True" for valid conditions
    """
    if cr_a.chrom != cr_b.chrom:
        return False
    if abs(cr_a.start - cr_b.start) > tol:
        return False
    if abs(cr_a.end - cr_b.end) > tol:
        return False
    if cr_a.read_type == cr_b.read_type:
        if cr_a.strand != cr_b.strand:
            return False
    else:
        if cr_a.strand == cr_b.strand:
            return False
    return True


def merge_circ_rnas(component):
    if len(component) == 1:
        return component[0]
    elements = []
    for circ_rna in component:
        elements.extend(circ_rna.ccr_array)
    return circ.CircRNA(args.output_file_format, ccr_array=elements)


def connected_components(circ_a, neighbours, cr_threshold):
    """ Generate an undirected graph and
        cluters ccr into connected components
    :ccr_a: array of ccr object
    :neighbours: dictionary of neighbours chimeric reads
    :return an array of circular RNAs
    """
    undirected = netwx.Graph()
    undirected.add_nodes_from(range(len(circ_a)))
    undirected.add_edges_from(neighbours)
    circ_rnas = []  # Circular RNAs array
    # Parsing of connected components composed of chimeric reads:
    for comp in netwx.connected_components(undirected):
        circ_rna = merge_circ_rnas([circ_a[i] for i in comp])
        if circ_rna.nb_ccr >= cr_threshold:
            circ_rnas.append(circ_rna)
    return circ_rnas


def cumulative_cr(circ_a):
    return sum([c.nb_ccr for c in circ_a])


def compute_fuzzy_circ_rnas(ccr_a, cr_threshold, tolerance):
    """ Get the circular RNA corresponding to each
        connected component of chimeric reads
    :ccr_a: array of ccr object
    :return array of circular RNAs
    """
    if cumulative_cr(ccr_a) < cr_threshold:
        return []
    n = len(ccr_a)
    # Neighbour cr share sign the same circular RNA
    neighbours = []
    circ_rnas = []
    for i in range(n):
        for j in range(i+1, n):
            if ccr_a[j].start > ccr_a[i].end:
                break
            # Two-by-two comparison of chimeric reads:
            if same_circ_signature(ccr_a[i], ccr_a[j], tolerance):
                neighbours.append((i, j))
    circ_rnas = connected_components(ccr_a, neighbours, cr_threshold)
    return circ_rnas


def get_independent_intervals(circ_rnas):
    # Get independant intervals
    independant_intervals = []
    circ_interval = []
    chrom = None
    for circ_rna in circ_rnas:
        if not chrom:
            chrom = circ_rna.chrom
            max_end = circ_rna.end
        if circ_rna.chrom != chrom:
            independant_intervals.append(circ_interval)
            circ_interval = []
            chrom = circ_rna.chrom
            max_end = circ_rna.start
        if circ_rna.chrom != chrom or circ_rna.start > max_end:
            independant_intervals.append(circ_interval)
            circ_interval = []
            chrom = circ_rna.chrom
            max_end = circ_rna.start
        max_end = max(max_end, circ_rna.end)
        circ_interval.append(circ_rna)
    independant_intervals.append(circ_interval)
    return independant_intervals


def get_fuzzy_circrnas(circ_rnas, cr_threshold, tolerance):
    eprint("Fuzzy circular RNA detection")
    circ_rnas = natsort.natsorted(circ_rnas, key=lambda e: (e.chrom,
                                                            e.start))
    intervals = get_independent_intervals(circ_rnas)
    circ_rnas = []
    for interval in intervals:
        fuzzy_circ_rnas = compute_fuzzy_circ_rnas(interval, cr_threshold,
                                                  tolerance)
        if fuzzy_circ_rnas:
            circ_rnas.extend(fuzzy_circ_rnas)
    return circ_rnas


def circrna_detection(circ_juntcions, cr_threshold, tolerance):
    """ Detecte circular RNAs
    :df_ccr: dataframe of circular chimeric reads
    :return the array of circular RNAs detected
    """
    # First pass, detect exact circular RNAs (no tolerance)
    circ_rnas = get_exact_circrnas(circ_juntcions, cr_threshold)
    eprint("nb of tol %d circRNAS %d" % (tolerance, len(circ_rnas)))#########
    # if tolerance > 0 compute fuzzy circrnas from previous list
    if tolerance > 0:
        circ_rnas = get_fuzzy_circrnas(circ_rnas, cr_threshold, tolerance)
    else:
        circ_rnas = [c for c in circ_rnas if c.nb_ccr >= cr_threshold]
    return circ_rnas


def write_tab_circ_rnas(circ_rnas, outputfile, fmt):
    """ Write the table of circ RNA retained
    :circ_rnas: array of circular RNAs object
    :outputfile: the output file path
    :return the table of circular RNAs retained
    :start Chimeric.out.junction file: last intron base in
    the reference genome order in the circular transcript
    :end Chimeric.out.junction file: first intron base in the
    reference genome order in the circular transcript
    :start GTF: first exon position
    :end GTF: last exon position
    """
    nb_line = 0 # name bed file
    circ_rnas = natsort.natsorted(circ_rnas, key=lambda e: (e.chrom, e.start))
    eprint(" "+"\n"+"Writing the table of circular RNAs..."+"\n")
    with open(outputfile, "w") as fout:
        eprint("\nConvertion the Chimeric.out.junction format to the %s format...\n" % fmt)
        for circ_rna in circ_rnas:
            # Convert Chimeric.out.junction format to out format:
            circ_rna.start += 1
            circ_rna.end -= 1
            # Convert out format to the chosen format:
            circ_rna.write_annot(nb_line, fout, fmt)
            nb_line += 1


def stats(min_cr, tol, nb_ccr_tot, nb_ccr, nb_circ_rna_detected, circ_rnas):
          nb_tot_cr_circ = sum([c.nb_ccr for c in circ_rnas])
          return "%d\t%d\t%d\t%d\t%d\t%d" % (min_cr, tol, nb_ccr_tot, nb_ccr,
                                             nb_circ_rna_detected, nb_tot_cr_circ)


def stats2(df):
    return df.groupby('nb_ccr')['nb_ccr'] \
           .count().reset_index(name='nb_circ_rnas')


def main(r1_input_file, r2_input_file, cr_threshold, tolerance,
         output_file):

    eprint("\nReading junction files...\n")
    df_junctions_a = []
    if r1_input_file:
        df_junctions_a.append(read_junction_file(r1_input_file, read_type="R1"))
    if r2_input_file:
        df_junctions_a.append(read_junction_file(r2_input_file, read_type="R2"))

    eprint("Merging junction files...\n")
    df = merge_junctions(df_junctions_a)
    eprint("%d chimeric reads \n" % len(df.index))

    # Filtering data:
    eprint("Selecting valid ccr...\n")
    circ_junctions = get_valid_circjunctions(df)
    eprint("%d circular chimeric reads" % len(circ_junctions))

    # Circular RNAs detection:
    circ_rnas = circrna_detection(circ_junctions, cr_threshold, tolerance) #### 'circRNA' object has no attribute 'ccr_array'
    eprint("%d circRNAs detected" % len(circ_rnas))

    #for cir in circ_rnas:
    #    print(cir.var_info_str())

    # Write the table of circular RNAs retained:
    circ.write_annotation(circ_rnas, output_file, fmt=args.output_file_format,
                          attributes=["nb_ccr", "genomic_size", "left",
                                      "right", "complete", "nb_distinct"])


def parse_arguments():
    parser = argparse.ArgumentParser(description='Identification of chimeric reads and circular RNAs')
    parser.add_argument('-r1', '--r1_input_file',
                        required=False, help='R1 input file name')
    parser.add_argument('-r2', '--r2_input_file',
                        required=False, help='R2 input file name')
    parser.add_argument('-min_cr', '--cr_threshold', type=int,
                        required=False, default=5,
                        help='Minimum number of chimeric reads')
    parser.add_argument('-tol', '--tolerance', type=int, required=False,
                        default=3,
                        help='Tolerance of different positions to start and end')
    parser.add_argument('-fmt', '--output_file_format',
                        required=False, default='gtf', choices=["bed","gtf"], # bed ou gtf
                        help='Output file format : 0-based (bed) or 1-based (gtf)')
    parser.add_argument('-o', '--output_file', required=False, default='circ_rnas_detected.out',
                        help='Output file path') # gft par défaut # attribut : nb ccr, min_cr_distinct
    parser.add_argument('--verbose', help='Print more info', action='store_true')

    args = parser.parse_args()

    dsp_control = circ.DisplayControl.instance()
    dsp_control.verbose = args.verbose

    if not (args.r1_input_file or args.r2_input_file):
        parser.error('At least one input file is requested')
    return args


if __name__ == '__main__':
    args = parse_arguments()
    main(args.r1_input_file, args.r2_input_file,
         args.cr_threshold, args.tolerance,
         args.output_file)

Python Pandas numpy tqdm networkx Circr From line 3 of scripts/circRNA_detection.py

from collections import defaultdict
import operator, pandas, re
import pybedtools as pybed
import natsort
from tqdm import tqdm
import pandas as pd


from Singleton import Singleton

@Singleton
class DisplayControl():
     def __init__(self):
         self.verbose = False


def tqdm_wrapper(df):
    display_control = DisplayControl.instance()
    if display_control.verbose:
        return tqdm(df.iterrows(), total=df.shape[0])
    else:
        return df.iterrows()


class CircJunction():
    def __init__(self, chrom, start, end, transcription_strand):
        """Initialize parameters of the CCR object"""
        self.chrom = chrom
        self.start = start
        self.end = end
        self.transcript_strand = transcription_strand

    @property
    def key(self):
        return "-".join(map(str,[self.chrom, self.start, self.end,
                                 self.transcript_strand]))


class CCR(CircJunction):
    def __init__(self, record):
        """
        Initialize parameters of the CCR object from junction record of the
        data frame
        """
        chrom = record['chr_donor']
        start, end, transcript_strand = self._setjunction(record)
        super(CCR, self).__init__(chrom, start, end, transcript_strand)
        self.record = record

    @property
    def CIGAR_s1(self):
        return self.record['CIGAR_s1']

    @property
    def CIGAR_s2(self):
        return self.record['CIGAR_s2']

    @property
    def read_type(self):
        return self.record['read_type']

    def _setjunction(self, record):
        strand = record['strand_donor']  # == record['strand_acptor']
        if strand == "+":
            start = record['pos_acceptor']
            end = record['pos_donor']
        else:
            start = record['pos_donor']
            end = record['pos_acceptor']
        if record['read_type'] == "R2":
            transcript_strand = strand
        else:
            transcript_strand = complement(strand)
        return start, end, transcript_strand


class Annotation():

    # initialise class variable
    counter = 0

    def __init__(self, chrom, start, end, strand, feature, attributes, source,
                 name=None):
        Annotation.counter += 1
        self.chrom = chrom
        self.start = int(start)
        self.end = int(end)
        self.strand = strand
        self.feature = feature
        self.source = source
        self.attributes = attributes
        if name:
            self.name = name
        else:
            self.name = "circRNA_%d" % Annotation.counter
        if isinstance(attributes, dict):
            if "name" not in attributes:
                attributes["name"] = name
            self.attributes_d = attributes
            self.attribues = "; ".join(["%s = %s " %(key, value) for key, value in attributes.items()])
        else:
            self.attributes += "; name=%s" % str(self.name)
            self.attributes_d = dict(re.split(" |=", item) for item in self.attributes.split("; "))
            self.clean_quotes()


    def _to_gtf_record(self, attributes=["name"]):
        s = [
            self.chrom, self.source, self.feature,
            self.start, self.end, self._get_score(),
            self.strand, self._get_frame(),
            self._get_attributes(attributes)
        ]
        return "\t".join(map(str, s))


    def _to_bed_record(self, attributes=["name"]):
        start_b, end_b = (self.start - 1, self.end)
        s = [
            self.chrom, start_b, end_b,
            self.name, self._get_score(),
            self.strand, self.source,
            self.feature, self._get_phase(),
            self._get_attributes(attributes)
        ]
        return "\t".join(map(str, s))

    def topybed(self):
        start_b, end_b = (self.start - 1, self.end)
        record = [self.chrom, start_b, end_b, self.name, ".", self.strand]
        return pybed.create_interval_from_list(record)

    def topybed_ife(self):
        record = [self.chrom, self.start, self.end, self.name, ".", self.strand]
        return pybed.create_interval_from_list(record)

    def _get_attributes(self, attributes):
        attribute_str = "; ".join([ "%s=%s" % (key, self.get_att(key)) for key in attributes]) + ";"
        return attribute_str


    @property
    def biotype(self):
        long_biotype = self.get_att("transcript_biotype")
        biotype_keys = ["protein_coding", "pseudogene", "snRNA", "snoRNA", "miRNA", "lncRNA"]
        biotype_values = ["c", "pseudo", "snc", "sno", "mi", "lnc"]
        succint_biotype = dict(zip(biotype_keys, biotype_values))
        if long_biotype in succint_biotype:
            return succint_biotype[long_biotype]
        else:
            # ribozyme, Y_RNA, scaRNA
            return long_biotype # "misc" = miscRNA: Miscellaneous RNA. A non-coding RNA that cannot be classified.

    @property
    def length(self):
        return self.end - self.start + 1

    def _get_score(self):
        return "."

    def _get_frame(self):
        return "."

    def _get_phase(self):
        return "."

    def get_att(self, key):
        if key not in self.attributes_d:
            print("Warning %s is not a valid attribute" % key)
            exit(1)
        return self.attributes_d[key]

    def clean_quotes(self):
        d = self.attributes_d
        for key in d:
            d[key] = d[key].replace('\"', '')

    def get_annotation_tag(self, end_type):
        return "_".join([self.get_att("exon_id"), end_type, self.biotype, self.strand])

    def annot_str(self, fmt="gtf", attributes=["name"]):
        if fmt=="gtf":
            return self._to_gtf_record(attributes)
        elif fmt=="bed":
            return self._to_bed_record(attributes)
        else:
            return "ERROR: Unknown format..."

    def __str__(self):
        return self.annot_str()


class CircRNA(Annotation):
    def __init__(self, fmt, *args, **kwargs):
        """ Initialize parameters of the circRNA object
        :ccr_array: array of circular chimeric reads
        """
        for key, value in kwargs.items():###
            if "ccr_array" in kwargs:
                self._init_from_ccr(kwargs.get("ccr_array"))
            if len(args) > 0:
                self._init_default(fmt, args[0], args[1], args[2], args[3], args[4], args[5], name=value) ###
            self.start_annotation = []
            self.end_annotation = []
            self.start_transcript_id = []
            self.end_transcript_id = []
            self.start_gene_id = []
            self.end_gene_id = []
            self.intron_annotation = []
            self.infra_exonic_annotation = []


    def _init_from_ccr(self, ccr_array):
        # Number of distinct CIGARs:
        self.ccr_array = ccr_array  # Array of cicular chimeric reads
        #rep_ccr = self._consensus_start_end(ccr_array)
        rep_ccr = ccr_array[0]
        chrom = rep_ccr.chrom
        start, end, start_var, end_var = self._compute_consensus_start_end(ccr_array)
        start = rep_ccr.start + 1 ###
        end = rep_ccr.end - 1 ###
        strand = rep_ccr.transcript_strand
        feature = "circRNA"
        source = "circRNA_detection"
        attributes = "nb_ccr=%d" % self._compute_nb_cr(ccr_array)
        super(CircRNA, self).__init__(chrom, start, end, strand, feature,
                                      attributes, source) # name=None
        self.source = source
        self.feature = feature


        self.CIGAR_s1 = rep_ccr.CIGAR_s1  # CIGAR of the first segment
        self.CIGAR_s2 = rep_ccr.CIGAR_s2  # CIGAR of the second segment
        self.read_type = rep_ccr.read_type  # Read type (R1 or R2)

        self.attributes_d['nb_ccr'] = self._compute_nb_cr(ccr_array)
        self.attributes_d['nb_distinct'] = compute_distinct_cr(ccr_array)
        self.attributes_d['genomic_size'] = self._compute_genomic_size()
        self.attributes_d['left'] = self._compute_left()
        self.attributes_d['right'] = self._compute_right()
        self.attributes_d['complete'] = self._compute_complete()

        self.start_var = start_var
        self.end_var = end_var

    @property
    def nb_ccr(self):
        return self.get_att("nb_ccr")

    @property
    def nccr(self):
        return int(self.get_att("nb_ccr"))

    @property
    def complete(self):
        return self.get_att("complete")

    @property
    def left(self):
        return self.get_att("left")

    @property
    def right(self):
        return self.get_att("right")

    @property
    def nb_distinct(self):
        return self.get_att("nb_distinct")

    def _get_score(self):
        return self.nb_ccr


    def _init_default(self, chrom, start, end, strand, feature, attributes,
                      source, name):
        super(CircRNA, self).__init__(chrom, start, end, strand,
                                      feature, attributes, source, name)

    @property
    def read_names(self):
        """ Return an array of chimeric reads names """
        reads_str = []
        for cr in self.ccr_array:
            reads_str.append(cr.name)
        return reads_str


    def _compute_nb_cr(self, ccr_array):
        return len(self.ccr_array)

    def _compute_genomic_size(self):
        """ Compute the genomic size """
        return self.end - self.start + 1

    def _compute_left(self):
        """ Compute the coordinates of genomic regions to
            left [chr:start strand] format
        """
        return str(self.chrom)+":"+str(self.start)+self.strand

    def _compute_right(self):
        """ Compute the coordinates of genomic regions to
            right [chr:end strand] format
        """
        return str(self.chrom)+":"+str(self.end)+self.strand

    def _compute_complete(self):
        """ Compute the coordinates of genomic regions to
            complete [chr:start-end strand] format
        """
        return str(self.chrom)+":"+str(self.start)+":"+str(self.end)+self.strand


    def _compute_consensus_start_end(self, ccr_array):
        """
           Some documentation would be welcome
        """
        d_start = defaultdict(list)
        d_end = defaultdict(list)
        for ccr in ccr_array:
            d_start[ccr.start].append(ccr)
            d_end[ccr.end].append(ccr)
        dd_start = defaultdict(list)
        dd_end = defaultdict(list)
        for key_start, value_start in d_start.items():
            dd_start[key_start] = compute_distinct_cr(value_start)
        for key_end, value_end in d_end.items():
            dd_end[key_end] = compute_distinct_cr(value_end)
        cons_start = max(dd_start.items(), key=operator.itemgetter(1))[0]
        cons_end = max(dd_end.items(), key=operator.itemgetter(1))[0]
        # Now variability arround consensus start and end
        start_var = compute_variability(dd_start, cons_start)
        end_var = compute_variability(dd_end, cons_end)
        # for cr in ccr_array: I don't understand the reason for this loop
        #     cr.start = max_start
        #     cr.end = max_end
        #     return cr
        return cons_start, cons_end, start_var, end_var

    def var_info_str(self):
        starts = ",".join(map(str,[cr.start for cr in self.ccr_array]))
        ends =  ",".join(map(str,[cr.end for cr in self.ccr_array]))
        return "\t".join(map(str, [self.chrom, self.start, self.end, self.nb_ccr,
                                   self.start_var,
                                   self.end_var]))

    def str4diffexp(self):
        key = "%s:%d-%d:%s" % (self.chrom, self.start, self.end, self.strand)
        return "%s\t%d" %(key, self.nccr)

    def add_start_annotation(self, annot):
        self.start_annotation.extend(annot)

    def add_end_annotation(self, annot):
        self.end_annotation.extend(annot)

    def add_start_transcript_id(self, annot):
        self.start_transcript_id.extend(annot)

    def add_end_transcript_id(self, annot):
        self.end_transcript_id.extend(annot)

    def add_start_gene_id(self, annot):
        self.start_gene_id.extend(annot)

    def add_end_gene_id(self, annot):
        self.end_gene_id.extend(annot)

    def get_start_annotation_str(self):
        annotations = []
        for exon_tuple in self.start_annotation:
            annotations.append(exon_tuple[0].get_annotation_tag(exon_tuple[1]))
        return ",".join(annotations)

    def get_end_annotation_str(self):
        annotations = []
        for exon_tuple in self.end_annotation:
            annotations.append(exon_tuple[0].get_annotation_tag(exon_tuple[1]))
        return ",".join(annotations)

    def get_start_transcript_id_str(self):
        annotations = []
        for transcript_tuple in self.start_transcript_id:
            annotations.append(transcript_tuple[0].attributes_d["transcript_id"])
        return ",".join(annotations)

    def get_end_transcript_id_str(self):
        annotations = []
        for transcript_tuple in self.end_transcript_id:
            annotations.append(transcript_tuple[0].attributes_d["transcript_id"])
        return ",".join(annotations)

    def get_start_gene_id_str(self):
        annotations = []
        for gene_tuple in self.start_gene_id:
            annotations.append(gene_tuple[0].attributes_d["gene_id"])
        return ",".join(annotations)

    def get_end_gene_id_str(self):
        annotations = []
        for gene_tuple in self.end_gene_id:
            annotations.append(gene_tuple[0].attributes_d["gene_id"])
        return ",".join(annotations)

    def get_intron_annot_str(self, att):
        annotations = []
        for intron, p_overlap in self.intron_annotation:
            upstream = intron.get_att("up_exon").get_att(att)
            downstream = intron.get_att("down_exon").get_att(att)
            annotations.append("%s-%s" % (upstream, downstream))
        return ",".join(annotations)


    def get_intron_annot_gene_str(self, att):
        annotations = []
        for intron, p_overlap in self.intron_annotation:
            upstream = intron.get_att("up_exon").get_att(att)
            strand_up = intron.get_att("up_exon").strand
            biotype_up = intron.get_att("up_exon").biotype
            downstream = intron.get_att("down_exon").get_att(att)
            strand_d = intron.get_att("down_exon").strand
            biotype_d = intron.get_att("down_exon").biotype
            annotations.append("%s_%s_%s-%s_%s_%s" % (upstream, strand_up, biotype_up,
                                                      downstream, strand_d, biotype_d))
        return ",".join(annotations)


    def annotate_intron(self, intron, p_overlap):
        self.intron_annotation.append((intron, p_overlap))


    def get_infra_exonic_annot_str(self, att):
        annotations = []
        for exon in self.infra_exonic_annotation:
            id_ife = exon.get_att(att)
            annotations.append("%s" % (id_ife))
        return ",".join(annotations)


    def get_infra_exonic_gene_annot_str(self, att):
        annotations = []
        for exon in self.infra_exonic_annotation:
            id_ife = exon.get_att(att)
            strand = exon.strand
            biotype = exon.biotype
            annotations.append("%s_%s_%s" % (id_ife, strand, biotype))
        return ",".join(annotations)


    def get_infra_exonic_exons_start_end(self):
        annotations = []
        for exon in self.infra_exonic_annotation:
            start_exon_ife = exon.start
            end_exon_ife = exon.end
            annotations.append("%s_%s" % (start_exon_ife, end_exon_ife))
        return ",".join(annotations)


    def annotate_ife(self, exon):
        self.infra_exonic_annotation.append(exon)


def complement(strand):
    """ molecular biology complement """
    return "-" if strand == "+" else "+"


def compute_distinct_cr(ccr_array):
    """ From a list of CR compute the number of distincts CR
    Read type with different alignments and different CIGAR
    :ccr_array: array of circular chimeric reads
    """
    cigar_dict = defaultdict(list)
    for cr in ccr_array:
        key = "-".join(map(str, [cr.read_type, cr.CIGAR_s1]))
        cigar_dict[key].append(cr)
    return len(cigar_dict.keys())

def compute_variability(d_values, center):
    deviations = []
    occurences = 0
    for pos, occ in d_values.items():
        deviations.append(occ * abs(pos - center))
        occurences += occ
    return sum(deviations)/occurences


def static_vars(**kwargs):
    def decorate(func):
        for k in kwargs:
            setattr(func, k, kwargs[k])
        return func
    return decorate


@static_vars(counter=0, introns_key_dict=defaultdict())
def get_intron_name(chrom, start, end):
    key = "-".join(map(str, [chrom, start, end]))
    if key not in get_intron_name.introns_key_dict:
        get_intron_name.counter += 1
        name = "INTRON_%d" % get_intron_name.counter
        get_intron_name.introns_key_dict[key] = name
    return get_intron_name.introns_key_dict[key]


@static_vars(introns_name_dict=defaultdict())
def new_intron(upstream_exon, donwstreem_exon):
    chrom = upstream_exon.chrom
    start_i = int(upstream_exon.end) + 1
    end_i = int(donwstreem_exon.start) - 1
    strand = upstream_exon.strand
    name = get_intron_name(chrom, start_i, end_i)
    if name in new_intron.introns_name_dict:
        return None
    new_intron.introns_name_dict[name] = 1
    source = "exon_annot"
    feature = "intron"
    attributes = {"up_exon": upstream_exon, "down_exon":  donwstreem_exon}
    intron = Annotation(chrom, start_i, end_i, strand,
                        feature, attributes, source, name=name)
    return intron


def compute_intronic_positions(d_transcripts):
    """
    returns an array of intron Annotation objects
    """
    introns_a = []
    for transcript_exons in d_transcripts.values():
        # sort by transcript
        sorted_exons = natsort.natsorted(transcript_exons,
                                         key=lambda e: (e.chrom, e.start))
        previous = None
        for exon in sorted_exons:
            if previous:
                intron = new_intron(previous, exon)
                if intron is not None:
                    introns_a.append(intron)
            previous = exon
    return introns_a


def write_annotation(annot_a, outputfile, fmt, attributes):
    with open(outputfile, "w") as fout:
        for record in annot_a:
            fout.write(record.annot_str(fmt, attributes)+"\n")


def read_annotation(file, fmt):
    if fmt=="gtf":
        header = ["chrom", "source", "feature", "start", "end",
                  "score", "strand", "phase", "attributes"]
    elif fmt=="bed":
        header = ["chrom", "start", "end", "gene_id", "score",
                  "strand", "source", "feature", "phase", "attributes"]
    else:
        print("Error: Unknown format...")
    return read_annotation_from_file(file, header)


def read_annotation_from_file(file, header):
    annot_df = pd.read_table(file, sep = '\t', names=header,
                             dtype={"chrom":str, "start":str, "end":str})
    annot_a = []
    for index, row in tqdm_wrapper(annot_df):
        chrom = row.chrom
        start = row.start
        end = row.end
        strand = row.strand
        feature = row.feature
        source = row.source if "source" in header else "."
        attributes = row.attributes
        name = row["gene_id"]
        if feature == "circRNA":
            annot = CircRNA(chrom, start, end, strand, feature, attributes,
                            source, name=name)
        else:
            annot = Annotation(chrom, start, end, strand, feature, attributes,
                               source)
        annot_a.append(annot)
    return annot_a

Python Pandas Consensus tqdm Pybedtools natsort singleton From line 4 of scripts/circRNA.py

import os
import argparse
import pandas as pd
import numpy as np
import re
import sys
from collections import defaultdict


def read_file(file):
    """ Read the circ_rnas_annotation.out file and return a pandas dataframe"""
    header = ["chrom", "start", "end", "circ_name", "nb_ccr", 
              "strand", "source", "feature", "phase", "attributes"]
    df = pd.read_table(file, sep='\t', names=header)
    df.replace(np.nan, "", inplace=True)
    return df


def get_folder_to_create(df): 
    folders_name = []
    dict_spl_sp = dict(zip(df["sample"], df["species"]))
    split_keys = [x.split("_") for x in dict_spl_sp.keys()]   

    for key in split_keys:
        folders_name.append(key[0]+"_"+key[1])
    folders = list(set(folders_name))
    return folders


def create_folders(folders):
    for folder in folders:
        if not os.path.exists(folder):
            os.makedirs(folder)


def filter_samples(df):
    df = df.drop(df[(df["sample"].isin(["ssc_liver_3", "ssc_liver_4", 
                                        "ssc_muscle_1", "ssc_testis_1"]))].index)
    df = df.reset_index(drop=True)
    return df


def main():

    # Read samples.tsv file:
    df_samples = pd.read_csv(args.sample_file, sep='\t')

    # Remove ssc_liver_3/4, ssc_muscle_1 and ssc_testis_1 samples:
    df_filtered = filter_samples(df_samples)

    samples_folders = df_filtered["sample"].tolist()
    df_filtered = df_filtered.loc[df_filtered['sample'].isin(samples_folders)] 

    # Get the folder list to create:
    folders = get_folder_to_create(df_filtered)

    # Create folders:
    create_folders(folders)

    # Group folders by species and tissue:
    df_grouped = df_filtered.groupby(["species", "tissue"])

    for key,item in df_grouped:

        group = df_grouped.get_group(key) 
        # Create col with path of folder:
        group["path_file_bed"] = group["sample"]+"/auzeville.bed"

        samples = group["sample"].tolist()
        samples_split = [s.split("_") for s in samples][0]
        folder_name = samples_split[0]+"_"+samples_split[1]

        paths = group["path_file_bed"].tolist()

        df_concat = pd.concat(read_file(f) for f in paths)
        df_concat = df_concat.sort_values(by=['chrom', 'start', 'end'])
        df_final = df_concat.groupby(['chrom', 'start', 'end', 'strand'], as_index=False)['nb_ccr'].sum()
        df_final.to_csv(folder_name+"/"+"auzeville.bed", sep="\t", index=False)


def parse_arguments():
    parser = argparse.ArgumentParser(description='Annotation')
    parser.add_argument('-sp', '--sample_file',
                        required=True, help='Sample file path')
    parser.add_argument('-circ', '--circ_rna_file',
                        required=False, help='Circular RNA file path')
    args = parser.parse_args()
    return args

if __name__ == '__main__':
    args = parse_arguments()
    main()

Python Pandas numpy From line 4 of scripts/cumul_bed.py

import os
import argparse
import pandas as pd
import numpy as np
import re
import sys

def read_file(file):
    """ Read the sample file containing path files and return a pandas dataframe"""
    sample = pd.read_csv(file, sep='\t', dtype=str)
    return sample

def get_log_stats(log_file):
    """ Reading, parsing the log file and returning a dict"""
    log_statistics = []
    with open(log_file) as fin:
        for line in fin:
            line = line.lstrip().rstrip()
            if not line or line.endswith(":") :  # if the line is empty we simply skip it
                continue
            key, value = line.split("|")
            value = value.replace("%", "")
            stat = (key.lstrip().rstrip(), value.lstrip().rstrip())
            log_statistics.append(stat)
    return log_statistics


def get_mapping_stats(runs):
    """
        Parse the STAR Log.final.out for each run to gather alignments
        statistics and returns a pandas dataframe
    """
    statistics = []
    reads = ["R1", "R2"]
    for idx, row in runs.iterrows():
        sample_info = [('sample', row['sample']),
                        ('sample_unit',row['sample_unit'] )]
        for read in reads:
            mapdir = os.path.join(row["mapdir"],"se", read)
            if not os.path.exists(mapdir):
                raise Exception("WARNING mapdir %s is missing" % mapdir)
            log_file = os.path.join(mapdir, "Log.final.out")
            if not os.path.exists(log_file):
                raise Exception("WARNING logfile %s is missing" % log_file)
            sample_info.append(('read', read ))
            log_stats = get_log_stats(log_file)
            sample_stats = sample_info + log_stats

            statistics.append(dict(sample_stats))
    return pd.DataFrame(statistics)


def main(input, output):

    # Read the sample file:
    runs = read_file(input)

    # Reading the log files
    stats = get_mapping_stats(runs)

    # Writing statistics to a file in tsv format
    stats.to_csv(output, sep="\t", index=False)



def parse_arguments():
    parser = argparse.ArgumentParser(description='Sample file')
    parser.add_argument('-i', '--input_file',
                        required=True, help='Sample file')
    parser.add_argument('-o', '--output_file',
                        required=True, help='Statuistics in tsv format')
    args = parser.parse_args()
    return args


if __name__ == '__main__':
    args = parse_arguments()
    main(args.input_file, args.output_file)

Python Pandas numpy STAR From line 5 of scripts/mapping_stats.py

import os
import argparse
import pandas as pd


def read_file(file):
    """ Read the metadata file containing path files and return a pandas dataframe"""
    metadata = pd.read_csv(file, sep='\t', dtype=str)
    return metadata


def check_mapdirs(mapdirs):
    """ Check if the mapdir exists or not"""
    for mapdir in mapdirs:
        if not os.path.exists(mapdir):
            raise Exception("WARNING following mapdir is missing %s" % mapdir)


def unique(liste):
    return sorted(set(liste))


def get_path_files(df):
    """ Group the samples by "sample".
    Return a dataframe with one line per individual.
    """
    df = df.groupby('sample').agg(lambda x: unique(list(x))).reset_index()
    return df


def get_species_short(species):
    animal_keys = ["bos_taurus", "sus_scrofa"]
    animal_values = ["cow", "pig"]
    succint_name = dict(zip(animal_keys, animal_values))
    if species in succint_name:
        return succint_name[species]
    else:
        return species


def write_sample_file(metadata, output_file):
    """Read the metadate file and return a pandas object"""
    sample_ids = []
    species_names = []
    for index, row in metadata.iterrows():
        short_name = get_species_short(row['species'])
        sample = "_".join([short_name, row['tissue'],
                           row['animal_name']])
        species = short_name
        sample_ids.append(sample)
        species_names.append(species)
    metadata['sample'] = sample_ids
    metadata['species'] = species_names
    samples = metadata[['sample', 'sample_unit', 'species', 'sex', 'mapdir']]
    samples.to_csv(output_file, sep="\t", index=False)


def main():

    # Read the metadata file:
    metadata = read_file(args.input_file)

    # Check the mapdir of each sample exists or not:
    check_mapdirs(metadata["mapdir"])

    # Write a new output file 'samples.tsv' with identical key per individual:
    write_sample_file(metadata, args.output_file)


def parse_arguments():
    parser = argparse.ArgumentParser(description='Metadata file')
    parser.add_argument('-i', '--input_file',
                        required=True, help='Metadata file')
    parser.add_argument('-o', '--output_file',
                        required=True, help='Metadata file')
    args = parser.parse_args()
    return args


if __name__ == '__main__':
    args = parse_arguments()
    main()

Python Pandas From line 5 of scripts/prepare.py

class Singleton:
    """
    A non-thread-safe helper class to ease implementing singletons.
    This should be used as a decorator -- not a metaclass -- to the
    class that should be a singleton.
    The decorated class can define one `__init__` function that
    takes only the `self` argument. Also, the decorated class cannot be
    inherited from. Other than that, there are no restrictions that apply
    to the decorated class.
    To get the singleton instance, use the `instance` method. Trying
    to use `__call__` will result in a `TypeError` being raised.
    """

    def __init__(self, decorated):
        self._decorated = decorated

    def instance(self):
        """
        Returns the singleton instance. Upon its first call, it creates a
        new instance of the decorated class and calls its `__init__` method.
        On all subsequent calls, the already created instance is returned.
        """
        try:
            return self._instance
        except AttributeError:
            self._instance = self._decorated()
            return self._instance

    def __call__(self):
        raise TypeError('Singletons must be accessed through `instance()`.')

    def __instancecheck__(self, inst):
        return isinstance(inst, self._decorated)


@Singleton
class Foo:
   def __init__(self):
       print('Foo created')


if __name__ == '__main__':
    # This enables to use this module as a script for testing
    f = Foo() # Error, this isn't how you get the instance of a singleton

    f = Foo.instance() # Good. Being explicit is in line with the Python Zen
    g = Foo.instance() # Returns already created instance

    print(f is g) # True

Python From line 4 of scripts/Singleton.py

import os, re, sys, csv, argparse
import circRNA as circ
import pandas as pd
import numpy as np

# Utility functions:
def eprint(*args, **kwargs):
    print(*args,  file=sys.stderr, **kwargs)

def get_sample(file):
    """ Get the sample name from the file path"""
    sample_name = os.path.split(os.path.dirname(file))[1]
    return sample_name

def read_file(file):
    """ Read the circ_rnas_annotation.out file and return a pandas dataframe"""
    df = pd.read_table(file, sep='\t')
    df.replace(np.nan, "", inplace=True)
    return df

def intersection(lst1, lst2):
    """ Return the intersection between two lists"""
    return list(set(lst1) & set(lst2))


def get_exonic_circrnas(df):
    """ Get statistics about exonic circRNAs from the annotation_circRNA.out file"""
    # Arrays, dicts:
    exonic_circ_names, monoexonic_circ_names, true_exonic, single_end_circ_names = [], [], [], []
    ccr_start_end_exonic, ccr_single_annot = dict(), dict()
    # Counter exonic circRNAs:
    nb_monoexonic, nb_start_end_exonic, nb_antisens_exonic, nb_single_annotated_junction = 0, 0, 0, 0

    for index, row in df.iterrows():

        exon_id_start = row.exons_id_start.split(",")
        exon_start = str(exon_id_start[0])
        exon_start_tmp = exon_start.split("_")
        exon_start = exon_start_tmp[0]
        exon_id_end = row.exons_id_end.split(",")
        exon_end = str(exon_id_end[0])
        exon_end_tmp = exon_end.split("_")
        exon_end = exon_end_tmp[0]

        if row.nb_ccr >= 5:
            # if ((len(row.exons_id_start) > 0 or len(row.exons_id_end) > 0)
            #     and len(row.intron_name) == 0):
            if len(row.exons_id_start) > 0 or len(row.exons_id_end) > 0:
                if row.strand == "+":
                    if ("5" and "+") in row.exons_id_start:
                        if len(row.exons_id_end)==0:
                            nb_single_annotated_junction += 1
                            ccr_single_annot[row.circ_rna_name] = row.nb_ccr
                            single_end_circ_names.append(row.circ_rna_name)
                        if ("3" and "+") in row.exons_id_end:
                            nb_start_end_exonic += 1
                            true_exonic.append(row)
                            exonic_circ_names.append(row.circ_rna_name)
                            ccr_start_end_exonic[row.circ_rna_name] = row.nb_ccr
                            if ((len(exon_id_start)==1 and len(exon_id_end)==1) and (exon_start == exon_end)):
                                nb_monoexonic += 1
                                monoexonic_circ_names.append(row.circ_rna_name)
                    if ("3" and "+") in row.exons_id_end:
                        if len(row.exons_id_start)==0:
                            nb_single_annotated_junction += 1
                            ccr_single_annot[row.circ_rna_name] = row.nb_ccr
                            single_end_circ_names.append(row.circ_rna_name)
                    if ("3" and "-") in row.exons_id_start:
                        if ("5" and "-") in row.exons_id_end:
                            nb_antisens_exonic += 1
                elif row.strand == "-":
                    if ("5" and "-") in row.exons_id_end:
                        if len(row.exons_id_start)==0:
                            nb_single_annotated_junction += 1
                            ccr_single_annot[row.circ_rna_name] = row.nb_ccr
                            single_end_circ_names.append(row.circ_rna_name)
                        if ("3" and "-") in row.exons_id_start:
                            nb_start_end_exonic += 1
                            true_exonic.append(row)
                            exonic_circ_names.append(row.circ_rna_name)
                            ccr_start_end_exonic[row.circ_rna_name] = row.nb_ccr
                            if ((len(exon_id_start)==1 and len(exon_id_end)==1) and (exon_start == exon_end)):
                                nb_monoexonic += 1
                                monoexonic_circ_names.append(row.circ_rna_name)
                    if ("3" and "-") in row.exons_id_start:
                        if len(row.exons_id_end)==0:
                            nb_single_annotated_junction += 1
                            ccr_single_annot[row.circ_rna_name] = row.nb_ccr
                            single_end_circ_names.append(row.circ_rna_name)
                    if ("5" and "+") in row.exons_id_start:
                        if ("3" and "+") in row.exons_id_end:
                            nb_antisens_exonic += 1
    return true_exonic, nb_start_end_exonic, nb_single_annotated_junction, single_end_circ_names, ccr_single_annot, monoexonic_circ_names, exonic_circ_names


def get_intronic_circrnas(df, exonic_circrnas):
    # Arrays, dicts:
    intronic_circ_names, true_intronic = [], []
    ccr_true_intronic = dict()
    # Counter intronic circRNAs:
    nb_true_intronic = 0

    for index, row in df.iterrows():
        if row.circ_rna_name not in exonic_circrnas:
            if row.nb_ccr >= 5:
                if len(row.intron_name) > 0:
                    if row.strand == "+":
                        if (row.end_i - row.end) in range(-5,60):
                            if (row.start - row.start_i) in range(-5,5) or (row.start==row.start_i):
                                nb_true_intronic += 1
                                true_intronic.append(row)
                                ccr_true_intronic[row.circ_rna_name] = row.nb_ccr
                                intronic_circ_names.append(row.circ_rna_name)
                        elif (row.start == row.start_i and (row.end_i - row.end) > 32):
                            nb_true_intronic += 1
                            true_intronic.append(row)
                            ccr_true_intronic[row.circ_rna_name] = row.nb_ccr
                            intronic_circ_names.append(row.circ_rna_name)
                    elif row.strand == "-":
                        if (row.start - row.start_i) in range(-5,60):
                            if ((row.end - row.end_i) in range(-5,5) or (row.end == row.end_i)):
                                nb_true_intronic += 1
                                true_intronic.append(row)
                                ccr_true_intronic[row.circ_rna_name] = row.nb_ccr
                                intronic_circ_names.append(row.circ_rna_name)
                        elif (row.end == row.end_i and (row.start - row.start_i) > 60):
                            nb_true_intronic += 1
                            true_intronic.append(row)
                            ccr_true_intronic[row.circ_rna_name] = row.nb_ccr
                            intronic_circ_names.append(row.circ_rna_name)
    return true_intronic, nb_true_intronic, ccr_true_intronic, intronic_circ_names


def get_subexonic_circrnas(df, monoexonic_circ_names, intronic_circ_names):
    # Subexonic_meg : sens + antisens
    # Subexonic_pleg : sens
    # Arrays, dicts:
    infraexonic_tot_names, infraexonic_circ_names, subexonic, subexonic_antisens_names = [], [], [], []
    ccr_infraexonic = dict()
    # Counter subexonic circRNAs:
    nb_infraexonic, nb_infraexonic_tot, nb_infraexonic_sens, nb_infraexonic_antisens = 0, 0, 0, 0

    for index, row in df.iterrows():
        if row.nb_ccr >= 5:
            if ((len(row.gene_id_ife) > 0) and (row.circ_rna_name not in monoexonic_circ_names)):
                if (("_5_c_+" not in row.exons_id_start) and ("_5_lnc_+" not in row.exons_id_start) and
                    ("_3_c_-" not in row.exons_id_start) and ("_3_lnc_-" not in row.exons_id_start) and
                    ("_3_c_+" not in row.exons_id_end) and ("_3_lnc_+" not in row.exons_id_end) and
                    ("_5_c_-" not in row.exons_id_end) and ("_5_lnc_-" not in row.exons_id_end)):
                    if len(row.gene_id_start) == 0 and len(row.gene_id_end) == 0:
                        nb_infraexonic_tot += 1
                        infraexonic_tot_names.append(row.circ_rna_name)
                        if row.circ_rna_name not in intronic_circ_names:
                            if row.strand == "+":
                                if "+" in row.gene_id_ife:
                                    nb_infraexonic_sens += 1
                                    subexonic.append(row)
                                    ccr_infraexonic[row.circ_rna_name] = row.nb_ccr
                                    infraexonic_circ_names.append(row.circ_rna_name)
                                elif "-" in row.gene_id_ife:
                                    nb_infraexonic_antisens += 1
                                    subexonic.append(row)
                                    infraexonic_circ_names.append(row.circ_rna_name)
                                    subexonic_antisens_names.append(row.circ_rna_name)
                            elif row.strand == "-":
                                if "-" in row.gene_id_ife:
                                    nb_infraexonic_sens += 1
                                    subexonic.append(row)
                                    infraexonic_circ_names.append(row.circ_rna_name)
                                    ccr_infraexonic[row.circ_rna_name] = row.nb_ccr
                                elif "+" in row.gene_id_ife:
                                    nb_infraexonic_antisens += 1
                                    subexonic.append(row)
                                    infraexonic_circ_names.append(row.circ_rna_name)
                                    subexonic_antisens_names.append(row.circ_rna_name)
                    elif ((len(row.gene_id_start) > 0 and len(row.gene_id_end) == 0) or
                            (len(row.gene_id_start) == 0  and len(row.gene_id_end) > 0)):
                        nb_infraexonic_tot += 1
                        infraexonic_tot_names.append(row.circ_rna_name)
                        if row.circ_rna_name not in intronic_circ_names:
                            if row.strand == "+":
                                if "+" in row.gene_id_ife:
                                    nb_infraexonic_sens += 1
                                    subexonic.append(row)
                                    infraexonic_circ_names.append(row.circ_rna_name)
                                    ccr_infraexonic[row.circ_rna_name] = row.nb_ccr
                                elif "-" in row.gene_id_ife:
                                    nb_infraexonic_antisens += 1
                                    subexonic.append(row)
                                    infraexonic_circ_names.append(row.circ_rna_name)
                                    subexonic_antisens_names.append(row.circ_rna_name)
                            elif row.strand == "-":
                                if "-" in row.gene_id_ife:
                                    nb_infraexonic_sens += 1
                                    subexonic.append(row)
                                    infraexonic_circ_names.append(row.circ_rna_name)
                                    ccr_infraexonic[row.circ_rna_name] = row.nb_ccr
                                elif "+" in row.gene_id_ife:
                                    nb_infraexonic_antisens += 1
                                    subexonic.append(row)
                                    infraexonic_circ_names.append(row.circ_rna_name)
                                    subexonic_antisens_names.append(row.circ_rna_name)
    return subexonic, infraexonic_circ_names, nb_infraexonic_sens, ccr_infraexonic, subexonic_antisens_names


def get_circrnas(df):
    """ Read the annotation_circRNA.out dataframe and return all statistics to tabular
        format about circRNAs"""
    # Get exonic circRNAs:
    stats_exonic_circrnas = get_exonic_circrnas(df)
    exonic = stats_exonic_circrnas[0]
    monoexonic_names = stats_exonic_circrnas[5]
    exonic_names = stats_exonic_circrnas[6]
    # Get intronic circRNAs:
    stats_intronic_circrnas = get_intronic_circrnas(df, exonic_names)
    intronic = stats_intronic_circrnas[0]
    intronic_names = stats_intronic_circrnas[3]
    # Get subexonic circRNAs:
    stats_subexonic_circrnas = get_subexonic_circrnas(df, monoexonic_names, intronic_names)
    subexonic = stats_subexonic_circrnas[0]
    subexonic_antisens = stats_subexonic_circrnas[4]

    return exonic, subexonic, intronic, subexonic_antisens


def get_stats_circrnas(sample, df, output_file_name):
    """ Read the annotation_circRNA.out dataframe and return all statistics to tabular
        format about circRNAs"""
    nb_circ_tot = len(df)
    # Get circRNAs:
    exonic = get_circrnas(df)[0]
    subexonic = get_circrnas(df)[1]
    subexonic_antisens = get_circrnas(df)[3]
    intronic = get_circrnas(df)[2]

    # Get exonic circRNAs stats:
    stats_exonic = write_comparison_exonic_table(sample, exonic, output_file_name)

    # Get subexonic circRNAs stats:
    nb_subexonic = get_nb_type(sample, subexonic, subexonic_antisens)[0]
    nb_gene_subexonic = get_nb_type(sample, subexonic, subexonic_antisens)[1]
    nb_subexonic_pleg = get_nb_type(sample, subexonic, subexonic_antisens)[2]
    nb_subexonic_meg = get_nb_type(sample, subexonic, subexonic_antisens)[3]
    nb_ccr_pleg = get_nb_type(sample, subexonic, subexonic_antisens)[4]
    nb_ccr_meg = get_nb_type(sample, subexonic, subexonic_antisens)[5]


    if len(stats_exonic) > 5:
        nb_c = stats_exonic[0]
        nb_lnc = stats_exonic[1]
        nb_autres = stats_exonic[2]
        nb_start_end_exonic_selected = stats_exonic[3]
        nb_ccr_start_end_exonic = stats_exonic[4]
        nb_exonic = write_comparison_exonic_table(sample, exonic, args.output_comp_exonic_file)[5]
    else:
        nb_c = stats_exonic[0]
        nb_lnc = 0
        nb_autres = stats_exonic[1]
        nb_start_end_exonic_selected = stats_exonic[2]
        nb_ccr_start_end_exonic = stats_exonic[3]
        nb_exonic = write_comparison_exonic_table(sample, exonic, args.output_comp_exonic_file)[4]

    nb_start_end_exonic = get_exonic_circrnas(df)[1]
    nb_single_annotated_junction = get_exonic_circrnas(df)[2]
    monoexonic_circ_names =  get_exonic_circrnas(df)[5]
    exonic_names = get_exonic_circrnas(df)[6]
    intronic_circ_names = get_intronic_circrnas(df, exonic_names)[3]
    infraexonic_circ_names = get_subexonic_circrnas(df, monoexonic_circ_names, intronic_circ_names)[1]
    single_end_circ_names = get_exonic_circrnas(df)[3]
    nb_infraexonic_sens = get_subexonic_circrnas(df, monoexonic_circ_names, intronic_circ_names)[2]
    nb_true_intronic = get_intronic_circrnas(df, exonic_names)[1]
    ccr_single_annot = get_exonic_circrnas(df)[4]
    ccr_true_intronic = get_intronic_circrnas(df, exonic_names)[2]
    ccr_infraexonic = get_subexonic_circrnas(df, monoexonic_circ_names, intronic_circ_names)[3]

    nb_start_end_false_exonic = nb_start_end_exonic - nb_start_end_exonic_selected
    nb_tot_exonic = nb_start_end_exonic + nb_single_annotated_junction
    nb_common_infra_single = len(intersection(infraexonic_circ_names, single_end_circ_names))
    nb_infraexonic = nb_infraexonic_sens - nb_common_infra_single
    nb_circ_non_annotated = nb_circ_tot - (nb_tot_exonic + nb_infraexonic + nb_true_intronic - nb_start_end_false_exonic)
    nb_ccr_single_annot = sum(ccr_single_annot.values())
    nb_ccr_intronic = sum(ccr_true_intronic.values())
    nb_ccr_infraexonic = sum(ccr_infraexonic.values())

    return "\t".join(map(str,[sample, nb_circ_tot, nb_exonic, nb_lnc, nb_autres, nb_ccr_start_end_exonic,
                              nb_subexonic_pleg, nb_ccr_pleg, nb_subexonic_meg, nb_ccr_meg, nb_gene_subexonic,
                              nb_true_intronic, nb_ccr_intronic]))+"\n"


def write_stats_table(stats, output_file):
    with open(output_file, "w") as fout:
        fout.write(stats)


def write_circ_table(self, header, path, index=None, sep="\t", na_rep='', float_format=None,
                     index_label=None, mode='w', encoding=None, date_format=None, decimal='.'):
    """
    Write a circRNAs list to a tabular-separated file (tsv).
    """
    from pandas.core.frame import DataFrame
    df = DataFrame(self)
    # result is only a string if no path provided, otherwise None
    result = df.to_csv(path, index=index, sep=sep, na_rep=na_rep, float_format=float_format,
                       header=header, index_label=index_label, mode=mode, encoding=encoding,
                       date_format=date_format, decimal=decimal)
    if path is None:
        return result


def write_comparison_exonic_table(sample, circ_rnas, output_file_name):
    # Write exonic table for comparaison between tissues/species:
    df_circ_rnas = pd.DataFrame(circ_rnas, index=None)
    header = ["chrom:start-end,strand", "nb_ccr", "gene_id", "biotype", "genomic_size"]
    nb_start_end_annotated, nb_exonic = 0, 0
    nb_ccr_exonics, biotypes = [], []

    with open(output_file_name, 'w') as fout:
        tsv_writer = csv.writer(fout, delimiter='\t')
        tsv_writer.writerow(header)
        for index, row in df_circ_rnas.iterrows():
            exons_id_start_a, exons_id_end_a = [], []
            # Select well-annotated true exonic circRNAs bases on the gene_id:
            # Get key:
            chrom_start = ":".join(map(str,[row.chrom, row.start]))
            chrom_start_end = "-".join(map(str, [chrom_start , row.end]))
            chrom_start_end_strand = ",".join(map(str, [chrom_start_end , row.strand]))
            # Get the ccr number:
            nb_ccr = row.nb_ccr
            nb_ccr_exonics.append(nb_ccr)
            # Get gene_id:
            genes_id_start = list(set(list(row.gene_id_start.split(","))))
            genes_id_end = list(set(list(row.gene_id_end.split(","))))
            # Get biotype:
            exons_id_start = list(set(list(row.exons_id_start.split(","))))
            exons_id_end = list(set(list(row.exons_id_end.split(","))))
            exons_id_start = list(i.split("_") for i in exons_id_start)
            exons_id_end = list(i.split("_") for i in exons_id_end)
            intersect_gene_id = ",".join(list(set(genes_id_start).intersection(genes_id_end)))
            if len(intersect_gene_id) > 0:
                nb_start_end_annotated += 1
                # Get biotype:
                biotypes_start = list(set(list(i[2] for i in exons_id_start)))
                biotypes_end = list(set(list(i[2] for i in exons_id_end)))
                intersect_biotypes = "".join(list(set(biotypes_start).intersection(biotypes_end)))
                # Write line into the output file:
                nb_exonic += 1
                genomic_size = row.end - row.start
                s = [chrom_start_end_strand, nb_ccr, intersect_gene_id, intersect_biotypes, genomic_size]
                tsv_writer.writerow(s)
                biotypes.append(intersect_biotypes)
    # Dictionary with key = biotype and value = counter:
    d = {}
    d["sample"] = sample
    for biotype in biotypes:
        d[biotype] = d.get(biotype, 0) + 1
    values = []
    valid_keys = ["c", "lnc", "sample"]
    for key, value in d.items():
        if key not in valid_keys:
            values.append(value)
    d["autres"] = sum(values)
    nb_ccr_exonic = sum(nb_ccr_exonics)
    for key, value in d.items():
        if ("c" in d.keys()) and ("lnc" in d.keys()):
            return d["c"], d["lnc"], d["autres"], nb_start_end_annotated, nb_ccr_exonic, nb_exonic
        elif ("c" in d.keys() and "lnc" not in d.keys()):
            return d["c"], d["autres"], nb_start_end_annotated, nb_ccr_exonic, nb_exonic


def compute_size(exons_start_end):
    pos_start_end = list(exons_start_end.split("_"))
    pos_start = int(pos_start_end[0])
    pos_end = int(pos_start_end[1])
    size = pos_end - pos_start
    return size

def get_true_exons_gene_id(exons_start_end, exons_id, gene_id):
    if len(exons_start_end)==1:
        exon_start_end = ",".join(exons_start_end)
        exon_id = ",".join(exons_id)
        gene_id = gene_id
    else:
        annot = dict(zip(exons_start_end, exons_id))
        sizes = []
        for key in annot:
            size = compute_size(key)
            sizes.append(size)
        values = zip(exons_start_end, sizes)
        annot_f = dict(zip(exons_id, values))
        min_size = min([i[1] for i in list(annot_f.values())])
        for key, values in annot_f.items():
            if values[1] == min_size:
                exon_id = key
                exon_start_end = values[0]
    return exon_start_end, exon_id, gene_id


def get_nb_type(sample, circ_rnas, subexonic_antisens):
    # Write exonic table for comparaison between tissues/species:
    df_circ_rnas = pd.DataFrame(circ_rnas, index=None)
    nb_sub_exonic, nb_sub_exonic_pleg, nb_sub_exonic_meg = 0, 0, 0
    subexonic_genes, nb_ccr_pleg, nb_ccr_meg = [], [], []

    for index, row in df_circ_rnas.iterrows():
        # Select well-annotated subexonic circRNAs according to the gene_id:
        # Get gene_id:
        genes_id = list(set(list(row.gene_id_ife.split(","))))
        genes_id = list(i.split("_") for i in genes_id)
        exons_start_end = sorted(set(row.exons_start_end_ife.split(",")), key=row.exons_start_end_ife.split(',').index)
        exon_id = sorted(set(row.exon_id_ife.split(",")), key=row.exon_id_ife.split(',').index)
        gene_id = ",".join(list(set(list(row.gene_id_ife.split(",")))))
        biotypes = []

        exon_start_end = get_true_exons_gene_id(exons_start_end, exon_id, gene_id)[0]
        exon_id = get_true_exons_gene_id(exons_start_end, exon_id, gene_id)[1]
        gene_id = get_true_exons_gene_id(exons_start_end, exon_id, gene_id)[2]

        row.exons_start_end_ife = exon_start_end
        row.exon_id_ife = exon_id
        row.gene_id_ife = gene_id

        if len(genes_id)==1:
            for i in genes_id:
                # Subexonic pleg circRNAs:
                if ('c' in i or 'lnc' in i or 'pseudo' in i) and (row.circ_rna_name not in subexonic_antisens):
                    nb_sub_exonic += 1
                    nb_sub_exonic_pleg += 1
                    nb_ccr_pleg.append(row.nb_ccr)
                    subexonic_genes.append(gene_id)
                    # Subexonic meg circRNAs:
                elif 'c' not in i and 'lnc' not in i and 'pseudo' not in i and 'rRNA' not in i:
                    nb_sub_exonic += 1
                    nb_sub_exonic_meg += 1
                    nb_ccr_meg.append(row.nb_ccr)
                    subexonic_genes.append(gene_id)
                elif 'rRNA' in i:
                    pass

        if len(genes_id) > 1:
            for i in genes_id:
                if len(i) == 3:
                    biotypes.append(i[2])
                    biotypes = list(set(biotypes))
                elif len(i) > 3:
                    p_biotypes = [i[2],i[3]]
                    biotypes = [p for p in p_biotypes]
            if 'rRNA' in biotypes:
                pass
            # Subexonic pleg circRNAs:
            elif ((biotypes == ['c'] or biotypes == ['lnc'] or biotypes == ['pseudo']
                or ('c' and 'pseudo' in biotypes) or ('c' and 'lnc' in biotypes)
                or ('lnc' and 'pseudo' in biotypes)) and (row.circ_rna_name not in subexonic_antisens)):
                nb_sub_exonic += 1
                nb_sub_exonic_pleg += 1
                nb_ccr_pleg.append(row.nb_ccr)
                subexonic_genes.append(gene_id)
            # Subexonic meg circRNAs:
            else:
                nb_sub_exonic += 1
                nb_sub_exonic_meg += 1
                nb_ccr_meg.append(row.nb_ccr)
                subexonic_genes.append(gene_id)

    nb_subexonic_genes = len(list(set([val for sublist in subexonic_genes for val in sublist])))
    nb_ccr_p = sum(nb_ccr_pleg)
    nb_ccr_m = sum(nb_ccr_meg)
    return nb_sub_exonic, nb_subexonic_genes, nb_sub_exonic_pleg, nb_sub_exonic_meg, nb_ccr_p, nb_ccr_m


def write_subexonic_tables(sample, circ_rnas, subexonic_antisens, output_file_name_meg, output_file_name_pleg):
    # Write exonic table for comparaison between tissues/species:
    df_circ_rnas = pd.DataFrame(circ_rnas, index=None)
    nb_sub_exonic, nb_sub_exonic_pleg, nb_sub_exonic_meg = 0, 0, 0
    subexonic_genes, nb_ccr_pleg, nb_ccr_meg = [], [], []
    header = list(df_circ_rnas.columns.values.tolist())

    with open(output_file_name_meg, 'w') as fout_meg, open(output_file_name_pleg, 'w') as fout_pleg:
        # Subexonic pleg circRNAs:
        tsv_writer_pleg = csv.writer(fout_pleg, delimiter='\t')
        tsv_writer_pleg.writerow(header)
        # Subexonic meg circRNAs:
        tsv_writer_meg = csv.writer(fout_meg, delimiter='\t')
        tsv_writer_meg.writerow(header)

        for index, row in df_circ_rnas.iterrows():
            # Select well-annotated subexonic circRNAs according to the gene_id:
            # Get gene_id:
            genes_id = list(set(list(row.gene_id_ife.split(","))))
            genes_id = list(i.split("_") for i in genes_id)
            exons_start_end = sorted(set(row.exons_start_end_ife.split(",")), key=row.exons_start_end_ife.split(',').index)
            exon_id = sorted(set(row.exon_id_ife.split(",")), key=row.exon_id_ife.split(',').index)
            gene_id = ",".join(list(set(list(row.gene_id_ife.split(",")))))
            biotypes = []

            exon_start_end = get_true_exons_gene_id(exons_start_end, exon_id, gene_id)[0]
            exon_id = get_true_exons_gene_id(exons_start_end, exon_id, gene_id)[1]
            gene_id = get_true_exons_gene_id(exons_start_end, exon_id, gene_id)[2]

            row.exons_start_end_ife = exon_start_end
            row.exon_id_ife = exon_id
            row.gene_id_ife = gene_id

            if len(genes_id)==1:
                for i in genes_id:
                    # Subexonic pleg circRNAs:
                    if ('c' in i or 'lnc' in i or 'pseudo' in i) and (row.circ_rna_name not in subexonic_antisens):
                        nb_sub_exonic += 1
                        nb_sub_exonic_pleg += 1
                        nb_ccr_pleg.append(row.nb_ccr)
                        subexonic_genes.append(gene_id)
                        s = row
                        tsv_writer_pleg.writerow(s)
                    # Subexonic meg circRNAs:
                    elif 'c' not in i and 'lnc' not in i and 'pseudo' not in i and 'rRNA' not in i:
                        nb_sub_exonic += 1
                        nb_sub_exonic_meg += 1
                        nb_ccr_meg.append(row.nb_ccr)
                        subexonic_genes.append(gene_id)
                        s = row
                        tsv_writer_meg.writerow(s)
                    elif 'rRNA' in i:
                        pass

            if len(genes_id) > 1:
                for i in genes_id:
                    if len(i) == 3:
                        biotypes.append(i[2])
                        biotypes = list(set(biotypes))
                    elif len(i) > 3:
                        p_biotypes = [i[2],i[3]]
                        biotypes = [p for p in p_biotypes]
                if 'rRNA' in biotypes:
                    pass
                # Subexonic pleg circRNAs:
                elif ((biotypes == ['c'] or biotypes == ['lnc'] or biotypes == ['pseudo']
                    or ('c' and 'pseudo' in biotypes) or ('c' and 'lnc' in biotypes)
                    or ('lnc' and 'pseudo' in biotypes)) and (row.circ_rna_name not in subexonic_antisens)):
                    nb_sub_exonic += 1
                    nb_sub_exonic_pleg += 1
                    nb_ccr_pleg.append(row.nb_ccr)
                    subexonic_genes.append(gene_id)
                    s = row
                    tsv_writer_pleg.writerow(s)
                # Subexonic meg circRNAs:
                else:
                    nb_sub_exonic += 1
                    nb_sub_exonic_meg += 1
                    nb_ccr_meg.append(row.nb_ccr)
                    subexonic_genes.append(gene_id)
                    s = row
                    tsv_writer_meg.writerow(s)

    nb_subexonic_genes = len(list(set([val for sublist in subexonic_genes for val in sublist])))
    nb_ccr_p = sum(nb_ccr_pleg)
    nb_ccr_m = sum(nb_ccr_meg)
    return nb_sub_exonic, nb_subexonic_genes, nb_sub_exonic_pleg, nb_sub_exonic_meg, nb_ccr_p, nb_ccr_m


def write_circrnas_tables(sample, df, exonic_circrnas, intronic_circrnas, subexonic_circrnas, subexonic_antisens):
    header = list(df.columns.values.tolist())
    if len(exonic_circrnas)>0:
        # Write the exonic circRNAs table:
        write_circ_table(exonic_circrnas, header, args.output_exonic_file)
    else:
        with open(args.output_exonic_file, 'w') as fout:
            tsv_writer = csv.writer(fout, delimiter='\t')
            tsv_writer.writerow(header)
    if len(intronic_circrnas)>0:
        # Write the intronic circRNAs table:
        write_circ_table(intronic_circrnas, header, args.output_intronic_file)
    else:
        with open(args.output_intronic_file, 'w') as fout:
            tsv_writer = csv.writer(fout, delimiter='\t')
            tsv_writer.writerow(header)

    eprint("subexonic_circrnas", len(subexonic_circrnas))
    eprint("subexonic_antisens", len(subexonic_antisens))
    if len(subexonic_circrnas)>0 or len(subexonic_antisens)>0:
        # Write the subexonic circRNAs tables:
        write_subexonic_tables(sample, subexonic_circrnas, subexonic_antisens, args.output_subexonic_meg_file,
                               args.output_subexonic_pleg_file)
    else:
        eprint("ok")
        open(args.output_subexonic_meg_file, 'a').close()
        open(args.output_subexonic_pleg_file, 'a').close()

def main():
    # Get sample name:
    sample = get_sample(args.input_file)

    # Read the circRNAs annotation file:
    df_circ_annot = read_file(args.input_file)

    # Get the list of exonic, subexonic and intronic circRNAs:
    exonic_circrnas = get_circrnas(df_circ_annot)[0]
    subexonic_circrnas = get_circrnas(df_circ_annot)[1]
    intronic_circrnas = get_circrnas(df_circ_annot)[2]
    subexonic_antisens = get_circrnas(df_circ_annot)[3]

    # Write exonic, intronic and subexonic circRNAs tables:
    circrnas_tables = write_circrnas_tables(sample, df_circ_annot, exonic_circrnas,
                                            intronic_circrnas, subexonic_circrnas, subexonic_antisens)

    # Compute statistics about exonic, subexonic and intronic circRNAs:
    stats = get_stats_circrnas(sample, df_circ_annot, args.output_comp_exonic_file)

    # Write the circRNAs statistics table:
    write_stats_table(stats, args.output_stats_file)


def parse_arguments():
    parser = argparse.ArgumentParser(description='Return annotation statistics tables')
    parser.add_argument('-i', '--input_file', required=True,
                        help='Annotation circRNAs file')
    parser.add_argument('-o_stats', '--output_stats_file', required=False,
                        default="stats_annotation.tsv",
                        help='Table containing statistics of annotation')
    parser.add_argument('-oi', '--output_intronic_file', required=False,
                        default="intronic_circRNAs.tsv",
                        help='Table containing intronic circRNAs')
    parser.add_argument('-oe', '--output_exonic_file', required=False,
                        default="exonic_circRNAs.tsv",
                        help='Table containing exonic circRNAs')
    parser.add_argument('-oce', '--output_comp_exonic_file', required=False,
                        default="exonic_summary.tsv",
                        help='Table containing exonics circRNAs for comparisons')
    parser.add_argument('-osepleg', '--output_subexonic_pleg_file', required=False,
                        default="subexonic_pleg_circRNAs.tsv",
                        help='Table containing subexonic circRNAs')
    parser.add_argument('-osemeg', '--output_subexonic_meg_file', required=False,
                        default="subexonic_meg_circRNAs.tsv",
                        help='Table containing subexonic circRNAs')
    args = parser.parse_args()
    return args

if __name__ == '__main__':
    args = parse_arguments()
    main()

Python Pandas numpy Circr From line 4 of scripts/stats_annotation.py

import os
import argparse
import pandas as pd
import numpy as np
import re
import sys


# Utility functions
def create_folder(path):
    if not os.path.exists(path):
        os.mkdir(path)

def eprint(*args, **kwargs):
    print(*args,  file=sys.stderr, **kwargs)

def read_file(file):
    """ Read the sample file containing path files and return a pandas dataframe"""
    sample = pd.read_csv(file, sep='\t', dtype=str)
    return sample

def check_mapdirs(mapdirs):
    """ Check if the mapdir exists or not"""
    for mapdir in mapdirs:
        if not os.path.exists(mapdir):
            raise Exception("WARNING following mapdir is missing %s" % mapdir)

def read_log_file(sample):
    """ Read and clean the Log.final.out files containing the summary mapping statistics
        after mapping job is complete and return an array of the pandas dataframe
        containing statistics"""
    stats = []
    d = {}
    reads = ["R1", "R2"]
    paths_out = []
    for index, row in sample.iterrows():
        sample_name = row["sample"]
        sample_unit_name = row["sample_unit"]
        path = row["mapdir"]
        for read in reads:
            path_file = path+"/se/"+read+"/Log.final.out"
            f = open(path_file, "r")
            for line in f:
                l = line.split("|")
                stats.append(l)
            for stat in stats:
                if len(stat) == 1:
                    stats.remove(stat)
            for stat in stats:
                stat[0] = ' '.join(stat[0].split("\t"))
                stat[0] = ' '.join(stat[0].split())
                stat[1] = ' '.join(stat[1].split())
                stat[1] = stat[1].replace("%", "")
                d['sample'] = sample_name
                d['sample_unit'] = sample_unit_name
                d['read'] = read
                d[stat[0]] = stat[1]
            f.close()
            create_folder("reports/"+sample_unit_name+"/")
            create_folder("reports/"+sample_unit_name+"/"+read+"/")
            path_out = "reports/"+sample_unit_name+"/"+read+"/"+args.output_file
            paths_out.append(path_out)
            write_sample_file(d, path_out)
    return paths_out


def write_sample_file(dict, output_file):
    """Get value from a statistic dictionnary and return a pandas object"""
    df = pd.DataFrame([dict])
    df.to_csv(output_file, sep="\t", index=False)


def write_final_stat_tab(paths_out, output_file):
    df_final = pd.DataFrame()
    for path in paths_out:
        df = pd.read_csv(path, sep="\t")
        df_final = pd.concat([df_final, df], ignore_index=True, sort =False)
    df_final.to_csv(output_file, sep="\t", index=False)


def main():

    # Read the sample file:
    sample = read_file(args.input_file)

    # Check the mapdir of each sample exists or not:
    check_mapdirs(sample["mapdir"])

    create_folder("reports")

    # Read log files and get the path of a table containing statistics:
    paths_out = read_log_file(sample)

    # Write the statistic table with all samples:
    write_final_stat_tab(paths_out, args.output_file)


def parse_arguments():
    parser = argparse.ArgumentParser(description='Sample file')
    parser.add_argument('-i', '--input_file',
                        required=True, help='Sample file')
    parser.add_argument('-o', '--output_file', required=False, 
                        default="mapping_stat.tsv", help='Sample file')
    args = parser.parse_args()
    return args


if __name__ == '__main__':
    args = parse_arguments()
    main()

Python Pandas numpy From line 5 of scripts/stats_mapping.py

import os, re, sys, csv, argparse
import pandas as pd
import numpy as np

# Utility functions:
def eprint(*args, **kwargs):
    print(*args,  file=sys.stderr, **kwargs)


def read_file(file):
    """ Read the circRNAs annotation file and return a pandas dataframe"""
    df = pd.read_table(file, sep='\t')
    df.replace(np.nan, "", inplace=True)
    return df


def get_item(item):
    return ''.join(list(set(item)))


def write_summary_meg_table(df, output_file_name, min_size):
    header = ["gene_name", "biotype", "nb_circ_sens", "nb_circ_antisens", "nb_ccr", "exon_name", "start_exon", "end_exon"]
    with open(output_file_name, 'w') as fout:
        tsv_writer = csv.writer(fout, delimiter='\t')
        tsv_writer.writerow(header)
        if len(df) > 0:
            for key, item in df:
                exon_start = get_item(item.exons_start_end_ife).split("_")[0]
                exon_end = get_item(item.exons_start_end_ife).split("_")[1]
                if int(exon_end) - int(exon_start) > min_size:
                    strand = list(item.strand)
                    gene_id = list(key.split("_"))[0]
                    strand_gene_ife = list(key.split("_"))[1]
                    # Get biotype:
                    if len(list(key.split("_"))) == 3:
                        biotype = list(key.split("_"))[2]
                    else:
                        biotype = list(key.split("_"))[2]+"_"+list(key.split("_"))[3]
                    nb_circ_rna = len(df.get_group(key))
                    nb_circ_sens = strand.count(strand_gene_ife)
                    nb_circ_antisens = nb_circ_rna - nb_circ_sens
                    nb_ccr = sum(item.nb_ccr)
                    exon_name = get_item(item.exon_id_ife)
                    # print(df.get_group(key), "\n\n")
                    row = [gene_id, biotype, nb_circ_sens, nb_circ_antisens, nb_ccr, exon_name, exon_start, exon_end]
                    tsv_writer.writerow(row)


def write_summary_intronic_table(df, output_file_name, min_size):
    header = ["gene_name", "biotype", "intron_name", "nb_circ_rna", "nb_ccr", "start_intron", "end_intron"]
    with open(output_file_name, 'w') as fout:
        tsv_writer = csv.writer(fout, delimiter='\t')
        tsv_writer.writerow(header)
        if len(df) > 0:
            for key, item in df:
                start_i = ''.join([str(round(x)) for x in list(set(item.start_i))])
                end_i = ''.join([str(round(x)) for x in list(set(item.end_i))])
                if int(end_i) - int(start_i) > min_size:
                    intron_name = key
                    gene_id = get_item(item.gene_id_i).split("_")[0]
                    # Get biotype:
                    biotype = get_item(item.gene_id_i).split(",")
                    if len(biotype)>1:
                        biotype = biotype[0].split("_")[4]
                    else:
                        biotype = ''.join(biotype).split("_")[4]
                    nb_ccr = sum(item.nb_ccr)
                    nb_circ_rna = len(df.get_group(key))
                    row = [gene_id, biotype, intron_name, nb_circ_rna, nb_ccr, start_i, end_i]
                    tsv_writer.writerow(row)


def write_summary_pleg_table(df, output_file_name, min_size):
    header = ["gene_name", "nb_gene", "biotype", "nb_circ_sens", "nb_circ_antisens", "nb_ccr", "exon_name", "start_exon", "end_exon"]
    with open(output_file_name, 'w') as fout:
        tsv_writer = csv.writer(fout, delimiter='\t')
        tsv_writer.writerow(header)
        if len(df) > 0:
            for key, item in df:
                exon_start = get_item(item.exons_start_end_ife).split("_")[0]
                exon_end = get_item(item.exons_start_end_ife).split("_")[1]
                if int(exon_end) - int(exon_start) > min_size:
                    # Get gene_id and biotype:
                    gene_id_ife = get_item(item.gene_id_ife).split(",")
                    nb_gene = len(gene_id_ife)
                    if len(gene_id_ife) > 1:
                        gene_id = gene_id_ife[0].split("_")[0]+","+gene_id_ife[1].split("_")[0]
                        strand_gene_ife = get_item(gene_id_ife[0].split("_")[1]+gene_id_ife[1].split("_")[1])
                        biotype = get_item(gene_id_ife[0].split("_")[2]+gene_id_ife[1].split("_")[2])
                    else:
                        gene_id = ''.join(gene_id_ife).split("_")[0]
                        strand_gene_ife = ''.join(gene_id_ife).split("_")[1]
                        biotype = ''.join(gene_id_ife).split("_")[2]
                    strand = list(item.strand)
                    nb_circ_rna = len(df.get_group(key))
                    nb_circ_sens = strand.count(strand_gene_ife)
                    nb_circ_antisens = nb_circ_rna - nb_circ_sens
                    nb_ccr = sum(item.nb_ccr)
                    exon_name = key
                    row = [gene_id, nb_gene, biotype, nb_circ_sens, nb_circ_antisens, nb_ccr, exon_name, exon_start, exon_end]
                    tsv_writer.writerow(row)

def check_file(file):
    with open(file, 'r') as f:
        if len(f.read()) <= 1:
            result="Empty"
        else:
            result="Not empty"
    return result


def main():

    # Get min_size:
    min_size = int(args.min_size)

    # Read input annotation circRNAs files:
    if check_file(args.input_meg_file) == "Not empty":
        df_meg = read_file(args.input_meg_file)
        # Meg circRNAs:
        ## Group by gene:
        df_meg = df_meg.groupby('gene_id_ife')
        write_summary_meg_table(df_meg, args.output_meg_file, min_size)
    else:
        open(args.output_meg_file, 'a').close()

    if check_file(args.input_intronic_file) == "Not empty":
        df_intronic = read_file(args.input_intronic_file)
        # Intronic circRNAs:
        ## Group by intron name:
        df_intronic = df_intronic.groupby('intron_name')
        write_summary_intronic_table(df_intronic, args.output_intronic_file, min_size)
    else:
        open(args.output_intronic_file, 'a').close()

    if check_file(args.input_pleg_file) == "Not empty":
        df_pleg = read_file(args.input_pleg_file)
        # Pleg circRNAs:
        ## Group by exon:
        df_pleg = df_pleg.groupby('exon_id_ife')
        write_summary_pleg_table(df_pleg, args.output_pleg_file, min_size)
    else:
        open(args.output_pleg_file, 'a').close()


def parse_arguments():
    parser = argparse.ArgumentParser(description='Return summary circRNAs tables')
    parser.add_argument('-ip', '--input_pleg_file', required=True,
                        help='Annotation pleg circRNAs file')
    parser.add_argument('-im', '--input_meg_file', required=True,
                        help='Annotation meg circRNAs file')
    parser.add_argument('-ii', '--input_intronic_file', required=True,
                        help='Annotation intronic circRNAs file')
    parser.add_argument('-op', '--output_pleg_file', required=False,
                        default="pleg_summary.tsv",
                        help='Table containing summary of pleg circRNAs annotation')
    parser.add_argument('-om', '--output_meg_file', required=False,
                        default="meg_summary.tsv",
                        help='Table containing summary of meg circRNAs annotation')
    parser.add_argument('-oi', '--output_intronic_file', required=False,
                        default="intronic_summary.tsv",
                        help='Table containing summary of intronic circRNAs annotation')
    parser.add_argument('-ms', '--min_size', required=False, default=55,
                        help='Minimum size of circRNAs')
    args = parser.parse_args()
    return args


if __name__ == '__main__':
    args = parse_arguments()
    main()

Python Pandas numpy From line 4 of scripts/summary_table.py

                                     # [-fmt {bed,gtf}] [-o OUTPUT_FILE]

# Imports
import circRNA as circ
from collections import defaultdict
import sys, os, re, argparse, natsort
import numpy as np
import csv
import pybedtools as pybed

# Utility functions
def eprint(*args, **kwargs):
    print(*args,  file=sys.stderr, **kwargs)
    dsp_control = circ.DisplayControl.instance()
    if dsp_control.verbose:
        print(*args,  file=sys.stderr, **kwargs)


def exonic_annotations(circ_rnas, exons):
    """
    Annotate the circrnas according to the exon annotations
    circ_rnas: an array of cirRNAs object
    exons: an array of circRNAs objects
    returns the circ_rnas array with the circRNAs annotated

    We annotate each circRNA's splice junction with the corresponding
    exons spice junctions using a dict of exons splice junctions
    This means that splice junction correspondances must be exact
    """
    eprint("\nCompute junction dictionary...\n")
    annotation_junctions = compute_junction_dict(exons)
    eprint("\nCircRNA annotation...\n")
    set_annotation(circ_rnas, annotation_junctions)

    return circ_rnas


def junction_key(chrom, pos):
    return "%s:%s" %(chrom, pos)


def compute_junction_dict(annot_array):
    """
    Compute a dictionnary of splic junctions:
    keys: chrom:pos
    values: list of tuples
    tuple: (exon, end_type)
    end_type: 3 (resp 5) for donor (resp acceptor)
    """
    junction_dict = defaultdict(list)
    seen_exons = defaultdict()
    for annot in annot_array:
        if annot.get_att("exon_id") in seen_exons:
            continue
        seen_exons[annot.get_att("exon_id")] = 1
        key_start = junction_key(annot.chrom, annot.start)
        key_end = junction_key(annot.chrom, annot.end)
        if annot.strand == "+":
            start_end_type = "5"
            end_end_type = "3"
        else:
            start_end_type = "3"
            end_end_type = "5"
        # annotating start
        junction_dict[key_start].append((annot, start_end_type))
        # annotating end
        junction_dict[key_end].append((annot, end_end_type))
    return junction_dict


def set_annotation(circ_rnas, annotation_junctions):
    """
    An annotation string would be very much appreciated
    """
    for circ_rna in circ_rnas:
        junction_start = junction_key(circ_rna.chrom, circ_rna.start)
        junction_end = junction_key(circ_rna.chrom, circ_rna.end)
        if junction_start in annotation_junctions:
            circ_rna.add_start_annotation(annotation_junctions[junction_start])
            circ_rna.add_start_transcript_id(annotation_junctions[junction_start])
            circ_rna.add_start_gene_id(annotation_junctions[junction_start])
        if junction_end in annotation_junctions:
            circ_rna.add_end_annotation(annotation_junctions[junction_end])
            circ_rna.add_end_transcript_id(annotation_junctions[junction_end])
            circ_rna.add_end_gene_id(annotation_junctions[junction_end])


def intronic_annotations(circ_rnas, exons):
    """
    Annotate the circrnas according to the introns annotations
    circ_rnas: an array of cirRNAs object
    exons: an array of circRNAs objects
    returns the circ_rnas array with the circRNAs intrno-annotated

    Because at least on circRNA junction is away from existing
    splice junctions, the annotation is solely based on overlap between
    the circRNA and an intron :
      - [start, end] interval must be within the intron
      - this interval must cover at least 90% of the intron
    """

    #eprint("\nCompute introns from exons...\n")
    transcripts = get_exons_per_transcript(exons)
    introns = circ.compute_intronic_positions(transcripts)

    #eprint("\nCompute intersection with circRNAs...\n")
    circ_rnas = annotate_intron_intersection(circ_rnas, introns)

    return circ_rnas


def annotate_intron_intersection(circ_rnas, introns):
    pybed_introns = annot_to_pybed(introns)
    pybed_circrnas = annot_to_pybed(circ_rnas)
    intersections = pybed_circrnas.intersect(pybed_introns, f=0.95, wo=True) 

    circrna_d = {circ.name: circ for circ in circ_rnas}
    introns_d = {intron.name: intron for intron in introns}
    for i in intersections:
        intron = introns_d[i[9]]
        circ = circrna_d[i[3]]
        overlap = i[-1]
        p_overlap = float(overlap)/intron.length # taille circ sur taille intron
        circ.annotate_intron(intron, p_overlap)

    return circ_rnas


def annot_to_pybed(annots):
    pybed_annot = []
    for a in annots:
        pybed_annot.append(a.topybed())
    return pybed.BedTool(pybed_annot).sort()


def infra_exonic_annotations(circ_rnas, exons):
    """
    """
    pybed_exons = annot_to_pybed_ife(exons)
    pybed_circrnas = annot_to_pybed_ife(circ_rnas)  
    intersections = pybed_circrnas.intersect(pybed_exons, f=0.95, wo=True)

    circrna_d = {circ.name: circ for circ in circ_rnas}
    exons_d = {exon.name: exon for exon in exons}
    for i in intersections:
        exon = exons_d[i[9]]
        circ = circrna_d[i[3]]
        circ.annotate_ife(exon)

    return circ_rnas


def annot_to_pybed_ife(annots):
    pybed_annot = []
    for a in annots:
        pybed_annot.append(a.topybed_ife())
    return pybed.BedTool(pybed_annot).sort()


def get_exons_per_transcript(annots):
    """
    returns a dictionnary
    key: ...
    """
    d_transcript = defaultdict(list)
    for exon in annots:
        transcript_id = exon.get_att("transcript_id")
        d_transcript[transcript_id].append(exon)
    return d_transcript


def write_annotated_circrnas(circ_annotated, outputfile):
    circ_rnas = natsort.natsorted(circ_annotated, key=lambda e: (e.chrom, e.start))
    header = "\t".join(["chrom", "start", "end", "strand", "nb_ccr", "circ_rna_name", "exons_id_start", "exons_id_end", 
                        "transcript_id_start", "transcript_id_end", "gene_id_start", "gene_id_end", 
                        "intron_name", "start_i", "end_i", "gene_id_i", "exon_id_i", 
                        'exon_id_ife', "exons_start_end_ife", "gene_id_ife"])
    with open(outputfile, "w") as fout:
        fout.write(header+"\n")
        for circ_rna in circ_rnas:
            intron_annot = circ_rna.intron_annotation 
            if len(intron_annot) == 0:
                start_i = ""
                end_i = ""
                name_i = ""
            else:
                for annot_i in intron_annot:
                    intron = annot_i[0]
                    start_i = intron.start
                    end_i = intron.end
                    name_i = intron.name
            att_d = dict(re.split(" |=", item) for item in circ_rna.attributes.split("; "))
            nb_ccr = att_d["nb_ccr"].replace('\"', '')
            s = [circ_rna.chrom, circ_rna.start, circ_rna.end, 
                circ_rna.strand, nb_ccr, circ_rna.name,
                circ_rna.get_start_annotation_str(), 
                circ_rna.get_end_annotation_str(),
                circ_rna.get_start_transcript_id_str(), 
                circ_rna.get_end_transcript_id_str(), 
                circ_rna.get_start_gene_id_str(), 
                circ_rna.get_end_gene_id_str(),
                name_i, start_i, end_i, 
                circ_rna.get_intron_annot_gene_str('gene_id'),
                circ_rna.get_intron_annot_str('exon_id'), 
                circ_rna.get_infra_exonic_annot_str("exon_id"),
                circ_rna.get_infra_exonic_exons_start_end(),
                circ_rna.get_infra_exonic_gene_annot_str("gene_id")]
            fout.write( "%s\n" % "\t".join(map(str, s)))


def main():

    # Exonic:
    eprint("\nIdentification of EXONIC circRNAs:\n")
    eprint("\nReading circRNA file...\n")
    circ_rnas = circ.read_annotation(args.circ_rna_file, fmt=args.annot_format)
    eprint("\nReading annotation file...\n")
    annots = circ.read_annotation(args.annotation_file, fmt=args.annot_format)

    eprint("\nExonic annotations..\n")
    circ_rnas = exonic_annotations(circ_rnas, annots)

    eprint("\nIntronic annotations..\n")
    circ_rnas = intronic_annotations(circ_rnas, annots)

    #####
    eprint("\nInfra-exonic annotations..\n")
    circ_rnas = infra_exonic_annotations(circ_rnas, annots)
    #####

    eprint("\nWrite annotations..\n")
    write_annotated_circrnas(circ_rnas, args.output)


def parse_arguments():
    parser = argparse.ArgumentParser(description='Annotation')
    parser.add_argument('-circ', '--circ_rna_file',
                        required=True, help='Circular RNA file path')
    parser.add_argument('-fmt', '--annot_format',
                        required=False, default="bed", choices=["bed","gtf"],
                        help='Annotation file format')
    parser.add_argument('-annot', '--annotation_file',
                        required=True, help='Annotataion file path')
    parser.add_argument('-oe', '--exonic_output_file',
                        required=False, default='exonic_circ_rnas_annot.out',
                        help='Exonic output file path')
    parser.add_argument('-oi', '--intronic_output_file',
                        required=False, default='annotation_introns.bed',
                        help='Intronic output file path')
    parser.add_argument('-o', '--output',
                        required=True,
                        help='Annotation output file path')
    parser.add_argument('--verbose', help='Print more info', action='store_true')

    args = parser.parse_args()

    dsp_control = circ.DisplayControl.instance()
    dsp_control.verbose = args.verbose

    return args


if __name__ == '__main__':
    args = parse_arguments()
    main()

Python numpy Pybedtools Circr From line 3 of scripts/circRNA_annotation.py

                                    # [-min_cr CR_THRESHOLD] [-tol TOLERANCE]
                                    # [-fmt {bed,gtf}] [-o OUTPUT_FILE]

# Imports
import sys, os, re, argparse, natsort, time
if not sys.warnoptions:
    import warnings
    warnings.simplefilter("ignore")
from collections import defaultdict
import pandas as pd
import numpy as np
from networkx import (draw, DiGraph, Graph)
import networkx as netwx
import statistics
from statistics import mode
from time import sleep
from tqdm import tqdm
import circRNA as circ


# Utility functions
def eprint(*args, **kwargs):
    print(*args,  file=sys.stderr, **kwargs)
    dsp_control = circ.DisplayControl.instance()
    if dsp_control.verbose:
        print(*args,  file=sys.stderr, **kwargs)


def merge_junctions(df_a):
    """ Merge two dataframes
    :df_r1: dataframe of R1 reads
    :df_r2: dataframe of R2 reads
    :return the merged dataframe
    """
    df_merged = pd.concat(df_a,ignore_index=True)
    return df_merged


def read_junction_file(file, read_type=None):
    """ Read the data, adds the header of the table
        and adds a column with the corresponding read type.
    :file: file to read
    :read_type: read type (R1 or R2)
    :return the dataframe of reads
    """
    names = ["chr_donor", "pos_donor","strand_donor","chr_acceptor",
            "pos_acceptor","strand_acceptor","junction_type",
            "repeat_left","repeat_right","read_name","pos_s1",
            "CIGAR_s1","pos_s2","CIGAR_s2"] # Columns names of the input data
    # Read data as table with pandas:
    df = pd.read_table(file, sep = '\t', names=names, dtype=str,
                       converters = {'pos_acceptor':int, 'pos_donor':int})
    df['read_type'] = read_type # Add column "read" to specify the read type
    return df


def get_valid_circjunctions(df):
    """ Selects valid circular chimeric reads (ccr)
        according to the conditions of the "valid_ccr" function.
    :df: datatframe to filter
    return: the filtered dataframe
    """
    # Parsing dataframe to select valid rows
    # Boolean array with true if the read is a valid ccr:
    valid_ccrs = []
    #for index, row in tqdm(df.iterrows(), total=df.shape[0]):
    for index, row in circ.tqdm_wrapper(df):
        valid_ccrs.append(valid_ccr(row))
    df_circ_cr = df[valid_ccrs]
    ccr_a = [] # Array of circular junctions
    eprint(" "+"\n"+"Generating array of selected ccr..."+"\n")
    #for index, row in tqdm(df_circ_cr.iterrows(), total=df_circ_cr.shape[0]):
    for index, row in circ.tqdm_wrapper(df_circ_cr):
        ccr =  circ.CCR(row) # Create CCR object
        ccr_a.append(ccr) # Add chimeric reads into array
    eprint("\n")
    return ccr_a


def valid_ccr(row):
    """ Select ccr according to several conditions.
    :row: row of the dataframe
    :return boolean "False" for not valid ccr
            or boolean "True" for valid ccr
    """
    # Regular expression of the left dangling CIGAR:
    dangling_left = re.compile('^\d+S\d+M$')
    # Regular expression of the rigth dangling CIGAR:
    dangling_right = re.compile('^\d+M\d+S$')
    if (row['chr_donor']!=row['chr_acceptor']
            or row['strand_donor']!=row['strand_acceptor']):
        return False
    elif abs(row['pos_donor'] - row['pos_acceptor']) > 10000000:
        return False
    else:
        if row['strand_donor'] == '-' :
            if (row['pos_donor']>row['pos_acceptor']
                    or row['pos_s1']>row['pos_s2']):
                return False
            else:
                if (not dangling_left.match(row['CIGAR_s1'])
                        or not dangling_right.match(row['CIGAR_s2'])):
                    return False
                else:
                    return True
        elif row['strand_donor'] == '+' :
            if (row['pos_donor']<row['pos_acceptor']
                    or row['pos_s1']<row['pos_s2']):
                return False
            else:
                if (not dangling_right.match(row['CIGAR_s1'])
                        or not dangling_left.match(row['CIGAR_s2'])):
                    return False
                else:
                    return True


def get_exact_circrnas(circ_junctions, cr_threshold):
    """
    Detects circ RNAs as CR sharing exactly the same circ signature
    chrom start end strand
    """
    eprint("Exact circular RNA detection")
    circ_rna_dict = defaultdict(list)
    for circ_junc in circ_junctions:
        circ_rna_dict[circ_junc.key].append(circ_junc)

    circ_array = []
    for key, cr_array in circ_rna_dict.items():
        if len(cr_array) >= cr_threshold:
            circ_array.append(circ.CircRNA(args.output_file_format, ccr_array=cr_array))
    return circ_array


def same_circ_signature(cr_a, cr_b, tol):
    """ Two-by-two comparison of chimeric reads
    :cr_a: chimeric read a
    :cr_b: chimeric read b
    :tol: number of different start-end positions tolerated
    :return boolean "False" for not valid conditions
    and "True" for valid conditions
    """
    if cr_a.chrom != cr_b.chrom:
        return False
    if abs(cr_a.start - cr_b.start) > tol:
        return False
    if abs(cr_a.end - cr_b.end) > tol:
        return False
    if cr_a.read_type == cr_b.read_type:
        if cr_a.strand != cr_b.strand:
            return False
    else:
        if cr_a.strand == cr_b.strand:
            return False
    return True


def merge_circ_rnas(component):
    if len(component) == 1:
        return component[0]
    elements = []
    for circ_rna in component:
        elements.extend(circ_rna.ccr_array)
    return circ.CircRNA(args.output_file_format, ccr_array=elements)


def connected_components(circ_a, neighbours, cr_threshold):
    """ Generate an undirected graph and
        cluters ccr into connected components
    :ccr_a: array of ccr object
    :neighbours: dictionary of neighbours chimeric reads
    :return an array of circular RNAs
    """
    undirected = netwx.Graph()
    undirected.add_nodes_from(range(len(circ_a)))
    undirected.add_edges_from(neighbours)
    circ_rnas = []  # Circular RNAs array
    # Parsing of connected components composed of chimeric reads:
    for comp in netwx.connected_components(undirected):
        circ_rna = merge_circ_rnas([circ_a[i] for i in comp])
        if circ_rna.nb_ccr >= cr_threshold:
            circ_rnas.append(circ_rna)
    return circ_rnas


def cumulative_cr(circ_a):
    return sum([c.nb_ccr for c in circ_a])


def compute_fuzzy_circ_rnas(ccr_a, cr_threshold, tolerance):
    """ Get the circular RNA corresponding to each
        connected component of chimeric reads
    :ccr_a: array of ccr object
    :return array of circular RNAs
    """
    if cumulative_cr(ccr_a) < cr_threshold:
        return []
    n = len(ccr_a)
    # Neighbour cr share sign the same circular RNA
    neighbours = []
    circ_rnas = []
    for i in range(n):
        for j in range(i+1, n):
            if ccr_a[j].start > ccr_a[i].end:
                break
            # Two-by-two comparison of chimeric reads:
            if same_circ_signature(ccr_a[i], ccr_a[j], tolerance):
                neighbours.append((i, j))
    circ_rnas = connected_components(ccr_a, neighbours, cr_threshold)
    return circ_rnas


def get_independent_intervals(circ_rnas):
    # Get independant intervals
    independant_intervals = []
    circ_interval = []
    chrom = None
    for circ_rna in circ_rnas:
        if not chrom:
            chrom = circ_rna.chrom
            max_end = circ_rna.end
        if circ_rna.chrom != chrom:
            independant_intervals.append(circ_interval)
            circ_interval = []
            chrom = circ_rna.chrom
            max_end = circ_rna.start
        if circ_rna.chrom != chrom or circ_rna.start > max_end:
            independant_intervals.append(circ_interval)
            circ_interval = []
            chrom = circ_rna.chrom
            max_end = circ_rna.start
        max_end = max(max_end, circ_rna.end)
        circ_interval.append(circ_rna)
    independant_intervals.append(circ_interval)
    return independant_intervals


def get_fuzzy_circrnas(circ_rnas, cr_threshold, tolerance):
    eprint("Fuzzy circular RNA detection")
    circ_rnas = natsort.natsorted(circ_rnas, key=lambda e: (e.chrom,
                                                            e.start))
    intervals = get_independent_intervals(circ_rnas)
    circ_rnas = []
    for interval in intervals:
        fuzzy_circ_rnas = compute_fuzzy_circ_rnas(interval, cr_threshold,
                                                  tolerance)
        if fuzzy_circ_rnas:
            circ_rnas.extend(fuzzy_circ_rnas)
    return circ_rnas


def circrna_detection(circ_juntcions, cr_threshold, tolerance):
    """ Detecte circular RNAs
    :df_ccr: dataframe of circular chimeric reads
    :return the array of circular RNAs detected
    """
    # First pass, detect exact circular RNAs (no tolerance)
    circ_rnas = get_exact_circrnas(circ_juntcions, cr_threshold)
    eprint("nb of tol %d circRNAS %d" % (tolerance, len(circ_rnas)))#########
    # if tolerance > 0 compute fuzzy circrnas from previous list
    if tolerance > 0:
        circ_rnas = get_fuzzy_circrnas(circ_rnas, cr_threshold, tolerance)
    else:
        circ_rnas = [c for c in circ_rnas if c.nb_ccr >= cr_threshold]
    return circ_rnas


def write_tab_circ_rnas(circ_rnas, outputfile, fmt):
    """ Write the table of circ RNA retained
    :circ_rnas: array of circular RNAs object
    :outputfile: the output file path
    :return the table of circular RNAs retained
    :start Chimeric.out.junction file: last intron base in
    the reference genome order in the circular transcript
    :end Chimeric.out.junction file: first intron base in the
    reference genome order in the circular transcript
    :start GTF: first exon position
    :end GTF: last exon position
    """
    nb_line = 0 # name bed file
    circ_rnas = natsort.natsorted(circ_rnas, key=lambda e: (e.chrom, e.start))
    eprint(" "+"\n"+"Writing the table of circular RNAs..."+"\n")
    with open(outputfile, "w") as fout:
        eprint("\nConvertion the Chimeric.out.junction format to the %s format...\n" % fmt)
        for circ_rna in circ_rnas:
            # Convert Chimeric.out.junction format to out format:
            circ_rna.start += 1
            circ_rna.end -= 1
            # Convert out format to the chosen format:
            circ_rna.write_annot(nb_line, fout, fmt)
            nb_line += 1


def stats(min_cr, tol, nb_ccr_tot, nb_ccr, nb_circ_rna_detected, circ_rnas):
          nb_tot_cr_circ = sum([c.nb_ccr for c in circ_rnas])
          return "%d\t%d\t%d\t%d\t%d\t%d" % (min_cr, tol, nb_ccr_tot, nb_ccr,
                                             nb_circ_rna_detected, nb_tot_cr_circ)


def stats2(df):
    return df.groupby('nb_ccr')['nb_ccr'] \
           .count().reset_index(name='nb_circ_rnas')


def main(r1_input_file, r2_input_file, cr_threshold, tolerance,
         output_file):

    eprint("\nReading junction files...\n")
    df_junctions_a = []
    if r1_input_file:
        df_junctions_a.append(read_junction_file(r1_input_file, read_type="R1"))
    if r2_input_file:
        df_junctions_a.append(read_junction_file(r2_input_file, read_type="R2"))

    eprint("Merging junction files...\n")
    df = merge_junctions(df_junctions_a)
    eprint("%d chimeric reads \n" % len(df.index))

    # Filtering data:
    eprint("Selecting valid ccr...\n")
    circ_junctions = get_valid_circjunctions(df)
    eprint("%d circular chimeric reads" % len(circ_junctions))

    # Circular RNAs detection:
    circ_rnas = circrna_detection(circ_junctions, cr_threshold, tolerance) #### 'circRNA' object has no attribute 'ccr_array'
    eprint("%d circRNAs detected" % len(circ_rnas))

    #for cir in circ_rnas:
    #    print(cir.var_info_str())

    # Write the table of circular RNAs retained:
    circ.write_annotation(circ_rnas, output_file, fmt=args.output_file_format,
                          attributes=["nb_ccr", "genomic_size", "left",
                                      "right", "complete", "nb_distinct"])


def parse_arguments():
    parser = argparse.ArgumentParser(description='Identification of chimeric reads and circular RNAs')
    parser.add_argument('-r1', '--r1_input_file',
                        required=False, help='R1 input file name')
    parser.add_argument('-r2', '--r2_input_file',
                        required=False, help='R2 input file name')
    parser.add_argument('-min_cr', '--cr_threshold', type=int,
                        required=False, default=5,
                        help='Minimum number of chimeric reads')
    parser.add_argument('-tol', '--tolerance', type=int, required=False,
                        default=3,
                        help='Tolerance of different positions to start and end')
    parser.add_argument('-fmt', '--output_file_format',
                        required=False, default='gtf', choices=["bed","gtf"], # bed ou gtf
                        help='Output file format : 0-based (bed) or 1-based (gtf)')
    parser.add_argument('-o', '--output_file', required=False, default='circ_rnas_detected.out',
                        help='Output file path') # gft par défaut # attribut : nb ccr, min_cr_distinct
    parser.add_argument('--verbose', help='Print more info', action='store_true')

    args = parser.parse_args()

    dsp_control = circ.DisplayControl.instance()
    dsp_control.verbose = args.verbose

    if not (args.r1_input_file or args.r2_input_file):
        parser.error('At least one input file is requested')
    return args


if __name__ == '__main__':
    args = parse_arguments()
    main(args.r1_input_file, args.r2_input_file,
         args.cr_threshold, args.tolerance,
         args.output_file)

Python Pandas numpy tqdm networkx Circr From line 3 of scripts/circRNA_detection.py

import os
import argparse
import pandas as pd
import numpy as np
import re
import sys
from collections import defaultdict


def read_file(file):
    """ Read the circ_rnas_annotation.out file and return a pandas dataframe"""
    header = ["chrom", "start", "end", "circ_name", "nb_ccr", 
              "strand", "source", "feature", "phase", "attributes"]
    df = pd.read_table(file, sep='\t', names=header)
    df.replace(np.nan, "", inplace=True)
    return df


def get_folder_to_create(df): 
    folders_name = []
    dict_spl_sp = dict(zip(df["sample"], df["species"]))
    split_keys = [x.split("_") for x in dict_spl_sp.keys()]   

    for key in split_keys:
        folders_name.append(key[0]+"_"+key[1])
    folders = list(set(folders_name))
    return folders


def create_folders(folders):
    for folder in folders:
        if not os.path.exists(folder):
            os.makedirs(folder)


def filter_samples(df):
    df = df.drop(df[(df["sample"].isin(["ssc_liver_3", "ssc_liver_4", 
                                        "ssc_muscle_1", "ssc_testis_1"]))].index)
    df = df.reset_index(drop=True)
    return df


def main():

    # Read samples.tsv file:
    df_samples = pd.read_csv(args.sample_file, sep='\t')

    # Remove ssc_liver_3/4, ssc_muscle_1 and ssc_testis_1 samples:
    df_filtered = filter_samples(df_samples)

    samples_folders = df_filtered["sample"].tolist()
    df_filtered = df_filtered.loc[df_filtered['sample'].isin(samples_folders)] 

    # Get the folder list to create:
    folders = get_folder_to_create(df_filtered)

    # Create folders:
    create_folders(folders)

    # Group folders by species and tissue:
    df_grouped = df_filtered.groupby(["species", "tissue"])

    for key,item in df_grouped:

        group = df_grouped.get_group(key) 
        # Create col with path of folder:
        group["path_file_bed"] = group["sample"]+"/auzeville.bed"

        samples = group["sample"].tolist()
        samples_split = [s.split("_") for s in samples][0]
        folder_name = samples_split[0]+"_"+samples_split[1]

        paths = group["path_file_bed"].tolist()

        df_concat = pd.concat(read_file(f) for f in paths)
        df_concat = df_concat.sort_values(by=['chrom', 'start', 'end'])
        df_final = df_concat.groupby(['chrom', 'start', 'end', 'strand'], as_index=False)['nb_ccr'].sum()
        df_final.to_csv(folder_name+"/"+"auzeville.bed", sep="\t", index=False)


def parse_arguments():
    parser = argparse.ArgumentParser(description='Annotation')
    parser.add_argument('-sp', '--sample_file',
                        required=True, help='Sample file path')
    parser.add_argument('-circ', '--circ_rna_file',
                        required=False, help='Circular RNA file path')
    args = parser.parse_args()
    return args

if __name__ == '__main__':
    args = parse_arguments()
    main()

Python Pandas numpy From line 4 of scripts/cumul_bed.py

import os, re, sys, csv, argparse
import circRNA as circ
import pandas as pd
import numpy as np

# Utility functions:
def eprint(*args, **kwargs):
    print(*args,  file=sys.stderr, **kwargs)

def get_sample(file):
    """ Get the sample name from the file path"""
    sample_name = os.path.split(os.path.dirname(file))[1]
    return sample_name

def read_file(file):
    """ Read the circ_rnas_annotation.out file and return a pandas dataframe"""
    df = pd.read_table(file, sep='\t')
    df.replace(np.nan, "", inplace=True)
    return df

def intersection(lst1, lst2):
    """ Return the intersection between two lists"""
    return list(set(lst1) & set(lst2))


def get_exonic_circrnas(df):
    """ Get statistics about exonic circRNAs from the annotation_circRNA.out file"""
    # Arrays, dicts:
    exonic_circ_names, monoexonic_circ_names, true_exonic, single_end_circ_names = [], [], [], []
    ccr_start_end_exonic, ccr_single_annot = dict(), dict()
    # Counter exonic circRNAs:
    nb_monoexonic, nb_start_end_exonic, nb_antisens_exonic, nb_single_annotated_junction = 0, 0, 0, 0

    for index, row in df.iterrows():

        exon_id_start = row.exons_id_start.split(",")
        exon_start = str(exon_id_start[0])
        exon_start_tmp = exon_start.split("_")
        exon_start = exon_start_tmp[0]
        exon_id_end = row.exons_id_end.split(",")
        exon_end = str(exon_id_end[0])
        exon_end_tmp = exon_end.split("_")
        exon_end = exon_end_tmp[0]

        if row.nb_ccr >= 5:
            # if ((len(row.exons_id_start) > 0 or len(row.exons_id_end) > 0)
            #     and len(row.intron_name) == 0):
            if len(row.exons_id_start) > 0 or len(row.exons_id_end) > 0:
                if row.strand == "+":
                    if ("5" and "+") in row.exons_id_start:
                        if len(row.exons_id_end)==0:
                            nb_single_annotated_junction += 1
                            ccr_single_annot[row.circ_rna_name] = row.nb_ccr
                            single_end_circ_names.append(row.circ_rna_name)
                        if ("3" and "+") in row.exons_id_end:
                            nb_start_end_exonic += 1
                            true_exonic.append(row)
                            exonic_circ_names.append(row.circ_rna_name)
                            ccr_start_end_exonic[row.circ_rna_name] = row.nb_ccr
                            if ((len(exon_id_start)==1 and len(exon_id_end)==1) and (exon_start == exon_end)):
                                nb_monoexonic += 1
                                monoexonic_circ_names.append(row.circ_rna_name)
                    if ("3" and "+") in row.exons_id_end:
                        if len(row.exons_id_start)==0:
                            nb_single_annotated_junction += 1
                            ccr_single_annot[row.circ_rna_name] = row.nb_ccr
                            single_end_circ_names.append(row.circ_rna_name)
                    if ("3" and "-") in row.exons_id_start:
                        if ("5" and "-") in row.exons_id_end:
                            nb_antisens_exonic += 1
                elif row.strand == "-":
                    if ("5" and "-") in row.exons_id_end:
                        if len(row.exons_id_start)==0:
                            nb_single_annotated_junction += 1
                            ccr_single_annot[row.circ_rna_name] = row.nb_ccr
                            single_end_circ_names.append(row.circ_rna_name)
                        if ("3" and "-") in row.exons_id_start:
                            nb_start_end_exonic += 1
                            true_exonic.append(row)
                            exonic_circ_names.append(row.circ_rna_name)
                            ccr_start_end_exonic[row.circ_rna_name] = row.nb_ccr
                            if ((len(exon_id_start)==1 and len(exon_id_end)==1) and (exon_start == exon_end)):
                                nb_monoexonic += 1
                                monoexonic_circ_names.append(row.circ_rna_name)
                    if ("3" and "-") in row.exons_id_start:
                        if len(row.exons_id_end)==0:
                            nb_single_annotated_junction += 1
                            ccr_single_annot[row.circ_rna_name] = row.nb_ccr
                            single_end_circ_names.append(row.circ_rna_name)
                    if ("5" and "+") in row.exons_id_start:
                        if ("3" and "+") in row.exons_id_end:
                            nb_antisens_exonic += 1
    return true_exonic, nb_start_end_exonic, nb_single_annotated_junction, single_end_circ_names, ccr_single_annot, monoexonic_circ_names, exonic_circ_names


def get_intronic_circrnas(df, exonic_circrnas):
    # Arrays, dicts:
    intronic_circ_names, true_intronic = [], []
    ccr_true_intronic = dict()
    # Counter intronic circRNAs:
    nb_true_intronic = 0

    for index, row in df.iterrows():
        if row.circ_rna_name not in exonic_circrnas:
            if row.nb_ccr >= 5:
                if len(row.intron_name) > 0:
                    if row.strand == "+":
                        if (row.end_i - row.end) in range(-5,60):
                            if (row.start - row.start_i) in range(-5,5) or (row.start==row.start_i):
                                nb_true_intronic += 1
                                true_intronic.append(row)
                                ccr_true_intronic[row.circ_rna_name] = row.nb_ccr
                                intronic_circ_names.append(row.circ_rna_name)
                        elif (row.start == row.start_i and (row.end_i - row.end) > 32):
                            nb_true_intronic += 1
                            true_intronic.append(row)
                            ccr_true_intronic[row.circ_rna_name] = row.nb_ccr
                            intronic_circ_names.append(row.circ_rna_name)
                    elif row.strand == "-":
                        if (row.start - row.start_i) in range(-5,60):
                            if ((row.end - row.end_i) in range(-5,5) or (row.end == row.end_i)):
                                nb_true_intronic += 1
                                true_intronic.append(row)
                                ccr_true_intronic[row.circ_rna_name] = row.nb_ccr
                                intronic_circ_names.append(row.circ_rna_name)
                        elif (row.end == row.end_i and (row.start - row.start_i) > 60):
                            nb_true_intronic += 1
                            true_intronic.append(row)
                            ccr_true_intronic[row.circ_rna_name] = row.nb_ccr
                            intronic_circ_names.append(row.circ_rna_name)
    return true_intronic, nb_true_intronic, ccr_true_intronic, intronic_circ_names


def get_subexonic_circrnas(df, monoexonic_circ_names, intronic_circ_names):
    # Subexonic_meg : sens + antisens
    # Subexonic_pleg : sens
    # Arrays, dicts:
    infraexonic_tot_names, infraexonic_circ_names, subexonic, subexonic_antisens_names = [], [], [], []
    ccr_infraexonic = dict()
    # Counter subexonic circRNAs:
    nb_infraexonic, nb_infraexonic_tot, nb_infraexonic_sens, nb_infraexonic_antisens = 0, 0, 0, 0

    for index, row in df.iterrows():
        if row.nb_ccr >= 5:
            if ((len(row.gene_id_ife) > 0) and (row.circ_rna_name not in monoexonic_circ_names)):
                if (("_5_c_+" not in row.exons_id_start) and ("_5_lnc_+" not in row.exons_id_start) and
                    ("_3_c_-" not in row.exons_id_start) and ("_3_lnc_-" not in row.exons_id_start) and
                    ("_3_c_+" not in row.exons_id_end) and ("_3_lnc_+" not in row.exons_id_end) and
                    ("_5_c_-" not in row.exons_id_end) and ("_5_lnc_-" not in row.exons_id_end)):
                    if len(row.gene_id_start) == 0 and len(row.gene_id_end) == 0:
                        nb_infraexonic_tot += 1
                        infraexonic_tot_names.append(row.circ_rna_name)
                        if row.circ_rna_name not in intronic_circ_names:
                            if row.strand == "+":
                                if "+" in row.gene_id_ife:
                                    nb_infraexonic_sens += 1
                                    subexonic.append(row)
                                    ccr_infraexonic[row.circ_rna_name] = row.nb_ccr
                                    infraexonic_circ_names.append(row.circ_rna_name)
                                elif "-" in row.gene_id_ife:
                                    nb_infraexonic_antisens += 1
                                    subexonic.append(row)
                                    infraexonic_circ_names.append(row.circ_rna_name)
                                    subexonic_antisens_names.append(row.circ_rna_name)
                            elif row.strand == "-":
                                if "-" in row.gene_id_ife:
                                    nb_infraexonic_sens += 1
                                    subexonic.append(row)
                                    infraexonic_circ_names.append(row.circ_rna_name)
                                    ccr_infraexonic[row.circ_rna_name] = row.nb_ccr
                                elif "+" in row.gene_id_ife:
                                    nb_infraexonic_antisens += 1
                                    subexonic.append(row)
                                    infraexonic_circ_names.append(row.circ_rna_name)
                                    subexonic_antisens_names.append(row.circ_rna_name)
                    elif ((len(row.gene_id_start) > 0 and len(row.gene_id_end) == 0) or
                            (len(row.gene_id_start) == 0  and len(row.gene_id_end) > 0)):
                        nb_infraexonic_tot += 1
                        infraexonic_tot_names.append(row.circ_rna_name)
                        if row.circ_rna_name not in intronic_circ_names:
                            if row.strand == "+":
                                if "+" in row.gene_id_ife:
                                    nb_infraexonic_sens += 1
                                    subexonic.append(row)
                                    infraexonic_circ_names.append(row.circ_rna_name)
                                    ccr_infraexonic[row.circ_rna_name] = row.nb_ccr
                                elif "-" in row.gene_id_ife:
                                    nb_infraexonic_antisens += 1
                                    subexonic.append(row)
                                    infraexonic_circ_names.append(row.circ_rna_name)
                                    subexonic_antisens_names.append(row.circ_rna_name)
                            elif row.strand == "-":
                                if "-" in row.gene_id_ife:
                                    nb_infraexonic_sens += 1
                                    subexonic.append(row)
                                    infraexonic_circ_names.append(row.circ_rna_name)
                                    ccr_infraexonic[row.circ_rna_name] = row.nb_ccr
                                elif "+" in row.gene_id_ife:
                                    nb_infraexonic_antisens += 1
                                    subexonic.append(row)
                                    infraexonic_circ_names.append(row.circ_rna_name)
                                    subexonic_antisens_names.append(row.circ_rna_name)
    return subexonic, infraexonic_circ_names, nb_infraexonic_sens, ccr_infraexonic, subexonic_antisens_names


def get_circrnas(df):
    """ Read the annotation_circRNA.out dataframe and return all statistics to tabular
        format about circRNAs"""
    # Get exonic circRNAs:
    stats_exonic_circrnas = get_exonic_circrnas(df)
    exonic = stats_exonic_circrnas[0]
    monoexonic_names = stats_exonic_circrnas[5]
    exonic_names = stats_exonic_circrnas[6]
    # Get intronic circRNAs:
    stats_intronic_circrnas = get_intronic_circrnas(df, exonic_names)
    intronic = stats_intronic_circrnas[0]
    intronic_names = stats_intronic_circrnas[3]
    # Get subexonic circRNAs:
    stats_subexonic_circrnas = get_subexonic_circrnas(df, monoexonic_names, intronic_names)
    subexonic = stats_subexonic_circrnas[0]
    subexonic_antisens = stats_subexonic_circrnas[4]

    return exonic, subexonic, intronic, subexonic_antisens


def get_stats_circrnas(sample, df, output_file_name):
    """ Read the annotation_circRNA.out dataframe and return all statistics to tabular
        format about circRNAs"""
    nb_circ_tot = len(df)
    # Get circRNAs:
    exonic = get_circrnas(df)[0]
    subexonic = get_circrnas(df)[1]
    subexonic_antisens = get_circrnas(df)[3]
    intronic = get_circrnas(df)[2]

    # Get exonic circRNAs stats:
    stats_exonic = write_comparison_exonic_table(sample, exonic, output_file_name)

    # Get subexonic circRNAs stats:
    nb_subexonic = get_nb_type(sample, subexonic, subexonic_antisens)[0]
    nb_gene_subexonic = get_nb_type(sample, subexonic, subexonic_antisens)[1]
    nb_subexonic_pleg = get_nb_type(sample, subexonic, subexonic_antisens)[2]
    nb_subexonic_meg = get_nb_type(sample, subexonic, subexonic_antisens)[3]
    nb_ccr_pleg = get_nb_type(sample, subexonic, subexonic_antisens)[4]
    nb_ccr_meg = get_nb_type(sample, subexonic, subexonic_antisens)[5]


    if len(stats_exonic) > 5:
        nb_c = stats_exonic[0]
        nb_lnc = stats_exonic[1]
        nb_autres = stats_exonic[2]
        nb_start_end_exonic_selected = stats_exonic[3]
        nb_ccr_start_end_exonic = stats_exonic[4]
        nb_exonic = write_comparison_exonic_table(sample, exonic, args.output_comp_exonic_file)[5]
    else:
        nb_c = stats_exonic[0]
        nb_lnc = 0
        nb_autres = stats_exonic[1]
        nb_start_end_exonic_selected = stats_exonic[2]
        nb_ccr_start_end_exonic = stats_exonic[3]
        nb_exonic = write_comparison_exonic_table(sample, exonic, args.output_comp_exonic_file)[4]

    nb_start_end_exonic = get_exonic_circrnas(df)[1]
    nb_single_annotated_junction = get_exonic_circrnas(df)[2]
    monoexonic_circ_names =  get_exonic_circrnas(df)[5]
    exonic_names = get_exonic_circrnas(df)[6]
    intronic_circ_names = get_intronic_circrnas(df, exonic_names)[3]
    infraexonic_circ_names = get_subexonic_circrnas(df, monoexonic_circ_names, intronic_circ_names)[1]
    single_end_circ_names = get_exonic_circrnas(df)[3]
    nb_infraexonic_sens = get_subexonic_circrnas(df, monoexonic_circ_names, intronic_circ_names)[2]
    nb_true_intronic = get_intronic_circrnas(df, exonic_names)[1]
    ccr_single_annot = get_exonic_circrnas(df)[4]
    ccr_true_intronic = get_intronic_circrnas(df, exonic_names)[2]
    ccr_infraexonic = get_subexonic_circrnas(df, monoexonic_circ_names, intronic_circ_names)[3]

    nb_start_end_false_exonic = nb_start_end_exonic - nb_start_end_exonic_selected
    nb_tot_exonic = nb_start_end_exonic + nb_single_annotated_junction
    nb_common_infra_single = len(intersection(infraexonic_circ_names, single_end_circ_names))
    nb_infraexonic = nb_infraexonic_sens - nb_common_infra_single
    nb_circ_non_annotated = nb_circ_tot - (nb_tot_exonic + nb_infraexonic + nb_true_intronic - nb_start_end_false_exonic)
    nb_ccr_single_annot = sum(ccr_single_annot.values())
    nb_ccr_intronic = sum(ccr_true_intronic.values())
    nb_ccr_infraexonic = sum(ccr_infraexonic.values())

    return "\t".join(map(str,[sample, nb_circ_tot, nb_exonic, nb_lnc, nb_autres, nb_ccr_start_end_exonic,
                              nb_subexonic_pleg, nb_ccr_pleg, nb_subexonic_meg, nb_ccr_meg, nb_gene_subexonic,
                              nb_true_intronic, nb_ccr_intronic]))+"\n"


def write_stats_table(stats, output_file):
    with open(output_file, "w") as fout:
        fout.write(stats)


def write_circ_table(self, header, path, index=None, sep="\t", na_rep='', float_format=None,
                     index_label=None, mode='w', encoding=None, date_format=None, decimal='.'):
    """
    Write a circRNAs list to a tabular-separated file (tsv).
    """
    from pandas.core.frame import DataFrame
    df = DataFrame(self)
    # result is only a string if no path provided, otherwise None
    result = df.to_csv(path, index=index, sep=sep, na_rep=na_rep, float_format=float_format,
                       header=header, index_label=index_label, mode=mode, encoding=encoding,
                       date_format=date_format, decimal=decimal)
    if path is None:
        return result


def write_comparison_exonic_table(sample, circ_rnas, output_file_name):
    # Write exonic table for comparaison between tissues/species:
    df_circ_rnas = pd.DataFrame(circ_rnas, index=None)
    header = ["chrom:start-end,strand", "nb_ccr", "gene_id", "biotype", "genomic_size"]
    nb_start_end_annotated, nb_exonic = 0, 0
    nb_ccr_exonics, biotypes = [], []

    with open(output_file_name, 'w') as fout:
        tsv_writer = csv.writer(fout, delimiter='\t')
        tsv_writer.writerow(header)
        for index, row in df_circ_rnas.iterrows():
            exons_id_start_a, exons_id_end_a = [], []
            # Select well-annotated true exonic circRNAs bases on the gene_id:
            # Get key:
            chrom_start = ":".join(map(str,[row.chrom, row.start]))
            chrom_start_end = "-".join(map(str, [chrom_start , row.end]))
            chrom_start_end_strand = ",".join(map(str, [chrom_start_end , row.strand]))
            # Get the ccr number:
            nb_ccr = row.nb_ccr
            nb_ccr_exonics.append(nb_ccr)
            # Get gene_id:
            genes_id_start = list(set(list(row.gene_id_start.split(","))))
            genes_id_end = list(set(list(row.gene_id_end.split(","))))
            # Get biotype:
            exons_id_start = list(set(list(row.exons_id_start.split(","))))
            exons_id_end = list(set(list(row.exons_id_end.split(","))))
            exons_id_start = list(i.split("_") for i in exons_id_start)
            exons_id_end = list(i.split("_") for i in exons_id_end)
            intersect_gene_id = ",".join(list(set(genes_id_start).intersection(genes_id_end)))
            if len(intersect_gene_id) > 0:
                nb_start_end_annotated += 1
                # Get biotype:
                biotypes_start = list(set(list(i[2] for i in exons_id_start)))
                biotypes_end = list(set(list(i[2] for i in exons_id_end)))
                intersect_biotypes = "".join(list(set(biotypes_start).intersection(biotypes_end)))
                # Write line into the output file:
                nb_exonic += 1
                genomic_size = row.end - row.start
                s = [chrom_start_end_strand, nb_ccr, intersect_gene_id, intersect_biotypes, genomic_size]
                tsv_writer.writerow(s)
                biotypes.append(intersect_biotypes)
    # Dictionary with key = biotype and value = counter:
    d = {}
    d["sample"] = sample
    for biotype in biotypes:
        d[biotype] = d.get(biotype, 0) + 1
    values = []
    valid_keys = ["c", "lnc", "sample"]
    for key, value in d.items():
        if key not in valid_keys:
            values.append(value)
    d["autres"] = sum(values)
    nb_ccr_exonic = sum(nb_ccr_exonics)
    for key, value in d.items():
        if ("c" in d.keys()) and ("lnc" in d.keys()):
            return d["c"], d["lnc"], d["autres"], nb_start_end_annotated, nb_ccr_exonic, nb_exonic
        elif ("c" in d.keys() and "lnc" not in d.keys()):
            return d["c"], d["autres"], nb_start_end_annotated, nb_ccr_exonic, nb_exonic


def compute_size(exons_start_end):
    pos_start_end = list(exons_start_end.split("_"))
    pos_start = int(pos_start_end[0])
    pos_end = int(pos_start_end[1])
    size = pos_end - pos_start
    return size

def get_true_exons_gene_id(exons_start_end, exons_id, gene_id):
    if len(exons_start_end)==1:
        exon_start_end = ",".join(exons_start_end)
        exon_id = ",".join(exons_id)
        gene_id = gene_id
    else:
        annot = dict(zip(exons_start_end, exons_id))
        sizes = []
        for key in annot:
            size = compute_size(key)
            sizes.append(size)
        values = zip(exons_start_end, sizes)
        annot_f = dict(zip(exons_id, values))
        min_size = min([i[1] for i in list(annot_f.values())])
        for key, values in annot_f.items():
            if values[1] == min_size:
                exon_id = key
                exon_start_end = values[0]
    return exon_start_end, exon_id, gene_id


def get_nb_type(sample, circ_rnas, subexonic_antisens):
    # Write exonic table for comparaison between tissues/species:
    df_circ_rnas = pd.DataFrame(circ_rnas, index=None)
    nb_sub_exonic, nb_sub_exonic_pleg, nb_sub_exonic_meg = 0, 0, 0
    subexonic_genes, nb_ccr_pleg, nb_ccr_meg = [], [], []

    for index, row in df_circ_rnas.iterrows():
        # Select well-annotated subexonic circRNAs according to the gene_id:
        # Get gene_id:
        genes_id = list(set(list(row.gene_id_ife.split(","))))
        genes_id = list(i.split("_") for i in genes_id)
        exons_start_end = sorted(set(row.exons_start_end_ife.split(",")), key=row.exons_start_end_ife.split(',').index)
        exon_id = sorted(set(row.exon_id_ife.split(",")), key=row.exon_id_ife.split(',').index)
        gene_id = ",".join(list(set(list(row.gene_id_ife.split(",")))))
        biotypes = []

        exon_start_end = get_true_exons_gene_id(exons_start_end, exon_id, gene_id)[0]
        exon_id = get_true_exons_gene_id(exons_start_end, exon_id, gene_id)[1]
        gene_id = get_true_exons_gene_id(exons_start_end, exon_id, gene_id)[2]

        row.exons_start_end_ife = exon_start_end
        row.exon_id_ife = exon_id
        row.gene_id_ife = gene_id

        if len(genes_id)==1:
            for i in genes_id:
                # Subexonic pleg circRNAs:
                if ('c' in i or 'lnc' in i or 'pseudo' in i) and (row.circ_rna_name not in subexonic_antisens):
                    nb_sub_exonic += 1
                    nb_sub_exonic_pleg += 1
                    nb_ccr_pleg.append(row.nb_ccr)
                    subexonic_genes.append(gene_id)
                    # Subexonic meg circRNAs:
                elif 'c' not in i and 'lnc' not in i and 'pseudo' not in i and 'rRNA' not in i:
                    nb_sub_exonic += 1
                    nb_sub_exonic_meg += 1
                    nb_ccr_meg.append(row.nb_ccr)
                    subexonic_genes.append(gene_id)
                elif 'rRNA' in i:
                    pass

        if len(genes_id) > 1:
            for i in genes_id:
                if len(i) == 3:
                    biotypes.append(i[2])
                    biotypes = list(set(biotypes))
                elif len(i) > 3:
                    p_biotypes = [i[2],i[3]]
                    biotypes = [p for p in p_biotypes]
            if 'rRNA' in biotypes:
                pass
            # Subexonic pleg circRNAs:
            elif ((biotypes == ['c'] or biotypes == ['lnc'] or biotypes == ['pseudo']
                or ('c' and 'pseudo' in biotypes) or ('c' and 'lnc' in biotypes)
                or ('lnc' and 'pseudo' in biotypes)) and (row.circ_rna_name not in subexonic_antisens)):
                nb_sub_exonic += 1
                nb_sub_exonic_pleg += 1
                nb_ccr_pleg.append(row.nb_ccr)
                subexonic_genes.append(gene_id)
            # Subexonic meg circRNAs:
            else:
                nb_sub_exonic += 1
                nb_sub_exonic_meg += 1
                nb_ccr_meg.append(row.nb_ccr)
                subexonic_genes.append(gene_id)

    nb_subexonic_genes = len(list(set([val for sublist in subexonic_genes for val in sublist])))
    nb_ccr_p = sum(nb_ccr_pleg)
    nb_ccr_m = sum(nb_ccr_meg)
    return nb_sub_exonic, nb_subexonic_genes, nb_sub_exonic_pleg, nb_sub_exonic_meg, nb_ccr_p, nb_ccr_m


def write_subexonic_tables(sample, circ_rnas, subexonic_antisens, output_file_name_meg, output_file_name_pleg):
    # Write exonic table for comparaison between tissues/species:
    df_circ_rnas = pd.DataFrame(circ_rnas, index=None)
    nb_sub_exonic, nb_sub_exonic_pleg, nb_sub_exonic_meg = 0, 0, 0
    subexonic_genes, nb_ccr_pleg, nb_ccr_meg = [], [], []
    header = list(df_circ_rnas.columns.values.tolist())

    with open(output_file_name_meg, 'w') as fout_meg, open(output_file_name_pleg, 'w') as fout_pleg:
        # Subexonic pleg circRNAs:
        tsv_writer_pleg = csv.writer(fout_pleg, delimiter='\t')
        tsv_writer_pleg.writerow(header)
        # Subexonic meg circRNAs:
        tsv_writer_meg = csv.writer(fout_meg, delimiter='\t')
        tsv_writer_meg.writerow(header)

        for index, row in df_circ_rnas.iterrows():
            # Select well-annotated subexonic circRNAs according to the gene_id:
            # Get gene_id:
            genes_id = list(set(list(row.gene_id_ife.split(","))))
            genes_id = list(i.split("_") for i in genes_id)
            exons_start_end = sorted(set(row.exons_start_end_ife.split(",")), key=row.exons_start_end_ife.split(',').index)
            exon_id = sorted(set(row.exon_id_ife.split(",")), key=row.exon_id_ife.split(',').index)
            gene_id = ",".join(list(set(list(row.gene_id_ife.split(",")))))
            biotypes = []

            exon_start_end = get_true_exons_gene_id(exons_start_end, exon_id, gene_id)[0]
            exon_id = get_true_exons_gene_id(exons_start_end, exon_id, gene_id)[1]
            gene_id = get_true_exons_gene_id(exons_start_end, exon_id, gene_id)[2]

            row.exons_start_end_ife = exon_start_end
            row.exon_id_ife = exon_id
            row.gene_id_ife = gene_id

            if len(genes_id)==1:
                for i in genes_id:
                    # Subexonic pleg circRNAs:
                    if ('c' in i or 'lnc' in i or 'pseudo' in i) and (row.circ_rna_name not in subexonic_antisens):
                        nb_sub_exonic += 1
                        nb_sub_exonic_pleg += 1
                        nb_ccr_pleg.append(row.nb_ccr)
                        subexonic_genes.append(gene_id)
                        s = row
                        tsv_writer_pleg.writerow(s)
                    # Subexonic meg circRNAs:
                    elif 'c' not in i and 'lnc' not in i and 'pseudo' not in i and 'rRNA' not in i:
                        nb_sub_exonic += 1
                        nb_sub_exonic_meg += 1
                        nb_ccr_meg.append(row.nb_ccr)
                        subexonic_genes.append(gene_id)
                        s = row
                        tsv_writer_meg.writerow(s)
                    elif 'rRNA' in i:
                        pass

            if len(genes_id) > 1:
                for i in genes_id:
                    if len(i) == 3:
                        biotypes.append(i[2])
                        biotypes = list(set(biotypes))
                    elif len(i) > 3:
                        p_biotypes = [i[2],i[3]]
                        biotypes = [p for p in p_biotypes]
                if 'rRNA' in biotypes:
                    pass
                # Subexonic pleg circRNAs:
                elif ((biotypes == ['c'] or biotypes == ['lnc'] or biotypes == ['pseudo']
                    or ('c' and 'pseudo' in biotypes) or ('c' and 'lnc' in biotypes)
                    or ('lnc' and 'pseudo' in biotypes)) and (row.circ_rna_name not in subexonic_antisens)):
                    nb_sub_exonic += 1
                    nb_sub_exonic_pleg += 1
                    nb_ccr_pleg.append(row.nb_ccr)
                    subexonic_genes.append(gene_id)
                    s = row
                    tsv_writer_pleg.writerow(s)
                # Subexonic meg circRNAs:
                else:
                    nb_sub_exonic += 1
                    nb_sub_exonic_meg += 1
                    nb_ccr_meg.append(row.nb_ccr)
                    subexonic_genes.append(gene_id)
                    s = row
                    tsv_writer_meg.writerow(s)

    nb_subexonic_genes = len(list(set([val for sublist in subexonic_genes for val in sublist])))
    nb_ccr_p = sum(nb_ccr_pleg)
    nb_ccr_m = sum(nb_ccr_meg)
    return nb_sub_exonic, nb_subexonic_genes, nb_sub_exonic_pleg, nb_sub_exonic_meg, nb_ccr_p, nb_ccr_m


def write_circrnas_tables(sample, df, exonic_circrnas, intronic_circrnas, subexonic_circrnas, subexonic_antisens):
    header = list(df.columns.values.tolist())
    if len(exonic_circrnas)>0:
        # Write the exonic circRNAs table:
        write_circ_table(exonic_circrnas, header, args.output_exonic_file)
    else:
        with open(args.output_exonic_file, 'w') as fout:
            tsv_writer = csv.writer(fout, delimiter='\t')
            tsv_writer.writerow(header)
    if len(intronic_circrnas)>0:
        # Write the intronic circRNAs table:
        write_circ_table(intronic_circrnas, header, args.output_intronic_file)
    else:
        with open(args.output_intronic_file, 'w') as fout:
            tsv_writer = csv.writer(fout, delimiter='\t')
            tsv_writer.writerow(header)

    eprint("subexonic_circrnas", len(subexonic_circrnas))
    eprint("subexonic_antisens", len(subexonic_antisens))
    if len(subexonic_circrnas)>0 or len(subexonic_antisens)>0:
        # Write the subexonic circRNAs tables:
        write_subexonic_tables(sample, subexonic_circrnas, subexonic_antisens, args.output_subexonic_meg_file,
                               args.output_subexonic_pleg_file)
    else:
        eprint("ok")
        open(args.output_subexonic_meg_file, 'a').close()
        open(args.output_subexonic_pleg_file, 'a').close()

def main():
    # Get sample name:
    sample = get_sample(args.input_file)

    # Read the circRNAs annotation file:
    df_circ_annot = read_file(args.input_file)

    # Get the list of exonic, subexonic and intronic circRNAs:
    exonic_circrnas = get_circrnas(df_circ_annot)[0]
    subexonic_circrnas = get_circrnas(df_circ_annot)[1]
    intronic_circrnas = get_circrnas(df_circ_annot)[2]
    subexonic_antisens = get_circrnas(df_circ_annot)[3]

    # Write exonic, intronic and subexonic circRNAs tables:
    circrnas_tables = write_circrnas_tables(sample, df_circ_annot, exonic_circrnas,
                                            intronic_circrnas, subexonic_circrnas, subexonic_antisens)

    # Compute statistics about exonic, subexonic and intronic circRNAs:
    stats = get_stats_circrnas(sample, df_circ_annot, args.output_comp_exonic_file)

    # Write the circRNAs statistics table:
    write_stats_table(stats, args.output_stats_file)


def parse_arguments():
    parser = argparse.ArgumentParser(description='Return annotation statistics tables')
    parser.add_argument('-i', '--input_file', required=True,
                        help='Annotation circRNAs file')
    parser.add_argument('-o_stats', '--output_stats_file', required=False,
                        default="stats_annotation.tsv",
                        help='Table containing statistics of annotation')
    parser.add_argument('-oi', '--output_intronic_file', required=False,
                        default="intronic_circRNAs.tsv",
                        help='Table containing intronic circRNAs')
    parser.add_argument('-oe', '--output_exonic_file', required=False,
                        default="exonic_circRNAs.tsv",
                        help='Table containing exonic circRNAs')
    parser.add_argument('-oce', '--output_comp_exonic_file', required=False,
                        default="exonic_summary.tsv",
                        help='Table containing exonics circRNAs for comparisons')
    parser.add_argument('-osepleg', '--output_subexonic_pleg_file', required=False,
                        default="subexonic_pleg_circRNAs.tsv",
                        help='Table containing subexonic circRNAs')
    parser.add_argument('-osemeg', '--output_subexonic_meg_file', required=False,
                        default="subexonic_meg_circRNAs.tsv",
                        help='Table containing subexonic circRNAs')
    args = parser.parse_args()
    return args

if __name__ == '__main__':
    args = parse_arguments()
    main()

Python Pandas numpy Circr From line 4 of scripts/stats_annotation.py

import os
import argparse
import pandas as pd
import numpy as np
import re
import sys


# Utility functions
def create_folder(path):
    if not os.path.exists(path):
        os.mkdir(path)

def eprint(*args, **kwargs):
    print(*args,  file=sys.stderr, **kwargs)

def read_file(file):
    """ Read the sample file containing path files and return a pandas dataframe"""
    sample = pd.read_csv(file, sep='\t', dtype=str)
    return sample

def check_mapdirs(mapdirs):
    """ Check if the mapdir exists or not"""
    for mapdir in mapdirs:
        if not os.path.exists(mapdir):
            raise Exception("WARNING following mapdir is missing %s" % mapdir)

def read_log_file(sample):
    """ Read and clean the Log.final.out files containing the summary mapping statistics
        after mapping job is complete and return an array of the pandas dataframe
        containing statistics"""
    stats = []
    d = {}
    reads = ["R1", "R2"]
    paths_out = []
    for index, row in sample.iterrows():
        sample_name = row["sample"]
        sample_unit_name = row["sample_unit"]
        path = row["mapdir"]
        for read in reads:
            path_file = path+"/se/"+read+"/Log.final.out"
            f = open(path_file, "r")
            for line in f:
                l = line.split("|")
                stats.append(l)
            for stat in stats:
                if len(stat) == 1:
                    stats.remove(stat)
            for stat in stats:
                stat[0] = ' '.join(stat[0].split("\t"))
                stat[0] = ' '.join(stat[0].split())
                stat[1] = ' '.join(stat[1].split())
                stat[1] = stat[1].replace("%", "")
                d['sample'] = sample_name
                d['sample_unit'] = sample_unit_name
                d['read'] = read
                d[stat[0]] = stat[1]
            f.close()
            create_folder("reports/"+sample_unit_name+"/")
            create_folder("reports/"+sample_unit_name+"/"+read+"/")
            path_out = "reports/"+sample_unit_name+"/"+read+"/"+args.output_file
            paths_out.append(path_out)
            write_sample_file(d, path_out)
    return paths_out


def write_sample_file(dict, output_file):
    """Get value from a statistic dictionnary and return a pandas object"""
    df = pd.DataFrame([dict])
    df.to_csv(output_file, sep="\t", index=False)


def write_final_stat_tab(paths_out, output_file):
    df_final = pd.DataFrame()
    for path in paths_out:
        df = pd.read_csv(path, sep="\t")
        df_final = pd.concat([df_final, df], ignore_index=True, sort =False)
    df_final.to_csv(output_file, sep="\t", index=False)


def main():

    # Read the sample file:
    sample = read_file(args.input_file)

    # Check the mapdir of each sample exists or not:
    check_mapdirs(sample["mapdir"])

    create_folder("reports")

    # Read log files and get the path of a table containing statistics:
    paths_out = read_log_file(sample)

    # Write the statistic table with all samples:
    write_final_stat_tab(paths_out, args.output_file)


def parse_arguments():
    parser = argparse.ArgumentParser(description='Sample file')
    parser.add_argument('-i', '--input_file',
                        required=True, help='Sample file')
    parser.add_argument('-o', '--output_file', required=False, 
                        default="mapping_stat.tsv", help='Sample file')
    args = parser.parse_args()
    return args


if __name__ == '__main__':
    args = parse_arguments()
    main()

Python Pandas numpy From line 5 of scripts/stats_mapping.py

import os, re, sys, csv, argparse
import pandas as pd
import numpy as np

# Utility functions:
def eprint(*args, **kwargs):
    print(*args,  file=sys.stderr, **kwargs)


def read_file(file):
    """ Read the circRNAs annotation file and return a pandas dataframe"""
    df = pd.read_table(file, sep='\t')
    df.replace(np.nan, "", inplace=True)
    return df


def get_item(item):
    return ''.join(list(set(item)))


def write_summary_meg_table(df, output_file_name, min_size):
    header = ["gene_name", "biotype", "nb_circ_sens", "nb_circ_antisens", "nb_ccr", "exon_name", "start_exon", "end_exon"]
    with open(output_file_name, 'w') as fout:
        tsv_writer = csv.writer(fout, delimiter='\t')
        tsv_writer.writerow(header)
        if len(df) > 0:
            for key, item in df:
                exon_start = get_item(item.exons_start_end_ife).split("_")[0]
                exon_end = get_item(item.exons_start_end_ife).split("_")[1]
                if int(exon_end) - int(exon_start) > min_size:
                    strand = list(item.strand)
                    gene_id = list(key.split("_"))[0]
                    strand_gene_ife = list(key.split("_"))[1]
                    # Get biotype:
                    if len(list(key.split("_"))) == 3:
                        biotype = list(key.split("_"))[2]
                    else:
                        biotype = list(key.split("_"))[2]+"_"+list(key.split("_"))[3]
                    nb_circ_rna = len(df.get_group(key))
                    nb_circ_sens = strand.count(strand_gene_ife)
                    nb_circ_antisens = nb_circ_rna - nb_circ_sens
                    nb_ccr = sum(item.nb_ccr)
                    exon_name = get_item(item.exon_id_ife)
                    # print(df.get_group(key), "\n\n")
                    row = [gene_id, biotype, nb_circ_sens, nb_circ_antisens, nb_ccr, exon_name, exon_start, exon_end]
                    tsv_writer.writerow(row)


def write_summary_intronic_table(df, output_file_name, min_size):
    header = ["gene_name", "biotype", "intron_name", "nb_circ_rna", "nb_ccr", "start_intron", "end_intron"]
    with open(output_file_name, 'w') as fout:
        tsv_writer = csv.writer(fout, delimiter='\t')
        tsv_writer.writerow(header)
        if len(df) > 0:
            for key, item in df:
                start_i = ''.join([str(round(x)) for x in list(set(item.start_i))])
                end_i = ''.join([str(round(x)) for x in list(set(item.end_i))])
                if int(end_i) - int(start_i) > min_size:
                    intron_name = key
                    gene_id = get_item(item.gene_id_i).split("_")[0]
                    # Get biotype:
                    biotype = get_item(item.gene_id_i).split(",")
                    if len(biotype)>1:
                        biotype = biotype[0].split("_")[4]
                    else:
                        biotype = ''.join(biotype).split("_")[4]
                    nb_ccr = sum(item.nb_ccr)
                    nb_circ_rna = len(df.get_group(key))
                    row = [gene_id, biotype, intron_name, nb_circ_rna, nb_ccr, start_i, end_i]
                    tsv_writer.writerow(row)


def write_summary_pleg_table(df, output_file_name, min_size):
    header = ["gene_name", "nb_gene", "biotype", "nb_circ_sens", "nb_circ_antisens", "nb_ccr", "exon_name", "start_exon", "end_exon"]
    with open(output_file_name, 'w') as fout:
        tsv_writer = csv.writer(fout, delimiter='\t')
        tsv_writer.writerow(header)
        if len(df) > 0:
            for key, item in df:
                exon_start = get_item(item.exons_start_end_ife).split("_")[0]
                exon_end = get_item(item.exons_start_end_ife).split("_")[1]
                if int(exon_end) - int(exon_start) > min_size:
                    # Get gene_id and biotype:
                    gene_id_ife = get_item(item.gene_id_ife).split(",")
                    nb_gene = len(gene_id_ife)
                    if len(gene_id_ife) > 1:
                        gene_id = gene_id_ife[0].split("_")[0]+","+gene_id_ife[1].split("_")[0]
                        strand_gene_ife = get_item(gene_id_ife[0].split("_")[1]+gene_id_ife[1].split("_")[1])
                        biotype = get_item(gene_id_ife[0].split("_")[2]+gene_id_ife[1].split("_")[2])
                    else:
                        gene_id = ''.join(gene_id_ife).split("_")[0]
                        strand_gene_ife = ''.join(gene_id_ife).split("_")[1]
                        biotype = ''.join(gene_id_ife).split("_")[2]
                    strand = list(item.strand)
                    nb_circ_rna = len(df.get_group(key))
                    nb_circ_sens = strand.count(strand_gene_ife)
                    nb_circ_antisens = nb_circ_rna - nb_circ_sens
                    nb_ccr = sum(item.nb_ccr)
                    exon_name = key
                    row = [gene_id, nb_gene, biotype, nb_circ_sens, nb_circ_antisens, nb_ccr, exon_name, exon_start, exon_end]
                    tsv_writer.writerow(row)

def check_file(file):
    with open(file, 'r') as f:
        if len(f.read()) <= 1:
            result="Empty"
        else:
            result="Not empty"
    return result


def main():

    # Get min_size:
    min_size = int(args.min_size)

    # Read input annotation circRNAs files:
    if check_file(args.input_meg_file) == "Not empty":
        df_meg = read_file(args.input_meg_file)
        # Meg circRNAs:
        ## Group by gene:
        df_meg = df_meg.groupby('gene_id_ife')
        write_summary_meg_table(df_meg, args.output_meg_file, min_size)
    else:
        open(args.output_meg_file, 'a').close()

    if check_file(args.input_intronic_file) == "Not empty":
        df_intronic = read_file(args.input_intronic_file)
        # Intronic circRNAs:
        ## Group by intron name:
        df_intronic = df_intronic.groupby('intron_name')
        write_summary_intronic_table(df_intronic, args.output_intronic_file, min_size)
    else:
        open(args.output_intronic_file, 'a').close()

    if check_file(args.input_pleg_file) == "Not empty":
        df_pleg = read_file(args.input_pleg_file)
        # Pleg circRNAs:
        ## Group by exon:
        df_pleg = df_pleg.groupby('exon_id_ife')
        write_summary_pleg_table(df_pleg, args.output_pleg_file, min_size)
    else:
        open(args.output_pleg_file, 'a').close()


def parse_arguments():
    parser = argparse.ArgumentParser(description='Return summary circRNAs tables')
    parser.add_argument('-ip', '--input_pleg_file', required=True,
                        help='Annotation pleg circRNAs file')
    parser.add_argument('-im', '--input_meg_file', required=True,
                        help='Annotation meg circRNAs file')
    parser.add_argument('-ii', '--input_intronic_file', required=True,
                        help='Annotation intronic circRNAs file')
    parser.add_argument('-op', '--output_pleg_file', required=False,
                        default="pleg_summary.tsv",
                        help='Table containing summary of pleg circRNAs annotation')
    parser.add_argument('-om', '--output_meg_file', required=False,
                        default="meg_summary.tsv",
                        help='Table containing summary of meg circRNAs annotation')
    parser.add_argument('-oi', '--output_intronic_file', required=False,
                        default="intronic_summary.tsv",
                        help='Table containing summary of intronic circRNAs annotation')
    parser.add_argument('-ms', '--min_size', required=False, default=55,
                        help='Minimum size of circRNAs')
    args = parser.parse_args()
    return args


if __name__ == '__main__':
    args = parse_arguments()
    main()

Python Pandas numpy From line 4 of scripts/summary_table.py

shell:
    "find {input} -type f -empty -delete"

SnakeMake From line 86 of master/Snakefile

shell:
    "cat {{bta,oar,ssc}}_*/*_exonic_summary.tsv | cut -f1,3,4 |tail -n+2|sort|uniq > {output}"

SnakeMake From line 95 of master/Snakefile

shell:
    "python3 ../scripts/summary_table.py -ip {input.pleg} -im {input.meg} -ii {input.intronic}"
    " -op {output.pleg} -om {output.meg} -oi {output.intronic} -ms {params.min_size}"
    " 1>{log.stdout} 2>{log.stderr}"

SnakeMake From line 113 of master/Snakefile

shell:
    "cat {input} >> {output}"

SnakeMake From line 124 of master/Snakefile

shell:
    "python3 ../scripts/stats_annotation.py -i {input} -o_stats {output.stats}"
    " -oi {output.intronic} -oe {output.exonic} -oce {output.comp_exonic} -osepleg {output.sub_exonic_pleg}"
    " -osemeg {output.sub_exonic_meg}"
    " 1>{log.stdout} 2>{log.stderr}"

SnakeMake From line 141 of master/Snakefile

shell:
    "python3 ../scripts/circRNA_annotation.py -circ {input.circ_detected}"
    " -annot {input.annot_exon} -fmt bed -o {output} 1>{log.stdout} 2>{log.stderr}"

SnakeMake From line 157 of master/Snakefile

shell:
    "python3 ../scripts/cumul_bed.py -sp {input} > {log}"

SnakeMake From line 167 of master/Snakefile

shell:
    "if grep -q 'Nreads' {input.R1}; then head -n -2 {input.R1} > {input.R1}2; mv {input.R1}2 {input.R1}; fi ;"
    " if grep -q 'Nreads' {input.R2}; then head -n -2 {input.R2} > {input.R2}2; mv {input.R2}2 {input.R2}; fi ;"
    " python3 ../scripts/circRNA_detection.py -r1 {input.R1} -r2 {input.R2}"
    " -min_cr {params.min_ccr} -tol 0 -fmt bed -o {output} 1>{log.stdout} 2>{log.stderr}"

SnakeMake From line 180 of master/Snakefile

shell:
    "python3 ../scripts/stats_mapping.py -i {input} -o {output}"

SnakeMake From line 191 of master/Snakefile

ShowHide 21 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/ccerutti88/circRNA

Name: circrna

Version: 2

Badge:

Insert copied code into your website to add a link to this workflow.

Other Versions:

License: GNU General Public License v3.0

Keywords:

Sequence annotation Circr Consensus Pandas Pybedtools Snakemake STAR natsort networkx numpy singleton tqdm RNA

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free

circDetector (CD): circular RNAs detection and annotation workflow

Help improve this workflow!

Description:

Table of contents:

Installation:

Compilation and configuration:

Required tools:

Setting the environment:

Preparing the config file:

Preparing the sample file:

Snakemake worflow:

Rule detection:

Rule mappingstat:

Rule annotation:

Example Commands:

Contact & Support

Copyright and License

Code Snippets

Comments

Support

Free

Related Workflows

public

public

public

public

public

public