Automatic identification, classification and annotation of plant lncRNAs in transcriptomic sequences

public public 1yr ago Version: 2 0 bookmarks

Table of contents

Introduction

Long non-coding RNAs (lncRNAs) are RNA molecules longer than 200 nucleotides that do not encode proteins. Experimental studies have shown the diversity and importance of lncRNA functions in plants. To expand knowledge about lncRNAs in other species, computational pipelines that allow for standardised data-processing steps in a mode that does not require user control up to the final result were actively developed recently. These advancements enable wider functionality for lncRNA data identification and analysis. In the present work, we propose the ICAnnoLncRNA pipeline for automatic identification, classification and annotation of plant lncRNAs in transcriptomic sequences assembled from high-throughput RNA sequencing (RNA-seq) data.

This pipeline is only applicable to the Linux operating system.

The pipeline includes the following steps:

1. Pre-processing.

  • Gmap index building.

  • LncFinder model building.

2. LncRNA filtering.

  • Length filtering

  • LncFinder predicting.

  • CPC2 predicting.

  • GMAP alignment on reference genome.

  • Filtering erros/noise.

  • Filtering possible transposable elements (TEs).

3. LncRNA annotation.

  • Classification gffcompare.

  • Blastn alignment.

  • Analysis of lncRNA expression.

The pipeline is implemented using the workflow management system Snakemake , which provides ability to platform-independent installation and execution of the software.

Schematic diagram

Test Image 1

Installation

Automatic

recommended for clusters/servers

Install only programs.yaml environment. Other environments will install automatically when ICAnnoLncRNA start work.

1. wget https://github.com/artempronozin95/ICAnnoLncRNA-identification-classification-and-annotation-of-LncRNA/archive/refs/heads/main.zip
2. unzip main.zip
3. cd ./ICAnnoLncRNA-identification-classification-and-annotation-of-LncRNA-main
4. conda env create --file env/programs.yaml
5. conda activate ICAnnoLncRNA

After these steps all necessary packages are installed. If you need update packages ( not recommended ), change the version of packages after “=” (example - snakemake=4.0.1 -> snakemake=6.0.0 ), then conda env update --file ./programs.yaml . All necessary packages will be updated. Recomended on clusters, requires a lot of processing power.

Step method

recommended for personal computer

1. conda update conda.
2. conda create -n ICAnnoLncRNA python=3.6
3. conda activate ICAnnoLncRNA
4. conda install -c bioconda emboss
5. install next packeges from file below

Install all packeges represented here

It is necessary to install and download (download program in pipeline folder):

1. cd ./ICAnnoLncRNA---identification-classification-and-annotation-of-LncRNA-main
2. wget https://github.com/biocoder/CPC2/archive/refs/heads/master.zip
3. unzip master.zip
4. mv CPC2-master CPC2

download LncAPDB.fasta into data/reference/data_index folder.

Input

Genome sequence

  1. Known LncRNA transcripts of the species in FASTA format. File with lncRNA from the public database of lncRNA. For example Ensembl , EVLncRNAs v2 .

  2. Known protein coding sequences of the species in FASTA format. File with protein coding sequences from a public database. For example Ensembl .

  3. RNA-seq transcripts of the species in FASTA format. This file contains transcripts obtained after assembly of transcriptome. For example file that we provide, 30k_zey.fasta , obtained by Trinity assembly. All three sets are necessary for build

Reference genome

  1. Reference genome of the species in FASTA format.

  2. Annotation of the species in GFF/GTF format.

Additional data

  1. TE coordinats on the reference genome, example Zea_N_merged.bed

  2. File with sequence ID from public lncRNA databases connected to IDs in library of known lncRNAs. example index_and_newindex.csv

  3. Annotation of the organism transcriptome libraries by tissue type, example SRX_all_org.tsv

Data

Additional data is presented here: https://data.mendeley.com/datasets/fnk8pmp2yz/3

It includes:

  • blast.outfmt6 - Blastn results. Contain homologs with known lncRNAs sequences from the LncAPDB library.

  • index_and_newindex.fasta - index of PNRD, CANTATAdb, GREENC, PlncDB, EVLncRNA databases compared with new index for LncAPDB library.

  • LncAPDB.fasta - lncRNA sequences of LncAPDB library in fasta format.

  • lncrna_class.tmap - novel lncRNAs divided into gffcompare classes.

  • lncrna_coordinats.bed - coordinates of novel lncRNAs on chromosomes.

  • lncrna_intron_size.tsv - intron size of novel lncRNAs and their coordinates on the genome.

  • new_lncrna.fasta - novel lncRNA sequences in fasta format.

  • new_lncrna_locus.loci - locus of lncRNA sequences on genome.

  • transcriptome_lib.txt - Maize transcriptome libraries.

Configuration file

Input all necessary files into configuration file “config.yaml”:

  • lnc: - known LncRNA sequences of studied organisms in FASTA format.

    • (Example: lnc: "data/input/lincrna.fasta" )
  • cds: - known CDS (coding) sequences of studied organisms in FASTA format.

    • (Example: cds: "data/input/7000_len_mRNA.fasta" )
  • model: - pipeline check if model (lncFinder) for this organism already exist and use it, if it's not, pipeline will create a new one (in output folder). Model that exist represented in Models .

    • (Example: model: "Zea_mays" )
  • sequence: - sequences that need to study in FASTA format.

    • (Example: sequence: "data/input/Zea.fasta" )
  • structure: - need to choose whether to use secondary structure in model building or not. Choose between DNA or SS (secondary structure).

    • (Example: structure: "DNA" )
  • gmap:

    • gff_reference: - reference genome annotation in GFF format.

      • (Example: gff_reference: "data/reference/Zea_mays.AGPv4.40.gff" )
    • reference: - reference genome in FASTA format.

      • (Example: reference: "data/reference/Zea_mays.AGPv4.dna.toplevel.fa" )
    • out: - output file.

      • (Example: out: "data/output/alignm.gff" )
  • gff: - gff file of reference genome.

    • (Example: gff: "data/reference/Zea_mays.AGPv4.40.gff" )
  • blast: - standard BLAST parameters

    • evalue: - evalue

      • (Example: evalue: 1e-50 )
    • max_target: - max_target_seqs

      • (Example: max_target: 1 )
    • identity: - perc_identity

      • (Example: identity: 30 )
    • threads: - num_threads

      • (Example: threads: 1 )
  • TE_coords: - if there is no TE coordinats information - recomended to turn off this step.

    • option: "on" - turn on or turn off this step (on/off)

    • coords: - TE coordinat

      • (Example: coords: "data/reference/N_coords/Zea_N_merged.bed" )
  • tissue:

    • organism: - choose species that you study form species list , or input your own species ID in same format

      • (Example: organism: "ZMAY" )
    • exp: - choose between organisms tissue experiments in Tissue analysis or input your organism.

      • (Example: exp: "ZM" )
    • LncAPDB: - File with sequence ID from public lncRNA databases connected to IDs in library of known lncRNAs.

      • (Example: LncAPDB: "tissue/index_and_newindex.csv" )
    • SRX_exp: - Annotation of the transcriptome libraries by tissue type and gene expression levels.

      • (Example: SRX_exp: "tissue/SRX_all_org.tsv" )
  • expression :

    • option : "on" - turn on or turn off this step (on/off)

    • path : - path to expression data.

      • (Example: path: "expressin/Kallisto" )
    • calculation_type : - type of expression data. There three types: 1.Kallisto 2.Htseq 3.Own type of data.

      • (Example: path: "Kallisto" )

Folders

data/input

Contain:

  1. Known LncRNA transcripts of the species in FASTA format.

  2. Known protein coding sequences of the species in FASTA format.

  3. RNA-seq transcripts of the species in FASTA format.

data/reference

Contain:

  • reference genome of studied organism in FASTA format.

  • annotation of studied organism in GFF/GTF format.

  • data_index folder with library of known lncRNA from databases.

  • intron_data folder with structure information of organisms.

  • models folder with model for lncFinder.

  • N_coords folder with TE coordinats

Work start

1. snakemake -j 2

-j or --cores - Use at most N CPU cores/jobs in parallel. If N is omitted or ‘all’, the limit is set to the number of available CPU cores.

2. snakemake -nr

-n - Do not execute anything, and display what would be done. If you have a very large workflow, use –dry-run –quiet to just print a summary of the DAG of jobs.

-r - Print the reason for each executed rule.

3. snakemake --use-conda

--use-conda - Use additional conda environment.

4. Recommended run:

snakemake -j 2 --use-conda

Models

  • Zea_mays

  • Arabidopsis_thaliana

LncRNA structure information

  • Zea_mays

  • Arabidopsis_thaliana

  • Oryza_sativa

Known LncRNA for database

Folder data/reference/data_index contain lncRNA library (LncAPDB.fasta) that build on base of 5 lncRNA databases (PNRD, CANTATAdb, GREENC, PlncDB, EVLncRNA). The library you can find here: https://data.mendeley.com/v1/datasets/fnk8pmp2yz/draft?a=13de1dfb-b631-42f3-8108-f3de2f50fd90

Species

  • Medicago truncatula - MTRU

  • Glycine max - GMAX

  • Populis trichocarpa - PTRI

  • Arabidopsis thaliana - ATHA

  • Vitis vivifera - VVIN

  • Oryza sativa japonica - OSAT

  • Brachipodiumdistachion - BDIS

  • Sorghum bicolor - SBIC

  • Zea mays - ZMAY

  • Selaginella moellendorfii - SMOE

  • Physcomitrella patens - PPAT

  • Ostreococcus tauri - OTAU

  • Volvox - VOLV

  • Amborell trichopoda - ATRI

Tissue analysis

File tissue/SRX_all_org.tsv , contains information about 1241 transcript experiment libraries for 5 organisms with respect to specific tissue. In config.yaml need to choose between organisms that presented in SRX file:

  • HV - Hordeum vulgare

  • OS - Oryza sativa

  • SL - Solanum lycopersicum

  • ST - Solanum tuberosum

  • ZM - Zea mays

If the libraries you are investigating (for other organism for exemple) are not in this file. You can add them to the data file in the same format

ST	SRX1478098	leaf
ZM	SRX339769	stomatal division zone
ZM	SRX339787	ear
ZM	SRX711129	leaf
OS	SRX2582231	unknown
ZM	SRX711024	ear
ZM	SRX711053	leaf

Where 1 column - id, 2 - organism that you study, 3 - SRX library, 4 - tissue.

Expression and entropy

For this step there is necessary to provide expression data for every transcript. Thus recommended turn off this step at first run of the pipeline. After receiving expression data turn on this step and pipeline will calculete entropy and espression statistic. In config.yaml need to choose between three types of table building:

  • Kallisto - result are received by means of Kallisto program. Results should be presented as on example:
├── SRR957466
│ ├── abundance.h5
│ ├── abundance.tsv
│ └── run_info.json
├── SRR957467
│ ├── abundance.h5
│ ├── abundance.tsv
│ └── run_info.json
└── SRR957468 ├── abundance.h5 ├── abundance.tsv └── run_info.json
  • Htseq - result are received by means of Htseq program. Results should be presented as on example:
├── SRR1586686.txt
└── SRR504468.txt
  • own - In this case user should build table by himself. Table should looks like as table presented in example folder. Where:

  • type - mRNA - is protein coding genes. noncons - nonconserved lncRNAs. consv - conserved lncRNAs.

  • all tissue that needed. In example table, value is a median of expression for libraries that belong to tissue in column name.

  • target_id - ID of the transcript. it is necessary that the transcripts correspond to the results of the pipeline.

Output

A typical structure of Output is follows:

├── gffcompare_first
 ├── gffcmp.annotated.gtf
 ├── gffcmp.filter_alignm.bed.refmap
 ├── gffcmp.filter_alignm.bed.tmap
 ├── gffcmp.loci
 ├── gffcmp.stats
 └── gffcmp.tracking
├── gffcompare_second
 ├── gff.annotated.gtf
 ├── gff.filter_alignm.bed.refmap
 ├── gff.filter_alignm.bed.tmap
 ├── gff.loci
 ├── gff.stats
 └── gff.tracking
├── GMAP
 ├── alignm.bed
 ├── alignm_filter.gff
 ├── filter_alignm.bed
 ├── gmap_build.err.log
 ├── gmap_build.out.log
 ├── gmap.out.log
 ├── reference.bed
 └── statistic.csv
├── lncRNA_prediction
 ├── Coding.fasta
 ├── cpc.txt
 ├── lncFinder.csv
 └── Noncoding.fasta
├── lncRNA_structure
 ├── anti.png
 ├── exon_size.png
 ├── intron_size.png
 ├── itron_coordin.tsv
 ├── long_transcripts.bed
 ├── number_of_exon.png
 ├── number_of_lncRNA.png
 └── statistic_bed.tsv
├── loci
 ├── lncRNA_before_loci.bed
 └── lncRNA_loci.bed
├── new_lncRNA
 ├── blast.outfmt6
 ├── classes.png
 ├── LncAPDB_vs_blast.csv
 ├── new_lncrna.fasta
 └── True_lncRNA.bed
├── TE_finder
 ├── Lnc_aling_with_TE.tsv
 └── lncRNA_after_loci.bed
├── tissue
 ├── all.txt
 ├── cod.txt
 ├── cons.txt
 ├── id.txt
 ├── non.txt
 ├── tissue_org.csv
 ├── tissue_org.png
 ├── transc_cod.csv
 ├── transc.csv
 └── transc_non.csv
├── expression
 ├── box_plot.png
 ├── entropy.png
 ├── expr_table.tsv
 ├── hist_expr.png
 └── table_type.txt
  • Folder "tissue", contains lncRNA distribution in tissue.

Groups of output files

1. Pre-processing.

Pre-processing performed in input directory:

  • filtered_seq.fasta - filter transcripts with length < 200 nt.

  • new_vs_old_id.tsv - new ID and old ID of transcripts.

  • test_train - folder of lncFinder model studie.

2. LncRNA filtering.

lncRNA_prediction

lncRNA_prediction

  • Coding.fasta - Transcripts predicted as protein coding by lncFinder, FASTA format.

  • cpc.txt - Transcripts predicted as lncRNA by CPC2, TXT format.

  • lncFinder.csv - Transcripts predicted as lncRNA by lncFinder, CSV format.

  • Noncoding.fasta - Transcripts predicted as lncRNA by lncFinder, FASTA format.

GMAP alignment on reference genome.

GMAP

  • alignm.bed - results of lncRNA transcripts alignment on reference genome by GMAP, 'BED' format.

  • alignm_filter.gff - results of lncRNA transcripts alignment on reference genome without long intron transcripts, GFF format.

  • filter_alignm.bed - results of lncRNA transcripts alignment on reference genome without long intron transcripts, BED format.

  • gmap_build.err.log

  • gmap_build.out.log

  • gmap.out.log

  • reference.bed - reference genome annotation, BED format.

  • statistic.csv - lncRNA structure statistic, CSV format.

Filtering erros/noise.

gffcompare_first - classification of the candidate lncRNA sequences by their location in the genome relative to a protein-coding genes, more information can be found on gffcompare

  • gffcmp.annotated.gtf - Gffcompare reports a GTF file containing the “union” of all transcript IDs in each sample.

  • gffcmp.filter_alignm.bed.refmap - This tab-delimited file lists, for each reference transcript, which query transcript either fully or partially matches that reference transcript.

  • gffcmp.filter_alignm.bed.tmap - This tab delimited file lists the most closely matching reference transcript for each query transcript.

  • gffcmp.loci - Represent loci of transcripts.

  • gffcmp.stats

  • gffcmp.tracking - This file matches transcripts up between samples.

loci

  • lncRNA_before_loci.bed - the candidate lncRNA sequences before merging into loci, BED format.

  • lncRNA_loci.bed - loci of the candidate lncRNA, BED format.

gffcompare_second - classification of the candidate lncRNA sequences by their location in the genome relative to lncRNA loci.

  • gff.annotated.gtf

  • gff.filter_alignm.bed.refmap

  • gff.filter_alignm.bed.tmap

  • gff.loci

  • gff.stats

  • gff.tracking

Filtering possible transposable elements (TEs).

TE_finder

  • Lnc_aling_with_TE.tsv - the candidate lncRNA sequences that mapped on TE, TSV format.

  • lncRNA_after_loci.bed - the candidate lncRNA sequences that are completely matched ( = - class of gff compare) with lncRNA loci, BED format.

3. LncRNA annotation.

lncRNA_structure

  • anti.png - allocation of antisense lncRNA alignment to target gene structure.

  • exon_size.png - lncRNA exon size distribution.

  • intron_size.png - lncRNA intron size distribution.

  • itron_coordin.tsv - coordinates lncRNA introns, TSV format.

  • long_transcripts.bed - transcripts with length > 5000 nt, BED format.

  • number_of_exon.png - the ratio of the number of exons per lncRNA.

  • number_of_lncRNA.png - number of lncRNA per chromosome.

  • statistic_bed.tsv - lncRNA intron statistic, TSV format.

new_lncRNA

  • blast.outfmt6 - BLASTn results in outfmt6 format.

  • classes.png - distribution of lncRNA classes.

  • LncAPDB_vs_blast.csv - known lncRNAs that were predicted.

  • new_lncrna.fasta - identified lncRNA transcripts, FASTA format. ( final result )

  • True_lncRNA.bed - identified lncRNA transcripts, BED format.

tissue

  • all.txt - lncRNA that have homologs with known lncRNAs.

  • cod.txt - ID of proteine coding transcripts.

  • cons.txt - ID of the lncRNAs similar to lncRNAs from other species (conserved sequences)

  • id.txt - ID of all input transcripts

  • non.txt - ID of the lncRNAs similar only to known studied organism sequences (non-conserved sequences)

  • tissue_org.csv - all in one (proteine coding, conserved lncRNAs, non-conserved lncRNAs)

  • tissue_org.png - heatmap of transcript libraries tissue type.

  • transc_cod.csv - proteine coding transcript libraries tissue type.

  • transc.csv - conserved lncRNAs transcript libraries tissue type.

  • transc_non.csv - non-conserved lncRNAs transcript libraries tissue type.

  • Blastn alignment.

  • Analysis of lncRNA expression.

expression

  • box_plot.png - box plot diagrams of gene expression.

  • entropy.png - density plot of Shannon entropy.

  • expr_table.tsv - table with expression data for every transcript in every tissue.

  • hist_expr.png - density plot of maximum expression levels of transcripts.

  • table_type.txt - type of expression results.

Errors

rule model_lncFinder: input: data/input/test_train/sets.txt output: data/input/test_train/dirs.txt jobid: 2
Activating conda environment /home/pronozin/ICAnnoLncRNA---identification-classification-and-annotation-of-LncRNA-main/.snakemake/conda/767d2b45.
/bin/bash: activate:

Solution : type conda install conda in ICAnnoLncRNA environment.

rule Gmap_build:
 input: data/reference/Zea_mays.AGPv4.dna.toplevel.fa, data/reference/Zea_mays.AGPv4.40.gff
 output: data/output/statistic.csv
 jobid: 5
...
Error in rule Gmap_build:
 jobid: 5
 output: data/output/statistic.csv
RuleException:
CalledProcessError in line 123 of ...:
Command ' set -euo pipefail; ...'

Solution : update gtfparse package. Example: conda install -c bioconda gtfparse

Code Snippets

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import pandas as pd
import sys
import numpy as np
from pybedtools import BedTool
pd.set_option('display.max_columns', None)

def clean(x):
    split = x.str.split('.', 4, expand=True)
    split = split[0]
    return (split)

tmap = pd.read_csv(sys.argv[1], sep='\t')
gmap_align_prime = pd.read_csv(sys.argv[2], sep='\t', header=None)

# take only lncRNA classes from tmap file
tmap['group'] = np.nan
tmap['group'][tmap['class_code'].isin(['='])] = 'true lncrna'
tmap.dropna(subset = ["group"], inplace=True)
tmap = tmap[tmap['qry_gene_id'].str.contains("path1")]
print(tmap['group'].value_counts())
# filter duplicates
query = clean(tmap['qry_gene_id'])
gmap_align_prime['name'] = clean(gmap_align_prime[3])
query = query.drop_duplicates()
# take lncRNA from alignment file
if str(sys.argv[3]) == 'on':
   print('TE finder ON')
   gmap_align_prime = gmap_align_prime[gmap_align_prime['name'].isin(query)]
   gmap_align = gmap_align_prime[gmap_align_prime[7].str.contains("gene")]
   gmap_align.to_csv(sys.argv[4], sep='\t', index=False, header=None)
# BedTool intersect
   genes = BedTool(sys.argv[4])
   loci = genes.intersect(wa=True, wb=True, b=sys.argv[5])
   loci_df = pd.read_table(loci.fn, sep='\t', names=['chr','lncRNA_start','lncRNA_end','lncRNA',4,'lncRNA_strand',6,7,8,9,10,'chr_TE','TE_start','TE_end'])
   loci_df = loci_df[['chr','lncRNA_start','lncRNA_end','lncRNA','lncRNA_strand','chr_TE','TE_start','TE_end']]
   loci_df = loci_df.drop_duplicates(subset=['lncRNA'])
   loci_name = loci_df['lncRNA'].str.split('.',4, expand=True)
   true_lncrna = gmap_align_prime[~gmap_align_prime['name'].isin(loci_name[0])]
   loci_df.to_csv(sys.argv[6], sep='\t', index=False, header=None)
   true_lncrna.to_csv(sys.argv[7], sep='\t', index=False, header=None)
else:
   gmap_align_prime = gmap_align_prime[gmap_align_prime['name'].isin(query)]
   gmap_align_prime.to_csv(sys.argv[4], sep='\t', index=False, header=None)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import os
import subprocess
import sys
from Bio import SeqIO
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt



def data_base(x):
    x_n = x.rsplit('/', 1)[1]
    x_n = x_n.rsplit('.', 1)[0]
    dir_check = os.path.isfile('data/reference/data_index/' + x_n + '.nin')
    if dir_check is True:
        print('index exist')
        index_path = os.path.join('data/reference/data_index/', x_n)
        return index_path
    else:
        print('build index')
        index_path = os.path.join('data/reference/data_index/', x_n)
        command = 'makeblastdb -in {db} -dbtype nucl -parse_seqids -out {index}'. format(db=x, index=index_path)
        exit_code = subprocess.call(command, shell=True)
        return index_path

def fasta(x):
    query = x['name']
    for w in query:
        try:
            old_id = old_new[old_new['new_id'].isin([w])]
            print(record_dict[str(old_id['old_id'].to_list()[0])].format('fasta'), end='', file=lncrna)
        except KeyError:
            continue

def alingment(x,y):

    path = x.rsplit('/', 1)[0]
    outfmt = os.path.join(path, 'blast' + '.outfmt6')
    aling = 'blastn -query {q} -db {dbw} -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen"  -evalue {evalue}  -max_target_seqs {max_target} -perc_identity {identity} -num_threads {threads} -out {outfmt}'. format(q=x, dbw=y, evalue=evalue, max_target=max_target, identity=identity, threads =threads, outfmt=outfmt)
    #| sort -k1,1 -k12,12nr -k11,11n | sort -u -k1,1 --merge > {outfmt}'. format(q=x, dbw=y, outfmt=outfmt)
    exit_code = subprocess.call(aling, shell=True)

# input data
query = pd.read_csv(sys.argv[1], sep='\t', header=None)
gff_tmap = sys.argv[2]
data = sys.argv[3]
record_dict = SeqIO.to_dict(SeqIO.parse(sys.argv[4], "fasta"))
lncrna = open(sys.argv[5], 'w', encoding='utf-8')
evalue = sys.argv[6]
max_target = sys.argv[7]
identity = sys.argv[8]
threads = sys.argv[9]
old_new = pd.read_csv(sys.argv[10], sep='\t')
tmap = pd.read_csv(gff_tmap, sep='\t')
# lncRNA classification graf
query = query[query[7].str.contains("gene")]
tmap = tmap[tmap['qry_gene_id'].str.contains("path1")]
name_tmap = tmap['qry_gene_id'].str.rsplit('.', expand=True)
tmap[['name','seq']] = name_tmap[[0,1]]
#tmap = tmap[tmap['seq'].str.contains("path1")]
tmap = tmap[tmap['name'].isin(query[10])]
tmap= tmap.drop_duplicates(subset=['name'])
tmap['group'] = np.nan
tmap['group'][tmap['class_code'].isin(['x'])] = "exon antisense"
tmap['group'][tmap['class_code'].isin(['i'])] = "intron"
tmap['group'][tmap['class_code'].isin(['u'])] = "intergenic"
tmap.dropna(subset = ["group"], inplace=True)
print(tmap['group'].value_counts())
f, ax = plt.subplots(figsize=(13, 13))
tmap['group'].value_counts().plot(kind='bar')
plt.xticks(size=15, rotation=30)
plt.yticks(size=15)
plt.ylabel('LncRNA transcripts number', size=15)
plt.xlabel('LncRNA classes', size=15)
plt.savefig("data/output/new_lncRNA/classes.png", dpi=500)
# BLAST
fasta(tmap)
indx = data_base(data)
alingment(sys.argv[5], indx)
1
2
3
4
5
6
7
8
9
test=$(ls -lrt $2 | cut -f 5 -d ' ')
minimumsize=4000000000
if  [ $test -gt $minimumsize ]; then
	mkdir ./data/input/chunk
	awk -v size=10000 -v pre=prefix -v pad=5 '/^>/{n++;if(n%size==1){close(f);f=sprintf("./data/input/chunk/%s%0"pad"d.fa",pre,n)}}{print>>f}' $2
	Rscript scripts/lncFind.r $1 ./data/input/chunk $3 $4
else
	Rscript scripts/lncFind.r $1 $2 $3 $4
fi
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
library(ggpubr)
library(dplyr)
library(purrr)
library(BioQC)
library(ggplot2)
library(data.table)

args <- commandArgs(TRUE)

entrop <- function(a) {
  ent <- entropySpecificity(a, norm=TRUE)
  return(ent)
}

trans <- function(a) {
  a$target_id <- NULL
  a$type <- NULL
  a$sum <- NULL
  entrop(a)
}

type <- function(a) {
  col_name <- colnames(a)
  col_name <- col_name[! col_name %in% c("type", "target_id")]
  a$sum <-  rowSums(a[,col_name], na.rm=TRUE)
  a <- filter(a, sum >0)
  ent <- trans(a)
  a$entropy <- ent
  return(a)
}

plot <- function(a,b,c) {
  ggplot() +
    geom_density(data = a, aes(x = entropy, fill="Non"), alpha=0.4)+
    geom_density(data = b, aes(x = entropy, fill="Cons"), alpha=0.4)+
    geom_density(data = c, aes(x = entropy, fill="mRNA"), alpha=0.4)+
    scale_colour_manual("", 
                        breaks = c("Non", "Cons", "mRNA")) +
    xlab('entropy') + ylab('density') + guides(fill = guide_legend(title = "Types")) + 
    theme(text = element_text(size = 20))
  ggsave("data/output/expression/entropy.png", width = 20, height = 20)
}

log <- function(a) {
       col_name <- colnames(a)
       col_name <- col_name[! col_name %in% c("type", "target_id")]  
       col <- c()
       for (x in col_name) {
            col <- append(col, paste("log_", noquote(x), sep = ""))
            a[, paste("log_", noquote(x), sep = "")] <- log2(a[, noquote(x)])
       } 
       new_col <<- col
       return(a)
}

# install dataframe
df <- read.table(args[1], header = TRUE)

# log of columns
df_log <- log(df) 

# exprassin analysis
df_log[df_log == -Inf] <- 0
my_comparisons = list(c("consv", "non_cons"), c("consv", "prot"),  c("prot", "non_cons"))

ggboxplot(df_log, x="type", y= new_col , merge = "flip", ylab = "log2(TPM)", color = "type", palette = "jco")+  
  font("subtitle", size = 20)+
  font("caption", size = 20)+
  font("xlab", size = 20)+
  font("ylab", size = 20)+
  font("xy.text", size = 20) +theme(legend.text=element_text(size=20))
ggsave("data/output/expression/box_plot.png", width = 20, height = 20)

ggdensity(df_log, x = new_col , y = "..density..", facet.by = "type", merge = TRUE, xlab = "TPM", rug = TRUE, add = "median", palette = "jco") +  font("subtitle", size = 20)+
  font("caption", size = 20)+
  font("xlab", size = 20)+
  font("ylab", size = 20)+
  font("xy.text", size = 20) + theme(legend.text=element_text(size=20))
ggsave("data/output/expression/hist_expr.png", width = 20, height = 20)

# entropy analysis
non <-df[grepl("noncons", df$type),]
cons <-df[grepl("consv", df$type),]
mRNA <-df[grepl("mRNA", df$type),]

non_ent <- type(non)
cons_ent <- type(cons)
mRNA_ent <- type(mRNA)

plot(non_ent, cons_ent, mRNA_ent)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import pandas as pd
import os
import sys


strc = str(sys.argv[3])[1:-1]

if strc == 'own':
   print('Own table')
   with open(sys.argv[9], 'w') as f:
        f.write('Own table')
else:
  print('Build table')
  tissue = pd.read_csv(sys.argv[1], sep='\t', header=None)
  path = str(sys.argv[2])[1:-1]
  new_old =  pd.read_csv(sys.argv[4], sep='\t')
  cons =  pd.read_csv(sys.argv[5], sep=' ', header=None)
  noncons =  pd.read_csv(sys.argv[6], sep=' ', header=None)
  cod = pd.read_csv(sys.argv[7], sep=' ', header=None)
  cod = pd.merge(cod, new_old, how='inner', left_on=[0], right_on=['new_id']).drop(columns = [0])
  cod['type'] = 'mRNA'
  noncons['type'] = 'noncons'
  cons['type'] = 'consv'

  if strc == 'Kallisto':
     list_dir = os.listdir(path)
     lib = {}
     for w in list_dir:
         abudance = pd.read_csv(path + '/' + w +'/' + "abundance.tsv", sep='\t')
         lib[w] = abudance['tpm']
  if strc == 'Htseq':
     list_dir = os.listdir(path)
     lib = {}
     for w in list_dir:
         id = w.split('.')
         abudance = pd.read_csv(path + '/' + w, sep='\t', header=None)
         abudance = abudance.rename(columns={0: 'target_id'})
         lib[id[0]] = abudance[1]

  lib_data = pd.DataFrame.from_dict(lib)
  colnames = lib.keys()
  tissue = tissue[tissue[1].isin(colnames)]
  tissue_gr = tissue.groupby([2])
  median_tis = {}
  for k,v in tissue_gr:
     lib_data_gr = lib_data[v[1]]
     median_tis[k] = lib_data_gr.median(axis=1)

  median_tis_df = pd.DataFrame.from_dict(median_tis)
  median_tis_df['target_id'] = abudance['target_id']
  cod = pd.merge(cod, median_tis_df, how='inner', left_on=['new_id'], right_on=['target_id']).drop(columns = ['old_id', 'new_id'])
  noncons = pd.merge(noncons, median_tis_df, how='inner', left_on=[0], right_on=['target_id']).drop(columns = [0])
  cons = pd.merge(cons, median_tis_df, how='inner', left_on=[0], right_on=['target_id']).drop(columns = [0])
  full_table = pd.concat([cod, cons, noncons])
  full_table.to_csv(sys.argv[8], sep='\t', index=False)
  with open(sys.argv[9], 'w') as f:
        f.write(strc)
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
import pandas as pd
import sys
from collections import Counter
import matplotlib.pyplot as plt
import scipy.stats as stats
from matplotlib.ticker import FuncFormatter
from functools import partial
import math
import statistics
from scipy.stats import lognorm
import numpy as np

pd.options.mode.chained_assignment = None  # default='warn'

def to_percent(y, position, n):
    s = str(round(100 * y / n, 3))
    return s + '%'

def coord(x):
    Row_list_query = []
    for index, rows in x.iterrows():
        my_list = [rows[1], rows[2]]
        Row_list_query.append(my_list)
    return(Row_list_query)

def find_same_coords(x):
    x['exon_num'] = x['name'] +"_"+ exon[7].astype(str).str.cat(x['number'].astype(str), sep='_')
    uni3 = x.drop_duplicates(subset=[1])
    target_coord = coord(uni3)
    for start_stop1 in query_coord:
        start1, stop1 = map(int, start_stop1)
        for start_stop in target_coord:
            start, stop = map(int, start_stop)
            if start1 in range(start-100, start + 100) and stop1 in range(stop - 100, stop+100):
                same_query = uni[1] == start1
                same_target = uni3[1] == start
                index = uni3[same_target]
                lines = uni[same_query]
                chrom.append([' '.join('{0}'.format(i) for i in lines[3]),
                                  ' '.join('{0}'.format(i) for i in index['exon_num']), str(round(start1)), str(round(stop1)), str(start),
                                  str(stop), ' '.join('{0}'.format(i) for i in index['number']),
                                  ' '.join('{0}'.format(i) for i in lines['length']), ' '.join('{0}'.format(i) for i in index[5]), ' '.join('{0}'.format(i) for i in lines[5])])


query = pd.read_csv(sys.argv[1], sep='\t', header=None)
target = pd.read_csv(sys.argv[2], sep='\t', header=None)
tmap = pd.read_csv(sys.argv[3], sep='\t', header=None, skiprows=1)
small_intron = pd.read_csv(sys.argv[4], sep='\t', header=None)
query = query[pd.to_numeric(query[0], errors='coerce').notnull()]
target = target[pd.to_numeric(target[0], errors='coerce').notnull()]

statistic = open(sys.argv[5], 'w', encoding='utf-8')

query = query[~query[10].isin(small_intron[0])]

name_tmap = tmap[3].str.split('.',4, expand=True)
tmap['name'] = name_tmap[0]

tmap = tmap[tmap[2].isin(['x', 'i', 'u'])]
tmap = tmap[tmap['name'].isin(query[10])]


# exon structure of lncRNA
chart = query[query[7].str.contains("exon")]
spl = chart[3].str.split('.',2, expand=True)
size = spl.pivot_table(index = [0], aggfunc ='size').to_dict()
res = Counter(size.values())
dict={}
for k, v in res.items():
    dict[k] = v/len(size)*100
f, ax = plt.subplots(figsize=(15, 5))
ax.bar(list(dict.keys()), dict.values(), color='b')
#ax.set_xlim([0, 20])
plt.xticks(np.sort(range(max(list(dict.keys())) + 1)))
plt.ylabel('Proportion of lncRNAs (%)', size=15)
plt.xlabel('Number of exon', size=15)
plt.savefig("data/output/lncRNA_structure/number_of_exon.png", dpi=500)

#exon size chart
chart['exon_length'] = (chart[2] - chart[1])
step = int(math.ceil(chart['exon_length'].mean() / 100.0)) * 100
f, ax = plt.subplots(figsize=(10, 5))
density = stats.gaussian_kde(chart['exon_length'])
n, x, _ = ax.hist(chart['exon_length'],  density=True, bins = np.logspace(np.log10(min(chart['exon_length'])),np.log10(max(chart['exon_length'])), 50),  color='b')
ax.plot(x, density(x), color='r')
ax.set_xscale('log')
ax.set_ylabel('Probability', size=15)
ax.set_xlabel('Exon size (bp)', size=15)
plt.savefig("data/output/lncRNA_structure/exon_size.png", dpi=500)

print('Mean_exon_length','\t', statistics.mean(chart['exon_length']), file=statistic)
print('Max_exon_length','\t', max(chart['exon_length']), file=statistic)
print('Min_exon_length','\t', min(chart['exon_length']), file=statistic)
print('Median_exon_length','\t', statistics.median(chart['exon_length']), file=statistic)

# intron size chart
chart['name'] = spl[0]
gr = chart[[1,2,'name']].groupby(by='name')
intron_length = []
name = []
for key, item in gr:
    n=0
    if 0 < len(item['name']) - 2:
        intron = abs(item[2].iloc[n] - item[1].iloc[n + 1])
        if intron < 60:
           pass
        elif intron > 100000:
           pass
        else:
           n = n + 1
           intron_length.append(intron)

    else:
        intron_length.append(0)

intron_length = list(filter(lambda x: x != 0, intron_length))
step = int(math.ceil(statistics.mean(intron_length) / 100.0)) * 100
f, ax = plt.subplots(figsize=(10, 5))
density = stats.gaussian_kde(intron_length)
n ,x , _ = ax.hist(intron_length, density=True, bins=np.logspace(np.log10(min(intron_length)),np.log10(max(intron_length)), 50), color='b')
ax.plot(x, density(x), color="r")
ax.set_xscale('log')
ax.set_ylabel('Probability', size=15)
ax.set_xlabel('Intron size (bp)', size=15)
plt.savefig("data/output/lncRNA_structure/intron_size.png", dpi=500)

print('Mean_intron_length' ,'\t', statistics.mean(intron_length), file=statistic)
print('Max_intron_length' ,'\t', max(intron_length), file=statistic)
print('Min_intron_length' ,'\t', min(intron_length), file=statistic)
print('Median_intron_length' ,'\t', statistics.median(intron_length), file=statistic)


# choose only query transcripts
query = query[query[3].str.contains("path1")]
query['length'] = abs(query[1] - query[2])

print('Mean transcript length' ,'\t', statistics.mean(query['length']), file=statistic)

# chart lcnRNA distribution across chromosome
f, ax = plt.subplots(figsize=(10, 5))
labels, counts = np.unique(query[0].astype(int), return_counts=True)
ax.bar(labels, counts, align='center', color='b')
plt.gca().set_xticks(labels)
plt.ylabel('Number of lncRNAs', size=15)
plt.xlabel('Chromosome of organism', size=15)
plt.savefig("data/output/lncRNA_structure/number_of_lncRNA.png", dpi=500)

chromasom_all = pd.unique(target[0].astype(int))

# choose lncRNA type
name_target = target[9].str.split(';',4, expand=True)
target['name'] = name_target[0]
new_tmap = tmap[tmap[2].isin(['x'])]
# filter seq by lncRNA type
tmap_name = new_tmap[3].str.split('.',4, expand=True)
query_anti = query[query[10].isin(tmap_name[0].to_list())]
# choose only target exons
exon = target[target[7].str.contains("exon")]
exon = exon[exon['name'].groupby(exon['name']).transform('size')>1]
uni = query_anti.drop_duplicates(subset=[1])

chrom = []
for w in chromasom_all:
    print(w)
    query_new = uni[uni[0].astype(int).isin([w])]
    target_new = exon[exon[0].astype(int).isin([w])]
    query_coord = coord(query_new)

    strand = target_new.groupby([5])
    for key, item in strand:
        if key == "-":
            item['number'] = item.groupby(['name']).cumcount(ascending=False) + 1
            find_same_coords(item)
        elif key == "+":
            item['number'] = item.groupby(['name']).cumcount() + 1
            find_same_coords(item)

df = pd.DataFrame(chrom)
df['number'] = df.groupby([6]).cumcount() + 1
df[6] = df[6].astype(int)
df = df.sort_values(by=[6])
df.rename(columns={0: 'transcript_id', 1: 'protein_coding_id', 2:'protein_coding_id_start', 3:'protein_coding_id_end', 4:'transcript_id_start',
                   5:'transcript_id_end', 6:'protein_coding_exon', 7:'transcript_length', 8:'protein_coding_strand', 9:'transcript_strand', "number":'number_of_transcripts'}, inplace=True)
df.to_csv('data/output/lncRNA_structure/anti_vs_mrna.tsv', sep='\t', index=False)

f, ax = plt.subplots(figsize=(15, 5))
ax.bar(df['protein_coding_exon'], (df['number_of_transcripts']/(len(df['protein_coding_exon']))*100), color='b')
plt.ylabel('Number of aligned lnRNAs', size=15)
plt.xlabel('Exon of the gene', size=15)
plt.xticks(range(max(df['protein_coding_exon'])+1))
plt.savefig('data/output/lncRNA_structure/anti.png')
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import os
import sys
import subprocess
from Bio import SeqIO
import pandas as pd
import numpy as np
from gtfparse import read_gtf

data_base = sys.argv[1]
gff = sys.argv[2]


def build(reference, kmer):
    args_alignment = []
    gmap_build = 'gmap_build'
    path, base = os.path.split(reference)
    ref_label = os.path.splitext(base)[0]
    gmap_build_logger_out_path = os.path.join('./data/output/GMAP', gmap_build + '.out.log')
    gmap_build_logger_err_path = os.path.join('./data/output/GMAP', gmap_build + '.err.log')


    command = '{gmap_build} -D {tmp_dir} -d {ref_index_name} -k {kmer_value} {reference} >> {log_out_1} 2>> {log_out_2}'.\
            format(gmap_build=gmap_build, tmp_dir=path, ref_index_name=ref_label, reference=reference, log_out_1=gmap_build_logger_out_path,
                   log_out_2=gmap_build_logger_err_path, kmer_value=kmer)
    exit_code = subprocess.call(command, shell=True)


gff_df = read_gtf(gff)
df = SeqIO.to_dict(SeqIO.parse(data_base, "fasta"))
size = round(os.path.getsize(data_base) / (1024*1024))
statistic = open(sys.argv[3], 'w', encoding='utf-8')

if size <= 64:
    k = 12
elif 64 < size <= 256:
    k = 13
elif 256 < size <= 1000:
    k = 14
elif 1000 < size <= 10000:
    k = 15
print("k" ,'\t', str(k), file=statistic)

total_len = []
for name,seq in df.items():
    total_len.append(len(seq))
print('Golden_Path_Length' ,'\t', str(np.sum(total_len)), file=statistic)

exon_only = gff_df[gff_df['feature'].str.contains('exon')]
exon_length = (exon_only['end'] - exon_only['start']).mean()
print('Mean_exon_length','\t', str(round(exon_length)), file=statistic)
transcript_only = gff_df[gff_df['feature'].str.contains('transcript')]
transcript_length = (transcript_only['end'] - transcript_only['start']).mean()
print('Mean_transcript_length','\t', str(round(transcript_length)), file=statistic)

gr = exon_only[['start','end','gene_id']].groupby(by='gene_id')
intron_length = []
for key, item in gr:
    n=0
    while n <= len(item['gene_id']) - 2:
        intron = abs(item['end'].iloc[n] - item['start'].iloc[n + 1]) - 1
        n = n + 1
        intron_length.append(intron)

mean_intron_length = np.mean(intron_length)
max_intron_length = np.max(intron_length)
min_intron_length = np.min(intron_length)
print('Mean_intron_length' ,'\t', str(round(mean_intron_length)), file=statistic)
print('Max_intron_length' ,'\t', str(round(max_intron_length)), file=statistic)
print('Min_intron_length' ,'\t', str(round(min_intron_length)), file=statistic)

build(data_base, k)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import pandas as pd
import sys

def coord(x):
    uni = x.drop_duplicates(subset=[1])
    Row_list_query = []
    for index, rows in uni.iterrows():
        my_list = [rows[1], rows[2]]
        Row_list_query.append(my_list)
    return(Row_list_query)

gmap  = sys.argv[1]
known_seq = sys.argv[2]
out = sys.argv[3]

gmap_align = pd.read_csv(gmap, sep='\t', header=None)
target = pd.read_csv(known_seq, sep='\t', header=None)

target_coords = coord(target[target[7].str.contains("transcript")])
name = gmap_align[9].str.split(';',4, expand=True)
name = name[1].str.split('=',2, expand=True)
gmap_align['name'] = name[1]
gene = gmap_align[gmap_align[3].str.contains("path1")]
exon = gmap_align[gmap_align[3].str.contains("exon")]
gr = exon[[0,1,2,'name']].groupby(by='name')
intron_large = []
intron_small = []
intron_data = []
for key, item in gr:
    n = 0
    if 0 < len(item['name']) - 2:
        intron = abs(item[2].iloc[n] - item[1].iloc[n + 1])
        intron_data.append([item[0].iloc[n], item[2].iloc[n], item[1].iloc[n + 1], key + '_intron' + str(n+1),intron])
        n = n + 1
        if intron < 60:
           intron_small.append(key)
        if intron > 20000:
           intron_large.append(key)      
        else:
           pass
    else:
       pass
df = pd.DataFrame(intron_data)
df.to_csv('data/output/lncRNA_structure/itron_coordin.tsv', sep='\t', index=False, header=None)
its = pd.DataFrame(intron_small)
its.to_csv('data/output/lncRNA_structure/itron_small.txt', index=False, header=None)
gmap_align = gmap_align[~gmap_align['name'].isin(intron_large)]
gmap_align['length'] = gmap_align[2] - gmap_align[1]
gmap_align = gmap_align[gmap_align['length'] < 5000]
gmap_align_big = gmap_align[gmap_align['length'] > 5000]
gmap_align_big.to_csv('data/output/lncRNA_structure/long_transcripts.bed', index=False, sep='\t', header=None)
gmap_align = gmap_align[~gmap_align['name'].isin(gmap_align_big['name'])]
gmap_align = gmap_align[[0,1,2,3,4,5,6,7,8,9]]
gmap_align.to_csv(out, index=False, sep='\t', header=None)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import os
import sys
import subprocess
import pandas as pd

data_base = sys.argv[1]
target = sys.argv[2]
parameters = sys.argv[3]
out_file = sys.argv[5]
opt = sys.argv[7:]


def aling(reference, transcripts, min, max, thr, lg):
    if lg[0] > 2**32:
       gmap_run = 'gmapl'
       print('large genome use gmapl')
    else:
       gmap_run = 'gmap'
    path, base = os.path.split(reference)
    ref_label = os.path.splitext(base)[0]
    out = os.path.join(out_file)
    gmap_run_logger_err_path = os.path.join('./data/output/GMAP', gmap_run + '.out.log')
    command = '{gmap} -D {tmp_dir} -d {ref_index_name} {transcripts} --min-intronlength={min_intron_length} --intronlength={intron_length} --cross-species ' \
                  '--format=gff3_gene --split-large-introns --npaths=1 -t {threads} {option} > {alignment_out} 2>> {log_out_2}'.\
            format(gmap=gmap_run, tmp_dir=path, ref_index_name=ref_label, transcripts=transcripts,
                   threads=thr, alignment_out=out, log_out_2=gmap_run_logger_err_path, min_intron_length=min[0], intron_length=max[0], option=' '.join('{0}'.format(w) for w in opt))
    exit_code = subprocess.call(command, shell=True)

dir_check = os.path.isdir('data/reference/intron_data/' + sys.argv[4])
if dir_check is True:
   print('use of existing statistics')
   gff_df = pd.read_csv('data/reference/intron_data/' + sys.argv[4] + '/' + sys.argv[4] + '.tsv', sep='\t', header=None)
   minn = gff_df[gff_df[0].str.contains("Min_intron_length")][1].to_list()
   maxx = gff_df[gff_df[0].str.contains("Max_intron_length")][1].to_list()
   length = gff_df[gff_df[0].str.contains("Golden_Path_Length")][1].to_list()
   thr = 40
   aling(data_base, target, minn, maxx, thr, length)
   print(target)
else:
   print('use of reference statistics')
   gff_df = pd.read_csv(parameters, sep='\t', header=None)
   minn = gff_df[gff_df[0].str.contains("Min_intron_length")][1].to_list()
   maxx = gff_df[gff_df[0].str.contains("Max_intron_length")][1].to_list()
   length = gff_df[gff_df[0].str.contains("Golden_Path_Length")][1].to_list()
   thr = 40
   aling(data_base, target, minn, maxx, thr, length)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import pandas as pd
import sys
import numpy as np
from pybedtools import BedTool

def clean(x):
    split = x.str.rsplit('.', 2, expand=True)
    split = split[0]
    return (split)

tmap = pd.read_csv(sys.argv[1], sep='\t')
gmap_align = pd.read_csv(sys.argv[2], sep='\t', header=None)
# take only lncRNA classes from tmap file
tmap['group'] = np.nan
tmap['group'][tmap['class_code'].isin(['x'])] = "exon antisense"
tmap['group'][tmap['class_code'].isin(['i'])] = "intron"
tmap['group'][tmap['class_code'].isin(['u'])] = "intergenic"
tmap.dropna(subset = ["group"], inplace=True)
tmap = tmap[tmap['qry_gene_id'].str.contains("path1")]
print(tmap['group'].value_counts())
# filter duplicates
query = clean(tmap['qry_gene_id'])
gmap_align['name'] = clean(gmap_align[3])
query = query.drop_duplicates()
# take lncRNA from alignment file
gmap_align = gmap_align[gmap_align['name'].isin(query)]
gmap_align = gmap_align[gmap_align[7].str.contains("gene")]
gmap_align.to_csv(sys.argv[3], sep='\t', index=False, header=None)
# lncRNA into loci
genes = BedTool(sys.argv[3])
loci = genes.merge(s=True, c=[4,5,6], o=['collapse','mean','distinct'])
loci_df = pd.read_table(loci.fn, sep='\t', names=[0,1,2,3,4,5])
loci_df = loci_df[(loci_df[2] - loci_df[1])>200]
loci_df['numb'] = list(range(0, len(loci_df[3])))
loci_df['loc'] = 'LOC'
loci_df['loci'] = loci_df['loc'] + '_' + loci_df['numb'].astype(str)
loci_df = loci_df[[0,1,2,'loci',4,5,3]]
loci_df.to_csv(sys.argv[4], sep='\t', index=False, header=None)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
library(seqinr)
library(LncFinder)

args <- commandArgs(TRUE)
list_dir <- list.dirs(path = args[1], recursive = FALSE)
for (w in list_dir) {

lnc <- read.fasta(args[2])
mrna <- read.fasta(paste(w, '/train_mrna.fasta', sep=''))
strc <- args[3]
print(strc)

if (strc == "SS") {
print('secondary structure on')
RNAfold.path <- '/home/pronozinau/miniconda3/bin/RNAfold'

mrna1 <- run_RNAfold(mrna, RNAfold.path = RNAfold.path, parallel.cores = 40)
lnc1 <- run_RNAfold(lnc, RNAfold.path = RNAfold.path, parallel.cores = 40)

feach <- make_frequencies(cds.seq = mrna1,  mRNA.seq = mrna1, lncRNA.seq = lnc1,
SS.features = TRUE, cds.format = "SS",
lnc.format = "SS", check.cds = TRUE,
ignore.illegal = FALSE)
print('done')
saveRDS(feach, file=paste(w, '/frequencies.rds', sep=""))

my_model <- build_model(lncRNA.seq = lnc1, mRNA.seq = mrna1,
frequencies.file = feach, SS.features = TRUE,
lncRNA.format = "SS", mRNA.format = "SS",
parallel.cores = 2, folds.num = 2, seed = 1)
saveRDS(my_model, file=paste(w,'/model.rds',sep=""))

print(paste(w, '/test.fasta', sep=''))
seq <- read.fasta(paste(w, '/test.fasta', sep=''))
test <- run_RNAfold(seq, RNAfold.path = RNAfold.path, parallel.cores = 40)
result_2 <- lnc_finder(test, SS.features = TRUE, format = "SS",
frequencies.file = feach, svm.model = my_model,
parallel.cores = 2)
write.table(result_2, file=paste(w, '/results.csv', sep=""), col.names=F, sep = ',')
} else {
print('secondary structure off')

feach <- make_frequencies(cds.seq = mrna,  mRNA.seq = mrna, lncRNA.seq = lnc,
SS.features = FALSE, cds.format = "DNA",
lnc.format = "DNA", check.cds = TRUE,
ignore.illegal = FALSE)
print('done')
saveRDS(feach, file=paste(w, '/frequencies.rds', sep=""))

my_model <- build_model(lncRNA.seq = lnc, mRNA.seq = mrna,
frequencies.file = feach, SS.features = FALSE,
lncRNA.format = "DNA", mRNA.format = "DNA",
parallel.cores = 2, folds.num = 2, seed = 1)
saveRDS(my_model, file=paste(w,'/model.rds',sep=""))

print(paste(w, '/test.fasta', sep=''))
seq <- read.fasta(paste(w, '/test.fasta', sep=''))
result_2 <- lnc_finder(seq, SS.features = FALSE, format = "DNA",
frequencies.file = feach, svm.model = my_model,
parallel.cores = 2)
write.table(result_2, file=paste(w, '/results.csv', sep=""), col.names=F, sep = ',')
}
}
print(list_dir)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from Bio import SeqIO
import sys
import pandas as pd

fasta = SeqIO.parse(sys.argv[1], "fasta")
corrected_file = sys.argv[2]
new_old_id = []
with open(corrected_file, 'w') as corrected:

 n = 1
 for rec in fasta:
    if len(rec.seq) >= 200:
        new_name =  "Transcript_" + str(n)
        new_old_id.append([rec.id, new_name])
        rec.id = new_name
        rec.description = new_name  # <- Add this line
        n = n + 1
        SeqIO.write(rec, corrected, 'fasta')
    else:
        pass

new_old_id_df = pd.DataFrame(new_old_id, columns=['old_id', 'new_id'])
new_old_id_df.to_csv(sys.argv[3], sep='\t', index=False)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import pandas as pd
import numpy as np
from Bio import SeqIO
import sys

lncfinder = pd.read_csv(sys.argv[1], sep=',', header=None)
record_dict = SeqIO.to_dict(SeqIO.parse(sys.argv[2], "fasta"))
lncrna = open(sys.argv[3], 'w', encoding='utf-8')
coding = open('data/output/lncRNA_prediction/Coding.fasta', 'w', encoding='utf-8')

lncfinder = lncfinder[[0,1]]
finder_cod = lncfinder[lncfinder[1].str.contains("NonCoding")==False]
finder_non = lncfinder[lncfinder[1].str.contains("NonCoding")]

for w in finder_non[0]:
    try:
      print(record_dict[w].format('fasta'), end='', file=lncrna)
    except KeyError:
      continue

for w in finder_cod[0]:
    try:
      print(record_dict[w].format('fasta'), end='', file=coding)
    except KeyError:
      continue
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import numpy as np
from sklearn.model_selection import train_test_split
from Bio import SeqIO
import sys
import os

lnc = SeqIO.to_dict(SeqIO.parse(sys.argv[1], "fasta"))
cds = SeqIO.to_dict(SeqIO.parse(sys.argv[2], "fasta"))
train_lnc = open('./data/input/test_train/train_lnc.fasta', 'w', encoding='utf-8')
print(len(lnc.keys()))
print(len(cds.keys()))
test_lnc = len(lnc.keys())*0.25
test_mrna = test_lnc*2
n=0
keys = list(cds.keys())
pers_lnc=test_lnc/len((list(lnc.keys())))
Y_train, Y_test = train_test_split(list(lnc.keys()), test_size=pers_lnc, shuffle=True)
print(len(Y_test))
if len(Y_train)*2 <= len(cds.keys())/5:
    train_mrna = len(Y_train)*2
else:
    train_mrna = len(cds.keys())/5.2
print(train_mrna)
for w in Y_train:
    print(lnc[w].format('fasta'), end='', file=train_lnc)

while n <= 4:
    os.mkdir('./data/input/test_train/' + str(n))
    train_cds = open('./data/input/test_train/' + str(n) + '/train_mrna.fasta', 'w', encoding='utf-8')
    test_file = open('./data/input/test_train/' + str(n) + '/test.fasta', 'w', encoding='utf-8')
    compare = open('./data/input/test_train/' + str(n) + '/compare.csv', 'w', encoding='utf-8')
    pers_mrna=train_mrna/(len(keys))
    print(pers_mrna)
    X_train, X_test = train_test_split(keys, test_size=pers_mrna, shuffle=True)
    print('train_set',len(X_train))
    print('test_set',len(X_test))
    if len(X_test) > test_mrna:
    	pers_mrna_test = test_mrna/(len(X_test))
    else:
        true_per = len(X_test)*0.25
        pers_mrna_test = true_per/len(X_test)
    pers_mrna_test = test_mrna/(len(X_test))
    train, test = train_test_split(X_test, test_size=pers_mrna_test, shuffle=True)
    for w in test:
        print(cds[w].format('fasta'), end='', file=test_file)
        print(cds[w].id , 'mrna', sep='\t', file=compare)
    for w in train:
        print(cds[w].format('fasta'), end='', file=train_cds)
    for w in Y_test:
        print(lnc[w].format('fasta'), end='', file=test_file)
        print(lnc[w].id, 'lnc', sep='\t', file=compare)
    keys = [w for w in keys if w not in X_test]
    print('new_set',len(keys))
    n=n+1
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sys

def value_prepare(x, y, z):
    name = x[0].str.rsplit('_', 1, expand=True)
    x['srx'] = name[1]
    same = pd.merge(df, x, how='inner', left_on=[1], right_on=['srx'])
    transcript = same[1].value_counts().reset_index()
    transcript = pd.merge(df, transcript, how='inner', left_on=[1], right_on=['index'])
    transcript.to_csv('data/output/tissue/' + y + '.csv')
    pr = same[2].value_counts().reset_index()
    pr = pr.rename(columns={2: z}, inplace=False)
    return (pr, same)


#LncAPDB_vs_blast table
blast = pd.read_csv(sys.argv[2], sep='\t', header=None)
LncAPDB = pd.read_csv(sys.argv[3], sep=' ', header=None)
old_new = pd.read_csv(sys.argv[4], sep='\t', header=None, skiprows=1)
LncAPDB = LncAPDB.drop_duplicates(subset=[1])
blast = blast[blast[2].isin(['100.000'])]
LncAPDB_vs_blast = pd.merge(blast, LncAPDB, how='inner', left_on=[1], right_on=[1])
LncAPDB_vs_blast = LncAPDB_vs_blast[['0_x',1,'2_x',10, '0_y','2_y']]
LncAPDB_vs_blast.rename(columns={'0_x': 'transcript_id', 1: 'library_id', '2_x':'percent_identity', 10:'e_value','0_y':'id_of_database', '2_y':'database'}, inplace=True)
LncAPDB_vs_blast.to_csv('data/output/new_lncRNA/LncAPDB_vs_blast.csv', sep='\t', index=False)

#tissue analysis
blast = pd.read_csv('data/output/tissue/cons.txt', header=None)
blast_non = pd.read_csv('data/output/tissue/non.txt', header=None)
blast_cod = pd.read_csv('data/output/tissue/cod.txt', header=None)
df = pd.read_csv(sys.argv[5], sep='\t', header=None)
df = df[df[0].isin([str(sys.argv[1])])]
df_tis = df[2].value_counts().reset_index()
blast_cod = old_new[old_new[1].isin(blast_cod[0])]
try:

    transc, same = value_prepare(blast, 'transc' , 'Conserved')
    transc_non, same_non = value_prepare(blast_non, 'transc_non', 'Nonconserved')
    transc_cod, same_cod = value_prepare(blast_cod, 'transc_cod', 'Coding')

    full_table = pd.merge(transc, transc_cod, how='outer', left_on=['index'], right_on=['index'])
    full_table = pd.merge(full_table, transc_non, how='outer', left_on=['index'], right_on=['index'])
    full_table = pd.merge(full_table, df_tis, how='inner', left_on=['index'], right_on=['index'])
    full_table = full_table.fillna(0)

    full_table['con'] = full_table['Conserved'] / full_table[2]
    full_table['noncon'] = full_table['Nonconserved'] / full_table[2]
    full_table['cod'] = full_table['Coding'] / full_table[2]

    full_table['lncRNA conserved'] = full_table['con'] / len(same[2])
    full_table['lncRNA nonconseved'] = full_table['noncon'] / len(same_non[2])
    full_table['mRNA'] = full_table['cod'] / len(same_cod[2])

    full_table = full_table[['index', 'lncRNA conserved', 'lncRNA nonconseved', 'mRNA']]

    full_table.to_csv('data/output/tissue/tissue_org.csv')
    full_table = full_table.set_index('index')

    fig,ax = plt.subplots(1,1,figsize=(8,10))
    sns.heatmap(full_table, annot=True, annot_kws={'size': 10}, xticklabels=True, cmap='Blues', ax=ax, cbar_kws={"shrink": .82}, yticklabels=True)
    plt.savefig('data/output/tissue/tissue_org.png', bbox_inches='tight')

except KeyError:
    print('Please check format of your sequence ID. It should be look like "MSTRG.1031.1_SRX123456" or "TRINITY_DN195_c0_g1_i1_SRX339783". The defining parameter is "_SRX#######" add it after your main ID.')
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import pandas as pd
import sys
import glob

def Specificity(x,y):
    if x+y == 0:
       return float(0)
    else:
       return x/(x+y)
def Precision(x,y):
    if x+y ==0:
       return float(0)
    else:
       return x/(x+y)
def Sensitivity(x,y):
    if x+y == 0:
       return float(0)
    else:
       return x/(x+y)
def F1(x,y):
    if x+y == 0:
       return float(0)
    else:
       return 2*(x*y)/(x+y)
def TP_FP(q_c,q_n,t_c,t_n):
    TP = 0
    FP = 0
    FN = 0
    TN = 0
    for w in t_n.to_list():
        if w in q_n.to_list():
           TP = TP+1
        if w in q_c.to_list():
           FP = FP+1
    for w in t_c.to_list():
        if w in q_c.to_list():
           TN = TN+1
        if w in q_n.to_list():
           FN = FN+1
    print(TP , FN , FP, TN)
    Pre = Precision(TP,FP)
    Spe = Specificity(TN,FP)
    Sen = Sensitivity(TP,FN)
    F = F1(Pre,Sen)
    return Pre , Sen, Spe, F

dirr = glob.glob(sys.argv[1] + '/*/')
f1_comp = {}
for w in dirr:
    cds_lnc = pd.read_csv(str(w) + '/compare.csv', sep='\t', header=None)
    test = pd.read_csv(str(w) + '/results.csv', sep=',', header=None)
    test = test[[0,1]]
    results = open(str(w) + '/F1.csv', 'w', encoding='utf-8')
    cds_lnc_cod = cds_lnc[cds_lnc[1].str.contains("mrna")]
    cds_lnc_non = cds_lnc[cds_lnc[1].str.contains("lnc")]
    test_cod = test[test[1].str.contains("NonCoding")==False]
    test_non = test[test[1].str.contains("NonCoding")]
    print(len(cds_lnc_cod[0]), len(cds_lnc_non[0]), len(test_cod[0]), len(test_non[0]))
    pre, sen, spe, f = TP_FP(cds_lnc_cod[0], cds_lnc_non[0], test_cod[0], test_non[0])
    f1_comp[w] = f
    print(pre, sen, spe, f, sep='\t' , file=results)
best_model = open(sys.argv[1] + '/best_model.txt', 'w', encoding='utf-8')
print(max(f1_comp, key=f1_comp.get), file=best_model)
39
40
run:
   shell("python scripts/test_train.py {input} > {output}")
SnakeMake From line 39 of main/Snakefile
53
54
shell:
    "Rscript scripts/model_lncFind.r {params.path} {params.lnc} {params.str} >> {output}"
SnakeMake From line 53 of main/Snakefile
62
63
shell:
  "python scripts/TP_FN.py {params}"
SnakeMake From line 62 of main/Snakefile
72
73
shell:
  "python scripts/rename.py {input} {output} {params}"
SnakeMake From line 72 of main/Snakefile
86
87
shell:
    "scripts/Chunk_dataframe.sh {input} {params}"
SnakeMake From line 86 of main/Snakefile
98
99
shell:
    "CPC2/bin/CPC2.py -i {input} -r -o {params}"
SnakeMake From line 98 of main/Snakefile
107
108
shell:
    "python scripts/take_noncoding.py {input} {output}"
SnakeMake From line 107 of main/Snakefile
116
117
shell:
   "python scripts/GMAP_build.py {input} {output}"
SnakeMake From line 116 of main/Snakefile
126
127
shell:
    "python scripts/GMAP.py {input} {config[model]} {output}"
SnakeMake From line 126 of main/Snakefile
136
137
138
run:
    shell("gff2bed < {input.query} > {output.query_out}")
    shell("gff2bed < {input.reference} > {output.reference_out}")
148
149
shell:
   "python scripts/gmap_filter.py {input} {output}"
SnakeMake From line 148 of main/Snakefile
159
160
161
162
run:
    shell("gffcompare -r {input} -o {params}")
    shell("mv data/output/GMAP/gffcmp.filter_alignm.bed.tmap data/output/gffcompare_first/")
    shell("mv data/output/GMAP/gffcmp.filter_alignm.bed.refmap data/output/gffcompare_first/")
173
174
run:
    shell("python scripts/lncRNA_class.py {input} {params} {output}")
SnakeMake From line 173 of main/Snakefile
184
185
186
187
run:
    shell("gffcompare -r {input} -o {params}")
    shell("mv data/output/GMAP/gff.filter_alignm.bed.tmap data/output/gffcompare_second/")
    shell("mv data/output/GMAP/gff.filter_alignm.bed.refmap data/output/gffcompare_second/")
200
201
shell:
   "python scripts/bed_intersect.py {input.gffcomp} {input.aling} {params.option} {output.loci} {input.coords} {output.TE} {output.true_lnc}"
SnakeMake From line 200 of main/Snakefile
212
213
run:
    shell("python scripts/lncRNA_class.py {input} {params} {output}")
SnakeMake From line 212 of main/Snakefile
223
224
225
226
run:
    shell("gffcompare -r {input} -o {params}")
    shell("mv data/output/GMAP/gff.filter_alignm.bed.tmap data/output/gffcompare_second/")
    shell("mv data/output/GMAP/gff.filter_alignm.bed.refmap data/output/gffcompare_second/")
236
237
shell:
   "python scripts/bed_intersect.py {input} {params} {output}"
SnakeMake From line 236 of main/Snakefile
249
250
shell:
    "python scripts/gff.py {input} {params}"
SnakeMake From line 249 of main/Snakefile
269
270
shell:
   "python scripts/blast.py {input} {output.fasta} {params.evalue} {params.max_target} {params.identity} {params.threads} {params.old_new}"
SnakeMake From line 269 of main/Snakefile
285
286
287
288
289
290
291
292
run:
   shell("""grep -v NonCoding {input.lnc} | awk -F ',' '{{print $1}}' | tr -d '"' > data/output/tissue/cod.txt""")
   shell("grep -v  {params.org} {input.blast} | awk -F '\t' '{{print $1}}' | sort | uniq > data/output/tissue/cons.txt")
   shell("grep {params.org} {input.blast} | awk -F '\t' '{{print $1}}' | sort | uniq > data/output/tissue/non.txt")
   shell("grep '>' {input.new} | awk -F '>' '{{print $2}}' > data/output/tissue/id.txt")
   shell("awk -F '\t' '{{print $1}}' {input.blast} | sort | uniq > data/output/tissue/all.txt")
   shell("grep -v -w -f data/output/tissue/all.txt data/output/tissue/id.txt > data/output/tissue/non_aling.txt")
   shell("python scripts/tissue.py {params.exp} {input.blast} {params.LncAPDB} {params.old_new} {params.SRX}")
SnakeMake From line 285 of main/Snakefile
309
310
shell:
   "python scripts/expr_table_form.py {input.SRX} {params} {output}"
SnakeMake From line 309 of main/Snakefile
319
320
shell:
   "Rscript scripts/Expraession_entropy.R {params.table}"
SnakeMake From line 319 of main/Snakefile
ShowHide 29 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/artempronozin95/ICAnnoLncRNA---identification-classification-and-annotation-of-LncRNA
Name: icannolncrna-identification-classification-and-ann
Version: 2
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Other Versions:
Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...