A de novo prediction tool of chromatin accessible regions for plant genomes

public 1yr ago Version: v4 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Introduction

The result of OCR (Open Chromatin Region) assay technologies such as DNase-seq and ATAC-seq represents the open state of a tissue at a given time and does not cover all the chromatin accessible information of this species. To predict the potential open regions of different tissues at different times in the whole sequence can help to understand gene transcription and regulation from a global perspective. CharPlant (Chromatin Accessible Regions for Plant) is a de novo OCRs prediction tool based on deep learning model. It can take complete DNA sequences or scaffolds as input, and output the outline of OCRs in a .Bed format file rely on sequence features. To our knowledge, this is the first tool to de novo predict potential open chromatin regions from DNA sequence and assay data.

If you use this tool, please cite the following article:

CharPlant: A De Novo Open Chromatin Region Prediction Tool for Plant Genomes

Steps to install and run CharPlant

CharPlant currently available for Linux-based platforms. It can learn the OCR features from DNase-seq or ATAC-seq data and de novo predict potential chromatin open regions for plant genome. To train the parameters of the deep learning model and predict OCRs, CharPlant perform the following three steps. Step 1 and step 2 are to install needed packages and set up running environment of the software, and they are optional for the users have had those packages worked. Step 3 is to download and run CharPlant. We provide as detailed instruction as possible although it can be run using simple command line. In the following, prompt “$” starts a shell command line of Linux, while “#” starts a comment.

Step 1. Install python packages

CharPlant is developed in python language, and some fundamental packages for scientific computing and network construction are indispensable. To efficient install and manage them, we strongly recommend using the Conda package manager. Conda is an open source package and environment management system for installing multiple versions of packages and their dependencies and easily switching between them.

(i) Install Conda

$ wget https://repo.continuum.io/archive/Anaconda3-4.2.0-Linux-x86_64.sh
$ bash Anaconda3-4.2.0-Linux-x86_64.sh
#Add environment variable
$ echo "export PATH=\"${PWD}/anaconda3/bin:\$PATH\" " >>~/.bashrc
$ source ~/.bashrc

(ii) Install needed python packages

Following Python packages are required:

numpy(1.16.1)
matplotlib(3.3.2)
pyfiglet(0.8.post1)
sklearn(0.22)
keras(2.0.5)
h5py(2.7.1)
tensorflow-gpu(1.3.0)

(iii) create environments

Because the de novo prediction step using cpu, we need create environments in cpu

$ conda create --name charplant-cpu python=3.6
$ source activate charplant-cpu
$ pip install sklearn
$ pip install matplotlib
$ pip install pyfiglet
$ conda install keras
$ source deactivate charplant-cpu

Step 2. Install the tools Bowtie2 , MACS2 and Snakemake

(i)Bowtie2

We use bowtie2 (version 2.2.6) for sequence alignment, and you can find all versions HERE . The detailed manual is provided in this LINK .

$ wget https://downloads.sourceforge.net/project/bowtie-bio/bowtie2/2.2.6/bowtie2-2.2.6-linux-x86_64.zip
$ unzip bowtie2-2.2.6-linux-x86_64.zip
#Add environment variable
$ echo "export PATH=\"${PWD}/bowtie2-2.2.6:\$PATH\" " >>~/.bashrc
$ source ~/.bashrc

(ii)MACS2

We use MACS2 (version 2.1.1) for peak calling, and you can find all versions and user manual HERE .

$ wget https://files.pythonhosted.org/packages/0f/e9/60761f3df0634a50a96531fdb00f45dd3a6f3aa2509fb722eb2a665faf35/MACS2-2.1.1.20160226.tar.gz
$ tar -zxvf MACS2-2.1.1.20160226.tar.gz
#setup environment variable
$ conda create --name macs2 python=2.7
$ source activate macs2
$ cd MACS2-2.1.1.20160226
$ pip install numpy
$ python setup.py install 
#deactivate the environment 
$ source deactivate macs2

(iii)Snakemake

We adopt an easy-to-use workflow software Snakemake to build our analysis process, which combines a series of steps into a single pipeline. Snakemake and the manual are provided in this LINK .

$ pip install snakemake==5.5.2

Step 3. Download and run CharPlant

(i) Download and install CharPlant

To Install and run CharPlant is very easy. You can download the CharPlant package in the following two ways. Then un-compress the package and set “CharPlant” as current directory.

Simple click to download CharPlant package HERE .
You can also download the CharPlant package using git.

$ git clone https://github.com/Yin-Shen/CharPlant.git

$ echo "export PATH=\"${PWD}/CharPlant:\$PATH\" " >>~/.bashrc
$ source ~/.bashrc
$ cd CharPlant
$ chmod -R 744 CharPlant.sh

(ii) The directory structure of CharPlant

The directory structure is as follows, which has two directories and three files. Directory “CharPlant/example” contains the reference genome and DNase-seq data of rice used as an example, file oryza_sativa.fa and ory_sativa.bed, respectively. The result of predicted OCRs is also saved in it. All the python and shell scripts are in directory “CharPlant/src”, but users generally don't need to care about it.

├──CharPlant
│ ├── example
│ │ ├──oryza_sativa.bed
│ │ ├──oryza_sativa.fa
│ ├── src
│ │ ├──data_preprocess
│ │ ├──de_novo_prediction
│ │ ├──get_positive_sample
│ │ ├──model
│ │ ├──motif
… … … …
│ │ ├──submit_lsf
│ ├── config.yaml
│ ├── Snakefile
│ ├── CharPlant.sh

For example, the follow command can print help information.

$ CharPlant.sh -h

Or print python scripts help information.

$ python model.py -h

(iii) Set the parameters of CharPlant

Configuration file “config.yaml” contains all parameters of the tool. To run CharPlant, you only need to revise three parameters in the following according to your path and file name, while leaving others as they are. A complete “config.yaml” file is shown in next section.

genome : Yourpath/CharPlant/example/oryza_sativa.fa #Genome file in .fasta format for input(Need full path)
bed: Yourpath/CharPlant/example/oryza_sativa.bed #Open chromatin regions file in .bed format for input(Need full path)
out: ory #Prefix of the output file

(iv) Run CharPlant

Snakemake file defines rules to performance operations. We have created a rule for each target and intermediate file. It is not necessary for the users to rewrite it. A complete “Snakefile” file is shown in subsequent section.

$ CharPlant.sh

CharPlant will perform four steps in turn and output the results of predicted OCRs to a .bed format file.

Data preprocessing
Model training
Motif visualization
De novo prediction

(v) Output files

If all steps are completed, the results will be output to the following five directories.

/data_preprocessing---Data preprocessing results for model training and motif visualization.
/model---.json file of model architecture, .h5 file of model parameters and .png file of the result figure.
/motif--- Positional weight matrix of motifs.
/de_novo_prediction---Results of de novo prediction.
/peak---Peaks of predicted OCRs.

Contact us

Yin Shen : [email protected]
Ling-Ling Chen : [email protected]
Junxiang Gao : [email protected]

Code Snippets

shell:
   "mkdir $PWD/data_preprocessing && cd $PWD/data_preprocessing && python ../src/data_preprocess.py -g {input.genome} -b {input.bed} -o {params.out} 2> /dev/null"    

SnakeMake From line 14 of master/Snakefile

shell:
   "mkdir $PWD/model && cd $PWD/model && python ../src/model.py -e {params.epochs} -p {params.patience} -lr {params.learningrate} -b {params.batch_size} -d {params.dropout} -n1 {params.nb_filter1} -n2 {params.nb_filter2} -fl1 {params.filter_len1} -fl2 {params.filter_len2} -hd {params.hidden} 2> /dev/null"  

SnakeMake From line 29 of master/Snakefile

shell:
      "mkdir $PWD/motif && cd $PWD/motif && python ../src/motif.py -n1 {params.nb_filter1} -fl1 {params.filter_len1} -o {params.motif_out} 2> /dev/null"

SnakeMake From line 39 of master/Snakefile

shell:
    "mkdir $PWD/de_novo_prediction && cd $PWD/de_novo_prediction && python ../src/de_novo_prediction.py -g {input.genome} -l {params.split_lines} -t {params.threshold} -o {params.prediction_out}"

SnakeMake From line 51 of master/Snakefile

shell:
      "cd src/ && bash batch_submit.sh {params.run_type} {params.prediction_out} {params.batch_submit_jobs_number}"

SnakeMake From line 61 of master/Snakefile

shell:
     "cd src/ && bash cat_predict.sh"

SnakeMake From line 66 of master/Snakefile

shell: 
    "python src/get_positive_sample.py -i {params.prediction_out} -n {params.speices_name}"

SnakeMake From line 74 of master/Snakefile

shell:
    " mkdir $PWD/peak && cd $PWD/peak && bowtie2-build {input.genome}  {params.index_name}"

SnakeMake Bowtie 2 From line 84 of master/Snakefile

shell:
     "cd $PWD/peak && bowtie2 -x {params.index_name}  -f ../de_novo_prediction/whole_pre.fa -S {params.sam_prefix}.sam "

SnakeMake Bowtie 2 From line 94 of master/Snakefile

shell:
      "cd $PWD/peak && source activate macs2 && macs2 callpeak -t {params.sam_prefix}.sam  -f SAM --shift -125 --extsize 250 --nomodel -B --SPMR -g hs -n {params.peak_prefix} "

SnakeMake macs2 From line 104 of master/Snakefile

import numpy as np
import re
import random
import pyfiglet
import argparse
import os
import sklearn
from sklearn.model_selection import train_test_split

#Making large letters out of ordinary text with pyfiglet
ascii_banner = pyfiglet.figlet_format("charplant")
print(ascii_banner)

#Argument parsing
parser = argparse.ArgumentParser(description='''Data Preprocess: 
                                                takes 3 required arguments''')
parser.add_argument("--genome", "-g", required=True,
                    help="genome file in fasta format")
parser.add_argument("--bed", "-b", required=True,
                    help="bed file of Chromatin Accessible Regions for input")
parser.add_argument("--out", "-o", required=True,
                    help="prefix of the output file")
args = parser.parse_args()

#Construct a positive sample sequences
genome_fasta={}
for line in open(args.genome,'r'):
    if line[0]=='>':
        seq_name=line.lstrip('>').strip()
        genome_fasta[seq_name]=[]
    else:
        genome_fasta[seq_name].append(line.strip())
for keys,val in genome_fasta.items():
    genome_fasta[keys]=''.join(val)
nuc=open(args.bed,'r')
out=open(args.out+".fa",'w')
col_num=0
bed=nuc.readlines()
col=len(bed)
for lline in bed:
        col_num+=1
        chr_name=lline.rstrip().split()[0]
        start_end=lline.rstrip().split()[1:]
        leng=int(start_end[1])-int(start_end[0])
        out.write('>'+chr_name+':'+start_end[0]+'-'+start_end[1]+'\n'+genome_fasta[chr_name][int(start_end[0]):int(start_end[1])]+'\n')
out.close()

#Negative sequences consist of random shuffled positive sequences with random.shuffle
neg_fasta={}
out_neg=open(args.out+"_negtive.fa",'w')
for line in open(args.out+".fa",'r'):
    if line[0]=='>':
        seq_name=line.lstrip('>').strip()
        neg_fasta[seq_name]=[]
    else:
        neg_fasta[seq_name].append(line.strip())
for keys,val in neg_fasta.items():
    neg_fasta[keys]=''.join(val)
    fa=list(neg_fasta[keys])
    random.shuffle(fa)
    out_neg.write(">"+keys+"\n"+"".join(fa)+"\n")
out_neg.close()

#Construct a positive samples datasets
bed_dict_pos={}
pos_o=open(args.out+"pos.txt",'w')
for line in open(args.out+".fa",'r'):
    if line[0]==">":
       seq=line.strip()
       bed_dict_pos[seq]=[]
    else:
       bed_dict_pos[seq].append(line.strip())
for keys,val in bed_dict_pos.items():
    bed_dict_pos[keys]=''.join(val)
    pos_o.write('1'+'\t'+bed_dict_pos[keys]+'\n')
pos_o.close()

#Construct a negtive samples datasets
bed_dict_neg={}
neg_o=open(args.out+"neg.txt",'w')
for line in open(args.out+"_negtive.fa",'r'):
    if line[0]==">":
       seq=line.strip()
       bed_dict_neg[seq]=[]
    else:
       bed_dict_neg[seq].append(line.strip())
for keys,val in bed_dict_neg.items():
    bed_dict_neg[keys]=''.join(val)
    neg_o.write('0'+'\t'+bed_dict_neg[keys]+'\n')
neg_o.close()


#Using “one hot enconding” to preprocessing each sequence to the two-dimensional matrix of 4 columns. 
#Each base in the sequence is vectorized, convert A|a, C|c, G|g, T|t, N|n to: [1,0,0,0], [0,1,0,0], [0,0,1,0], [0,0,0,1], [0,0,0,0]

def load_data(filename):
    f=open(filename,'r')
    sequences=f.readlines()
    num=len(sequences)
    data=np.empty((num,1000,4),dtype='float32')
    label=np.empty((num,),dtype="int")
    for i in range(num):
        line=sequences[i].replace('\n','')
        list_line=re.split('\s+',line)
        one_sequence=list_line[1]
        for j in range(1000):
            if j<=len(one_sequence)-1:
                if re.findall(one_sequence[j],'A|a'):
                    data[i,j,:]=np.array([1.0,0.0,0.0,0.0],dtype='float32')
                if re.findall(one_sequence[j],'C|c'):
                    data[i,j,:]=np.array([0.0,1.0,0.0,0.0],dtype='float32')
                if re.findall(one_sequence[j],'G|g'):
                    data[i,j,:]=np.array([0.0,0.0,1.0,0.0],dtype='float32')
                if re.findall(one_sequence[j],'T|t'):
                    data[i,j,:]=np.array([0.0,0.0,0.0,1.0],dtype='float32')
                if re.findall(one_sequence[j],'N|n'):
                    data[i,j,:]=np.array([0.0,0.0,0.0,0.0],dtype='float32')
            else:
                data[i,j,:]=np.array([0.0,0.0,0.0,0.0],dtype='float32')
        label[i]=list_line[0]
    return data,label

pos_sample=args.out+"pos.txt"
neg_sample=args.out+"neg.txt"

data_pos,label_pos = load_data(pos_sample)
data_neg,label_neg = load_data(neg_sample)

#Divide postive datasets into training sets validate sets and test sets according to 60%, 20%and 20%

data_pos_train, data_pos_test, label_pos_train, label_pos_test = train_test_split(data_pos,label_pos, test_size=0.4, random_state=1)

data_pos_test, data_pos_val, label_pos_test, label_pos_val = train_test_split(data_pos_test, label_pos_test, test_size=0.5, random_state=1)


#Divide negtive datasets into training sets validate sets and test sets according to 60%, 20%and 20%

data_neg_train, data_neg_test, label_neg_train, label_neg_test = train_test_split(data_neg,label_neg, test_size=0.4, random_state=1)

data_neg_test, data_neg_val, label_neg_test, label_neg_val = train_test_split(data_neg_test, label_neg_test, test_size=0.5, random_state=1)

#Combined training sets
data_train = np.concatenate((data_pos_train,data_neg_train),axis=0)
label_train = np.concatenate((label_pos_train,label_neg_train),axis=0)
#Combined validate sets
data_val = np.concatenate((data_pos_val,data_neg_val),axis=0)
label_val = np.concatenate((label_pos_val,label_neg_val),axis=0)
#Combined test sets
data_test = np.concatenate((data_pos_test,data_neg_test),axis=0)
label_test = np.concatenate((label_pos_test,label_neg_test),axis=0)


#save
np.save('data_train.npy',data_train)
np.save('label_train.npy',label_train)
np.save('data_val.npy',data_val)
np.save('label_val.npy',label_val)
np.save('data_test.npy',data_test)
np.save('label_test.npy',label_test)


#Get training set sample sequence
sequence=[[] for i in range(data_train.shape[0])]
for i in range(data_train.shape[0]):
    for j in range(data_train.shape[1]):
        if (data_train[i,j,:] == np.array([1.0,0.0,0.0,0.0],dtype='float32')).all():
                 sequence[i]+='A'
        if (data_train[i,j,:] == np.array([0.0,1.0,0.0,0.0],dtype='float32')).all():                      sequence[i]+='C'
        if (data_train[i,j,:] == np.array([0.0,0.0,1.0,0.0],dtype='float32')).all(): 
                 sequence[i]+='G'
        if (data_train[i,j,:] == np.array([0.0,0.0,0.0,1.0],dtype='float32')).all():
                 sequence[i]+='T'
        if (data_train[i,j,:] == np.array([0.0,0.0,0.0,0.0],dtype='float32')).all():
                 sequence[i]+='N'

o=open(args.out+"train.txt",'w')
for i in range(data_train.shape[0]):    
   o.write(str(label_train[i])+'\t'+''.join(sequence[i])+'\n')
o.close()

Python numpy sklearn BiocSklearn pyfiglet From line 8 of src/data_preprocess.py

import pyfiglet
import argparse
import re
import os

#Making large letters out of ordinary text with pyfiglet
ascii_banner = pyfiglet.figlet_format("charplant")
print(ascii_banner)

#Argument parsing
parser = argparse.ArgumentParser(description='''De novo prediction''')
parser.add_argument("--genome", "-g", required=True,
                    help="genome file in fasta format")
parser.add_argument("--lines", "-l",default=20000, type=int, required=False,
                    help="The number of lines in a Sliding window cut genome file splited into smaller files (default is 20,000)")
parser.add_argument("--threshold", "-t",default=0.5, type=float, required=False,
                    help="the threshold adopted to assign positive predictions (default is 0.5)")
parser.add_argument("--out", "-o", required=True,    
                    help="prefix of the output file")  
args = parser.parse_args()



print("\033[1;30;34m%s\033[0m" %"Sliding window cutting genome...")
print("\033[1;30;34m%s\033[0m" %"please wait...")
fasta={}
for line in open(args.genome,'r'):
    if line[0]=='>':
        seq_name=line.lstrip('>').strip()
        fasta[seq_name]=[]
    else:
        fasta[seq_name].append(line.strip())
for keys,val in fasta.items():
    fasta[keys]=''.join(val)

o=open(args.out,'w')               
len_chrom={}
for key,val in fasta.items():
        len_chrom[key]=len(fasta[key])
        for i in range(0,len_chrom[key],1):
          if i+36<=int(len_chrom[key]):
             o.write(key+'\t'+str(i)+'\t'+str(i+36)+'\t'+fasta[key][i:i+36]+'\n')
          else:
             o.write(key+'\t'+str(i)+'\t'+str(len_chrom[key])+'\t'+fasta[key][i:len_chrom
[key]+1]+'\n')
             break
o.close()
print("\033[1;30;34m%s\033[0m" %"well done...")

print("\033[1;30;34m%s\033[0m" %"Split genome file according to multi-line...")
print("\033[1;30;34m%s\033[0m" %"please wait...")
splitLen = int(args.lines) 
output_prefix = args.out        # example: split_fasta_36_1_

input = open(args.out, 'r').read().split('\n')

counts = 1
for lines in range(0, len(input), splitLen):
    output_data = input[lines:lines+splitLen]
    output = open(output_prefix + str(counts), 'w')
    output.write('\n'.join(output_data))
    counts += 1
    output.close()
print("\033[1;30;34m%s\033[0m" %"well done...")

print("\033[1;30;34m%s\033[0m" %"batch write into python files...")
print("\033[1;30;34m%s\033[0m" %"please wait...")
get_model_files_path = os.chdir("../model")
model_files_path = os.getcwd()
model_files = os.listdir(model_files_path)
for file in model_files:
     if file[-4: ] == 'json':
          model_architecture_path = model_files_path+"/"+file
     elif file[-2: ] =='h5':
          model_weights_path = model_files_path+"/"+file

get_file_path = os.chdir("../de_novo_prediction")
file_path = os.getcwd()
files = os.listdir(file_path)
file_list=[]
for file in files:
    if file[-2: ] == 'py':
        continue 
    elif re.match(args.out+'\d+',file):
         file_list.append(file)

def tryint(s):            
    try:
        return int(s)
    except ValueError:
        return s
def str2int(v_str):    
    return [tryint(sub_str) for sub_str in re.split('([0-9]+)', v_str)]
def sort_humanly(v_list):
    return sorted(v_list, key=str2int)

sort_file_list = sort_humanly(file_list)


python_file_1='''

def loading_data1(filename):
    f=open(filename,'r')
    sequences=f.readlines()
    num=len(sequences)
    data=np.empty((num,1000,4),dtype='float32')
    for i in range(num):
        line=sequences[i].replace("\\n",'')
        list_line=re.split('\s+',line)
        one_sequence=list_line[3]
        for j in range(1000):
            if j<=len(one_sequence)-1:
                if re.match(one_sequence[j],'A|a'):
                    data[i,j,:]=np.array([1.0,0.0,0.0,0.0],dtype='float32')
                if re.match(one_sequence[j],'C|c'):
                    data[i,j,:]=np.array([0.0,1.0,0.0,0.0],dtype='float32')
                if re.match(one_sequence[j],'G|g'):
                    data[i,j,:]=np.array([0.0,0.0,1.0,0.0],dtype='float32')
                if re.match(one_sequence[j],'T|t'):
                    data[i,j,:]=np.array([0.0,0.0,0.0,1.0],dtype='float32')
                if re.match(one_sequence[j],'N|n'):
                    data[i,j,:]=np.array([0.0,0.0,0.0,0.0],dtype='float32')
            else:
                data[i,j,:]=np.array([0.0,0.0,0.0,0.0],dtype='float32')
    return data
'''

python_file_2 = '''
result_new=np.empty((results.shape[0],1),dtype='int32')
for i in range(result_new.shape[0]):
         if results[i] >= %(threshold)s:
            result_new[i] = np.array([1], dtype='int32')
         else:
            result_new[i] = np.array([0], dtype = 'int32')
for i in result_new.tolist():
   o.write(str(i)+"\\n")    
o.close()
'''
for file in sort_file_list:
    file_read = open(file_path+"/"+file+'.py', 'w')
    file_read.write('import numpy as np'+'\n'+'import re'+'\n'+'from keras.models import model_from_json'+'\n'+"model = model_from_json(open('"+model_architecture_path+"').read())"+'\n'+"model.load_weights('"+model_weights_path+"')"+'\n'+python_file_1+'\n'+"data=loading_data1('"+file+"')"+'\n'+'results=model.predict_classes(data)'+'\n'+"o=open('whole_predict_fasta"+file+".txt','w')"+'\n'+python_file_2%dict(threshold=args.threshold)+'\n')    
    file_read.close()
print("\033[1;30;34m%s\033[0m" %"well done...")

Python JSON pyfiglet From line 7 of src/de_novo_prediction.py

import pyfiglet
import argparse
import os
#Making large letters out of ordinary text with pyfiglet
ascii_banner = pyfiglet.figlet_format("charplant")
print(ascii_banner)
#Argument parsing
parser = argparse.ArgumentParser(description="motif Visualization ")
parser.add_argument("--input", "-i", required=True,
                    help="The prediction  output file generated by this de_novo_prediction(example :split_fasta_36_1)")
parser.add_argument("--name", "-n", required=True,
                    help="The name of the species")
args = parser.parse_args()


get_input_file_path = os.chdir("de_novo_prediction/")
input_file_path = os.getcwd()
input_file = os.listdir(input_file_path)

for file in input_file:
    if file[-17: ] == args.input:
         input_files = input_file_path+"/"+file
#Get positive sample
genome=[]
genome_line=0
for l in open(input_files,'r'):
    ll=l.strip()
    genome.append(ll)
    genome_line +=1 

predict=[]
for line in open('whole_predict','r'):
    lines=line.strip()
    liness=lines.lstrip('[').rstrip(']')
    predict.append(liness)


genome_pre={}
lines=0
while lines < genome_line:
    k_name=genome[lines]
    genome_pre[k_name]=predict[lines]
    lines+=1

o=open('whole_pre.txt','w')
for k,v in genome_pre.items():
    if genome_pre[k] == '1':
        kk=k.strip().split('\t')
        o.write(kk[3]+'\n')
o.close()

#construct fasta file
i=0
o=open('whole_pre.fa','w')
for line in open('whole_pre.txt','r'):
   lines=line.strip()
   o.write('>'+args.name+'-'+'predict_region'+str(i)+'\n'+lines+'\n')
   i+=1
o.close()
print("\033[1;30;34m%s\033[0m" %"Done...")

Python pyfiglet From line 8 of src/get_positive_sample.py

import sys
import argparse
import time
import numpy as np
import re
import pyfiglet
from keras.models import Sequential,model_from_json
from keras.callbacks import ModelCheckpoint,EarlyStopping 
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.layers.pooling import GlobalMaxPooling1D
from keras.layers.convolutional import Convolution1D, MaxPooling1D
from keras import regularizers
from keras.optimizers import SGD
from sklearn.metrics import roc_curve, auc
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

#specify which GPU(s) to be used
import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

np.random.seed(666)

#Making large letters out of ordinary text with pyfiglet
ascii_banner = pyfiglet.figlet_format("charplant")
print(ascii_banner)

#Argument parsing
parser = argparse.ArgumentParser(description="Training Charplant cnn model")
parser.add_argument("--epochs", "-e", default=150, type=int, required=False,
                    help="Number of epochs.(default is 150)")
parser.add_argument("--patience", "-p", default=20, type=int, required=False,
                    help='Number of epochs  for early stopping.(default is 20)')
parser.add_argument("--learningrate", "-lr", default=0.001, type=float, required=False,
                   help='Learning rate.(default is 0.001)')
parser.add_argument("--batch_size","-b",  default=128, type=int, required=False,	
                    help="Batch Size.(default is 128)")
parser.add_argument("--dropout","-d",  default=0.6, type=float,required=False,
                    help="Dropout rate.(default is 0.6)")
parser.add_argument("--nb_filter1","-n1",  default=200, type=int, required=False,
                    help="Number of filters in first layer of convolution.(default is 200)")
parser.add_argument("--nb_filter2","-n2",  default=100, type=int, required=False,
                    help="Number of filters in second layer of convolution.(default is 100)")
parser.add_argument("--filter_len1","-fl1",  default=19, type=int, required=False,
                    help="length of filters in first layer of convolution.(default is 19)")
parser.add_argument("--filter_len2","-fl2",  default=11, type=int, required=False,
                    help="length of filters in second layer of convolution.(default is 11)")
parser.add_argument("--hidden","-hd", default=200, type=int, required=False,
                    help="units in the fully connected layer.(default is 200)")
args = parser.parse_args()



print("\033[1;30;34m%s\033[0m" %"loading data...")
print("\033[1;30;34m%s\033[0m" %"please wait...")
sys.stdout.flush()

data_train=np.load('../data_preprocessing/data_train.npy')
label_train=np.load('../data_preprocessing/label_train.npy')
data_val=np.load('../data_preprocessing/data_val.npy')
label_val=np.load('../data_preprocessing/label_val.npy')
data_test=np.load('../data_preprocessing/data_test.npy')
label_test=np.load('../data_preprocessing/label_test.npy')
print("\033[1;30;34m%s\033[0m" %"well done...")

model=Sequential()
model.add(Convolution1D(int(args.nb_filter1),
                        int(args.filter_len1),
                        border_mode='same',
                        input_shape=(1000,4)))
model.add(Activation('relu'))
model.add(Dropout(float(args.dropout)))

model.add(Convolution1D(int(args.nb_filter2),
                        int(args.filter_len2),
                        border_mode='same'))
model.add(Activation('relu'))
model.add(Dropout(float(args.dropout)))


model.add(Flatten())
model.add(Dense(int(args.hidden),init='normal',activation='relu'))
model.add(Dropout(float(args.dropout)))

model.add(Dense(1,init='normal',activation='linear'))
model.add(Activation('sigmoid'))

sgd=SGD(lr=float(args.learningrate),decay=1e-5,momentum=0.9,nesterov=True)

print("\033[1;30;34m%s\033[0m" %"model compiling...")
print("\033[1;30;34m%s\033[0m" %"please wait...")
sys.stdout.flush()
model.compile(loss='binary_crossentropy',optimizer=sgd,metrics=['accuracy'])  
early_stopping =EarlyStopping(monitor='val_loss', patience=args.patience,verbose=1)
checkpointer = ModelCheckpoint(filepath='model_weights'+str(args.nb_filter1)+'-'+str(args.nb_filter2)+'-'+str(args.filter_len1)+'-'+str(args.filter_len2)+'-'+str(args.hidden)+'-'+str(args.dropout)+'-'+str(args.learningrate)+'-'+str(args.batch_size)+'-'+str(args.epochs)+'-'+str(args.patience)+'.h5', verbose=1, save_best_only=True)
print("\033[1;30;34m%s\033[0m" %"well done...")

print("\033[1;30;34m%s\033[0m" %"training...")
print("\033[1;30;34m%s\033[0m" %"please wait...")
time_start = time.time()
result=model.fit(data_train,label_train,batch_size=args.batch_size,nb_epoch=args.epochs,shuffle=True,validation_data=(data_val,label_val),callbacks=[checkpointer, early_stopping])
time_end = time.time()
print("\033[1;30;34m%s\033[0m" %"well done...")

json_string=model.to_json()
open('model_architecture'+str(args.nb_filter1)+'-'+str(args.nb_filter2)+'-'+str(args.filter_len1)+'-'+str(args.filter_len2)+'-'+str(args.hidden)+'-'+str(args.dropout)+'-'+str(args.learningrate)+'-'+str(args.batch_size)+'-'+str(args.epochs)+'-'+str(args.patience)+'.json','w').write(json_string)

model.load_weights('model_weights'+str(args.nb_filter1)+'-'+str(args.nb_filter2)+'-'+str(args.filter_len1)+'-'+str(args.filter_len2)+'-'+str(args.hidden)+'-'+str(args.dropout)+'-'+str(args.learningrate)+'-'+str(args.batch_size)+'-'+str(args.epochs)+'-'+str(args.patience)+'.h5')
score = model.evaluate(data_val,label_val, verbose=0)
print('accuracy_validate :',score[1])

score1 = model.evaluate(data_test,label_test, verbose=0)
print('Test loss:', score1[0])
print('Test accuracy:', score1[1])


print('training time : %d sec' % (time_end-time_start))


pred=model.predict_proba(data_test)

fpr, tpr, thresholds = roc_curve(label_test, pred)
roc_auc = auc(fpr, tpr)
print('the auc is ',roc_auc)


print("\033[1;30;34m%s\033[0m" %"plot ROC curve...")

plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.savefig('pictureauc'+str(args.nb_filter1)+'-'+str(args.nb_filter2)+'-'+str(args.filter_len1)+'-'+str(args.filter_len2)+'-'+str(args.hidden)+'-'+str(args.dropout)+'-'+str(args.learningrate)+'-'+str(args.batch_size)+'-'+str(args.epochs)+'-'+str(args.patience)+'.png')
plt.close()


print("\033[1;30;34m%s\033[0m" %"plot figure...")


plt.plot(result.history['loss'])
plt.plot(result.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.savefig('pictureloss'+str(args.nb_filter1)+'-'+str(args.nb_filter2)+'-'+str(args.filter_len1)+'-'+str(args.filter_len2)+'-'+str(args.hidden)+'-'+str(args.dropout)+'-'+str(args.learningrate)+'-'+str(args.batch_size)+'-'+str(args.epochs)+'-'+str(args.patience)+'.png')
plt.close()

plt.figure
plt.plot(result.epoch,result.history['acc'],label="acc")
plt.plot(result.epoch,result.history['val_acc'],label="val_acc")
plt.scatter(result.epoch,result.history['acc'],marker='*')
plt.scatter(result.epoch,result.history['val_acc'])
plt.legend(loc='under right')
plt.savefig('picture'+str(args.nb_filter1)+'-'+str(args.nb_filter2)+'-'+str(args.filter_len1)+'-'+str(args.filter_len2)+'-'+str(args.hidden)+'-'+str(args.dropout)+'-'+str(args.learningrate)+'-'+str(args.batch_size)+'-'+str(args.epochs)+'-'+str(args.patience)+'.png')
plt.close()

Python numpy matplotlib sklearn keras pyfiglet From line 7 of src/model.py

from __future__ import print_function
import os
import numpy as np
import pyfiglet
import keras
from keras.models import model_from_json
from keras import backend as K
import argparse
#specify which GPU(s) to be used
import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"


#Making large letters out of ordinary text with pyfiglet
ascii_banner = pyfiglet.figlet_format("charplant")
print(ascii_banner)

#Argument parsing
parser = argparse.ArgumentParser(description="motif Visualization ")
parser.add_argument("--nb_filter1","-n1",  default=200, type=int, required=False,
                    help="Number of filters in first layer of convolution.(default is 200)")
parser.add_argument("--filter_len1","-fl1",  default=19, type=int, required=False,
                    help="length of filters in first layer of convolution.(default is 19)")
parser.add_argument("--out", "-o", required=True,
                    help="prefix of the output file")
args = parser.parse_args()

get_input_file_path = os.chdir("../data_preprocessing")
input_file_path = os.getcwd()
input_file = os.listdir(input_file_path)
for file in input_file:
    if file[-9: ] == 'train.txt':
         motif_input_file = input_file_path+"/"+file

get_model_files_path = os.chdir("../model")
model_files_path = os.getcwd()
model_files = os.listdir(model_files_path)
for file in model_files:
     if file[-4: ] == 'json':
          model_architecture_path = model_files_path+"/"+file
     elif file[-2: ] =='h5':
          model_weights_path = model_files_path+"/"+file


get_current_path = os.chdir("../motif")
os.makedirs(args.out+"_motif")
filter_size=int(args.filter_len1)
model = model_from_json(open(model_architecture_path).read())
model.load_weights(model_weights_path)
f = K.function([model.layers[0].input,K.learning_phase()], [model.layers[1].output])


data=np.load("../data_preprocessing/data_train.npy")


def print_pwm(f, filter_idx, filter_pwm, nsites):
    if nsites < 10:
        return

    print('MOTIF filter%d' % filter_idx, end = "\n", file = f)
    print('letter-probability matrix: alength= 4 w= %d nsites= %d' % (filter_pwm.shape[0], nsites), end = "\n", file = f) 

    for i in range(0, filter_pwm.shape[0]):
        print('%.4f %.4f %.4f %.4f' % tuple(filter_pwm[i]), end = "\n", file = f)
    print('', end = '\n', file = f)



def logo_kmers(filter_outs,filter_size, seqs, filename, maxpct_t = 0.7):
    all_outs = np.ravel(filter_outs)
    all_outs_mean = all_outs.mean()
    all_outs_norm = all_outs - all_outs_mean
    raw_t = maxpct_t * all_outs_norm.max() + all_outs_mean

    with open(filename, 'w') as f:
        for i in range(filter_outs.shape[0]):
            for j in range(filter_outs.shape[1]):
                if filter_outs[i,j] > raw_t:
                    kmer = seqs[i][j-9:j+10]
                    if len(kmer) <filter_size:
                        continue
                    print('>%d_%d' % (i,j), end = '\n', file = f)
                    print(kmer, end = '\n', file = f)


def make_filter_pwm(filter_fasta):
    nts = {'A':0, 'C':1, 'G':2, 'T':3}
    pwm_counts = []
    nsites = 4
    for line in open(filter_fasta):
        if line[0] == '>':
            continue

        seq = line.rstrip()
        nsites += 1
        if len(pwm_counts) == 0:
            for i in range(len(seq)):
                pwm_counts.append(np.array([1.0]*4))

        for i in range(len(seq)):
            try:
                pwm_counts[i][nts[seq[i]]] += 1
            except KeyError:
                pwm_counts[i] += np.array([0.25]*4)

    pwm_freqs = []
    for i in range(len(pwm_counts)):
        pwm_freqs.append([pwm_counts[i][j]/float(nsites) for j in range(4)])

    return np.array(pwm_freqs), nsites - 4

def meme_header(f, seqs):
    nts = {'A':0, 'C':1, 'G':2, 'T':3}

    nt_counts = [1]*4
    for i in range(len(seqs)):
        for nt in seqs[i]:
            try:
                nt_counts[nts[nt]] += 1
            except KeyError:
                pass

    nt_sum = float(sum(nt_counts))
    nt_freqs = [nt_counts[i]/nt_sum for i in range(4)]

    print('MEME version 4', end = '\n', file = f)
    print('', end = '\n', file = f)
    print('ALPHABET= ACGT', end = '\n', file = f)
    print('', end = '\n', file = f)
    print('Background letter frequencies:', end = '\n', file = f)
    print('A %.4f C %.4f G %.4f T %.4f' % tuple(nt_freqs), end = '\n', file = f)
    print('', end = '\n', file = f)




sequence=[]
for line in open(motif_input_file,'r'):
    lines=line.strip().split('\t')
    one_sequence=lines[1]
    sequence.append(one_sequence)


y=0
final_output=np.empty([0,1000,int(args.nb_filter1)])
while y+128 <int(data.shape[0]):
   x=data[y:y+128]
   cnn_output=f([x])[0]
   y+=128
   final_output=np.concatenate([final_output,cnn_output],axis=0)
   print('%s/%s data points processed...' % (y, data.shape[0]))
x1=data[y:int(data.shape[0])]
cnn_output1=f([x1])[0]
final_output=np.concatenate([final_output,cnn_output1],axis=0)
print(final_output.shape)
cnn_layer_idx = 0
with open('%s/filter_meme.txt' % (args.out+"_motif"), 'w') as meme:
        meme_header(meme, sequence)
        for i in range(int(args.nb_filter1)):
            logo_kmers(final_output[:, :, i], filter_size,sequence, '%s/filter_%s.fa' % ((args.out+"_motif"), i))
            filter_pwm, nsites = make_filter_pwm('%s/filter_%s.fa' % ((args.out+"_motif"), i))
            print_pwm(meme, i, filter_pwm, nsites)