A de novo prediction tool of chromatin accessible regions for plant genomes
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Introduction
The result of OCR (Open Chromatin Region) assay technologies such as DNase-seq and ATAC-seq represents the open state of a tissue at a given time and does not cover all the chromatin accessible information of this species. To predict the potential open regions of different tissues at different times in the whole sequence can help to understand gene transcription and regulation from a global perspective.
CharPlant
(Chromatin Accessible Regions for Plant) is a
de novo
OCRs prediction tool based on deep learning model. It can take complete DNA sequences or scaffolds as input, and output the outline of OCRs in a .Bed format file rely on sequence features. To our knowledge, this is the first tool to
de novo
predict potential open chromatin regions from DNA sequence and assay data.
If you use this tool, please cite the following article:
CharPlant: A De Novo Open Chromatin Region Prediction Tool for Plant Genomes
Steps to install and run CharPlant
CharPlant currently available for Linux-based platforms. It can learn the OCR features from DNase-seq or ATAC-seq data and de novo predict potential chromatin open regions for plant genome. To train the parameters of the deep learning model and predict OCRs, CharPlant perform the following three steps. Step 1 and step 2 are to install needed packages and set up running environment of the software, and they are optional for the users have had those packages worked. Step 3 is to download and run CharPlant. We provide as detailed instruction as possible although it can be run using simple command line. In the following, prompt “$” starts a shell command line of Linux, while “#” starts a comment.
Step 1. Install python packages
CharPlant is developed in python language, and some fundamental packages for scientific computing and network construction are indispensable. To efficient install and manage them, we strongly recommend using the Conda package manager. Conda is an open source package and environment management system for installing multiple versions of packages and their dependencies and easily switching between them.
(i) Install Conda
$ wget https://repo.continuum.io/archive/Anaconda3-4.2.0-Linux-x86_64.sh
$ bash Anaconda3-4.2.0-Linux-x86_64.sh
#Add environment variable
$ echo "export PATH=\"${PWD}/anaconda3/bin:\$PATH\" " >>~/.bashrc
$ source ~/.bashrc
(ii) Install needed python packages
Following Python packages are required:
-
numpy(1.16.1)
-
matplotlib(3.3.2)
-
pyfiglet(0.8.post1)
-
sklearn(0.22)
-
keras(2.0.5)
-
h5py(2.7.1)
-
tensorflow-gpu(1.3.0)
(iii) create environments
Because the de novo prediction step using cpu, we need create environments in cpu
$ conda create --name charplant-cpu python=3.6
$ source activate charplant-cpu
$ pip install sklearn
$ pip install matplotlib
$ pip install pyfiglet
$ conda install keras
$ source deactivate charplant-cpu
Step 2. Install the tools Bowtie2 , MACS2 and Snakemake
(i)Bowtie2
We use bowtie2 (version 2.2.6) for sequence alignment, and you can find all versions HERE . The detailed manual is provided in this LINK .
$ wget https://downloads.sourceforge.net/project/bowtie-bio/bowtie2/2.2.6/bowtie2-2.2.6-linux-x86_64.zip
$ unzip bowtie2-2.2.6-linux-x86_64.zip
#Add environment variable
$ echo "export PATH=\"${PWD}/bowtie2-2.2.6:\$PATH\" " >>~/.bashrc
$ source ~/.bashrc
(ii)MACS2
We use MACS2 (version 2.1.1) for peak calling, and you can find all versions and user manual HERE .
$ wget https://files.pythonhosted.org/packages/0f/e9/60761f3df0634a50a96531fdb00f45dd3a6f3aa2509fb722eb2a665faf35/MACS2-2.1.1.20160226.tar.gz
$ tar -zxvf MACS2-2.1.1.20160226.tar.gz
#setup environment variable
$ conda create --name macs2 python=2.7
$ source activate macs2
$ cd MACS2-2.1.1.20160226
$ pip install numpy
$ python setup.py install
#deactivate the environment
$ source deactivate macs2
(iii)Snakemake
We adopt an easy-to-use workflow software Snakemake to build our analysis process, which combines a series of steps into a single pipeline. Snakemake and the manual are provided in this LINK .
$ pip install snakemake==5.5.2
Step 3. Download and run CharPlant
(i) Download and install CharPlant
To Install and run CharPlant is very easy. You can download the CharPlant package in the following two ways. Then un-compress the package and set “CharPlant” as current directory.
-
Simple click to download CharPlant package HERE .
-
You can also download the CharPlant package using git.
$ git clone https://github.com/Yin-Shen/CharPlant.git
$ echo "export PATH=\"${PWD}/CharPlant:\$PATH\" " >>~/.bashrc
$ source ~/.bashrc
$ cd CharPlant
$ chmod -R 744 CharPlant.sh
(ii) The directory structure of CharPlant
The directory structure is as follows, which has two directories and three files. Directory “CharPlant/example” contains the reference genome and DNase-seq data of rice used as an example, file oryza_sativa.fa and ory_sativa.bed, respectively. The result of predicted OCRs is also saved in it. All the python and shell scripts are in directory “CharPlant/src”, but users generally don't need to care about it.
├──CharPlant
│ ├── example
│ │ ├──oryza_sativa.bed
│ │ ├──oryza_sativa.fa
│ ├── src
│ │ ├──data_preprocess
│ │ ├──de_novo_prediction
│ │ ├──get_positive_sample
│ │ ├──model
│ │ ├──motif
… … … …
│ │ ├──submit_lsf
│ ├── config.yaml
│ ├── Snakefile
│ ├── CharPlant.sh
For example, the follow command can print help information.
$ CharPlant.sh -h
Or print python scripts help information.
$ python model.py -h
(iii) Set the parameters of CharPlant
Configuration file “config.yaml” contains all parameters of the tool. To run CharPlant, you only need to revise three parameters in the following according to your path and file name, while leaving others as they are. A complete “config.yaml” file is shown in next section.
genome : Yourpath/CharPlant/example/oryza_sativa.fa #Genome file in .fasta format for input(Need full path)
bed: Yourpath/CharPlant/example/oryza_sativa.bed #Open chromatin regions file in .bed format for input(Need full path)
out: ory #Prefix of the output file
(iv) Run CharPlant
Snakemake file defines rules to performance operations. We have created a rule for each target and intermediate file. It is not necessary for the users to rewrite it. A complete “Snakefile” file is shown in subsequent section.
$ CharPlant.sh
CharPlant will perform four steps in turn and output the results of predicted OCRs to a .bed format file.
-
Data preprocessing
-
Model training
-
Motif visualization
-
De novo prediction
(v) Output files
If all steps are completed, the results will be output to the following five directories.
/data_preprocessing---Data preprocessing results for model training and motif visualization.
/model---.json file of model architecture, .h5 file of model parameters and .png file of the result figure.
/motif--- Positional weight matrix of motifs.
/de_novo_prediction---Results of de novo prediction.
/peak---Peaks of predicted OCRs.
Contact us
Yin Shen
: [email protected]
Ling-Ling Chen
: [email protected]
Junxiang Gao
: [email protected]
Code Snippets
14 15 | shell: "mkdir $PWD/data_preprocessing && cd $PWD/data_preprocessing && python ../src/data_preprocess.py -g {input.genome} -b {input.bed} -o {params.out} 2> /dev/null" |
29 30 | shell: "mkdir $PWD/model && cd $PWD/model && python ../src/model.py -e {params.epochs} -p {params.patience} -lr {params.learningrate} -b {params.batch_size} -d {params.dropout} -n1 {params.nb_filter1} -n2 {params.nb_filter2} -fl1 {params.filter_len1} -fl2 {params.filter_len2} -hd {params.hidden} 2> /dev/null" |
39 40 | shell: "mkdir $PWD/motif && cd $PWD/motif && python ../src/motif.py -n1 {params.nb_filter1} -fl1 {params.filter_len1} -o {params.motif_out} 2> /dev/null" |
51 52 | shell: "mkdir $PWD/de_novo_prediction && cd $PWD/de_novo_prediction && python ../src/de_novo_prediction.py -g {input.genome} -l {params.split_lines} -t {params.threshold} -o {params.prediction_out}" |
61 62 | shell: "cd src/ && bash batch_submit.sh {params.run_type} {params.prediction_out} {params.batch_submit_jobs_number}" |
66 67 | shell: "cd src/ && bash cat_predict.sh" |
74 75 | shell: "python src/get_positive_sample.py -i {params.prediction_out} -n {params.speices_name}" |
84 85 | shell: " mkdir $PWD/peak && cd $PWD/peak && bowtie2-build {input.genome} {params.index_name}" |
94 95 | shell: "cd $PWD/peak && bowtie2 -x {params.index_name} -f ../de_novo_prediction/whole_pre.fa -S {params.sam_prefix}.sam " |
104 105 | shell: "cd $PWD/peak && source activate macs2 && macs2 callpeak -t {params.sam_prefix}.sam -f SAM --shift -125 --extsize 250 --nomodel -B --SPMR -g hs -n {params.peak_prefix} " |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 | import numpy as np import re import random import pyfiglet import argparse import os import sklearn from sklearn.model_selection import train_test_split #Making large letters out of ordinary text with pyfiglet ascii_banner = pyfiglet.figlet_format("charplant") print(ascii_banner) #Argument parsing parser = argparse.ArgumentParser(description='''Data Preprocess: takes 3 required arguments''') parser.add_argument("--genome", "-g", required=True, help="genome file in fasta format") parser.add_argument("--bed", "-b", required=True, help="bed file of Chromatin Accessible Regions for input") parser.add_argument("--out", "-o", required=True, help="prefix of the output file") args = parser.parse_args() #Construct a positive sample sequences genome_fasta={} for line in open(args.genome,'r'): if line[0]=='>': seq_name=line.lstrip('>').strip() genome_fasta[seq_name]=[] else: genome_fasta[seq_name].append(line.strip()) for keys,val in genome_fasta.items(): genome_fasta[keys]=''.join(val) nuc=open(args.bed,'r') out=open(args.out+".fa",'w') col_num=0 bed=nuc.readlines() col=len(bed) for lline in bed: col_num+=1 chr_name=lline.rstrip().split()[0] start_end=lline.rstrip().split()[1:] leng=int(start_end[1])-int(start_end[0]) out.write('>'+chr_name+':'+start_end[0]+'-'+start_end[1]+'\n'+genome_fasta[chr_name][int(start_end[0]):int(start_end[1])]+'\n') out.close() #Negative sequences consist of random shuffled positive sequences with random.shuffle neg_fasta={} out_neg=open(args.out+"_negtive.fa",'w') for line in open(args.out+".fa",'r'): if line[0]=='>': seq_name=line.lstrip('>').strip() neg_fasta[seq_name]=[] else: neg_fasta[seq_name].append(line.strip()) for keys,val in neg_fasta.items(): neg_fasta[keys]=''.join(val) fa=list(neg_fasta[keys]) random.shuffle(fa) out_neg.write(">"+keys+"\n"+"".join(fa)+"\n") out_neg.close() #Construct a positive samples datasets bed_dict_pos={} pos_o=open(args.out+"pos.txt",'w') for line in open(args.out+".fa",'r'): if line[0]==">": seq=line.strip() bed_dict_pos[seq]=[] else: bed_dict_pos[seq].append(line.strip()) for keys,val in bed_dict_pos.items(): bed_dict_pos[keys]=''.join(val) pos_o.write('1'+'\t'+bed_dict_pos[keys]+'\n') pos_o.close() #Construct a negtive samples datasets bed_dict_neg={} neg_o=open(args.out+"neg.txt",'w') for line in open(args.out+"_negtive.fa",'r'): if line[0]==">": seq=line.strip() bed_dict_neg[seq]=[] else: bed_dict_neg[seq].append(line.strip()) for keys,val in bed_dict_neg.items(): bed_dict_neg[keys]=''.join(val) neg_o.write('0'+'\t'+bed_dict_neg[keys]+'\n') neg_o.close() #Using “one hot enconding” to preprocessing each sequence to the two-dimensional matrix of 4 columns. #Each base in the sequence is vectorized, convert A|a, C|c, G|g, T|t, N|n to: [1,0,0,0], [0,1,0,0], [0,0,1,0], [0,0,0,1], [0,0,0,0] def load_data(filename): f=open(filename,'r') sequences=f.readlines() num=len(sequences) data=np.empty((num,1000,4),dtype='float32') label=np.empty((num,),dtype="int") for i in range(num): line=sequences[i].replace('\n','') list_line=re.split('\s+',line) one_sequence=list_line[1] for j in range(1000): if j<=len(one_sequence)-1: if re.findall(one_sequence[j],'A|a'): data[i,j,:]=np.array([1.0,0.0,0.0,0.0],dtype='float32') if re.findall(one_sequence[j],'C|c'): data[i,j,:]=np.array([0.0,1.0,0.0,0.0],dtype='float32') if re.findall(one_sequence[j],'G|g'): data[i,j,:]=np.array([0.0,0.0,1.0,0.0],dtype='float32') if re.findall(one_sequence[j],'T|t'): data[i,j,:]=np.array([0.0,0.0,0.0,1.0],dtype='float32') if re.findall(one_sequence[j],'N|n'): data[i,j,:]=np.array([0.0,0.0,0.0,0.0],dtype='float32') else: data[i,j,:]=np.array([0.0,0.0,0.0,0.0],dtype='float32') label[i]=list_line[0] return data,label pos_sample=args.out+"pos.txt" neg_sample=args.out+"neg.txt" data_pos,label_pos = load_data(pos_sample) data_neg,label_neg = load_data(neg_sample) #Divide postive datasets into training sets validate sets and test sets according to 60%, 20%and 20% data_pos_train, data_pos_test, label_pos_train, label_pos_test = train_test_split(data_pos,label_pos, test_size=0.4, random_state=1) data_pos_test, data_pos_val, label_pos_test, label_pos_val = train_test_split(data_pos_test, label_pos_test, test_size=0.5, random_state=1) #Divide negtive datasets into training sets validate sets and test sets according to 60%, 20%and 20% data_neg_train, data_neg_test, label_neg_train, label_neg_test = train_test_split(data_neg,label_neg, test_size=0.4, random_state=1) data_neg_test, data_neg_val, label_neg_test, label_neg_val = train_test_split(data_neg_test, label_neg_test, test_size=0.5, random_state=1) #Combined training sets data_train = np.concatenate((data_pos_train,data_neg_train),axis=0) label_train = np.concatenate((label_pos_train,label_neg_train),axis=0) #Combined validate sets data_val = np.concatenate((data_pos_val,data_neg_val),axis=0) label_val = np.concatenate((label_pos_val,label_neg_val),axis=0) #Combined test sets data_test = np.concatenate((data_pos_test,data_neg_test),axis=0) label_test = np.concatenate((label_pos_test,label_neg_test),axis=0) #save np.save('data_train.npy',data_train) np.save('label_train.npy',label_train) np.save('data_val.npy',data_val) np.save('label_val.npy',label_val) np.save('data_test.npy',data_test) np.save('label_test.npy',label_test) #Get training set sample sequence sequence=[[] for i in range(data_train.shape[0])] for i in range(data_train.shape[0]): for j in range(data_train.shape[1]): if (data_train[i,j,:] == np.array([1.0,0.0,0.0,0.0],dtype='float32')).all(): sequence[i]+='A' if (data_train[i,j,:] == np.array([0.0,1.0,0.0,0.0],dtype='float32')).all(): sequence[i]+='C' if (data_train[i,j,:] == np.array([0.0,0.0,1.0,0.0],dtype='float32')).all(): sequence[i]+='G' if (data_train[i,j,:] == np.array([0.0,0.0,0.0,1.0],dtype='float32')).all(): sequence[i]+='T' if (data_train[i,j,:] == np.array([0.0,0.0,0.0,0.0],dtype='float32')).all(): sequence[i]+='N' o=open(args.out+"train.txt",'w') for i in range(data_train.shape[0]): o.write(str(label_train[i])+'\t'+''.join(sequence[i])+'\n') o.close() |
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 | import pyfiglet import argparse import re import os #Making large letters out of ordinary text with pyfiglet ascii_banner = pyfiglet.figlet_format("charplant") print(ascii_banner) #Argument parsing parser = argparse.ArgumentParser(description='''De novo prediction''') parser.add_argument("--genome", "-g", required=True, help="genome file in fasta format") parser.add_argument("--lines", "-l",default=20000, type=int, required=False, help="The number of lines in a Sliding window cut genome file splited into smaller files (default is 20,000)") parser.add_argument("--threshold", "-t",default=0.5, type=float, required=False, help="the threshold adopted to assign positive predictions (default is 0.5)") parser.add_argument("--out", "-o", required=True, help="prefix of the output file") args = parser.parse_args() print("\033[1;30;34m%s\033[0m" %"Sliding window cutting genome...") print("\033[1;30;34m%s\033[0m" %"please wait...") fasta={} for line in open(args.genome,'r'): if line[0]=='>': seq_name=line.lstrip('>').strip() fasta[seq_name]=[] else: fasta[seq_name].append(line.strip()) for keys,val in fasta.items(): fasta[keys]=''.join(val) o=open(args.out,'w') len_chrom={} for key,val in fasta.items(): len_chrom[key]=len(fasta[key]) for i in range(0,len_chrom[key],1): if i+36<=int(len_chrom[key]): o.write(key+'\t'+str(i)+'\t'+str(i+36)+'\t'+fasta[key][i:i+36]+'\n') else: o.write(key+'\t'+str(i)+'\t'+str(len_chrom[key])+'\t'+fasta[key][i:len_chrom [key]+1]+'\n') break o.close() print("\033[1;30;34m%s\033[0m" %"well done...") print("\033[1;30;34m%s\033[0m" %"Split genome file according to multi-line...") print("\033[1;30;34m%s\033[0m" %"please wait...") splitLen = int(args.lines) output_prefix = args.out # example: split_fasta_36_1_ input = open(args.out, 'r').read().split('\n') counts = 1 for lines in range(0, len(input), splitLen): output_data = input[lines:lines+splitLen] output = open(output_prefix + str(counts), 'w') output.write('\n'.join(output_data)) counts += 1 output.close() print("\033[1;30;34m%s\033[0m" %"well done...") print("\033[1;30;34m%s\033[0m" %"batch write into python files...") print("\033[1;30;34m%s\033[0m" %"please wait...") get_model_files_path = os.chdir("../model") model_files_path = os.getcwd() model_files = os.listdir(model_files_path) for file in model_files: if file[-4: ] == 'json': model_architecture_path = model_files_path+"/"+file elif file[-2: ] =='h5': model_weights_path = model_files_path+"/"+file get_file_path = os.chdir("../de_novo_prediction") file_path = os.getcwd() files = os.listdir(file_path) file_list=[] for file in files: if file[-2: ] == 'py': continue elif re.match(args.out+'\d+',file): file_list.append(file) def tryint(s): try: return int(s) except ValueError: return s def str2int(v_str): return [tryint(sub_str) for sub_str in re.split('([0-9]+)', v_str)] def sort_humanly(v_list): return sorted(v_list, key=str2int) sort_file_list = sort_humanly(file_list) python_file_1=''' def loading_data1(filename): f=open(filename,'r') sequences=f.readlines() num=len(sequences) data=np.empty((num,1000,4),dtype='float32') for i in range(num): line=sequences[i].replace("\\n",'') list_line=re.split('\s+',line) one_sequence=list_line[3] for j in range(1000): if j<=len(one_sequence)-1: if re.match(one_sequence[j],'A|a'): data[i,j,:]=np.array([1.0,0.0,0.0,0.0],dtype='float32') if re.match(one_sequence[j],'C|c'): data[i,j,:]=np.array([0.0,1.0,0.0,0.0],dtype='float32') if re.match(one_sequence[j],'G|g'): data[i,j,:]=np.array([0.0,0.0,1.0,0.0],dtype='float32') if re.match(one_sequence[j],'T|t'): data[i,j,:]=np.array([0.0,0.0,0.0,1.0],dtype='float32') if re.match(one_sequence[j],'N|n'): data[i,j,:]=np.array([0.0,0.0,0.0,0.0],dtype='float32') else: data[i,j,:]=np.array([0.0,0.0,0.0,0.0],dtype='float32') return data ''' python_file_2 = ''' result_new=np.empty((results.shape[0],1),dtype='int32') for i in range(result_new.shape[0]): if results[i] >= %(threshold)s: result_new[i] = np.array([1], dtype='int32') else: result_new[i] = np.array([0], dtype = 'int32') for i in result_new.tolist(): o.write(str(i)+"\\n") o.close() ''' for file in sort_file_list: file_read = open(file_path+"/"+file+'.py', 'w') file_read.write('import numpy as np'+'\n'+'import re'+'\n'+'from keras.models import model_from_json'+'\n'+"model = model_from_json(open('"+model_architecture_path+"').read())"+'\n'+"model.load_weights('"+model_weights_path+"')"+'\n'+python_file_1+'\n'+"data=loading_data1('"+file+"')"+'\n'+'results=model.predict_classes(data)'+'\n'+"o=open('whole_predict_fasta"+file+".txt','w')"+'\n'+python_file_2%dict(threshold=args.threshold)+'\n') file_read.close() print("\033[1;30;34m%s\033[0m" %"well done...") |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 | import pyfiglet import argparse import os #Making large letters out of ordinary text with pyfiglet ascii_banner = pyfiglet.figlet_format("charplant") print(ascii_banner) #Argument parsing parser = argparse.ArgumentParser(description="motif Visualization ") parser.add_argument("--input", "-i", required=True, help="The prediction output file generated by this de_novo_prediction(example :split_fasta_36_1)") parser.add_argument("--name", "-n", required=True, help="The name of the species") args = parser.parse_args() get_input_file_path = os.chdir("de_novo_prediction/") input_file_path = os.getcwd() input_file = os.listdir(input_file_path) for file in input_file: if file[-17: ] == args.input: input_files = input_file_path+"/"+file #Get positive sample genome=[] genome_line=0 for l in open(input_files,'r'): ll=l.strip() genome.append(ll) genome_line +=1 predict=[] for line in open('whole_predict','r'): lines=line.strip() liness=lines.lstrip('[').rstrip(']') predict.append(liness) genome_pre={} lines=0 while lines < genome_line: k_name=genome[lines] genome_pre[k_name]=predict[lines] lines+=1 o=open('whole_pre.txt','w') for k,v in genome_pre.items(): if genome_pre[k] == '1': kk=k.strip().split('\t') o.write(kk[3]+'\n') o.close() #construct fasta file i=0 o=open('whole_pre.fa','w') for line in open('whole_pre.txt','r'): lines=line.strip() o.write('>'+args.name+'-'+'predict_region'+str(i)+'\n'+lines+'\n') i+=1 o.close() print("\033[1;30;34m%s\033[0m" %"Done...") |
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 | import sys import argparse import time import numpy as np import re import pyfiglet from keras.models import Sequential,model_from_json from keras.callbacks import ModelCheckpoint,EarlyStopping from keras.layers.core import Dense, Dropout, Activation, Flatten from keras.layers.pooling import GlobalMaxPooling1D from keras.layers.convolutional import Convolution1D, MaxPooling1D from keras import regularizers from keras.optimizers import SGD from sklearn.metrics import roc_curve, auc import matplotlib matplotlib.use('Agg') import matplotlib.pyplot as plt #specify which GPU(s) to be used import os os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID" os.environ["CUDA_VISIBLE_DEVICES"] = "0" np.random.seed(666) #Making large letters out of ordinary text with pyfiglet ascii_banner = pyfiglet.figlet_format("charplant") print(ascii_banner) #Argument parsing parser = argparse.ArgumentParser(description="Training Charplant cnn model") parser.add_argument("--epochs", "-e", default=150, type=int, required=False, help="Number of epochs.(default is 150)") parser.add_argument("--patience", "-p", default=20, type=int, required=False, help='Number of epochs for early stopping.(default is 20)') parser.add_argument("--learningrate", "-lr", default=0.001, type=float, required=False, help='Learning rate.(default is 0.001)') parser.add_argument("--batch_size","-b", default=128, type=int, required=False, help="Batch Size.(default is 128)") parser.add_argument("--dropout","-d", default=0.6, type=float,required=False, help="Dropout rate.(default is 0.6)") parser.add_argument("--nb_filter1","-n1", default=200, type=int, required=False, help="Number of filters in first layer of convolution.(default is 200)") parser.add_argument("--nb_filter2","-n2", default=100, type=int, required=False, help="Number of filters in second layer of convolution.(default is 100)") parser.add_argument("--filter_len1","-fl1", default=19, type=int, required=False, help="length of filters in first layer of convolution.(default is 19)") parser.add_argument("--filter_len2","-fl2", default=11, type=int, required=False, help="length of filters in second layer of convolution.(default is 11)") parser.add_argument("--hidden","-hd", default=200, type=int, required=False, help="units in the fully connected layer.(default is 200)") args = parser.parse_args() print("\033[1;30;34m%s\033[0m" %"loading data...") print("\033[1;30;34m%s\033[0m" %"please wait...") sys.stdout.flush() data_train=np.load('../data_preprocessing/data_train.npy') label_train=np.load('../data_preprocessing/label_train.npy') data_val=np.load('../data_preprocessing/data_val.npy') label_val=np.load('../data_preprocessing/label_val.npy') data_test=np.load('../data_preprocessing/data_test.npy') label_test=np.load('../data_preprocessing/label_test.npy') print("\033[1;30;34m%s\033[0m" %"well done...") model=Sequential() model.add(Convolution1D(int(args.nb_filter1), int(args.filter_len1), border_mode='same', input_shape=(1000,4))) model.add(Activation('relu')) model.add(Dropout(float(args.dropout))) model.add(Convolution1D(int(args.nb_filter2), int(args.filter_len2), border_mode='same')) model.add(Activation('relu')) model.add(Dropout(float(args.dropout))) model.add(Flatten()) model.add(Dense(int(args.hidden),init='normal',activation='relu')) model.add(Dropout(float(args.dropout))) model.add(Dense(1,init='normal',activation='linear')) model.add(Activation('sigmoid')) sgd=SGD(lr=float(args.learningrate),decay=1e-5,momentum=0.9,nesterov=True) print("\033[1;30;34m%s\033[0m" %"model compiling...") print("\033[1;30;34m%s\033[0m" %"please wait...") sys.stdout.flush() model.compile(loss='binary_crossentropy',optimizer=sgd,metrics=['accuracy']) early_stopping =EarlyStopping(monitor='val_loss', patience=args.patience,verbose=1) checkpointer = ModelCheckpoint(filepath='model_weights'+str(args.nb_filter1)+'-'+str(args.nb_filter2)+'-'+str(args.filter_len1)+'-'+str(args.filter_len2)+'-'+str(args.hidden)+'-'+str(args.dropout)+'-'+str(args.learningrate)+'-'+str(args.batch_size)+'-'+str(args.epochs)+'-'+str(args.patience)+'.h5', verbose=1, save_best_only=True) print("\033[1;30;34m%s\033[0m" %"well done...") print("\033[1;30;34m%s\033[0m" %"training...") print("\033[1;30;34m%s\033[0m" %"please wait...") time_start = time.time() result=model.fit(data_train,label_train,batch_size=args.batch_size,nb_epoch=args.epochs,shuffle=True,validation_data=(data_val,label_val),callbacks=[checkpointer, early_stopping]) time_end = time.time() print("\033[1;30;34m%s\033[0m" %"well done...") json_string=model.to_json() open('model_architecture'+str(args.nb_filter1)+'-'+str(args.nb_filter2)+'-'+str(args.filter_len1)+'-'+str(args.filter_len2)+'-'+str(args.hidden)+'-'+str(args.dropout)+'-'+str(args.learningrate)+'-'+str(args.batch_size)+'-'+str(args.epochs)+'-'+str(args.patience)+'.json','w').write(json_string) model.load_weights('model_weights'+str(args.nb_filter1)+'-'+str(args.nb_filter2)+'-'+str(args.filter_len1)+'-'+str(args.filter_len2)+'-'+str(args.hidden)+'-'+str(args.dropout)+'-'+str(args.learningrate)+'-'+str(args.batch_size)+'-'+str(args.epochs)+'-'+str(args.patience)+'.h5') score = model.evaluate(data_val,label_val, verbose=0) print('accuracy_validate :',score[1]) score1 = model.evaluate(data_test,label_test, verbose=0) print('Test loss:', score1[0]) print('Test accuracy:', score1[1]) print('training time : %d sec' % (time_end-time_start)) pred=model.predict_proba(data_test) fpr, tpr, thresholds = roc_curve(label_test, pred) roc_auc = auc(fpr, tpr) print('the auc is ',roc_auc) print("\033[1;30;34m%s\033[0m" %"plot ROC curve...") plt.title('Receiver Operating Characteristic') plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc) plt.legend(loc = 'lower right') plt.plot([0, 1], [0, 1],'r--') plt.xlim([0, 1]) plt.ylim([0, 1]) plt.ylabel('True Positive Rate') plt.xlabel('False Positive Rate') plt.savefig('pictureauc'+str(args.nb_filter1)+'-'+str(args.nb_filter2)+'-'+str(args.filter_len1)+'-'+str(args.filter_len2)+'-'+str(args.hidden)+'-'+str(args.dropout)+'-'+str(args.learningrate)+'-'+str(args.batch_size)+'-'+str(args.epochs)+'-'+str(args.patience)+'.png') plt.close() print("\033[1;30;34m%s\033[0m" %"plot figure...") plt.plot(result.history['loss']) plt.plot(result.history['val_loss']) plt.title('model loss') plt.ylabel('loss') plt.xlabel('epoch') plt.legend(['train', 'test'], loc='upper left') plt.savefig('pictureloss'+str(args.nb_filter1)+'-'+str(args.nb_filter2)+'-'+str(args.filter_len1)+'-'+str(args.filter_len2)+'-'+str(args.hidden)+'-'+str(args.dropout)+'-'+str(args.learningrate)+'-'+str(args.batch_size)+'-'+str(args.epochs)+'-'+str(args.patience)+'.png') plt.close() plt.figure plt.plot(result.epoch,result.history['acc'],label="acc") plt.plot(result.epoch,result.history['val_acc'],label="val_acc") plt.scatter(result.epoch,result.history['acc'],marker='*') plt.scatter(result.epoch,result.history['val_acc']) plt.legend(loc='under right') plt.savefig('picture'+str(args.nb_filter1)+'-'+str(args.nb_filter2)+'-'+str(args.filter_len1)+'-'+str(args.filter_len2)+'-'+str(args.hidden)+'-'+str(args.dropout)+'-'+str(args.learningrate)+'-'+str(args.batch_size)+'-'+str(args.epochs)+'-'+str(args.patience)+'.png') plt.close() |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 | from __future__ import print_function import os import numpy as np import pyfiglet import keras from keras.models import model_from_json from keras import backend as K import argparse #specify which GPU(s) to be used import os os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID" os.environ["CUDA_VISIBLE_DEVICES"] = "0" #Making large letters out of ordinary text with pyfiglet ascii_banner = pyfiglet.figlet_format("charplant") print(ascii_banner) #Argument parsing parser = argparse.ArgumentParser(description="motif Visualization ") parser.add_argument("--nb_filter1","-n1", default=200, type=int, required=False, help="Number of filters in first layer of convolution.(default is 200)") parser.add_argument("--filter_len1","-fl1", default=19, type=int, required=False, help="length of filters in first layer of convolution.(default is 19)") parser.add_argument("--out", "-o", required=True, help="prefix of the output file") args = parser.parse_args() get_input_file_path = os.chdir("../data_preprocessing") input_file_path = os.getcwd() input_file = os.listdir(input_file_path) for file in input_file: if file[-9: ] == 'train.txt': motif_input_file = input_file_path+"/"+file get_model_files_path = os.chdir("../model") model_files_path = os.getcwd() model_files = os.listdir(model_files_path) for file in model_files: if file[-4: ] == 'json': model_architecture_path = model_files_path+"/"+file elif file[-2: ] =='h5': model_weights_path = model_files_path+"/"+file get_current_path = os.chdir("../motif") os.makedirs(args.out+"_motif") filter_size=int(args.filter_len1) model = model_from_json(open(model_architecture_path).read()) model.load_weights(model_weights_path) f = K.function([model.layers[0].input,K.learning_phase()], [model.layers[1].output]) data=np.load("../data_preprocessing/data_train.npy") def print_pwm(f, filter_idx, filter_pwm, nsites): if nsites < 10: return print('MOTIF filter%d' % filter_idx, end = "\n", file = f) print('letter-probability matrix: alength= 4 w= %d nsites= %d' % (filter_pwm.shape[0], nsites), end = "\n", file = f) for i in range(0, filter_pwm.shape[0]): print('%.4f %.4f %.4f %.4f' % tuple(filter_pwm[i]), end = "\n", file = f) print('', end = '\n', file = f) def logo_kmers(filter_outs,filter_size, seqs, filename, maxpct_t = 0.7): all_outs = np.ravel(filter_outs) all_outs_mean = all_outs.mean() all_outs_norm = all_outs - all_outs_mean raw_t = maxpct_t * all_outs_norm.max() + all_outs_mean with open(filename, 'w') as f: for i in range(filter_outs.shape[0]): for j in range(filter_outs.shape[1]): if filter_outs[i,j] > raw_t: kmer = seqs[i][j-9:j+10] if len(kmer) <filter_size: continue print('>%d_%d' % (i,j), end = '\n', file = f) print(kmer, end = '\n', file = f) def make_filter_pwm(filter_fasta): nts = {'A':0, 'C':1, 'G':2, 'T':3} pwm_counts = [] nsites = 4 for line in open(filter_fasta): if line[0] == '>': continue seq = line.rstrip() nsites += 1 if len(pwm_counts) == 0: for i in range(len(seq)): pwm_counts.append(np.array([1.0]*4)) for i in range(len(seq)): try: pwm_counts[i][nts[seq[i]]] += 1 except KeyError: pwm_counts[i] += np.array([0.25]*4) pwm_freqs = [] for i in range(len(pwm_counts)): pwm_freqs.append([pwm_counts[i][j]/float(nsites) for j in range(4)]) return np.array(pwm_freqs), nsites - 4 def meme_header(f, seqs): nts = {'A':0, 'C':1, 'G':2, 'T':3} nt_counts = [1]*4 for i in range(len(seqs)): for nt in seqs[i]: try: nt_counts[nts[nt]] += 1 except KeyError: pass nt_sum = float(sum(nt_counts)) nt_freqs = [nt_counts[i]/nt_sum for i in range(4)] print('MEME version 4', end = '\n', file = f) print('', end = '\n', file = f) print('ALPHABET= ACGT', end = '\n', file = f) print('', end = '\n', file = f) print('Background letter frequencies:', end = '\n', file = f) print('A %.4f C %.4f G %.4f T %.4f' % tuple(nt_freqs), end = '\n', file = f) print('', end = '\n', file = f) sequence=[] for line in open(motif_input_file,'r'): lines=line.strip().split('\t') one_sequence=lines[1] sequence.append(one_sequence) y=0 final_output=np.empty([0,1000,int(args.nb_filter1)]) while y+128 <int(data.shape[0]): x=data[y:y+128] cnn_output=f([x])[0] y+=128 final_output=np.concatenate([final_output,cnn_output],axis=0) print('%s/%s data points processed...' % (y, data.shape[0])) x1=data[y:int(data.shape[0])] cnn_output1=f([x1])[0] final_output=np.concatenate([final_output,cnn_output1],axis=0) print(final_output.shape) cnn_layer_idx = 0 with open('%s/filter_meme.txt' % (args.out+"_motif"), 'w') as meme: meme_header(meme, sequence) for i in range(int(args.nb_filter1)): logo_kmers(final_output[:, :, i], filter_size,sequence, '%s/filter_%s.fa' % ((args.out+"_motif"), i)) filter_pwm, nsites = make_filter_pwm('%s/filter_%s.fa' % ((args.out+"_motif"), i)) print_pwm(meme, i, filter_pwm, nsites) |
Support
- Future updates