A de novo prediction tool of chromatin accessible regions for plant genomes

public public 1yr ago Version: v4 0 bookmarks

Introduction

The result of OCR (Open Chromatin Region) assay technologies such as DNase-seq and ATAC-seq represents the open state of a tissue at a given time and does not cover all the chromatin accessible information of this species. To predict the potential open regions of different tissues at different times in the whole sequence can help to understand gene transcription and regulation from a global perspective. CharPlant (Chromatin Accessible Regions for Plant) is a de novo OCRs prediction tool based on deep learning model. It can take complete DNA sequences or scaffolds as input, and output the outline of OCRs in a .Bed format file rely on sequence features. To our knowledge, this is the first tool to de novo predict potential open chromatin regions from DNA sequence and assay data.

If you use this tool, please cite the following article:

CharPlant: A De Novo Open Chromatin Region Prediction Tool for Plant Genomes

Steps to install and run CharPlant

CharPlant currently available for Linux-based platforms. It can learn the OCR features from DNase-seq or ATAC-seq data and de novo predict potential chromatin open regions for plant genome. To train the parameters of the deep learning model and predict OCRs, CharPlant perform the following three steps. Step 1 and step 2 are to install needed packages and set up running environment of the software, and they are optional for the users have had those packages worked. Step 3 is to download and run CharPlant. We provide as detailed instruction as possible although it can be run using simple command line. In the following, prompt “$” starts a shell command line of Linux, while “#” starts a comment.

Step 1. Install python packages

CharPlant is developed in python language, and some fundamental packages for scientific computing and network construction are indispensable. To efficient install and manage them, we strongly recommend using the Conda package manager. Conda is an open source package and environment management system for installing multiple versions of packages and their dependencies and easily switching between them.

(i) Install Conda

$ wget https://repo.continuum.io/archive/Anaconda3-4.2.0-Linux-x86_64.sh
$ bash Anaconda3-4.2.0-Linux-x86_64.sh
#Add environment variable
$ echo "export PATH=\"${PWD}/anaconda3/bin:\$PATH\" " >>~/.bashrc
$ source ~/.bashrc

(ii) Install needed python packages

Following Python packages are required:

  • numpy(1.16.1)

  • matplotlib(3.3.2)

  • pyfiglet(0.8.post1)

  • sklearn(0.22)

  • keras(2.0.5)

  • h5py(2.7.1)

  • tensorflow-gpu(1.3.0)

(iii) create environments

Because the de novo prediction step using cpu, we need create environments in cpu

$ conda create --name charplant-cpu python=3.6
$ source activate charplant-cpu
$ pip install sklearn
$ pip install matplotlib
$ pip install pyfiglet
$ conda install keras
$ source deactivate charplant-cpu

Step 2. Install the tools Bowtie2 , MACS2 and Snakemake

(i)Bowtie2

We use bowtie2 (version 2.2.6) for sequence alignment, and you can find all versions HERE . The detailed manual is provided in this LINK .

$ wget https://downloads.sourceforge.net/project/bowtie-bio/bowtie2/2.2.6/bowtie2-2.2.6-linux-x86_64.zip
$ unzip bowtie2-2.2.6-linux-x86_64.zip
#Add environment variable
$ echo "export PATH=\"${PWD}/bowtie2-2.2.6:\$PATH\" " >>~/.bashrc
$ source ~/.bashrc

(ii)MACS2

We use MACS2 (version 2.1.1) for peak calling, and you can find all versions and user manual HERE .

$ wget https://files.pythonhosted.org/packages/0f/e9/60761f3df0634a50a96531fdb00f45dd3a6f3aa2509fb722eb2a665faf35/MACS2-2.1.1.20160226.tar.gz
$ tar -zxvf MACS2-2.1.1.20160226.tar.gz
#setup environment variable
$ conda create --name macs2 python=2.7
$ source activate macs2
$ cd MACS2-2.1.1.20160226
$ pip install numpy
$ python setup.py install 
#deactivate the environment 
$ source deactivate macs2

(iii)Snakemake

We adopt an easy-to-use workflow software Snakemake to build our analysis process, which combines a series of steps into a single pipeline. Snakemake and the manual are provided in this LINK .

$ pip install snakemake==5.5.2

Step 3. Download and run CharPlant

(i) Download and install CharPlant

To Install and run CharPlant is very easy. You can download the CharPlant package in the following two ways. Then un-compress the package and set “CharPlant” as current directory.

  • Simple click to download CharPlant package HERE .

  • You can also download the CharPlant package using git.

$ git clone https://github.com/Yin-Shen/CharPlant.git
$ echo "export PATH=\"${PWD}/CharPlant:\$PATH\" " >>~/.bashrc
$ source ~/.bashrc
$ cd CharPlant
$ chmod -R 744 CharPlant.sh

(ii) The directory structure of CharPlant

The directory structure is as follows, which has two directories and three files. Directory “CharPlant/example” contains the reference genome and DNase-seq data of rice used as an example, file oryza_sativa.fa and ory_sativa.bed, respectively. The result of predicted OCRs is also saved in it. All the python and shell scripts are in directory “CharPlant/src”, but users generally don't need to care about it.

├──CharPlant
│ ├── example
│ │ ├──oryza_sativa.bed
│ │ ├──oryza_sativa.fa
│ ├── src
│ │ ├──data_preprocess
│ │ ├──de_novo_prediction
│ │ ├──get_positive_sample
│ │ ├──model
│ │ ├──motif
… … … …
│ │ ├──submit_lsf
│ ├── config.yaml
│ ├── Snakefile
│ ├── CharPlant.sh

For example, the follow command can print help information.

$ CharPlant.sh -h

Or print python scripts help information.

$ python model.py -h

(iii) Set the parameters of CharPlant

Configuration file “config.yaml” contains all parameters of the tool. To run CharPlant, you only need to revise three parameters in the following according to your path and file name, while leaving others as they are. A complete “config.yaml” file is shown in next section.

genome : Yourpath/CharPlant/example/oryza_sativa.fa #Genome file in .fasta format for input(Need full path)
bed: Yourpath/CharPlant/example/oryza_sativa.bed #Open chromatin regions file in .bed format for input(Need full path)
out: ory #Prefix of the output file

(iv) Run CharPlant

Snakemake file defines rules to performance operations. We have created a rule for each target and intermediate file. It is not necessary for the users to rewrite it. A complete “Snakefile” file is shown in subsequent section.

$ CharPlant.sh

CharPlant will perform four steps in turn and output the results of predicted OCRs to a .bed format file.

  • Data preprocessing

  • Model training

  • Motif visualization

  • De novo prediction

(v) Output files

If all steps are completed, the results will be output to the following five directories.

/data_preprocessing---Data preprocessing results for model training and motif visualization.
/model---.json file of model architecture, .h5 file of model parameters and .png file of the result figure.
/motif--- Positional weight matrix of motifs.
/de_novo_prediction---Results of de novo prediction.
/peak---Peaks of predicted OCRs.

Contact us

Yin Shen : [email protected]
Ling-Ling Chen : [email protected]
Junxiang Gao : [email protected]

Code Snippets

14
15
shell:
   "mkdir $PWD/data_preprocessing && cd $PWD/data_preprocessing && python ../src/data_preprocess.py -g {input.genome} -b {input.bed} -o {params.out} 2> /dev/null"    
29
30
shell:
   "mkdir $PWD/model && cd $PWD/model && python ../src/model.py -e {params.epochs} -p {params.patience} -lr {params.learningrate} -b {params.batch_size} -d {params.dropout} -n1 {params.nb_filter1} -n2 {params.nb_filter2} -fl1 {params.filter_len1} -fl2 {params.filter_len2} -hd {params.hidden} 2> /dev/null"  
39
40
shell:
      "mkdir $PWD/motif && cd $PWD/motif && python ../src/motif.py -n1 {params.nb_filter1} -fl1 {params.filter_len1} -o {params.motif_out} 2> /dev/null"
51
52
shell:
    "mkdir $PWD/de_novo_prediction && cd $PWD/de_novo_prediction && python ../src/de_novo_prediction.py -g {input.genome} -l {params.split_lines} -t {params.threshold} -o {params.prediction_out}"
61
62
shell:
      "cd src/ && bash batch_submit.sh {params.run_type} {params.prediction_out} {params.batch_submit_jobs_number}"
66
67
shell:
     "cd src/ && bash cat_predict.sh"
74
75
shell: 
    "python src/get_positive_sample.py -i {params.prediction_out} -n {params.speices_name}"
84
85
shell:
    " mkdir $PWD/peak && cd $PWD/peak && bowtie2-build {input.genome}  {params.index_name}"
94
95
shell:
     "cd $PWD/peak && bowtie2 -x {params.index_name}  -f ../de_novo_prediction/whole_pre.fa -S {params.sam_prefix}.sam "
104
105
shell:
      "cd $PWD/peak && source activate macs2 && macs2 callpeak -t {params.sam_prefix}.sam  -f SAM --shift -125 --extsize 250 --nomodel -B --SPMR -g hs -n {params.peak_prefix} "
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
import numpy as np
import re
import random
import pyfiglet
import argparse
import os
import sklearn
from sklearn.model_selection import train_test_split

#Making large letters out of ordinary text with pyfiglet
ascii_banner = pyfiglet.figlet_format("charplant")
print(ascii_banner)

#Argument parsing
parser = argparse.ArgumentParser(description='''Data Preprocess: 
                                                takes 3 required arguments''')
parser.add_argument("--genome", "-g", required=True,
                    help="genome file in fasta format")
parser.add_argument("--bed", "-b", required=True,
                    help="bed file of Chromatin Accessible Regions for input")
parser.add_argument("--out", "-o", required=True,
                    help="prefix of the output file")
args = parser.parse_args()

#Construct a positive sample sequences
genome_fasta={}
for line in open(args.genome,'r'):
    if line[0]=='>':
        seq_name=line.lstrip('>').strip()
        genome_fasta[seq_name]=[]
    else:
        genome_fasta[seq_name].append(line.strip())
for keys,val in genome_fasta.items():
    genome_fasta[keys]=''.join(val)
nuc=open(args.bed,'r')
out=open(args.out+".fa",'w')
col_num=0
bed=nuc.readlines()
col=len(bed)
for lline in bed:
        col_num+=1
        chr_name=lline.rstrip().split()[0]
        start_end=lline.rstrip().split()[1:]
        leng=int(start_end[1])-int(start_end[0])
        out.write('>'+chr_name+':'+start_end[0]+'-'+start_end[1]+'\n'+genome_fasta[chr_name][int(start_end[0]):int(start_end[1])]+'\n')
out.close()

#Negative sequences consist of random shuffled positive sequences with random.shuffle
neg_fasta={}
out_neg=open(args.out+"_negtive.fa",'w')
for line in open(args.out+".fa",'r'):
    if line[0]=='>':
        seq_name=line.lstrip('>').strip()
        neg_fasta[seq_name]=[]
    else:
        neg_fasta[seq_name].append(line.strip())
for keys,val in neg_fasta.items():
    neg_fasta[keys]=''.join(val)
    fa=list(neg_fasta[keys])
    random.shuffle(fa)
    out_neg.write(">"+keys+"\n"+"".join(fa)+"\n")
out_neg.close()

#Construct a positive samples datasets
bed_dict_pos={}
pos_o=open(args.out+"pos.txt",'w')
for line in open(args.out+".fa",'r'):
    if line[0]==">":
       seq=line.strip()
       bed_dict_pos[seq]=[]
    else:
       bed_dict_pos[seq].append(line.strip())
for keys,val in bed_dict_pos.items():
    bed_dict_pos[keys]=''.join(val)
    pos_o.write('1'+'\t'+bed_dict_pos[keys]+'\n')
pos_o.close()

#Construct a negtive samples datasets
bed_dict_neg={}
neg_o=open(args.out+"neg.txt",'w')
for line in open(args.out+"_negtive.fa",'r'):
    if line[0]==">":
       seq=line.strip()
       bed_dict_neg[seq]=[]
    else:
       bed_dict_neg[seq].append(line.strip())
for keys,val in bed_dict_neg.items():
    bed_dict_neg[keys]=''.join(val)
    neg_o.write('0'+'\t'+bed_dict_neg[keys]+'\n')
neg_o.close()


#Using “one hot enconding” to preprocessing each sequence to the two-dimensional matrix of 4 columns. 
#Each base in the sequence is vectorized, convert A|a, C|c, G|g, T|t, N|n to: [1,0,0,0], [0,1,0,0], [0,0,1,0], [0,0,0,1], [0,0,0,0]

def load_data(filename):
    f=open(filename,'r')
    sequences=f.readlines()
    num=len(sequences)
    data=np.empty((num,1000,4),dtype='float32')
    label=np.empty((num,),dtype="int")
    for i in range(num):
        line=sequences[i].replace('\n','')
        list_line=re.split('\s+',line)
        one_sequence=list_line[1]
        for j in range(1000):
            if j<=len(one_sequence)-1:
                if re.findall(one_sequence[j],'A|a'):
                    data[i,j,:]=np.array([1.0,0.0,0.0,0.0],dtype='float32')
                if re.findall(one_sequence[j],'C|c'):
                    data[i,j,:]=np.array([0.0,1.0,0.0,0.0],dtype='float32')
                if re.findall(one_sequence[j],'G|g'):
                    data[i,j,:]=np.array([0.0,0.0,1.0,0.0],dtype='float32')
                if re.findall(one_sequence[j],'T|t'):
                    data[i,j,:]=np.array([0.0,0.0,0.0,1.0],dtype='float32')
                if re.findall(one_sequence[j],'N|n'):
                    data[i,j,:]=np.array([0.0,0.0,0.0,0.0],dtype='float32')
            else:
                data[i,j,:]=np.array([0.0,0.0,0.0,0.0],dtype='float32')
        label[i]=list_line[0]
    return data,label

pos_sample=args.out+"pos.txt"
neg_sample=args.out+"neg.txt"

data_pos,label_pos = load_data(pos_sample)
data_neg,label_neg = load_data(neg_sample)

#Divide postive datasets into training sets validate sets and test sets according to 60%, 20%and 20%

data_pos_train, data_pos_test, label_pos_train, label_pos_test = train_test_split(data_pos,label_pos, test_size=0.4, random_state=1)

data_pos_test, data_pos_val, label_pos_test, label_pos_val = train_test_split(data_pos_test, label_pos_test, test_size=0.5, random_state=1)


#Divide negtive datasets into training sets validate sets and test sets according to 60%, 20%and 20%

data_neg_train, data_neg_test, label_neg_train, label_neg_test = train_test_split(data_neg,label_neg, test_size=0.4, random_state=1)

data_neg_test, data_neg_val, label_neg_test, label_neg_val = train_test_split(data_neg_test, label_neg_test, test_size=0.5, random_state=1)

#Combined training sets
data_train = np.concatenate((data_pos_train,data_neg_train),axis=0)
label_train = np.concatenate((label_pos_train,label_neg_train),axis=0)
#Combined validate sets
data_val = np.concatenate((data_pos_val,data_neg_val),axis=0)
label_val = np.concatenate((label_pos_val,label_neg_val),axis=0)
#Combined test sets
data_test = np.concatenate((data_pos_test,data_neg_test),axis=0)
label_test = np.concatenate((label_pos_test,label_neg_test),axis=0)


#save
np.save('data_train.npy',data_train)
np.save('label_train.npy',label_train)
np.save('data_val.npy',data_val)
np.save('label_val.npy',label_val)
np.save('data_test.npy',data_test)
np.save('label_test.npy',label_test)


#Get training set sample sequence
sequence=[[] for i in range(data_train.shape[0])]
for i in range(data_train.shape[0]):
    for j in range(data_train.shape[1]):
        if (data_train[i,j,:] == np.array([1.0,0.0,0.0,0.0],dtype='float32')).all():
                 sequence[i]+='A'
        if (data_train[i,j,:] == np.array([0.0,1.0,0.0,0.0],dtype='float32')).all():                      sequence[i]+='C'
        if (data_train[i,j,:] == np.array([0.0,0.0,1.0,0.0],dtype='float32')).all(): 
                 sequence[i]+='G'
        if (data_train[i,j,:] == np.array([0.0,0.0,0.0,1.0],dtype='float32')).all():
                 sequence[i]+='T'
        if (data_train[i,j,:] == np.array([0.0,0.0,0.0,0.0],dtype='float32')).all():
                 sequence[i]+='N'

o=open(args.out+"train.txt",'w')
for i in range(data_train.shape[0]):    
   o.write(str(label_train[i])+'\t'+''.join(sequence[i])+'\n')
o.close()
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
import pyfiglet
import argparse
import re
import os

#Making large letters out of ordinary text with pyfiglet
ascii_banner = pyfiglet.figlet_format("charplant")
print(ascii_banner)

#Argument parsing
parser = argparse.ArgumentParser(description='''De novo prediction''')
parser.add_argument("--genome", "-g", required=True,
                    help="genome file in fasta format")
parser.add_argument("--lines", "-l",default=20000, type=int, required=False,
                    help="The number of lines in a Sliding window cut genome file splited into smaller files (default is 20,000)")
parser.add_argument("--threshold", "-t",default=0.5, type=float, required=False,
                    help="the threshold adopted to assign positive predictions (default is 0.5)")
parser.add_argument("--out", "-o", required=True,    
                    help="prefix of the output file")  
args = parser.parse_args()



print("\033[1;30;34m%s\033[0m" %"Sliding window cutting genome...")
print("\033[1;30;34m%s\033[0m" %"please wait...")
fasta={}
for line in open(args.genome,'r'):
    if line[0]=='>':
        seq_name=line.lstrip('>').strip()
        fasta[seq_name]=[]
    else:
        fasta[seq_name].append(line.strip())
for keys,val in fasta.items():
    fasta[keys]=''.join(val)

o=open(args.out,'w')               
len_chrom={}
for key,val in fasta.items():
        len_chrom[key]=len(fasta[key])
        for i in range(0,len_chrom[key],1):
          if i+36<=int(len_chrom[key]):
             o.write(key+'\t'+str(i)+'\t'+str(i+36)+'\t'+fasta[key][i:i+36]+'\n')
          else:
             o.write(key+'\t'+str(i)+'\t'+str(len_chrom[key])+'\t'+fasta[key][i:len_chrom
[key]+1]+'\n')
             break
o.close()
print("\033[1;30;34m%s\033[0m" %"well done...")

print("\033[1;30;34m%s\033[0m" %"Split genome file according to multi-line...")
print("\033[1;30;34m%s\033[0m" %"please wait...")
splitLen = int(args.lines) 
output_prefix = args.out        # example: split_fasta_36_1_

input = open(args.out, 'r').read().split('\n')

counts = 1
for lines in range(0, len(input), splitLen):
    output_data = input[lines:lines+splitLen]
    output = open(output_prefix + str(counts), 'w')
    output.write('\n'.join(output_data))
    counts += 1
    output.close()
print("\033[1;30;34m%s\033[0m" %"well done...")

print("\033[1;30;34m%s\033[0m" %"batch write into python files...")
print("\033[1;30;34m%s\033[0m" %"please wait...")
get_model_files_path = os.chdir("../model")
model_files_path = os.getcwd()
model_files = os.listdir(model_files_path)
for file in model_files:
     if file[-4: ] == 'json':
          model_architecture_path = model_files_path+"/"+file
     elif file[-2: ] =='h5':
          model_weights_path = model_files_path+"/"+file

get_file_path = os.chdir("../de_novo_prediction")
file_path = os.getcwd()
files = os.listdir(file_path)
file_list=[]
for file in files:
    if file[-2: ] == 'py':
        continue 
    elif re.match(args.out+'\d+',file):
         file_list.append(file)

def tryint(s):            
    try:
        return int(s)
    except ValueError:
        return s
def str2int(v_str):    
    return [tryint(sub_str) for sub_str in re.split('([0-9]+)', v_str)]
def sort_humanly(v_list):
    return sorted(v_list, key=str2int)

sort_file_list = sort_humanly(file_list)


python_file_1='''

def loading_data1(filename):
    f=open(filename,'r')
    sequences=f.readlines()
    num=len(sequences)
    data=np.empty((num,1000,4),dtype='float32')
    for i in range(num):
        line=sequences[i].replace("\\n",'')
        list_line=re.split('\s+',line)
        one_sequence=list_line[3]
        for j in range(1000):
            if j<=len(one_sequence)-1:
                if re.match(one_sequence[j],'A|a'):
                    data[i,j,:]=np.array([1.0,0.0,0.0,0.0],dtype='float32')
                if re.match(one_sequence[j],'C|c'):
                    data[i,j,:]=np.array([0.0,1.0,0.0,0.0],dtype='float32')
                if re.match(one_sequence[j],'G|g'):
                    data[i,j,:]=np.array([0.0,0.0,1.0,0.0],dtype='float32')
                if re.match(one_sequence[j],'T|t'):
                    data[i,j,:]=np.array([0.0,0.0,0.0,1.0],dtype='float32')
                if re.match(one_sequence[j],'N|n'):
                    data[i,j,:]=np.array([0.0,0.0,0.0,0.0],dtype='float32')
            else:
                data[i,j,:]=np.array([0.0,0.0,0.0,0.0],dtype='float32')
    return data
'''

python_file_2 = '''
result_new=np.empty((results.shape[0],1),dtype='int32')
for i in range(result_new.shape[0]):
         if results[i] >= %(threshold)s:
            result_new[i] = np.array([1], dtype='int32')
         else:
            result_new[i] = np.array([0], dtype = 'int32')
for i in result_new.tolist():
   o.write(str(i)+"\\n")    
o.close()
'''
for file in sort_file_list:
    file_read = open(file_path+"/"+file+'.py', 'w')
    file_read.write('import numpy as np'+'\n'+'import re'+'\n'+'from keras.models import model_from_json'+'\n'+"model = model_from_json(open('"+model_architecture_path+"').read())"+'\n'+"model.load_weights('"+model_weights_path+"')"+'\n'+python_file_1+'\n'+"data=loading_data1('"+file+"')"+'\n'+'results=model.predict_classes(data)'+'\n'+"o=open('whole_predict_fasta"+file+".txt','w')"+'\n'+python_file_2%dict(threshold=args.threshold)+'\n')    
    file_read.close()
print("\033[1;30;34m%s\033[0m" %"well done...")
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import pyfiglet
import argparse
import os
#Making large letters out of ordinary text with pyfiglet
ascii_banner = pyfiglet.figlet_format("charplant")
print(ascii_banner)
#Argument parsing
parser = argparse.ArgumentParser(description="motif Visualization ")
parser.add_argument("--input", "-i", required=True,
                    help="The prediction  output file generated by this de_novo_prediction(example :split_fasta_36_1)")
parser.add_argument("--name", "-n", required=True,
                    help="The name of the species")
args = parser.parse_args()


get_input_file_path = os.chdir("de_novo_prediction/")
input_file_path = os.getcwd()
input_file = os.listdir(input_file_path)

for file in input_file:
    if file[-17: ] == args.input:
         input_files = input_file_path+"/"+file
#Get positive sample
genome=[]
genome_line=0
for l in open(input_files,'r'):
    ll=l.strip()
    genome.append(ll)
    genome_line +=1 

predict=[]
for line in open('whole_predict','r'):
    lines=line.strip()
    liness=lines.lstrip('[').rstrip(']')
    predict.append(liness)


genome_pre={}
lines=0
while lines < genome_line:
    k_name=genome[lines]
    genome_pre[k_name]=predict[lines]
    lines+=1

o=open('whole_pre.txt','w')
for k,v in genome_pre.items():
    if genome_pre[k] == '1':
        kk=k.strip().split('\t')
        o.write(kk[3]+'\n')
o.close()

#construct fasta file
i=0
o=open('whole_pre.fa','w')
for line in open('whole_pre.txt','r'):
   lines=line.strip()
   o.write('>'+args.name+'-'+'predict_region'+str(i)+'\n'+lines+'\n')
   i+=1
o.close()
print("\033[1;30;34m%s\033[0m" %"Done...")
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
import sys
import argparse
import time
import numpy as np
import re
import pyfiglet
from keras.models import Sequential,model_from_json
from keras.callbacks import ModelCheckpoint,EarlyStopping 
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.layers.pooling import GlobalMaxPooling1D
from keras.layers.convolutional import Convolution1D, MaxPooling1D
from keras import regularizers
from keras.optimizers import SGD
from sklearn.metrics import roc_curve, auc
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

#specify which GPU(s) to be used
import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

np.random.seed(666)

#Making large letters out of ordinary text with pyfiglet
ascii_banner = pyfiglet.figlet_format("charplant")
print(ascii_banner)

#Argument parsing
parser = argparse.ArgumentParser(description="Training Charplant cnn model")
parser.add_argument("--epochs", "-e", default=150, type=int, required=False,
                    help="Number of epochs.(default is 150)")
parser.add_argument("--patience", "-p", default=20, type=int, required=False,
                    help='Number of epochs  for early stopping.(default is 20)')
parser.add_argument("--learningrate", "-lr", default=0.001, type=float, required=False,
                   help='Learning rate.(default is 0.001)')
parser.add_argument("--batch_size","-b",  default=128, type=int, required=False,	
                    help="Batch Size.(default is 128)")
parser.add_argument("--dropout","-d",  default=0.6, type=float,required=False,
                    help="Dropout rate.(default is 0.6)")
parser.add_argument("--nb_filter1","-n1",  default=200, type=int, required=False,
                    help="Number of filters in first layer of convolution.(default is 200)")
parser.add_argument("--nb_filter2","-n2",  default=100, type=int, required=False,
                    help="Number of filters in second layer of convolution.(default is 100)")
parser.add_argument("--filter_len1","-fl1",  default=19, type=int, required=False,
                    help="length of filters in first layer of convolution.(default is 19)")
parser.add_argument("--filter_len2","-fl2",  default=11, type=int, required=False,
                    help="length of filters in second layer of convolution.(default is 11)")
parser.add_argument("--hidden","-hd", default=200, type=int, required=False,
                    help="units in the fully connected layer.(default is 200)")
args = parser.parse_args()



print("\033[1;30;34m%s\033[0m" %"loading data...")
print("\033[1;30;34m%s\033[0m" %"please wait...")
sys.stdout.flush()

data_train=np.load('../data_preprocessing/data_train.npy')
label_train=np.load('../data_preprocessing/label_train.npy')
data_val=np.load('../data_preprocessing/data_val.npy')
label_val=np.load('../data_preprocessing/label_val.npy')
data_test=np.load('../data_preprocessing/data_test.npy')
label_test=np.load('../data_preprocessing/label_test.npy')
print("\033[1;30;34m%s\033[0m" %"well done...")

model=Sequential()
model.add(Convolution1D(int(args.nb_filter1),
                        int(args.filter_len1),
                        border_mode='same',
                        input_shape=(1000,4)))
model.add(Activation('relu'))
model.add(Dropout(float(args.dropout)))

model.add(Convolution1D(int(args.nb_filter2),
                        int(args.filter_len2),
                        border_mode='same'))
model.add(Activation('relu'))
model.add(Dropout(float(args.dropout)))


model.add(Flatten())
model.add(Dense(int(args.hidden),init='normal',activation='relu'))
model.add(Dropout(float(args.dropout)))

model.add(Dense(1,init='normal',activation='linear'))
model.add(Activation('sigmoid'))

sgd=SGD(lr=float(args.learningrate),decay=1e-5,momentum=0.9,nesterov=True)

print("\033[1;30;34m%s\033[0m" %"model compiling...")
print("\033[1;30;34m%s\033[0m" %"please wait...")
sys.stdout.flush()
model.compile(loss='binary_crossentropy',optimizer=sgd,metrics=['accuracy'])  
early_stopping =EarlyStopping(monitor='val_loss', patience=args.patience,verbose=1)
checkpointer = ModelCheckpoint(filepath='model_weights'+str(args.nb_filter1)+'-'+str(args.nb_filter2)+'-'+str(args.filter_len1)+'-'+str(args.filter_len2)+'-'+str(args.hidden)+'-'+str(args.dropout)+'-'+str(args.learningrate)+'-'+str(args.batch_size)+'-'+str(args.epochs)+'-'+str(args.patience)+'.h5', verbose=1, save_best_only=True)
print("\033[1;30;34m%s\033[0m" %"well done...")

print("\033[1;30;34m%s\033[0m" %"training...")
print("\033[1;30;34m%s\033[0m" %"please wait...")
time_start = time.time()
result=model.fit(data_train,label_train,batch_size=args.batch_size,nb_epoch=args.epochs,shuffle=True,validation_data=(data_val,label_val),callbacks=[checkpointer, early_stopping])
time_end = time.time()
print("\033[1;30;34m%s\033[0m" %"well done...")

json_string=model.to_json()
open('model_architecture'+str(args.nb_filter1)+'-'+str(args.nb_filter2)+'-'+str(args.filter_len1)+'-'+str(args.filter_len2)+'-'+str(args.hidden)+'-'+str(args.dropout)+'-'+str(args.learningrate)+'-'+str(args.batch_size)+'-'+str(args.epochs)+'-'+str(args.patience)+'.json','w').write(json_string)

model.load_weights('model_weights'+str(args.nb_filter1)+'-'+str(args.nb_filter2)+'-'+str(args.filter_len1)+'-'+str(args.filter_len2)+'-'+str(args.hidden)+'-'+str(args.dropout)+'-'+str(args.learningrate)+'-'+str(args.batch_size)+'-'+str(args.epochs)+'-'+str(args.patience)+'.h5')
score = model.evaluate(data_val,label_val, verbose=0)
print('accuracy_validate :',score[1])

score1 = model.evaluate(data_test,label_test, verbose=0)
print('Test loss:', score1[0])
print('Test accuracy:', score1[1])


print('training time : %d sec' % (time_end-time_start))


pred=model.predict_proba(data_test)

fpr, tpr, thresholds = roc_curve(label_test, pred)
roc_auc = auc(fpr, tpr)
print('the auc is ',roc_auc)


print("\033[1;30;34m%s\033[0m" %"plot ROC curve...")

plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.savefig('pictureauc'+str(args.nb_filter1)+'-'+str(args.nb_filter2)+'-'+str(args.filter_len1)+'-'+str(args.filter_len2)+'-'+str(args.hidden)+'-'+str(args.dropout)+'-'+str(args.learningrate)+'-'+str(args.batch_size)+'-'+str(args.epochs)+'-'+str(args.patience)+'.png')
plt.close()


print("\033[1;30;34m%s\033[0m" %"plot figure...")


plt.plot(result.history['loss'])
plt.plot(result.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.savefig('pictureloss'+str(args.nb_filter1)+'-'+str(args.nb_filter2)+'-'+str(args.filter_len1)+'-'+str(args.filter_len2)+'-'+str(args.hidden)+'-'+str(args.dropout)+'-'+str(args.learningrate)+'-'+str(args.batch_size)+'-'+str(args.epochs)+'-'+str(args.patience)+'.png')
plt.close()

plt.figure
plt.plot(result.epoch,result.history['acc'],label="acc")
plt.plot(result.epoch,result.history['val_acc'],label="val_acc")
plt.scatter(result.epoch,result.history['acc'],marker='*')
plt.scatter(result.epoch,result.history['val_acc'])
plt.legend(loc='under right')
plt.savefig('picture'+str(args.nb_filter1)+'-'+str(args.nb_filter2)+'-'+str(args.filter_len1)+'-'+str(args.filter_len2)+'-'+str(args.hidden)+'-'+str(args.dropout)+'-'+str(args.learningrate)+'-'+str(args.batch_size)+'-'+str(args.epochs)+'-'+str(args.patience)+'.png')
plt.close()
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
from __future__ import print_function
import os
import numpy as np
import pyfiglet
import keras
from keras.models import model_from_json
from keras import backend as K
import argparse
#specify which GPU(s) to be used
import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"


#Making large letters out of ordinary text with pyfiglet
ascii_banner = pyfiglet.figlet_format("charplant")
print(ascii_banner)

#Argument parsing
parser = argparse.ArgumentParser(description="motif Visualization ")
parser.add_argument("--nb_filter1","-n1",  default=200, type=int, required=False,
                    help="Number of filters in first layer of convolution.(default is 200)")
parser.add_argument("--filter_len1","-fl1",  default=19, type=int, required=False,
                    help="length of filters in first layer of convolution.(default is 19)")
parser.add_argument("--out", "-o", required=True,
                    help="prefix of the output file")
args = parser.parse_args()

get_input_file_path = os.chdir("../data_preprocessing")
input_file_path = os.getcwd()
input_file = os.listdir(input_file_path)
for file in input_file:
    if file[-9: ] == 'train.txt':
         motif_input_file = input_file_path+"/"+file

get_model_files_path = os.chdir("../model")
model_files_path = os.getcwd()
model_files = os.listdir(model_files_path)
for file in model_files:
     if file[-4: ] == 'json':
          model_architecture_path = model_files_path+"/"+file
     elif file[-2: ] =='h5':
          model_weights_path = model_files_path+"/"+file


get_current_path = os.chdir("../motif")
os.makedirs(args.out+"_motif")
filter_size=int(args.filter_len1)
model = model_from_json(open(model_architecture_path).read())
model.load_weights(model_weights_path)
f = K.function([model.layers[0].input,K.learning_phase()], [model.layers[1].output])


data=np.load("../data_preprocessing/data_train.npy")


def print_pwm(f, filter_idx, filter_pwm, nsites):
    if nsites < 10:
        return

    print('MOTIF filter%d' % filter_idx, end = "\n", file = f)
    print('letter-probability matrix: alength= 4 w= %d nsites= %d' % (filter_pwm.shape[0], nsites), end = "\n", file = f) 

    for i in range(0, filter_pwm.shape[0]):
        print('%.4f %.4f %.4f %.4f' % tuple(filter_pwm[i]), end = "\n", file = f)
    print('', end = '\n', file = f)



def logo_kmers(filter_outs,filter_size, seqs, filename, maxpct_t = 0.7):
    all_outs = np.ravel(filter_outs)
    all_outs_mean = all_outs.mean()
    all_outs_norm = all_outs - all_outs_mean
    raw_t = maxpct_t * all_outs_norm.max() + all_outs_mean

    with open(filename, 'w') as f:
        for i in range(filter_outs.shape[0]):
            for j in range(filter_outs.shape[1]):
                if filter_outs[i,j] > raw_t:
                    kmer = seqs[i][j-9:j+10]
                    if len(kmer) <filter_size:
                        continue
                    print('>%d_%d' % (i,j), end = '\n', file = f)
                    print(kmer, end = '\n', file = f)


def make_filter_pwm(filter_fasta):
    nts = {'A':0, 'C':1, 'G':2, 'T':3}
    pwm_counts = []
    nsites = 4
    for line in open(filter_fasta):
        if line[0] == '>':
            continue

        seq = line.rstrip()
        nsites += 1
        if len(pwm_counts) == 0:
            for i in range(len(seq)):
                pwm_counts.append(np.array([1.0]*4))

        for i in range(len(seq)):
            try:
                pwm_counts[i][nts[seq[i]]] += 1
            except KeyError:
                pwm_counts[i] += np.array([0.25]*4)

    pwm_freqs = []
    for i in range(len(pwm_counts)):
        pwm_freqs.append([pwm_counts[i][j]/float(nsites) for j in range(4)])

    return np.array(pwm_freqs), nsites - 4

def meme_header(f, seqs):
    nts = {'A':0, 'C':1, 'G':2, 'T':3}

    nt_counts = [1]*4
    for i in range(len(seqs)):
        for nt in seqs[i]:
            try:
                nt_counts[nts[nt]] += 1
            except KeyError:
                pass

    nt_sum = float(sum(nt_counts))
    nt_freqs = [nt_counts[i]/nt_sum for i in range(4)]

    print('MEME version 4', end = '\n', file = f)
    print('', end = '\n', file = f)
    print('ALPHABET= ACGT', end = '\n', file = f)
    print('', end = '\n', file = f)
    print('Background letter frequencies:', end = '\n', file = f)
    print('A %.4f C %.4f G %.4f T %.4f' % tuple(nt_freqs), end = '\n', file = f)
    print('', end = '\n', file = f)




sequence=[]
for line in open(motif_input_file,'r'):
    lines=line.strip().split('\t')
    one_sequence=lines[1]
    sequence.append(one_sequence)


y=0
final_output=np.empty([0,1000,int(args.nb_filter1)])
while y+128 <int(data.shape[0]):
   x=data[y:y+128]
   cnn_output=f([x])[0]
   y+=128
   final_output=np.concatenate([final_output,cnn_output],axis=0)
   print('%s/%s data points processed...' % (y, data.shape[0]))
x1=data[y:int(data.shape[0])]
cnn_output1=f([x1])[0]
final_output=np.concatenate([final_output,cnn_output1],axis=0)
print(final_output.shape)
cnn_layer_idx = 0
with open('%s/filter_meme.txt' % (args.out+"_motif"), 'w') as meme:
        meme_header(meme, sequence)
        for i in range(int(args.nb_filter1)):
            logo_kmers(final_output[:, :, i], filter_size,sequence, '%s/filter_%s.fa' % ((args.out+"_motif"), i))
            filter_pwm, nsites = make_filter_pwm('%s/filter_%s.fa' % ((args.out+"_motif"), i))
            print_pwm(meme, i, filter_pwm, nsites)
ShowHide 10 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/Yin-Shen/CharPlant
Name: charplant
Version: v4
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...