Large comparative genomic analysis tool for the prediction of lifestyle-associated genes

public public 1yr ago 0 bookmarks

microLife is a streamlined computational workflow that annotates bacterial genomes and performs large-scale comparative genomics to predict bacterial lifestyles and to pinpoint candidate genes, denominated lifestyle-associated genes (LAGs) , and biosynthetic gene clusters associated with each lifestyle detected. This whole process is divided into different modules:

  • Clustering module Predicts, clusters and annotates the genes of every input genome

  • Lifestyle prediction Employs a machine learning model to forecast bacterial lifestyle or other specified metadata

  • Analitical module (Shiny app) Results from the previous modules are embedded in a user-friendly interface for comprehensive and interactive comparative genomics. An example of the app with a demo dataset (genomes present in data ) is available at http://178.128.251.24:3838/microLife_linux

workflow

How to start using microLife

Download microLife and install dependencies

 git clone https://github.com/Carrion-lab/microLife.git microLife cd microLife/

Install the conda environment necessary to download files from NCBI using the file ENVS/MicroLife_download.yml and activate it:

mamba env create -f ENVS/MicroLife_download.yml
conda activate MicroLife_download

Download genomes and reduce redundancy

Download the accession names

Retrieve a list of accesion IDs you are interested in processing with microLife (e.g All Mycobacterium RefSeq genomes) by extracting the SQL query (image below) from NCBI website search and integrating it in the following command:

esearch -db assembly -query '("Mycobacterium"[Organism] OR Mycobacterium[All Fields]) AND ("latest refseq"[filter] AND all[filter] NOT anomalous[filter])' | esummary | xtract -pattern DocumentSummary -element AssemblyAccession > NCBI_accs.txt

screenshot

Download and uncompress the FASTA files

mkdir genomes/
cd genomes/
bit-dl-ncbi-assemblies -w ../NCBI_accs.txt -f fasta -j 10
gzip -d ./*.gz

Download metadata and remove redundancy at >99% Average Nucleotide Identity (ANI)

cd ../
python main_script_download.py

Making sure everything is ready

Before starting microLife, you should have the following in your main directory ( microLife/download/ ):

  • "genomes_renamed/": Directory where the non-redundant genomes with the filenames ready for microLife are stored.

  • "METADATA_MERGED.txt": File with the following columns: 'scientific_name', 'NCBIid', 'cluster_membership', 'representative' and 'MicroLife_name'

Remember to deactivate the conda environment MicroLife_download if you want to jump to the next section ( "Executing microLife" ) right after this one!

Executing clustering module

Install dependencies necessary for MicroLife

mamba env create -f ENVS/MicroLife_environment.yml
mamba env create -f ENVS/antismash.yml
mamba env create -f ENVS/bigscape.yml
conda activate MicroLife_environment

Prepare input files

The input must be FASTA assemblies stored in the data/ directory, all with name that should follow the format " Genus_species_strain_O.fna ". * If genomes were downloaded as described in the previous section 'Download genomes and reduce redundancy' , the folder directory download/genomes_renamed/ already contains the genome files in the correct name format and they only have to be moved to the data/ folder .

mv download/genomes_renamed/* data/

To run microLife with your own genomes they should follow the format " Genus_species_strain_O.fna "* and be stored in the data/ . Then, you must use the script src/rename_genomes.R to change the input genome strain names into a MicroLife friendly format in the following way:

Rscript src/rename_genomes.R data/ names_equivalence.txt

This script changes the name of the input genome files by replacing the strain name into a barcode to avoid microLife crashes
I.e Pseudomonas_fluorescens_FDAARGOS-1088.fna --> Pseudomonas_fluorescens_X0001.fna
The file "names_equivalence.txt" contains the correspondent name matching and needs to be present in the main directory before running microLife. If you downloadede your genomes as described above, the file names_equivalence.txt is already present in the main folder.

*In the genome format name, the user must avoid adding any special characters, that is not a alphabet letter, number , dash or dot. This applies specially to the strain name which also should be of less than 10 characters long.

Executing Snakemake

microLife is written using the Snakemake workflow manager and it can be executed using the following command from the main directory

snakemake -j 24

-j specifies the number of CPU cores to use for the whole process. If this flag is ommitted, Snakemake will use all the available CPU cores in the machine

Generate metadata

To be able to run the Lifestyle prediction module and the Shiny app (next modules of the microLife pipeline) you have to modify the mapping_file.txt file. This file is generated at the end of the clustering module. It consists of a two column tab-separated file where the first column specifies the genome name (as established in the data/ directory) and the second column specifies the Lifestyle of that genome. Every genome lifestyle is annotated as 'Unknown' when the file is generated. User have to manually annotate the genomes with their information of interest. Genomes with no lifestyle information must be annotated as "Unknown". Note that additional columns can be added to the file with other metadata information.

microLife outputs

Two primary outputs, namely MEGAMATRIX_renamed.txt and big_scape_binary_table_renamed.txt , are generated in the main folder. MEGAMATRIX_renamed.txt is a matrix where each row represents a gene cluster, and each column represents the presence of these clusters in different genomes, along with their annotations from different databases. Similarly, big_scape_binary_table_renamed.txt is a matrix where each row represents a Gene Cluster Family (GCF) or a cluster of Biosynthetic Gene Clusters (BGCs), and each column represents the presence of these clusters in different genomes.

Intermediate outputs are stored in the intermediate_files/ directory. Within this folder, the annot/ subdirectory contains the prokka outputs for each genome. These outputs include annotation files such as .gbk , .faa , .gff , and .ffn , among others. Similarly, the antismash/ folder contains the antiSMASH results for each genome.

Lifestyle prediction module

The user can test the predictability of their metadata with the machine learning model random forest using the script src/classifier.R and specifying the location of mapping_file.txt and MEGAMATRIX_renamed.txt in addition to the column name of the metadata variable that the user wants to predict

Rscript classifier_src/classifier.R mapping_file.txt MEGAMATRIX_renamed.txt Lifestyle

ROC plots showing the model evaluation for the different classes present in the metadata are generated and stored in classifier/

If good accuracy is accomplished, the user can use the model for predictions of genomes labeled as 'Unknown' in the mapping file. The resulting augmented metadata is stored in the new file mapping_file_augmented.txt

microLife App (Analytical module)

In order to initiate the shiny app, the following snakemake output files need to be droped in the 'Shiny_app/input/' directory.

  • MEGAMATRIX_renamed.txt

  • mapping_file.txt or mapping_file_augmented.txt

  • intermediate_files/BiG-SCAPE/big_scape_binary_table_renamed.txt

  • intermediate_files/BiG-SCAPE/annotation.txt

  • intermediate_files/combined_proteins/combined_proteins.fasta

  • names_equivalence.txt

After having prepared these input files, users should move the working directory to Shiny_app/ and open the script app.R in Rstudio and click in the upper right ' run app ' button to initiate the microLife app and enjoy visualizing all their results! Note that user may move the Shiny_app/ folder and all its content to a local computer and lunch Rstudio locally.

Code Snippets

64
65
66
run:
    shell("mkdir -p {params}")
    shell("touch .mkdir.chkpnt")
SnakeMake From line 64 of main/Snakefile
84
85
run:
        shell('prokka --force --outdir {params.outdir} --prefix {params.prefix} --locustag {params.str} --addgenes --increment 5 --centre NIOO-KNAW --genus {params.genus} --species {params.species} --str {params.str} --gcode 11 --cpus 20 --evalue 1e-03 --rfam {input.file}')
95
96
shell:
    'python ./src/genbank_to_protein_fasta.py --genbank {input} --fasta {output.faa}'
SnakeMake From line 95 of main/Snakefile
108
109
110
run:
    shell('cat {input.faa} >  {output.orig}')
    shell('cat {input.tags} >  {output.tags}')
SnakeMake From line 108 of main/Snakefile
119
120
run:
    shell('python ./src/rename_proteins.py -f {input.orig} -t {input.tags} -o {output.combined_proteins} -n {output.newtagfile}')
SnakeMake From line 119 of main/Snakefile
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
run:
    #Extract gene lengths
    shell("bioawk -c fastx '{{ print $name, length($seq) }}' < intermediate_files/combined_proteins/combined_proteins.fasta >intermediate_files/combined_proteins/length_genes.txt")

    #Run mmseq2 clustering to 0.95
    shell('mmseqs createdb {input} {params.mmseq_db}')
    shell('mmseqs cluster {params.mmseq_db} {params.mmseq_db_clu} {params.mmseq_temp} --min-seq-id 0.95 --cov-mode 0 -c 0.8')
    shell('mmseqs createtsv {params.mmseq_db} {params.mmseq_db} {params.mmseq_db_clu} {params.mmseq_tsv}')

    #Extract mmseq2 representatives
    shell('mmseqs createsubdb {params.mmseq_db_clu} {params.mmseq_db} {params.mmseq_rep}')
    shell('mmseqs convert2fasta {params.mmseq_rep} {params.mmseq_fasta}')


    #Run diamond
    shell('diamond makedb --in {params.mmseq_fasta} -d {params.diamond_db}')
    shell('diamond blastp -b 10 -p {THREADS} -c 1 -e 0.00001 --un {params.diamond_unaligned} -k 5000 -d {params.diamond_db} -q {params.mmseq_fasta} -o {params.diamond_results} --outfmt 6 qseqid sseqid pident bitscore')
    shell("awk -F'\t' '$3>20' {params.diamond_results} > {params.diamond_results_filtered}")
    shell("cut -f1,2,4 {params.diamond_results_filtered} > {params.diamond_results_clean}")

    ##Get unaaligned fasta headers
    shell('grep "^>" {params.diamond_unaligned}  > {params.diamond_unaligned_headers}')

    ### Run MCL
    shell('mcxload -abc {params.diamond_results_clean} --stream-mirror -o {params.mcl_data} -write-tab {params.mcl_tab}')
    shell('mcl {params.mcl_data} -I 3.0 -use-tab {params.mcl_tab}')
    shell('mv out.mcl_data.mci.I30 intermediate_files/clustering/')

    ##Generate matrix

    shell('Rscript src/MCL_merge.R {params.mmseq_tsv} {params.mcl_clusters} {params.diamond_unaligned_headers} {output.binary_table} {input} {output.fasta}')
189
190
191
run:
    shell("grep '^>' {input.faa} >  {params.headers}")
    shell("Rscript ./src/extract_prokka.R {input.id2tag} {params.headers} {output} ")
SnakeMake From line 189 of main/Snakefile
203
204
205
206
207
run:
    shell('wget -P ./intermediate_files/PFAM/ ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam33.1/Pfam-A.hmm.gz')
    shell('gunzip intermediate_files/PFAM/Pfam-A.hmm.gz' )
    shell('hmmpress intermediate_files/PFAM/Pfam-A.hmm')
    shell('hmmsearch --tblout {output.pfam} --cpu 30 -E 1e-5 ./intermediate_files/PFAM/Pfam-A.hmm {input}')
216
217
218
run:
    shell('download_eggnog_data.py -y --data_dir {params.data}')
    shell('emapper.py -i {input} --cpu 25 -o intermediate_files/eggnog_annotation/eggnog_annotation --data_dir {params.data} --pident 30 --query_cover 50 --subject_cover 50 --report_orthologs')
231
232
233
234
run:
    shell("sed '/^#/d' {input} > {output.clean_annotation}")
    shell('Rscript src/COG_KEGG_annotations.R {output.clean_annotation} {params.cog_descriptions} {params.kegg_descriptions} {output.cog} {output.kegg}')
    shell("sed '/^ko/d' {output.kegg}  > {output.kegg_clean}")
SnakeMake From line 231 of main/Snakefile
243
244
245
run:
    shell('wget -P ./intermediate_files/DBCAN/ http://bcb.unl.edu/dbCAN2/download/dbCAN-HMMdb-V9.txt')
    shell('hmmsearch --tblout {output} -E 1e-5 ./intermediate_files/DBCAN/dbCAN-HMMdb-V9.txt {input}')
255
256
run:
    shell("Rscript ./src/Process_hmm_annotation.R {input.dbcan} {input.pfam} {params.dbcan_family} {params.pfam_family} {output}")
SnakeMake From line 255 of main/Snakefile
268
269
run:
    shell("set -euo pipefail; Rscript ./src/Process_annotations.R {input.matrix} {input.cog} {input.kegg} {input.hmm_annotation} {input.prokka} {output}")
SnakeMake From line 268 of main/Snakefile
281
282
shell:
    'set +u; source /opt/miniconda3/etc/profile.d/conda.sh; conda activate antismash_microlife; set -u; antismash --cb-general --cb-knownclusters --cb-subclusters --output-dir {params.out_dir} -c {params.threads} --asf --pfam2go --genefinding-tool prodigal --smcog-trees {input}'
290
291
292
293
294
295
run:
        shell('git clone https://github.com/medema-group/BiG-SCAPE.git')
        shell('mv BiG-SCAPE intermediate_files/')

        shell('rm intermediate_files/BiG-SCAPE/bigscape.py')
        shell('cp src/bigscape.py intermediate_files/BiG-SCAPE/')
SnakeMake From line 290 of main/Snakefile
311
312
shell:
    "set +u; source /opt/miniconda3/etc/profile.d/conda.sh; conda activate bigscape_microlife; set -u; python ./intermediate_files/BiG-SCAPE/bigscape.py -i {params.indir} -o {params.outdir} --pfam_dir intermediate_files/PFAM/ --mode glocal --mibig --cutoffs 0.3 0.7 --include_singletons --cores {params.threads} --mix"
SnakeMake From line 311 of main/Snakefile
329
330
331
332
run:
    shell("Rscript src/I-BIGSCAPE_revision.R {input.clustering} {input.network} {input.annotations} intermediate_files/BiG-SCAPE/mix_filtered.network intermediate_files/BiG-SCAPE/GCF_annotation.txt {output.annotations}")
    shell("python src/II_Absence_Presence.py intermediate_files/BiG-SCAPE/GCF_annotation.txt intermediate_files/BiG-SCAPE/abs_pres_table.csv ")
    shell("Rscript src/III_Absence_Presence_GCF.R intermediate_files/BiG-SCAPE/abs_pres_table.csv {output.binary_matrix}")
SnakeMake From line 329 of main/Snakefile
346
347
run:
    shell('Rscript src/rename_MEGAMATRIX.R {input.genes} {input.BGCs} {params} {output.genes} {output.BGCs} {output.mapping_file}')
SnakeMake From line 346 of main/Snakefile
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
args = commandArgs(trailingOnly=TRUE)
##arg[1] is the annotation file from emapper
##arg[2] is the cog descriptions
##arg[3] is the kegg descriptions
##arg[4] is the cog output
##arg[5] is the kegg output

annotation <- read.delim2(args[1], header = F)
annotation <-annotation[,c(1,5,12)]
colnames(annotation) <- c('id', 'cogid', 'keggid')



#Extract COGs
cog_annotation <- annotation[,c(1,2)]
keep <- grep("COG", cog_annotation$cogid)
cog_annotation <-cog_annotation[keep,]
cog_annotation$cogid <- sapply(strsplit(cog_annotation$cogid,"@"), `[`, 1)
keep <- grep("ar", cog_annotation$cogid, invert = T)
cog_annotation <-cog_annotation[keep,]


##Load COG descriptions
cog_extended_annotations <- read.csv(args[2], header = F)
cog_extended_annotations <- cog_extended_annotations[,c(1,3)]
colnames(cog_extended_annotations)[1] <- 'id'
colnames(cog_extended_annotations)[2] <- 'cog_description'

cog_annotation<- merge(cog_annotation, cog_extended_annotations, by.x = 'cogid', by.y = 'id', all.x = T)
cog_annotation<- cog_annotation[,c(2,1,3)] ##TO WRITE
colnames(cog_annotation)[1]<- 'gene'

write.table(cog_annotation, args[4], row.names = F)


#Extract KEGGs
kegg_annotation <- annotation[,c(1,3)]
keep <- grep("ko", kegg_annotation$keggid)
kegg_annotation <-kegg_annotation[keep,]
kegg_annotation$keggid <- sapply(strsplit(kegg_annotation$keggid,","), `[`, 1)
kegg_annotation$keggid <- sapply(strsplit(kegg_annotation$keggid,":"), `[`, 2)

##KO descriptions
ko_descriptions <- read.delim2(args[3], header = F)
ko_descriptions$V1 <- sapply(strsplit(ko_descriptions$V1,":"), `[`, 2)
colnames(ko_descriptions)<- c('keggid', 'kegg_description')

kegg_annotation <- merge(kegg_annotation, ko_descriptions, by= 'keggid', all.x = T)
kegg_annotation <-kegg_annotation[,c(2,1,3)]
colnames(kegg_annotation)[1]<- 'gene'

write.table(kegg_annotation, args[5], row.names = F)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
library(stringr)
library(dplyr)
args = commandArgs(trailingOnly=TRUE)
#args[1] is the id2tags input
#args[2] is the prokka_headers.txt input
#args[3] is the output




id2tags<- read.delim2(args[1], header = F)
id2tags<- id2tags[,1:2]
colnames(id2tags)<- c('geneid', 'gene')

prokka_header <- read.delim(args[2], header = F)
prokka_header$prokkadescription <- prokka_header$V1
prokka_header$remove <- sapply(strsplit(prokka_header$V1," "), `[`, 1)

prokka_header$prokkadescription <- str_remove(prokka_header$prokkadescription, prokka_header$remove)
prokka_header$prokkadescription <- sub('.', '', prokka_header$prokkadescription)

prokka_header$remove <- sub('.', '', prokka_header$remove)

prokka_header <- prokka_header[,c(3,2)]
colnames(prokka_header)[1] <- 'gene'

##Merge id2tags with prokka headers
prokka_annotation <- merge(id2tags, prokka_header, by= 'gene', all.x = T)
prokka_annotation <- prokka_annotation[,2:3]
prokka_annotation<- prokka_annotation %>% distinct(geneid, .keep_all = TRUE)

write.table(prokka_annotation, args[3], row.names = F)
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
import argparse, os
import Bio
from Bio import SeqIO

def genbank2faa(infile, outfile):
    seq = SeqIO.parse(infile,'genbank')
    fastaseq = []
    if not os.path.exists(outfile):
        print( "Converting :%s"%(infile) )
        #partial CDS_counter
        partial = 0

        id_2_tags = "%s.tags"%(outfile)

        tagfile = open(id_2_tags,'w')

        for s in seq:

            features = s.features
            for f in features:
                if f.type == "source":
                    sourcefeature = f

            if 'organism' in sourcefeature.qualifiers:
                organism = sourcefeature.qualifiers['organism'][0]
                strain = ''

                if 'strain' in sourcefeature.qualifiers:
                    strain = sourcefeature.qualifiers['strain'][0]
                    strain = strain.replace(' ','').replace(':','').replace('|','')
                organism = organism + '_' + strain

            #else:
            #    organism = s.annotations['organism'].replace(' ','_').replace(':','_').replace('|','_')

            organism = organism.replace(' ','_').replace(':','_').replace('|','_')

            ltprefix = estimate_locustag_prefix(features)
            ltprefixnew = "%s_"%(ltprefix)


            moltype="C"
            for f in features:
                if f.type == "source":
                    if "plasmid" in f.qualifiers:
                        moltype="P"

            cdscount=0
            for f in features:
                if f.type == "CDS":
                    cdscount = cdscount + 1
                    if 'locus_tag' in f.qualifiers:
                        locus_tags = f.qualifiers['locus_tag']
                        locus_tag = locus_tags[-1]
                        if not '_' in locus_tag:
                           locus_tag = locus_tag.replace(ltprefix, ltprefixnew) 

                        if not isinstance(f.location.start, Bio.SeqFeature.ExactPosition):
                            # location is not exact.
                            partial = partial +1
                            locus_tag = "%s_%02d"%(locus_tag, partial)

                        prot = f.extract(s)
                        prot.seq = prot.seq.translate(table=11, to_stop=True)
                        prot.id = locus_tag

                        tagfile.write(locus_tag)
                        tagfile.write("\t")
                        tagfile.write(moltype)
                        tagfile.write("\t")
                        tagfile.write(organism)
                        tagfile.write("\t")
                        tagfile.write(prot.description)
                        tagfile.write("\n")

                        if 'product' in f.qualifiers:
                            prot.description = f.qualifiers['product'][0]
                        else:
                            prot.description = 'unknown'

                        fastaseq.append(prot)
            print( "Written :%s sequences from %s"%(len(fastaseq), s.description) )

        if partial>0:
            print("Warning sequence %s contains %d partial CDS sequences. This might inflate the number of clusters."%(infile, partial))

        SeqIO.write(fastaseq, outfile,'fasta')
    else:
        print( "Not Converting :%s"%(infile) )
        print( "%s exists"%(outfile) )


def genbank2ffn(infile, outfile):
    seq = SeqIO.parse(infile,'genbank')
    fastaseq = []
    print( "Converting :%s"%(infile) )
    #partial CDS_counter
    partial = 0


    for s in seq:
        features = s.features
        for f in features:
            if f.type == "gene":
                if 'locus_tag' in f.qualifiers:
                    locus_tags = f.qualifiers['locus_tag']
                    locus_tag = locus_tags[-1]
                    if not isinstance(f.location.start, Bio.SeqFeature.ExactPosition):
                        # location is not exact.
                        partial = partial +1
                        locus_tag = "%s_%02d"%(locus_tag, partial)

                    gene = f.extract(s)
                    gene.id = locus_tag
                    if 'gene' in f.qualifiers:
                        gene.description = f.qualifiers['gene'][0]
                    else:
                        gene.description = gene.id

                    fastaseq.append(gene)
        print( "Written :%s sequences from %s"%(len(fastaseq), s.description) )

    if partial>0:
        print("Warning sequence %s contains %d partial CDS sequences. This might inflate the number of clusters."%(infile, partial))

    SeqIO.write(fastaseq, outfile,'fasta')

def estimate_locustag_prefix(features):

    strings = [f.qualifiers['locus_tag'][-1] for f in features if 'locus_tag' in f.qualifiers]

    """ Find the longest string that is a prefix of all the strings.
    """
    if not strings:
        return ''
    prefix = strings[0]
    for s in strings:
        if len(s) < len(prefix):
            prefix = prefix[:len(s)]
        if not prefix:
            return ''
        for i in range(len(prefix)):
            if prefix[i] != s[i]:
                prefix = prefix[:i]
                break
    return prefix

if __name__=='__main__':

    parser = argparse.ArgumentParser()
    parser.add_argument('-g', '--genbank', dest = 'genbank', help = 'genbank input file', required = True)
    parser.add_argument('-f', '--fasta', dest = 'fasta', help = 'fasta CDS output file', required = True)

    args = parser.parse_args()
    genbank2faa(args.genbank, args.fasta)
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
'R Script for Data Mining and Manipulation of Output Tables from BiG-SCAPE'

'Libraries to be Imported'
library(plyr)
library(dplyr)
library(stats)
library(tidyverse)
library(ggplot2)

#Set Working Directory
#setwd('/Results_I')
#ex:setwd("E:/NACTAR")

args = commandArgs(trailingOnly=TRUE)
"Import Data Tables from BiG-SCAPE"
#Clustering Tables
clustering_df_mix <- read.delim(args[1], header = TRUE) #clustering dataframe

#Network Tables
network_df <- read.delim(args[2], header = TRUE) #general network dataframe
network_df <- network_df [-c(4:11)] #removal of unwanted columns

#Annotation Tables
annotations_df_full<- read.delim(args[3], header = TRUE) #annotations dataframe

'Merge Files for Absence Presence Table'
annotations_merged_df_mix <- merge(annotations_df_full , clustering_df_mix , by.x= "BGC", by.y = "X.BGC.Name") #merge clustering and annotations
annotations_merged_df_mix  <- annotations_merged_df_mix [, c(1,8,6,5)]
#annotations_merged_df_mix  <- annotations_merged_df_mix [, c(1,8,4,5,6)] #removal of unwanted columns

colnames(annotations_merged_df_mix) <- c('BGC Name', 'GCF No', 'Organism', 'BGC Class')

'Extract Rows based on Logical Criteria (User Based Criteria) (Optional by User)'
#sum(network_df_mix$Raw.distance > 0.1)
network_df__distance_filtered <- network_df %>% filter(Raw.distance > 0.10)
network_df_bgc_filtered <- network_df[!grepl("BGC", network_df$Clustername.1),]



#Export Tables to CSV Format
# Write data to txt file: tab separated values
# sep = "\t"
write.table(network_df__distance_filtered, file = args[4], sep = "\t",
            row.names = FALSE, quote = FALSE)
# Write data to csv files:
# decimal point = "." and value separators = comma (",")

write.table(annotations_merged_df_mix, file = args[5], sep = '\t',
            row.names = FALSE, quote = FALSE)




###Obtain GCF annotations list
merged_annotations <- annotations_merged_df_mix
merged_annotations$GCF.No <- paste0( 'GCF',merged_annotations$GCF.No)
merged_annotations<- merged_annotations[,c(2,4)]
merged_annotations <- distinct(merged_annotations)

write.csv(merged_annotations, args[6], row.names = F)
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
'Python Script used to create an absence presence table of GCFs per Strain from BiG-SCAPE file'

#Import Libraries
import pandas as pd
import sys
import os
from string import digits

#Importing DataFrame
df1 = pd.read_csv(sys.argv[1], sep="\t")
df1.columns = ['BGC Name', 'GCF No', 'Organism', 'BGC Class']
#print(df1.isnull())
#print(df1.head())

'Method for Creating Absene Presence Table'
#Dataframe Copies
df2 = df1.copy() #Copy Original DataFrame
df2.columns = ['BGC Name', 'GCF No', 'Organism', 'BGC Class']
#DataFrame Manipulation
df2['GCF No'] = df1['BGC Class'].astype(str) + 'GCF' + df2['GCF No'].astype(str)
df1[['BGC','BGC2']] = df1['BGC Name'].str.split('_',n=1,expand=True)
df2['Genome'] = df1['Organism'].astype(str) + df1['BGC Name'].astype(str)
df2.set_index('GCF No')
df2 = df2[['GCF No','Genome']]
df2['Genome'] = df2['Genome'].str.replace(' ', '_').str.replace('Unclassified','_').str.replace('__','_').str.replace('.','')
path = os.path.join(os.path.abspath(os.getcwd()), 'intermediate_files/BiG-SCAPE')
df2.to_csv(os.path.join(path,r'abs_pres_table.csv'))
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
args = commandArgs(trailingOnly=TRUE)
'R Script for Creation of BGC absence presence per genome from Absence_Presence Table from BiG-SCAPE'
# -----------------------------------------------------------------------------
# Absence Presence Matrix Creation
# Abs_Pres Table Created with Python Script ()
# -----------------------------------------------------------------------------

#Creation of Co_Ocurrence Matrix
'Clustering Matrix With BGC from MIBIG'
abs_pres_matrix <- read.delim(args[1], header= TRUE, sep=",")
abs_pres_matrix <-abs_pres_matrix[, c(2,3)] #removal of first column

#Modify GCF.No and Genome Columns
abs_pres_matrix <- abs_pres_matrix[!grepl("BGC", abs_pres_matrix$Genome),]
abs_pres_matrix$Genome<- gsub("BGC", "_BGC", abs_pres_matrix$Genome)
abs_pres_matrix$GCF.No<- gsub("[a-zA-Z ]", "", abs_pres_matrix$GCF.No) #Removes BiG-SCAPE Class from GCF Column
abs_pres_matrix$GCF.No<- gsub("-_", "", abs_pres_matrix$GCF.No)
abs_pres_matrix$GCF.No <- sub("^", "GCF", abs_pres_matrix$GCF.No) #Appends GCF to the Front of the Number

#Separating Columns for Renaming
abs_pres_matrix$genome2 <- sapply(strsplit(as.character(abs_pres_matrix$Genome),"_"), `[`, 1)
abs_pres_matrix$genome3 <- sapply(strsplit(as.character(abs_pres_matrix$Genome),"_"), `[`, 2)
abs_pres_matrix$genome4 <- sapply(strsplit(as.character(abs_pres_matrix$Genome),"_"), `[`, 3)
abs_pres_matrix$BGC.Region <- sapply(strsplit(as.character(abs_pres_matrix$Genome),"_"), `[`, 4)

#Merging Previous Columns
abs_pres_matrix$Genome <- paste0(abs_pres_matrix$genome2, "_",
                                 abs_pres_matrix$genome3, "_",
                                 abs_pres_matrix$genome4)

#Removal of Final Columns
abs_pres_matrix<- abs_pres_matrix[, -c(3,4,5,6)]

binary_table <- table(abs_pres_matrix)
binary_table[binary_table>0] <- 1
binary_table<- as.data.frame.matrix(binary_table)

columns <- colnames(binary_table)
new_columns <- paste0(sapply(strsplit(columns,"_"), `[`, 1),'_', sapply(strsplit(columns,"_"), `[`, 2), '_', sapply(strsplit(columns,"_"), `[`, 3))
colnames(binary_table) <- new_columns


write.table(binary_table,args[2])
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
library(reshape2)
library(tidyverse)
library(stringr)
library(dplyr)
library(readr)
library(Biostrings)
library(data.table)
library(parallel)

cl <- makeCluster(30)
args = commandArgs(trailingOnly=TRUE)

tsv = data.table(read_table(args[1], col_names = F))
colnames(tsv) <- c('V1', 'V2')

n_clusters_mmseq = length(unique(tsv$V1))


tsv2 <- tsv[, .( V2 = str_c(V2, collapse = ',') ), by = V1]

#write.table(tsv2,'diamond/mmseq2_clusters.txt', row.names = F, quote = F)


'Open MCL output, collapse it and clean it\n'

mcl <- data.table(read_csv( args[2], col_names = F))

#Load unaligned headers 
unalin <- read.delim2(args[3], header = F)
unalin <- data.frame(X1 = str_remove_all(unalin$V1, '>'))

mcl <- dplyr::bind_rows(mcl, unalin)


#Name clusters

mcl$clusters <- seq(nrow(mcl))
mcl$clusters <- sprintf("cluster_%06d", mcl$clusters)


mcl_melt <- mcl[, c(X1 = strsplit(X1, "\t")), by = clusters]

'Join mmseq2 clusters with MCL clusters\n'

M <- merge(mcl_melt, tsv2, by.x= 'X1', by.y = 'V1')
M <- M[,c(2,3)]
colnames(M)[2] <- 'descriptions'
M <- data.table(M)
M <- M[, .( descriptions = str_c(descriptions, collapse = ',') ), by = clusters]
M <- M[order(clusters)]
M$gene <- sapply(strsplit(M$descriptions,","), `[`, 1)

#write.table(M, 'clusters_info.txt', row.names = F, sep = '\t', quote = F)

'Create 1/0 matrix\n' 

bacteria_name <- function(string){
  b <- stringr::str_split(string, ',')[[1]]
  c <- sapply(stringr::str_split(b,"[|]"), `[`, 1)
  d <- stringr::str_c(c, collapse = ",")
  return(d)
}

MEGAMATRIX <- M
MEGAMATRIX$descriptions <- parSapply(cl, MEGAMATRIX$descriptions , FUN = bacteria_name)
MEGAMATRIX$gene <- NULL



tbl <- table(splitstackshape:::cSplit(MEGAMATRIX, "descriptions", sep = ",", direction = "long"))
tbl <- data.table(as.data.frame(tbl))
tbl <-dcast.data.table(tbl, clusters ~ descriptions, value.var = 'Freq')

#lista <- as.list(MEGAMATRIX$descriptions)
#binary_matrix <- as.data.frame((splitstackshape:::charMat(listOfValues = lista, fill = 0L)))
#binary_matrix <- cbind(clusters = MEGAMATRIX$clusters, binary_matrix)
binary_matrix <- merge(tbl, M, by = 'clusters')



####Extract sequence of representative proteins
#system("bioawk -c fastx '{ print $name, length($seq) }' < intermediate_files/combined_proteins/combined_proteins.fasta >intermediate_files/combined_proteins/length_genes.txt")
gene_lengths <- read.table('intermediate_files/combined_proteins/length_genes.txt', header = F)

##Function to extract longest gene

extract_long_genes <- function(character, gene_length){
  a <- as.character(stringr::str_split(character, ',')[[1]])
  sub <- data.table::data.table(gene_length[gene_length$V1 %in% a,])
  longest_gene <- sub[which.max(sub$V2)]$V1
  return(longest_gene)
}

'Obtain representative genes\n'

genes_description <- binary_matrix$descriptions

new_rep <- parSapply(cl, genes_description, extract_long_genes, gene_length = gene_lengths)

binary_matrix$gene <- new_rep

write.table(binary_matrix, args[4], row.names = F)

'Obtain representative sequences\n'

representative_genes <- binary_matrix$gene
fasta <-  Biostrings::readAAStringSet(args[5])
subset <- fasta[representative_genes]
writeXStringSet(subset, args[6])
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
args = commandArgs(trailingOnly=TRUE)
##arg[1] is the matrix from cdhit
##arg[2] is the cog annotatuion
##arg[3] is the kegg annotatuion
##arg[4] is the hmm annotation
##arg[5] is the clusterVSid file
##arg[6] is the prokka annotations
##arg[7] is the output name of matrix with all annotations





'Reading input files'
matrix <- read.table(args[1], header = T)
clustervsgene <- matrix[,c('clusters', 'gene')]

###COG annotations-
cog <- read.table(args[2], header = T)

####KEGG annotations
kegg <- read.table(args[3], header = T)

###Prokka annotations
prokka <- read.table(args[5], header = T)

'Merge annotations COG/KEGG'
###Merge annotations
all_annotations <- merge(kegg, cog, by= 'gene', all = T)

'Merge with HMM annotations'
##Merge with hmm annotations
hmm_annotations <- read.table(args[4], header = T)
all_annotations <- merge(all_annotations, hmm_annotations, by= 'gene', all = T)

'Merge with prokka annotation'
##Merge with prokka
all_annotations <- merge(all_annotations, prokka, by.x= 'gene', by.y = 'geneid', all = T)



###Merge with absence_presence_matrix

'Merge with absence_presence_matrix'
all_annotations <- merge(clustervsgene, all_annotations, by = 'gene', all=T)
all_annotations$gene<- NULL
column_number= ncol(matrix)
matrix_info = matrix[,c(column_number -1 , column_number )]
matrix <- matrix[1:(column_number-2)]
matrix$completeness <- 'True'
matrix <- cbind(matrix, matrix_info)


MEGAMATRIX <- merge(matrix, all_annotations, by= 'clusters', all.x=T)
'Write MEGAMATRIX'
write.table(MEGAMATRIX, args[6], row.names = F)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
library(dplyr)
library(stringr)
source('src/rhmmer_functions.R')
#arg[1] is the dbcan annotation
#arg[2] is the pfam annotation
#arg[3] is the dbcan family info
#arg[4] is the pfam family info
#arg[5] is the output table with both annotations


args = commandArgs(trailingOnly=TRUE)


###Extract top hits function
top_hits <- function(table){
  genes <- unique(table$domain_name)
  top_hits <- table[0,]
  for (i in 1:length(genes)){
    subtable <- table[table$domain_name == genes[i],]
    min_evalue <- subtable[which.min(subtable$sequence_evalue),]
    top_hits <- rbind(top_hits,min_evalue)
  }

  return(top_hits)
}


##DBCAN annotations
dbcan <- read_tblout(args[1])
dbcan <- top_hits(dbcan)
dbcan$query_name <- sapply(strsplit(dbcan$query_name,"_"), `[`, 1) 
dbcan$query_name <- str_remove(dbcan$query_name, '.hmm')

dbcan_description <- read.delim(args[3], header = F)
colnames(dbcan_description)[1]<- 'dbcan_family'
colnames(dbcan_description)[2]<- 'dbcan_description'


dbcan_final <- merge(dbcan[,c(1,3)], dbcan_description, by.x = 'query_name', by.y = 'dbcan_family', all.x=T)
colnames(dbcan_final)[1]<- 'dbcanid'
colnames(dbcan_final)[2]<- 'gene'
##Group different annotations of same gene
dbcan_final <- dbcan_final %>% group_by_at(vars(gene)) %>%
  summarize_all(paste, collapse=",")
###Check problems of family names with a _

##PFAM annotations
pfam <- read_tblout(args[2])
pfam <- top_hits(pfam)
pfam <- pfam[,c(1,4)]
pfam$query_accession <- sapply(strsplit(pfam$query_accession,"[.]"), `[`, 1)

pfam_description <- read.delim2(args[4], header = F)
pfam_description <-pfam_description[,c(1,5)]
colnames(pfam_description)[1] <- 'query_accession'
colnames(pfam_description)[2] <- 'pfam_description'

pfam_final <- merge(pfam, pfam_description, by = 'query_accession', all.x = T)
colnames(pfam_final)[1]<- 'pfamid'
colnames(pfam_final)[2]<- 'gene'
##Group different annotations of same gene
pfam_final <- pfam_final %>% group_by_at(vars(gene)) %>%
  summarize_all(paste, collapse=",")


###Merge both annotations in one table
hmm_annotations <- merge(pfam_final, dbcan_final, by= 'gene', all = T)
write.table(hmm_annotations, args[5], row.names=FALSE)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
library(stringr)
library(readr)
args = commandArgs(trailingOnly=TRUE)


matrix <- read_delim(args[1], col_names = T, quote = "\"")

matrix <- as.data.frame(matrix)

names_equivalence <- read.table(args[3], header = T)
names_equivalence$Full_name <- str_remove(names_equivalence$Full_name, '_O.fna')
names_equivalence$MicroLife_name <- str_remove(names_equivalence$MicroLife_name, '_O.fna')

n_samples <- grep("completeness", colnames(matrix)) - 1

old_colnames <- data.frame(MicroLife_name = colnames(matrix)[2:n_samples] ) 

M <- merge(old_colnames, names_equivalence, by = 'MicroLife_name')

M <- M[match(M$MicroLife_name,old_colnames$MicroLife_name),]
colnames(matrix)[2:n_samples] <- M$Full_name

write.table(matrix, args[4], row.names = F)


###Rename bigscape table
big_scape_matrix <- read.table(args[2], header = T)


old_colnames <- data.frame(MicroLife_name = colnames(big_scape_matrix) ) 

M <- merge(old_colnames, names_equivalence, by = 'MicroLife_name')

M <- M[match(M$MicroLife_name,old_colnames$MicroLife_name),]
colnames(big_scape_matrix) <- M$Full_name

write.table(big_scape_matrix, args[5], row.names = T)

###Create mapping_file
mapping_file = data.frame(Sample = names_equivalence$Full_name, Lifestyle = 'Unknown')
write.table(mapping_file, args[6], row.names = F)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import fnmatch
import os, glob
from Bio import SeqIO
import sys, traceback
import argparse

def rename_fasta(infile, tagfile, outfile, newtagfile):
    counters = {}
    tags = {}
    newseqs=[]

    t = open(tagfile,'r')
    tn = open(newtagfile,'w')

    for line in t:
        data = line.strip().split('\t')
        tags[data[0]] = data

    seqs = SeqIO.parse(infile, 'fasta')
    for s in seqs:
        data = tags[s.id]
        counter = counters.get(data[2], 0)
        counter = counter + 1
        idtag = "%s|%06d"%(data[2], counter)
        counters[data[2]]=counter
        s.id = idtag
        s.description = ""
        tn.write(idtag)
        tn.write('\t')
        tn.write('\t'.join(data))
        tn.write('\n')
        newseqs.append(s)
    tn.close()

    try:
        SeqIO.write((newseqs), outfile, 'fasta')
    except:
        print("Exception in user code:")
        print("-"*60)
        traceback.print_exc(file=sys.stdout)
        print("-"*60)


if __name__=='__main__':

    parser = argparse.ArgumentParser()
    parser.add_argument('-f', '--fasta', dest = 'fasta', help = 'fasta input file', required = True)
    parser.add_argument('-t', '--tags', dest = 'tags', help = 'tags', required = True)
    parser.add_argument('-o', '--outfile', dest = 'outfile', help = 'outfile', required = True)
    parser.add_argument('-n', '--newtags', dest = 'newtags', help = 'tags', required = True)

    args = parser.parse_args()
    rename_fasta(args.fasta, args.tags, args.outfile, args.newtags)
ShowHide 19 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/Carrion-lab/microLife
Name: microlife
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...