Large comparative genomic analysis tool for the prediction of lifestyle-associated genes

public 1yr ago 0 bookmarks

View Workflow

microLife is a streamlined computational workflow that annotates bacterial genomes and performs large-scale comparative genomics to predict bacterial lifestyles and to pinpoint candidate genes, denominated lifestyle-associated genes (LAGs) , and biosynthetic gene clusters associated with each lifestyle detected. This whole process is divided into different modules:

Clustering module Predicts, clusters and annotates the genes of every input genome
Lifestyle prediction Employs a machine learning model to forecast bacterial lifestyle or other specified metadata
Analitical module (Shiny app) Results from the previous modules are embedded in a user-friendly interface for comprehensive and interactive comparative genomics. An example of the app with a demo dataset (genomes present in data ) is available at http://178.128.251.24:3838/microLife_linux

workflow

How to start using microLife

Download microLife and install dependencies

 git clone https://github.com/Carrion-lab/microLife.git microLife cd microLife/

Install the conda environment necessary to download files from NCBI using the file ENVS/MicroLife_download.yml and activate it:

mamba env create -f ENVS/MicroLife_download.yml
conda activate MicroLife_download

Download genomes and reduce redundancy

Download the accession names

Retrieve a list of accesion IDs you are interested in processing with microLife (e.g All Mycobacterium RefSeq genomes) by extracting the SQL query (image below) from NCBI website search and integrating it in the following command:

esearch -db assembly -query '("Mycobacterium"[Organism] OR Mycobacterium[All Fields]) AND ("latest refseq"[filter] AND all[filter] NOT anomalous[filter])' | esummary | xtract -pattern DocumentSummary -element AssemblyAccession > NCBI_accs.txt

screenshot

Download and uncompress the FASTA files

mkdir genomes/
cd genomes/
bit-dl-ncbi-assemblies -w ../NCBI_accs.txt -f fasta -j 10
gzip -d ./*.gz

Download metadata and remove redundancy at >99% Average Nucleotide Identity (ANI)

cd ../
python main_script_download.py

Making sure everything is ready

Before starting microLife, you should have the following in your main directory ( microLife/download/ ):

"genomes_renamed/": Directory where the non-redundant genomes with the filenames ready for microLife are stored.
"METADATA_MERGED.txt": File with the following columns: 'scientific_name', 'NCBIid', 'cluster_membership', 'representative' and 'MicroLife_name'

Remember to deactivate the conda environment MicroLife_download if you want to jump to the next section ( "Executing microLife" ) right after this one!

Executing clustering module

Install dependencies necessary for MicroLife

mamba env create -f ENVS/MicroLife_environment.yml
mamba env create -f ENVS/antismash.yml
mamba env create -f ENVS/bigscape.yml
conda activate MicroLife_environment

Prepare input files

The input must be FASTA assemblies stored in the data/ directory, all with name that should follow the format " Genus_species_strain_O.fna ". * If genomes were downloaded as described in the previous section 'Download genomes and reduce redundancy' , the folder directory download/genomes_renamed/ already contains the genome files in the correct name format and they only have to be moved to the data/ folder .

mv download/genomes_renamed/* data/

To run microLife with your own genomes they should follow the format " Genus_species_strain_O.fna "* and be stored in the data/ . Then, you must use the script src/rename_genomes.R to change the input genome strain names into a MicroLife friendly format in the following way:

Rscript src/rename_genomes.R data/ names_equivalence.txt

This script changes the name of the input genome files by replacing the strain name into a barcode to avoid microLife crashes
I.e Pseudomonas_fluorescens_FDAARGOS-1088.fna --> Pseudomonas_fluorescens_X0001.fna
The file "names_equivalence.txt" contains the correspondent name matching and needs to be present in the main directory before running microLife. If you downloadede your genomes as described above, the file names_equivalence.txt is already present in the main folder.

*In the genome format name, the user must avoid adding any special characters, that is not a alphabet letter, number , dash or dot. This applies specially to the strain name which also should be of less than 10 characters long.

Executing Snakemake

microLife is written using the Snakemake workflow manager and it can be executed using the following command from the main directory

snakemake -j 24

-j specifies the number of CPU cores to use for the whole process. If this flag is ommitted, Snakemake will use all the available CPU cores in the machine

Generate metadata

To be able to run the Lifestyle prediction module and the Shiny app (next modules of the microLife pipeline) you have to modify the mapping_file.txt file. This file is generated at the end of the clustering module. It consists of a two column tab-separated file where the first column specifies the genome name (as established in the data/ directory) and the second column specifies the Lifestyle of that genome. Every genome lifestyle is annotated as 'Unknown' when the file is generated. User have to manually annotate the genomes with their information of interest. Genomes with no lifestyle information must be annotated as "Unknown". Note that additional columns can be added to the file with other metadata information.

microLife outputs

Two primary outputs, namely MEGAMATRIX_renamed.txt and big_scape_binary_table_renamed.txt , are generated in the main folder. MEGAMATRIX_renamed.txt is a matrix where each row represents a gene cluster, and each column represents the presence of these clusters in different genomes, along with their annotations from different databases. Similarly, big_scape_binary_table_renamed.txt is a matrix where each row represents a Gene Cluster Family (GCF) or a cluster of Biosynthetic Gene Clusters (BGCs), and each column represents the presence of these clusters in different genomes.

Intermediate outputs are stored in the intermediate_files/ directory. Within this folder, the annot/ subdirectory contains the prokka outputs for each genome. These outputs include annotation files such as .gbk , .faa , .gff , and .ffn , among others. Similarly, the antismash/ folder contains the antiSMASH results for each genome.

Lifestyle prediction module

The user can test the predictability of their metadata with the machine learning model random forest using the script src/classifier.R and specifying the location of mapping_file.txt and MEGAMATRIX_renamed.txt in addition to the column name of the metadata variable that the user wants to predict

Rscript classifier_src/classifier.R mapping_file.txt MEGAMATRIX_renamed.txt Lifestyle

ROC plots showing the model evaluation for the different classes present in the metadata are generated and stored in classifier/

If good accuracy is accomplished, the user can use the model for predictions of genomes labeled as 'Unknown' in the mapping file. The resulting augmented metadata is stored in the new file mapping_file_augmented.txt

microLife App (Analytical module)

In order to initiate the shiny app, the following snakemake output files need to be droped in the 'Shiny_app/input/' directory.

MEGAMATRIX_renamed.txt
mapping_file.txt or mapping_file_augmented.txt
intermediate_files/BiG-SCAPE/big_scape_binary_table_renamed.txt
intermediate_files/BiG-SCAPE/annotation.txt
intermediate_files/combined_proteins/combined_proteins.fasta
names_equivalence.txt

After having prepared these input files, users should move the working directory to Shiny_app/ and open the script app.R in Rstudio and click in the upper right ' run app ' button to initiate the microLife app and enjoy visualizing all their results! Note that user may move the Shiny_app/ folder and all its content to a local computer and lunch Rstudio locally.

Code Snippets

run:
    shell("mkdir -p {params}")
    shell("touch .mkdir.chkpnt")

SnakeMake From line 64 of main/Snakefile

run:
        shell('prokka --force --outdir {params.outdir} --prefix {params.prefix} --locustag {params.str} --addgenes --increment 5 --centre NIOO-KNAW --genus {params.genus} --species {params.species} --str {params.str} --gcode 11 --cpus 20 --evalue 1e-03 --rfam {input.file}')

SnakeMake metaprokka From line 84 of main/Snakefile

shell:
    'python ./src/genbank_to_protein_fasta.py --genbank {input} --fasta {output.faa}'

SnakeMake From line 95 of main/Snakefile

run:
    shell('cat {input.faa} >  {output.orig}')
    shell('cat {input.tags} >  {output.tags}')

SnakeMake From line 108 of main/Snakefile

run:
    shell('python ./src/rename_proteins.py -f {input.orig} -t {input.tags} -o {output.combined_proteins} -n {output.newtagfile}')

SnakeMake From line 119 of main/Snakefile

run:
    #Extract gene lengths
    shell("bioawk -c fastx '{{ print $name, length($seq) }}' < intermediate_files/combined_proteins/combined_proteins.fasta >intermediate_files/combined_proteins/length_genes.txt")

    #Run mmseq2 clustering to 0.95
    shell('mmseqs createdb {input} {params.mmseq_db}')
    shell('mmseqs cluster {params.mmseq_db} {params.mmseq_db_clu} {params.mmseq_temp} --min-seq-id 0.95 --cov-mode 0 -c 0.8')
    shell('mmseqs createtsv {params.mmseq_db} {params.mmseq_db} {params.mmseq_db_clu} {params.mmseq_tsv}')

    #Extract mmseq2 representatives
    shell('mmseqs createsubdb {params.mmseq_db_clu} {params.mmseq_db} {params.mmseq_rep}')
    shell('mmseqs convert2fasta {params.mmseq_rep} {params.mmseq_fasta}')


    #Run diamond
    shell('diamond makedb --in {params.mmseq_fasta} -d {params.diamond_db}')
    shell('diamond blastp -b 10 -p {THREADS} -c 1 -e 0.00001 --un {params.diamond_unaligned} -k 5000 -d {params.diamond_db} -q {params.mmseq_fasta} -o {params.diamond_results} --outfmt 6 qseqid sseqid pident bitscore')
    shell("awk -F'\t' '$3>20' {params.diamond_results} > {params.diamond_results_filtered}")
    shell("cut -f1,2,4 {params.diamond_results_filtered} > {params.diamond_results_clean}")

    ##Get unaaligned fasta headers
    shell('grep "^>" {params.diamond_unaligned}  > {params.diamond_unaligned_headers}')

    ### Run MCL
    shell('mcxload -abc {params.diamond_results_clean} --stream-mirror -o {params.mcl_data} -write-tab {params.mcl_tab}')
    shell('mcl {params.mcl_data} -I 3.0 -use-tab {params.mcl_tab}')
    shell('mv out.mcl_data.mci.I30 intermediate_files/clustering/')

    ##Generate matrix

    shell('Rscript src/MCL_merge.R {params.mmseq_tsv} {params.mcl_clusters} {params.diamond_unaligned_headers} {output.binary_table} {input} {output.fasta}')

SnakeMake Diamond MMseqs MCL bioawk From line 146 of main/Snakefile

run:
    shell("grep '^>' {input.faa} >  {params.headers}")
    shell("Rscript ./src/extract_prokka.R {input.id2tag} {params.headers} {output} ")

SnakeMake From line 189 of main/Snakefile

run:
    shell('wget -P ./intermediate_files/PFAM/ ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam33.1/Pfam-A.hmm.gz')
    shell('gunzip intermediate_files/PFAM/Pfam-A.hmm.gz' )
    shell('hmmpress intermediate_files/PFAM/Pfam-A.hmm')
    shell('hmmsearch --tblout {output.pfam} --cpu 30 -E 1e-5 ./intermediate_files/PFAM/Pfam-A.hmm {input}')

SnakeMake hmmsearch (genouest) From line 203 of main/Snakefile

run:
    shell('download_eggnog_data.py -y --data_dir {params.data}')
    shell('emapper.py -i {input} --cpu 25 -o intermediate_files/eggnog_annotation/eggnog_annotation --data_dir {params.data} --pident 30 --query_cover 50 --subject_cover 50 --report_orthologs')

SnakeMake eggNOG-mapper v2 From line 216 of main/Snakefile

run:
    shell("sed '/^#/d' {input} > {output.clean_annotation}")
    shell('Rscript src/COG_KEGG_annotations.R {output.clean_annotation} {params.cog_descriptions} {params.kegg_descriptions} {output.cog} {output.kegg}')
    shell("sed '/^ko/d' {output.kegg}  > {output.kegg_clean}")

SnakeMake From line 231 of main/Snakefile

run:
    shell('wget -P ./intermediate_files/DBCAN/ http://bcb.unl.edu/dbCAN2/download/dbCAN-HMMdb-V9.txt')
    shell('hmmsearch --tblout {output} -E 1e-5 ./intermediate_files/DBCAN/dbCAN-HMMdb-V9.txt {input}')

SnakeMake hmmsearch (genouest) From line 243 of main/Snakefile

run:
    shell("Rscript ./src/Process_hmm_annotation.R {input.dbcan} {input.pfam} {params.dbcan_family} {params.pfam_family} {output}")

SnakeMake From line 255 of main/Snakefile

run:
    shell("set -euo pipefail; Rscript ./src/Process_annotations.R {input.matrix} {input.cog} {input.kegg} {input.hmm_annotation} {input.prokka} {output}")

SnakeMake From line 268 of main/Snakefile

shell:
    'set +u; source /opt/miniconda3/etc/profile.d/conda.sh; conda activate antismash_microlife; set -u; antismash --cb-general --cb-knownclusters --cb-subclusters --output-dir {params.out_dir} -c {params.threads} --asf --pfam2go --genefinding-tool prodigal --smcog-trees {input}'

SnakeMake prodigal antiSMASH From line 281 of main/Snakefile

run:
        shell('git clone https://github.com/medema-group/BiG-SCAPE.git')
        shell('mv BiG-SCAPE intermediate_files/')

        shell('rm intermediate_files/BiG-SCAPE/bigscape.py')
        shell('cp src/bigscape.py intermediate_files/BiG-SCAPE/')

SnakeMake From line 290 of main/Snakefile

shell:
    "set +u; source /opt/miniconda3/etc/profile.d/conda.sh; conda activate bigscape_microlife; set -u; python ./intermediate_files/BiG-SCAPE/bigscape.py -i {params.indir} -o {params.outdir} --pfam_dir intermediate_files/PFAM/ --mode glocal --mibig --cutoffs 0.3 0.7 --include_singletons --cores {params.threads} --mix"

SnakeMake From line 311 of main/Snakefile

run:
    shell("Rscript src/I-BIGSCAPE_revision.R {input.clustering} {input.network} {input.annotations} intermediate_files/BiG-SCAPE/mix_filtered.network intermediate_files/BiG-SCAPE/GCF_annotation.txt {output.annotations}")
    shell("python src/II_Absence_Presence.py intermediate_files/BiG-SCAPE/GCF_annotation.txt intermediate_files/BiG-SCAPE/abs_pres_table.csv ")
    shell("Rscript src/III_Absence_Presence_GCF.R intermediate_files/BiG-SCAPE/abs_pres_table.csv {output.binary_matrix}")

SnakeMake From line 329 of main/Snakefile

run:
    shell('Rscript src/rename_MEGAMATRIX.R {input.genes} {input.BGCs} {params} {output.genes} {output.BGCs} {output.mapping_file}')

SnakeMake From line 346 of main/Snakefile

args = commandArgs(trailingOnly=TRUE)
##arg[1] is the annotation file from emapper
##arg[2] is the cog descriptions
##arg[3] is the kegg descriptions
##arg[4] is the cog output
##arg[5] is the kegg output

annotation <- read.delim2(args[1], header = F)
annotation <-annotation[,c(1,5,12)]
colnames(annotation) <- c('id', 'cogid', 'keggid')



#Extract COGs
cog_annotation <- annotation[,c(1,2)]
keep <- grep("COG", cog_annotation$cogid)
cog_annotation <-cog_annotation[keep,]
cog_annotation$cogid <- sapply(strsplit(cog_annotation$cogid,"@"), `[`, 1)
keep <- grep("ar", cog_annotation$cogid, invert = T)
cog_annotation <-cog_annotation[keep,]


##Load COG descriptions
cog_extended_annotations <- read.csv(args[2], header = F)
cog_extended_annotations <- cog_extended_annotations[,c(1,3)]
colnames(cog_extended_annotations)[1] <- 'id'
colnames(cog_extended_annotations)[2] <- 'cog_description'

cog_annotation<- merge(cog_annotation, cog_extended_annotations, by.x = 'cogid', by.y = 'id', all.x = T)
cog_annotation<- cog_annotation[,c(2,1,3)] ##TO WRITE
colnames(cog_annotation)[1]<- 'gene'

write.table(cog_annotation, args[4], row.names = F)


#Extract KEGGs
kegg_annotation <- annotation[,c(1,3)]
keep <- grep("ko", kegg_annotation$keggid)
kegg_annotation <-kegg_annotation[keep,]
kegg_annotation$keggid <- sapply(strsplit(kegg_annotation$keggid,","), `[`, 1)
kegg_annotation$keggid <- sapply(strsplit(kegg_annotation$keggid,":"), `[`, 2)

##KO descriptions
ko_descriptions <- read.delim2(args[3], header = F)
ko_descriptions$V1 <- sapply(strsplit(ko_descriptions$V1,":"), `[`, 2)
colnames(ko_descriptions)<- c('keggid', 'kegg_description')

kegg_annotation <- merge(kegg_annotation, ko_descriptions, by= 'keggid', all.x = T)
kegg_annotation <-kegg_annotation[,c(2,1,3)]
colnames(kegg_annotation)[1]<- 'gene'

write.table(kegg_annotation, args[5], row.names = F)

R From line 1 of src/COG_KEGG_annotations.R

library(stringr)
library(dplyr)
args = commandArgs(trailingOnly=TRUE)
#args[1] is the id2tags input
#args[2] is the prokka_headers.txt input
#args[3] is the output




id2tags<- read.delim2(args[1], header = F)
id2tags<- id2tags[,1:2]
colnames(id2tags)<- c('geneid', 'gene')

prokka_header <- read.delim(args[2], header = F)
prokka_header$prokkadescription <- prokka_header$V1
prokka_header$remove <- sapply(strsplit(prokka_header$V1," "), `[`, 1)

prokka_header$prokkadescription <- str_remove(prokka_header$prokkadescription, prokka_header$remove)
prokka_header$prokkadescription <- sub('.', '', prokka_header$prokkadescription)

prokka_header$remove <- sub('.', '', prokka_header$remove)

prokka_header <- prokka_header[,c(3,2)]
colnames(prokka_header)[1] <- 'gene'

##Merge id2tags with prokka headers
prokka_annotation <- merge(id2tags, prokka_header, by= 'gene', all.x = T)
prokka_annotation <- prokka_annotation[,2:3]
prokka_annotation<- prokka_annotation %>% distinct(geneid, .keep_all = TRUE)

write.table(prokka_annotation, args[3], row.names = F)

R dplyr stringr metaprokka From line 1 of src/extract_prokka.R

import argparse, os
import Bio
from Bio import SeqIO

def genbank2faa(infile, outfile):
    seq = SeqIO.parse(infile,'genbank')
    fastaseq = []
    if not os.path.exists(outfile):
        print( "Converting :%s"%(infile) )
        #partial CDS_counter
        partial = 0

        id_2_tags = "%s.tags"%(outfile)

        tagfile = open(id_2_tags,'w')

        for s in seq:

            features = s.features
            for f in features:
                if f.type == "source":
                    sourcefeature = f

            if 'organism' in sourcefeature.qualifiers:
                organism = sourcefeature.qualifiers['organism'][0]
                strain = ''

                if 'strain' in sourcefeature.qualifiers:
                    strain = sourcefeature.qualifiers['strain'][0]
                    strain = strain.replace(' ','').replace(':','').replace('|','')
                organism = organism + '_' + strain

            #else:
            #    organism = s.annotations['organism'].replace(' ','_').replace(':','_').replace('|','_')

            organism = organism.replace(' ','_').replace(':','_').replace('|','_')

            ltprefix = estimate_locustag_prefix(features)
            ltprefixnew = "%s_"%(ltprefix)


            moltype="C"
            for f in features:
                if f.type == "source":
                    if "plasmid" in f.qualifiers:
                        moltype="P"

            cdscount=0
            for f in features:
                if f.type == "CDS":
                    cdscount = cdscount + 1
                    if 'locus_tag' in f.qualifiers:
                        locus_tags = f.qualifiers['locus_tag']
                        locus_tag = locus_tags[-1]
                        if not '_' in locus_tag:
                           locus_tag = locus_tag.replace(ltprefix, ltprefixnew) 

                        if not isinstance(f.location.start, Bio.SeqFeature.ExactPosition):
                            # location is not exact.
                            partial = partial +1
                            locus_tag = "%s_%02d"%(locus_tag, partial)

                        prot = f.extract(s)
                        prot.seq = prot.seq.translate(table=11, to_stop=True)
                        prot.id = locus_tag

                        tagfile.write(locus_tag)
                        tagfile.write("\t")
                        tagfile.write(moltype)
                        tagfile.write("\t")
                        tagfile.write(organism)
                        tagfile.write("\t")
                        tagfile.write(prot.description)
                        tagfile.write("\n")

                        if 'product' in f.qualifiers:
                            prot.description = f.qualifiers['product'][0]
                        else:
                            prot.description = 'unknown'

                        fastaseq.append(prot)
            print( "Written :%s sequences from %s"%(len(fastaseq), s.description) )

        if partial>0:
            print("Warning sequence %s contains %d partial CDS sequences. This might inflate the number of clusters."%(infile, partial))

        SeqIO.write(fastaseq, outfile,'fasta')
    else:
        print( "Not Converting :%s"%(infile) )
        print( "%s exists"%(outfile) )


def genbank2ffn(infile, outfile):
    seq = SeqIO.parse(infile,'genbank')
    fastaseq = []
    print( "Converting :%s"%(infile) )
    #partial CDS_counter
    partial = 0


    for s in seq:
        features = s.features
        for f in features:
            if f.type == "gene":
                if 'locus_tag' in f.qualifiers:
                    locus_tags = f.qualifiers['locus_tag']
                    locus_tag = locus_tags[-1]
                    if not isinstance(f.location.start, Bio.SeqFeature.ExactPosition):
                        # location is not exact.
                        partial = partial +1
                        locus_tag = "%s_%02d"%(locus_tag, partial)

                    gene = f.extract(s)
                    gene.id = locus_tag
                    if 'gene' in f.qualifiers:
                        gene.description = f.qualifiers['gene'][0]
                    else:
                        gene.description = gene.id

                    fastaseq.append(gene)
        print( "Written :%s sequences from %s"%(len(fastaseq), s.description) )

    if partial>0:
        print("Warning sequence %s contains %d partial CDS sequences. This might inflate the number of clusters."%(infile, partial))

    SeqIO.write(fastaseq, outfile,'fasta')

def estimate_locustag_prefix(features):

    strings = [f.qualifiers['locus_tag'][-1] for f in features if 'locus_tag' in f.qualifiers]

    """ Find the longest string that is a prefix of all the strings.
    """
    if not strings:
        return ''
    prefix = strings[0]
    for s in strings:
        if len(s) < len(prefix):
            prefix = prefix[:len(s)]
        if not prefix:
            return ''
        for i in range(len(prefix)):
            if prefix[i] != s[i]:
                prefix = prefix[:i]
                break
    return prefix

if __name__=='__main__':

    parser = argparse.ArgumentParser()
    parser.add_argument('-g', '--genbank', dest = 'genbank', help = 'genbank input file', required = True)
    parser.add_argument('-f', '--fasta', dest = 'fasta', help = 'fasta CDS output file', required = True)

    args = parser.parse_args()
    genbank2faa(args.genbank, args.fasta)

Python Biopython From line 2 of src/genbank_to_protein_fasta.py

'R Script for Data Mining and Manipulation of Output Tables from BiG-SCAPE'

'Libraries to be Imported'
library(plyr)
library(dplyr)
library(stats)
library(tidyverse)
library(ggplot2)

#Set Working Directory
#setwd('/Results_I')
#ex:setwd("E:/NACTAR")

args = commandArgs(trailingOnly=TRUE)
"Import Data Tables from BiG-SCAPE"
#Clustering Tables
clustering_df_mix <- read.delim(args[1], header = TRUE) #clustering dataframe

#Network Tables
network_df <- read.delim(args[2], header = TRUE) #general network dataframe
network_df <- network_df [-c(4:11)] #removal of unwanted columns

#Annotation Tables
annotations_df_full<- read.delim(args[3], header = TRUE) #annotations dataframe

'Merge Files for Absence Presence Table'
annotations_merged_df_mix <- merge(annotations_df_full , clustering_df_mix , by.x= "BGC", by.y = "X.BGC.Name") #merge clustering and annotations
annotations_merged_df_mix  <- annotations_merged_df_mix [, c(1,8,6,5)]
#annotations_merged_df_mix  <- annotations_merged_df_mix [, c(1,8,4,5,6)] #removal of unwanted columns

colnames(annotations_merged_df_mix) <- c('BGC Name', 'GCF No', 'Organism', 'BGC Class')

'Extract Rows based on Logical Criteria (User Based Criteria) (Optional by User)'
#sum(network_df_mix$Raw.distance > 0.1)
network_df__distance_filtered <- network_df %>% filter(Raw.distance > 0.10)
network_df_bgc_filtered <- network_df[!grepl("BGC", network_df$Clustername.1),]



#Export Tables to CSV Format
# Write data to txt file: tab separated values
# sep = "\t"
write.table(network_df__distance_filtered, file = args[4], sep = "\t",
            row.names = FALSE, quote = FALSE)
# Write data to csv files:
# decimal point = "." and value separators = comma (",")

write.table(annotations_merged_df_mix, file = args[5], sep = '\t',
            row.names = FALSE, quote = FALSE)




###Obtain GCF annotations list
merged_annotations <- annotations_merged_df_mix
merged_annotations$GCF.No <- paste0( 'GCF',merged_annotations$GCF.No)
merged_annotations<- merged_annotations[,c(2,4)]
merged_annotations <- distinct(merged_annotations)

write.csv(merged_annotations, args[6], row.names = F)

R ggplot2 tidyverse dplyr plyr From line 9 of src/I-BIGSCAPE_revision.R

'Python Script used to create an absence presence table of GCFs per Strain from BiG-SCAPE file'

#Import Libraries
import pandas as pd
import sys
import os
from string import digits

#Importing DataFrame
df1 = pd.read_csv(sys.argv[1], sep="\t")
df1.columns = ['BGC Name', 'GCF No', 'Organism', 'BGC Class']
#print(df1.isnull())
#print(df1.head())

'Method for Creating Absene Presence Table'
#Dataframe Copies
df2 = df1.copy() #Copy Original DataFrame
df2.columns = ['BGC Name', 'GCF No', 'Organism', 'BGC Class']
#DataFrame Manipulation
df2['GCF No'] = df1['BGC Class'].astype(str) + 'GCF' + df2['GCF No'].astype(str)
df1[['BGC','BGC2']] = df1['BGC Name'].str.split('_',n=1,expand=True)
df2['Genome'] = df1['Organism'].astype(str) + df1['BGC Name'].astype(str)
df2.set_index('GCF No')
df2 = df2[['GCF No','Genome']]
df2['Genome'] = df2['Genome'].str.replace(' ', '_').str.replace('Unclassified','_').str.replace('__','_').str.replace('.','')
path = os.path.join(os.path.abspath(os.getcwd()), 'intermediate_files/BiG-SCAPE')
df2.to_csv(os.path.join(path,r'abs_pres_table.csv'))

Python Pandas From line 8 of src/II_Absence_Presence.py

args = commandArgs(trailingOnly=TRUE)
'R Script for Creation of BGC absence presence per genome from Absence_Presence Table from BiG-SCAPE'
# -----------------------------------------------------------------------------
# Absence Presence Matrix Creation
# Abs_Pres Table Created with Python Script ()
# -----------------------------------------------------------------------------

#Creation of Co_Ocurrence Matrix
'Clustering Matrix With BGC from MIBIG'
abs_pres_matrix <- read.delim(args[1], header= TRUE, sep=",")
abs_pres_matrix <-abs_pres_matrix[, c(2,3)] #removal of first column

#Modify GCF.No and Genome Columns
abs_pres_matrix <- abs_pres_matrix[!grepl("BGC", abs_pres_matrix$Genome),]
abs_pres_matrix$Genome<- gsub("BGC", "_BGC", abs_pres_matrix$Genome)
abs_pres_matrix$GCF.No<- gsub("[a-zA-Z ]", "", abs_pres_matrix$GCF.No) #Removes BiG-SCAPE Class from GCF Column
abs_pres_matrix$GCF.No<- gsub("-_", "", abs_pres_matrix$GCF.No)
abs_pres_matrix$GCF.No <- sub("^", "GCF", abs_pres_matrix$GCF.No) #Appends GCF to the Front of the Number

#Separating Columns for Renaming
abs_pres_matrix$genome2 <- sapply(strsplit(as.character(abs_pres_matrix$Genome),"_"), `[`, 1)
abs_pres_matrix$genome3 <- sapply(strsplit(as.character(abs_pres_matrix$Genome),"_"), `[`, 2)
abs_pres_matrix$genome4 <- sapply(strsplit(as.character(abs_pres_matrix$Genome),"_"), `[`, 3)
abs_pres_matrix$BGC.Region <- sapply(strsplit(as.character(abs_pres_matrix$Genome),"_"), `[`, 4)

#Merging Previous Columns
abs_pres_matrix$Genome <- paste0(abs_pres_matrix$genome2, "_",
                                 abs_pres_matrix$genome3, "_",
                                 abs_pres_matrix$genome4)

#Removal of Final Columns
abs_pres_matrix<- abs_pres_matrix[, -c(3,4,5,6)]

binary_table <- table(abs_pres_matrix)
binary_table[binary_table>0] <- 1
binary_table<- as.data.frame.matrix(binary_table)

columns <- colnames(binary_table)
new_columns <- paste0(sapply(strsplit(columns,"_"), `[`, 1),'_', sapply(strsplit(columns,"_"), `[`, 2), '_', sapply(strsplit(columns,"_"), `[`, 3))
colnames(binary_table) <- new_columns


write.table(binary_table,args[2])

R From line 8 of src/III_Absence_Presence_GCF.R

library(reshape2)
library(tidyverse)
library(stringr)
library(dplyr)
library(readr)
library(Biostrings)
library(data.table)
library(parallel)

cl <- makeCluster(30)
args = commandArgs(trailingOnly=TRUE)

tsv = data.table(read_table(args[1], col_names = F))
colnames(tsv) <- c('V1', 'V2')

n_clusters_mmseq = length(unique(tsv$V1))


tsv2 <- tsv[, .( V2 = str_c(V2, collapse = ',') ), by = V1]

#write.table(tsv2,'diamond/mmseq2_clusters.txt', row.names = F, quote = F)


'Open MCL output, collapse it and clean it\n'

mcl <- data.table(read_csv( args[2], col_names = F))

#Load unaligned headers 
unalin <- read.delim2(args[3], header = F)
unalin <- data.frame(X1 = str_remove_all(unalin$V1, '>'))

mcl <- dplyr::bind_rows(mcl, unalin)


#Name clusters

mcl$clusters <- seq(nrow(mcl))
mcl$clusters <- sprintf("cluster_%06d", mcl$clusters)


mcl_melt <- mcl[, c(X1 = strsplit(X1, "\t")), by = clusters]

'Join mmseq2 clusters with MCL clusters\n'

M <- merge(mcl_melt, tsv2, by.x= 'X1', by.y = 'V1')
M <- M[,c(2,3)]
colnames(M)[2] <- 'descriptions'
M <- data.table(M)
M <- M[, .( descriptions = str_c(descriptions, collapse = ',') ), by = clusters]
M <- M[order(clusters)]
M$gene <- sapply(strsplit(M$descriptions,","), `[`, 1)

#write.table(M, 'clusters_info.txt', row.names = F, sep = '\t', quote = F)

'Create 1/0 matrix\n' 

bacteria_name <- function(string){
  b <- stringr::str_split(string, ',')[[1]]
  c <- sapply(stringr::str_split(b,"[|]"), `[`, 1)
  d <- stringr::str_c(c, collapse = ",")
  return(d)
}

MEGAMATRIX <- M
MEGAMATRIX$descriptions <- parSapply(cl, MEGAMATRIX$descriptions , FUN = bacteria_name)
MEGAMATRIX$gene <- NULL



tbl <- table(splitstackshape:::cSplit(MEGAMATRIX, "descriptions", sep = ",", direction = "long"))
tbl <- data.table(as.data.frame(tbl))
tbl <-dcast.data.table(tbl, clusters ~ descriptions, value.var = 'Freq')

#lista <- as.list(MEGAMATRIX$descriptions)
#binary_matrix <- as.data.frame((splitstackshape:::charMat(listOfValues = lista, fill = 0L)))
#binary_matrix <- cbind(clusters = MEGAMATRIX$clusters, binary_matrix)
binary_matrix <- merge(tbl, M, by = 'clusters')



####Extract sequence of representative proteins
#system("bioawk -c fastx '{ print $name, length($seq) }' < intermediate_files/combined_proteins/combined_proteins.fasta >intermediate_files/combined_proteins/length_genes.txt")
gene_lengths <- read.table('intermediate_files/combined_proteins/length_genes.txt', header = F)

##Function to extract longest gene

extract_long_genes <- function(character, gene_length){
  a <- as.character(stringr::str_split(character, ',')[[1]])
  sub <- data.table::data.table(gene_length[gene_length$V1 %in% a,])
  longest_gene <- sub[which.max(sub$V2)]$V1
  return(longest_gene)
}

'Obtain representative genes\n'

genes_description <- binary_matrix$descriptions

new_rep <- parSapply(cl, genes_description, extract_long_genes, gene_length = gene_lengths)

binary_matrix$gene <- new_rep

write.table(binary_matrix, args[4], row.names = F)

'Obtain representative sequences\n'

representative_genes <- binary_matrix$gene
fasta <-  Biostrings::readAAStringSet(args[5])
subset <- fasta[representative_genes]
writeXStringSet(subset, args[6])

R tidyverse dplyr data.table stringr readr reshape2 MCL bioawk From line 1 of src/MCL_merge.R

args = commandArgs(trailingOnly=TRUE)
##arg[1] is the matrix from cdhit
##arg[2] is the cog annotatuion
##arg[3] is the kegg annotatuion
##arg[4] is the hmm annotation
##arg[5] is the clusterVSid file
##arg[6] is the prokka annotations
##arg[7] is the output name of matrix with all annotations





'Reading input files'
matrix <- read.table(args[1], header = T)
clustervsgene <- matrix[,c('clusters', 'gene')]

###COG annotations-
cog <- read.table(args[2], header = T)

####KEGG annotations
kegg <- read.table(args[3], header = T)

###Prokka annotations
prokka <- read.table(args[5], header = T)

'Merge annotations COG/KEGG'
###Merge annotations
all_annotations <- merge(kegg, cog, by= 'gene', all = T)

'Merge with HMM annotations'
##Merge with hmm annotations
hmm_annotations <- read.table(args[4], header = T)
all_annotations <- merge(all_annotations, hmm_annotations, by= 'gene', all = T)

'Merge with prokka annotation'
##Merge with prokka
all_annotations <- merge(all_annotations, prokka, by.x= 'gene', by.y = 'geneid', all = T)



###Merge with absence_presence_matrix

'Merge with absence_presence_matrix'
all_annotations <- merge(clustervsgene, all_annotations, by = 'gene', all=T)
all_annotations$gene<- NULL
column_number= ncol(matrix)
matrix_info = matrix[,c(column_number -1 , column_number )]
matrix <- matrix[1:(column_number-2)]
matrix$completeness <- 'True'
matrix <- cbind(matrix, matrix_info)


MEGAMATRIX <- merge(matrix, all_annotations, by= 'clusters', all.x=T)
'Write MEGAMATRIX'
write.table(MEGAMATRIX, args[6], row.names = F)

R metaprokka From line 12 of src/Process_annotations.R

library(dplyr)
library(stringr)
source('src/rhmmer_functions.R')
#arg[1] is the dbcan annotation
#arg[2] is the pfam annotation
#arg[3] is the dbcan family info
#arg[4] is the pfam family info
#arg[5] is the output table with both annotations


args = commandArgs(trailingOnly=TRUE)


###Extract top hits function
top_hits <- function(table){
  genes <- unique(table$domain_name)
  top_hits <- table[0,]
  for (i in 1:length(genes)){
    subtable <- table[table$domain_name == genes[i],]
    min_evalue <- subtable[which.min(subtable$sequence_evalue),]
    top_hits <- rbind(top_hits,min_evalue)
  }

  return(top_hits)
}


##DBCAN annotations
dbcan <- read_tblout(args[1])
dbcan <- top_hits(dbcan)
dbcan$query_name <- sapply(strsplit(dbcan$query_name,"_"), `[`, 1) 
dbcan$query_name <- str_remove(dbcan$query_name, '.hmm')

dbcan_description <- read.delim(args[3], header = F)
colnames(dbcan_description)[1]<- 'dbcan_family'
colnames(dbcan_description)[2]<- 'dbcan_description'


dbcan_final <- merge(dbcan[,c(1,3)], dbcan_description, by.x = 'query_name', by.y = 'dbcan_family', all.x=T)
colnames(dbcan_final)[1]<- 'dbcanid'
colnames(dbcan_final)[2]<- 'gene'
##Group different annotations of same gene
dbcan_final <- dbcan_final %>% group_by_at(vars(gene)) %>%
  summarize_all(paste, collapse=",")
###Check problems of family names with a _

##PFAM annotations
pfam <- read_tblout(args[2])
pfam <- top_hits(pfam)
pfam <- pfam[,c(1,4)]
pfam$query_accession <- sapply(strsplit(pfam$query_accession,"[.]"), `[`, 1)

pfam_description <- read.delim2(args[4], header = F)
pfam_description <-pfam_description[,c(1,5)]
colnames(pfam_description)[1] <- 'query_accession'
colnames(pfam_description)[2] <- 'pfam_description'

pfam_final <- merge(pfam, pfam_description, by = 'query_accession', all.x = T)
colnames(pfam_final)[1]<- 'pfamid'
colnames(pfam_final)[2]<- 'gene'
##Group different annotations of same gene
pfam_final <- pfam_final %>% group_by_at(vars(gene)) %>%
  summarize_all(paste, collapse=",")


###Merge both annotations in one table
hmm_annotations <- merge(pfam_final, dbcan_final, by= 'gene', all = T)
write.table(hmm_annotations, args[5], row.names=FALSE)

R dplyr stringr From line 2 of src/Process_hmm_annotation.R

library(stringr)
library(readr)
args = commandArgs(trailingOnly=TRUE)

matrix <- read_delim(args[1], col_names = T, quote = "\"")

matrix <- as.data.frame(matrix)

names_equivalence <- read.table(args[3], header = T)
names_equivalence$Full_name <- str_remove(names_equivalence$Full_name, '_O.fna')
names_equivalence$MicroLife_name <- str_remove(names_equivalence$MicroLife_name, '_O.fna')

n_samples <- grep("completeness", colnames(matrix)) - 1

old_colnames <- data.frame(MicroLife_name = colnames(matrix)[2:n_samples] ) 

M <- merge(old_colnames, names_equivalence, by = 'MicroLife_name')

M <- M[match(M$MicroLife_name,old_colnames$MicroLife_name),]
colnames(matrix)[2:n_samples] <- M$Full_name

write.table(matrix, args[4], row.names = F)

###Rename bigscape table
big_scape_matrix <- read.table(args[2], header = T)

old_colnames <- data.frame(MicroLife_name = colnames(big_scape_matrix) ) 

M <- merge(old_colnames, names_equivalence, by = 'MicroLife_name')

M <- M[match(M$MicroLife_name,old_colnames$MicroLife_name),]
colnames(big_scape_matrix) <- M$Full_name

write.table(big_scape_matrix, args[5], row.names = T)

###Create mapping_file
mapping_file = data.frame(Sample = names_equivalence$Full_name, Lifestyle = 'Unknown')
write.table(mapping_file, args[6], row.names = F)

R stringr readr From line 1 of src/rename_MEGAMATRIX.R

import fnmatch
import os, glob
from Bio import SeqIO
import sys, traceback
import argparse

def rename_fasta(infile, tagfile, outfile, newtagfile):
    counters = {}
    tags = {}
    newseqs=[]

    t = open(tagfile,'r')
    tn = open(newtagfile,'w')

    for line in t:
        data = line.strip().split('\t')
        tags[data[0]] = data

    seqs = SeqIO.parse(infile, 'fasta')
    for s in seqs:
        data = tags[s.id]
        counter = counters.get(data[2], 0)
        counter = counter + 1
        idtag = "%s|%06d"%(data[2], counter)
        counters[data[2]]=counter
        s.id = idtag
        s.description = ""
        tn.write(idtag)
        tn.write('\t')
        tn.write('\t'.join(data))
        tn.write('\n')
        newseqs.append(s)
    tn.close()

    try:
        SeqIO.write((newseqs), outfile, 'fasta')
    except:
        print("Exception in user code:")
        print("-"*60)
        traceback.print_exc(file=sys.stdout)
        print("-"*60)


if __name__=='__main__':

    parser = argparse.ArgumentParser()
    parser.add_argument('-f', '--fasta', dest = 'fasta', help = 'fasta input file', required = True)
    parser.add_argument('-t', '--tags', dest = 'tags', help = 'tags', required = True)
    parser.add_argument('-o', '--outfile', dest = 'outfile', help = 'outfile', required = True)
    parser.add_argument('-n', '--newtags', dest = 'newtags', help = 'tags', required = True)

    args = parser.parse_args()
    rename_fasta(args.fasta, args.tags, args.outfile, args.newtags)