Snakemake workflow for the analysis of biosynthetic gene clusters across large collections of genomes (pangenomes)

public public 1yr ago Version: v0.7.4 0 bookmarks

PEP compatible

BGCFlow is a systematic workflow for the analysis of biosynthetic gene clusters across large collections of genomes (pangenomes) from internal & public datasets.

At present, BGCFlow is only tested and confirmed to work on Linux systems with conda / mamba package manager.

Publication

Matin Nuhamunada, Omkar S. Mohite, Patrick V. Phaneuf, Bernhard O. Palsson, and Tilmann Weber. (2023). BGCFlow: Systematic pangenome workflow for the analysis of biosynthetic gene clusters across large genomic datasets. bioRxiv 2023.06.14.545018; doi: https://doi.org/10.1101/2023.06.14.545018

Pre-requisites

BGCFlow requires gcc and the conda / mamba package manager. See installation instruction for details.

Please use the latest version of BGCFlow available.

Quick Start

A quick and easy way to use BGCFlow using bgcflow_wrapper .

  1. Create a conda environment and install the BGCFlow python wrapper :
# create and activate a new conda environment
conda create -n bgcflow pip -y
conda activate bgcflow
# install `BGCFlow` wrapper
pip install git+https://github.com/NBChub/bgcflow_wrapper.git
# make sure to use bgcflow_wrapper version >= 0.2.7
bgcflow --version
  1. Additional pre-requisites : With the environment activated, install or setup this configurations:
  • Set conda channel priorities to flexible
conda config --set channel_priority disabled
conda config --describe channel_priority
  • Java (required to run metabase )
conda install openjdk 
  1. Deploy and run BGCFlow, change your_bgcflow_directory variable accordingly:
# Deploy and run BGCFlow
bgcflow clone <your_bgcflow_directory> # clone `BGCFlow` to your_bgcflow_directory
cd <your_bgcflow_directory> # move to BGCFLOW_PATH
bgcflow init # initiate `BGCFlow` config and examples from template
bgcflow run -n # do a dry run, remove the flag "-n" to run the example dataset
  1. Build and serve interactive report (after bgcflow run finished). The report will be served in http://localhost:8001/ :
# build a report
bgcflow build report
# show available projects
bgcflow serve
# serve interactive report
bgcflow serve --project Lactobacillus_delbrueckii
  • For detailed usage and configurations, have a look at the BGCFlow WIKI ( under development ) :warning:

  • Read more about bgcflow_wrapper for a detailed overview of the command line interface.

asciicast

Workflow overview

The main Snakefile workflow comprises various pipelines for data selection, functional annotation, phylogenetic analysis, genome mining, and comparative genomics for Prokaryotic datasets.

dag

Available pipelines in the main Snakefile can be checked using the following command:

bgcflow pipelines

List of Available Pipelines

Here you can find pipeline keywords that you can run using the main Snakefile of BGCflow.

Keyword Description Links
0 eggnog Annotate samples with eggNOG database (http://eggnog5.embl.de) eggnog-mapper
1 mash Calculate distance estimation for all samples using MinHash. Mash
2 fastani Do pairwise Average Nucleotide Identity (ANI) calculation across all samples. FastANI
3 automlst-wrapper Simplified Tree building using autoMLST automlst-simplified-wrapper
4 roary Build pangenome using Roary. Roary
5 eggnog-roary Annotate Roary output using eggNOG mapper eggnog-mapper
6 seqfu Calculate sequence statistics using SeqFu. seqfu2
7 bigslice Cluster BGCs using BiG-SLiCE (https://github.com/medema-group/bigslice) bigslice
8 query-bigslice Map BGCs to BiG-FAM database (https://bigfam.bioinformatics.nl/) bigfam.bioinformatics.nl
9 checkm Assess genome quality with CheckM. CheckM
10 gtdbtk Taxonomic placement with GTDB-Tk GTDBTk
11 prokka-gbk Copy annotated genbank results. prokka
12 antismash Summarizes antiSMASH result. antismash
13 arts Run Antibiotic Resistant Target Seeker (ARTS) on samples. arts
14 deeptfactor Use deep learning to find Transcription Factors. deeptfactor
15 deeptfactor-roary Use DeepTFactor on Roary outputs. Roary
16 cblaster-genome Build diamond database of genomes for cblaster search. cblaster
17 cblaster-bgc Build diamond database of BGCs for cblaster search. cblaster
18 bigscape Cluster BGCs using BiG-SCAPE BiG-SCAPE

References

  • eggNOG-mapper v2: functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Carlos P. Cantalapiedra, Ana Hernandez-Plaza, Ivica Letunic, Peer Bork, Jaime Huerta-Cepas. 2021. Molecular Biology and Evolution, msab293

  • eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Jaime Huerta-Cepas, Damian Szklarczyk, Davide Heller, Ana Hernández-Plaza, Sofia K Forslund, Helen Cook, Daniel R Mende, Ivica Letunic, Thomas Rattei, Lars J Jensen, Christian von Mering, Peer Bork Nucleic Acids Res. 2019 Jan 8; 47(Database issue): D309–D314. doi: 10.1093/nar/gky1085

  • Mash: fast genome and metagenome distance estimation using MinHash. Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM. Genome Biol. 2016 Jun 20;17(1):132. doi: 10.1186/s13059-016-0997-x.

  • Mash Screen: high-throughput sequence containment estimation for genome discovery. Ondov BD, Starrett GJ, Sappington A, Kostic A, Koren S, Buck CB, Phillippy AM. Genome Biol. 2019 Nov 5;20(1):232. doi: 10.1186/s13059-019-1841-x.

  • Jain, C., Rodriguez-R, L.M., Phillippy, A.M. et al. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat Commun 9, 5114 (2018). https://doi.org/10.1038/s41467-018-07641-9

  • Mohammad Alanjary, Katharina Steinke, Nadine Ziemert, AutoMLST: an automated web server for generating multi-locus species trees highlighting natural product potential, Nucleic Acids Research, Volume 47, Issue W1, 02 July 2019, Pages W276–W282

  • Andrew J. Page, Carla A. Cummins, Martin Hunt, Vanessa K. Wong, Sandra Reuter, Matthew T. G. Holden, Maria Fookes, Daniel Falush, Jacqueline A. Keane, Julian Parkhill, 'Roary: Rapid large-scale prokaryote pan genome analysis', Bioinformatics, 2015;31(22):3691-3693 doi:10.1093/bioinformatics/btv421

  • Andrew J. Page, Carla A. Cummins, Martin Hunt, Vanessa K. Wong, Sandra Reuter, Matthew T. G. Holden, Maria Fookes, Daniel Falush, Jacqueline A. Keane, Julian Parkhill, 'Roary: Rapid large-scale prokaryote pan genome analysis', Bioinformatics, 2015;31(22):3691-3693 doi:10.1093/bioinformatics/btv421

  • eggNOG-mapper v2: functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Carlos P. Cantalapiedra, Ana Hernandez-Plaza, Ivica Letunic, Peer Bork, Jaime Huerta-Cepas. 2021. Molecular Biology and Evolution, msab293

  • eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Jaime Huerta-Cepas, Damian Szklarczyk, Davide Heller, Ana Hernández-Plaza, Sofia K Forslund, Helen Cook, Daniel R Mende, Ivica Letunic, Thomas Rattei, Lars J Jensen, Christian von Mering, Peer Bork Nucleic Acids Res. 2019 Jan 8; 47(Database issue): D309–D314. doi: 10.1093/nar/gky1085

  • Telatin, A., Birolo, G., & Fariselli, P. SeqFu [Computer software]. GITHUB: https://github.com/telatin/seqfu2

  • Satria A Kautsar, Justin J J van der Hooft, Dick de Ridder, Marnix H Medema, BiG-SLiCE: A highly scalable tool maps the diversity of 1.2 million biosynthetic gene clusters, GigaScience, Volume 10, Issue 1, January 2021, giaa154

  • Satria A Kautsar, Kai Blin, Simon Shaw, Tilmann Weber, Marnix H Medema, BiG-FAM: the biosynthetic gene cluster families database, Nucleic Acids Research, gkaa812, https://doi.org/10.1093/nar/gkaa812

  • Satria A Kautsar, Justin J J van der Hooft, Dick de Ridder, Marnix H Medema, BiG-SLiCE: A highly scalable tool maps the diversity of 1.2 million biosynthetic gene clusters, GigaScience, Volume 10, Issue 1, January 2021, giaa154.

  • Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. 2014. Assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Research, 25: 1043-1055.

  • Chaumeil PA, et al. 2019. GTDB-Tk: A toolkit to classify genomes with the Genome Taxonomy Database. Bioinformatics, btz848.

  • Parks DH, et al. 2020. A complete domain-to-species taxonomy for Bacteria and Archaea. Nature Biotechnology, https://doi.org/10.1038/s41587-020-0501-8 .

  • Parks DH, et al. 2018. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nature Biotechnology, http://dx.doi.org/10.1038/nbt.4229 .

  • Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics 2014 Jul 15;30(14):2068-9. PMID:24642063

  • antiSMASH 6.0: improving cluster detection and comparison capabilities. Kai Blin, Simon Shaw, Alexander M Kloosterman, Zach Charlop-Powers, Gilles P van Weezel, Marnix H Medema, & Tilmann Weber. Nucleic Acids Research (2021) doi: 10.1093/nar/gkab335.

  • Mungan,M.D., Alanjary,M., Blin,K., Weber,T., Medema,M.H. and Ziemert,N. (2020) ARTS 2.0: feature updates and expansion of the Antibiotic Resistant Target Seeker for comparative genome mining. Nucleic Acids Res.,10.1093/nar/gkaa374

  • Alanjary,M., Kronmiller,B., Adamek,M., Blin,K., Weber,T., Huson,D., Philmus,B. and Ziemert,N. (2017) The Antibiotic Resistant Target Seeker (ARTS), an exploration engine for antibiotic cluster prioritization and novel drug target discovery. Nucleic Acids Res.,10.1093/nar/gkx360

  • Kim G.B., Gao Y., Palsson B.O., Lee S.Y. 2020. DeepTFactor: A deep learning-based tool for the prediction of transcription factors. PNAS. doi: 10.1073/pnas.2021171118

  • Kim G.B., Gao Y., Palsson B.O., Lee S.Y. 2020. DeepTFactor: A deep learning-based tool for the prediction of transcription factors. PNAS. doi: 10.1073/pnas.2021171118

  • Andrew J. Page, Carla A. Cummins, Martin Hunt, Vanessa K. Wong, Sandra Reuter, Matthew T. G. Holden, Maria Fookes, Daniel Falush, Jacqueline A. Keane, Julian Parkhill, 'Roary: Rapid large-scale prokaryote pan genome analysis', Bioinformatics, 2015;31(22):3691-3693 doi:10.1093/bioinformatics/btv421

  • Gilchrist, C., Booth, T. J., van Wersch, B., van Grieken, L., Medema, M. H., & Chooi, Y. (2021). cblaster: a remote search tool for rapid identification and visualisation of homologous gene clusters (Version 1.3.9) [Computer software]. https://doi.org/10.1101/2020.11.08.370601

  • Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60 (2015) .

  • Gilchrist, C., Booth, T. J., van Wersch, B., van Grieken, L., Medema, M. H., & Chooi, Y. (2021). cblaster: a remote search tool for rapid identification and visualisation of homologous gene clusters (Version 1.3.9) [Computer software]. https://doi.org/10.1101/2020.11.08.370601

  • Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60 (2015) .

  • Navarro-Muñoz, J.C., Selem-Mojica, N., Mullowney, M.W. et al. A computational framework to explore large-scale biosynthetic diversity. Nat Chem Biol 16, 60–68 (2020)

Code Snippets

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
import json
import logging
import sys
from pathlib import Path

import pandas as pd

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)


def generate_arts_dict(df_arts):
    """
    Convert arts table to dictionary
    """
    arts_dict = {}
    arts_scaffold_count = df_arts.Source.value_counts().to_dict()
    for k in arts_scaffold_count.keys():
        arts_dict[k] = {"counts": arts_scaffold_count[k]}
        arts_dict[k]["regions"] = []
        for i in df_arts[df_arts.loc[:, "Source"] == k].index:
            regions = {
                "cluster_id": df_arts.loc[i, "#Cluster"],
                "products": df_arts.loc[i, "Type"],
                "location": df_arts.loc[i, "Location"],
            }
            arts_dict[k]["regions"].append(regions)
    return arts_dict


def generate_arts_mapping(df_arts, as_json):
    arts_dict = generate_arts_dict(df_arts)
    # container for final result
    hit_mapper = {}
    contig_mapper = {}

    # iterate antismash json records
    for num, r in enumerate(as_json["records"]):
        # count number of detected regions per record
        region_count = len(r["areas"])
        # if antismash detects region
        if region_count > 0:
            contig_id = r["id"]
            logging.info(
                f"Contig {contig_id} has {region_count} regions detected. Finding corresponding scaffold in ARTS2..."
            )
            # find arts scaffold with the same region number detected
            arts_match = [
                k
                for k in arts_dict.keys()
                if int(arts_dict[k]["counts"]) == int(region_count)
            ]
            logging.debug(f"Finding matches from: {arts_match}")
            for n, a in enumerate(r["areas"]):
                bgc_id = f"{contig_id}.region{str(n+1).zfill(3)}"
                location = f"{a['start']} - {a['end']}"
                products = ",".join(a["products"])
                logging.debug(f"Finding match for: {bgc_id} | {location} | {products}")
                for m in arts_match:
                    bgc_match = arts_dict[m]["regions"][n]
                    if (bgc_match["location"] == location) and (
                        bgc_match["products"] == products
                    ):
                        logging.debug(
                            f"Found match! {bgc_match['cluster_id']} | {bgc_match['location']} | {bgc_match['products']}"
                        )
                        # logging.debug(f"Found match! {contig_id} == ARTS2 {m}")
                        hit_mapper[bgc_match["cluster_id"]] = bgc_id
                        contig_mapper[bgc_match["cluster_id"]] = contig_id
    return hit_mapper, arts_dict, contig_mapper


def extract_arts_table(input_table, input_as_json, outfile, genome_id=None):
    """
    Given an ARTS2 bgc table, map the corresponding cluster hits to antismash regions
    """
    df_arts = pd.read_csv(input_table, sep="\t")

    with open(input_as_json, "r") as f:
        as_json = json.load(f)

    if genome_id is None:
        logging.info("Assuming genome_id from ARTS2 input")
        genome_id = Path(input_table).parents[1].stem

    logging.info(f"Extracting ARTS2 BGC hits from: {genome_id}")

    logging.info("Generating ARTS2 to antiSMASH region mapper...")
    hit_mapper, arts_dict, contig_mapper = generate_arts_mapping(df_arts, as_json)

    logging.info("Mapping ARTS2 cluster to antiSMASH regions...")
    df = pd.DataFrame(columns=df_arts.columns)
    for i in df_arts.index:
        # only extract hits
        if len(df_arts.loc[i, "Genelist"]) > 2:
            cluster_id = df_arts.loc[i, "#Cluster"]
            bgc_id = hit_mapper[cluster_id]
            contig_id = contig_mapper[cluster_id]
            df.loc[i, :] = df_arts.loc[i, :]
            df.loc[i, "#Cluster"] = bgc_id
            df.loc[i, "Source"] = contig_id
            df.loc[i, "genome_id"] = genome_id
    df = df.rename(columns={"#Cluster": "bgc_id"}).set_index("bgc_id")

    logging.info(f"Writing results to {outfile}")
    df.T.to_json(outfile)
    return


if __name__ == "__main__":
    extract_arts_table(sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4])
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
import json
import logging
import sys
from pathlib import Path

import pandas as pd
from alive_progress import alive_bar

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)


def generate_change_dict(change_log):
    logging.info("Generate BGC ids mapping dict...")
    change_log = Path(change_log)
    change_log = [i for i in change_log.glob("*/*.json")]
    change_dict = {}
    for i in change_log:
        with open(i, "r") as f:
            data = json.load(f)
            for k in data.keys():
                change_dict[k] = data[k]
    return change_dict


def correct_arts_bgc_ids(arts_json, change_dict, genome_id):
    logging.debug(f"Correcting BGC ids for {genome_id}")
    output = {}
    with open(arts_json, "r") as f:
        query = json.load(f)
        for k in query.keys():
            correct_bgc_id = Path(
                change_dict[genome_id][f"{k}.gbk"]["symlink_path"]
            ).stem
            output[correct_bgc_id] = query[k]
    return output


def combine_arts_json(input_json, change_log_path, table):
    logging.info("Combining and correcting ARTS output...")

    input_json = Path(input_json)
    logging.info(input_json)
    if input_json.is_file() and input_json.suffix == ".json":
        logging.info(f"Getting ARTS overview from a single file: {input_json}")
        input_json_files = input_json

    elif input_json.is_file() and input_json.suffix == ".txt":
        logging.info(f"Getting ARTS overview  from a text file: {input_json}")
        with open(input_json, "r") as file:
            file_content = [i.strip("\n") for i in file.readlines()]
            if len(file_content) == 1:
                # Paths space-separated on a single line
                paths = file_content[0].split()
            else:
                # Paths written on separate lines
                paths = file_content
            input_json_files = [
                Path(path) for path in paths if Path(path).suffix == ".json"
            ]
    else:
        input_json_files = [
            Path(file)
            for file in str(input_json).split()
            if Path(file).suffix == ".json"
        ]
        logging.info(
            f"Getting ARTS overview from the given list of {len(input_json_files)} files..."
        )

    container = {}
    change_dict = generate_change_dict(change_log_path)

    with alive_bar(len(input_json_files), title="Merging json:") as bar:
        for j in input_json_files:
            arts_json = Path(j)
            genome_id = arts_json.stem
            value = correct_arts_bgc_ids(arts_json, change_dict, genome_id)
            container.update(value)
            bar()

    df = pd.DataFrame.from_dict(container).T
    df.index.name = "bgc_id"

    logging.debug(f"Writing file to: {table}")

    # Save dataframes to csv tables
    df_table = Path(table)
    df_table.parent.mkdir(parents=True, exist_ok=True)
    df.to_csv(table)
    logging.info("Job done")
    return None


if __name__ == "__main__":
    combine_arts_json(sys.argv[1], sys.argv[2], sys.argv[3])
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import logging
import sys
from pathlib import Path

import pandas as pd

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.INFO)


def csv_to_parquet(project_folder, output_folder="."):
    """
    Given a list of csv files, convert into parquet files
    """
    project_folder = Path(project_folder)
    if output_folder == ".":
        output_folder = project_folder / "data_warehouse"
    else:
        pass

    logging.info(
        f"Grabbing all csv from folder {project_folder} and saving parquets in {output_folder}"
    )
    for i in project_folder.rglob("*.csv"):
        if "docs" in str(i):
            pass
        elif "dbt" in str(i):
            pass
        elif "assets" in str(i):
            pass
        elif "ipynb_checkpoints" in str(i):
            pass
        else:
            category = str(i).split("/")[3]
            output_subfolder = output_folder / category
            df = pd.read_csv(i)
            # df = df.fillna("")
            # df.columns = [c.replace(".", "_") for c in df.columns]
            output_parquet = output_subfolder / f"{i.stem}.parquet"
            logging.info(f"Converting {i} to {output_parquet}")
            output_subfolder.mkdir(parents=True, exist_ok=True)
            df.to_parquet(output_parquet)

    return


if __name__ == "__main__":
    csv_to_parquet(sys.argv[1])
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import json
import logging
import sys
from pathlib import Path

from Bio import SeqIO

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)


def bgc_downstream_prep(input_dir, output_dir, selected_bgcs=False):
    """
    Given an antiSMASH directory, check for changed name
    """
    logging.info(f"Reading input directory: {input_dir}")
    path = Path(input_dir)

    if not path.is_dir():
        raise FileNotFoundError(f"No such file or directory: {path}")

    genome_id = path.name
    logging.debug(f"Deducting genome id as {genome_id}")

    change_log = {genome_id: {}}

    for gbk in path.glob("*.gbk"):
        logging.info(f"Parsing file: {selected_bgcs}")
        logging.info(f"Parsing file: {gbk.name}")
        region = SeqIO.parse(str(gbk), "genbank")
        for record in region:
            record_log = {}
            if "comment" in record.annotations:
                filename = gbk.name
                try:
                    original_id = record.annotations["structured_comment"][
                        "antiSMASH-Data"
                    ]["Original ID"].split()[0]
                except KeyError:
                    original_id = record.id
                    logging.warning(
                        f"Found shortened record.id: {record.id} <- {original_id}."
                    )

                # generate symlink
                outpath = Path(output_dir) / genome_id
                outpath.mkdir(parents=True, exist_ok=True)

                new_filename = filename.replace(record.id, original_id)
                target_path = Path.cwd() / gbk  # target for symlink

                link = outpath / new_filename

                logging.info(f"Generating symlink: {link}")
                try:
                    link.symlink_to(target_path)
                except FileExistsError:
                    logging.warning(
                        f"Previous symlink exist, updating target: {link} -> {target_path}"
                    )
                    link.unlink()
                    link.symlink_to(target_path)

                record_log["record_id"] = record.id
                record_log["original_id"] = original_id
                record_log["target_path"] = str(gbk)
                record_log["symlink_path"] = str(link)

            change_log[genome_id][filename] = record_log

    with open(
        outpath / f"{genome_id}-change_log.json", "w", encoding="utf8"
    ) as json_file:
        json.dump(change_log, json_file, indent=4)

    logging.info(f"{genome_id}: Job done!\n")
    return


if __name__ == "__main__":
    bgc_downstream_prep(sys.argv[1], sys.argv[2])
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import logging
import sys
from pathlib import Path

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)


def bigscape_copy(input_index, output_main):
    input = Path(input_index)
    output = Path(output_main)
    output.mkdir(parents=True, exist_ok=True)
    for item in input.parent.glob("*"):
        target = output / item.name
        logging.debug(f"Generating symlink for: {target} --> {item.resolve()}")
        try:
            target.symlink_to(item.resolve(), target_is_directory=True)
        except FileExistsError as e:
            logging.debug(f"Got error:\n{e}")
    return


if __name__ == "__main__":
    bigscape_copy(sys.argv[1], sys.argv[2])
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import sys

import pandas as pd


def prep_bigslice(df_gtdb_output, df_bigslice_output):
    df = pd.read_csv(df_gtdb_output)
    # Format to BiG-Slice
    df_bigslice = df.rename(columns={"genome_id": "#Genome Folder"})
    df_bigslice["#Genome Folder"] = df_bigslice["#Genome Folder"].apply(
        lambda x: f"{x}/"
    )
    df_bigslice.to_csv(df_bigslice_output, index=False, sep="\t")
    return None


if __name__ == "__main__":
    prep_bigslice(sys.argv[1], sys.argv[2])
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import logging
import sys
from pathlib import Path

import pandas as pd

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)

# create intermediate folder


def prepare_bigslice(input_folder, taxonomy_file, project_name, output_folder):
    """
    Preparing bigslice input folder for a single project
    """
    logging.info(f"Preparing bigslice input folder for {project_name}")
    input_folder = Path(input_folder)
    output_folder = Path(output_folder)
    assert input_folder.is_dir()
    output_folder.mkdir(parents=True, exist_ok=True)

    # create symlink to antismash
    logging.info("Creating symlink to AntiSMASH BGCs...")
    antismash_dir = output_folder / project_name
    try:
        antismash_dir.symlink_to(input_folder.resolve(), target_is_directory=True)
    except FileExistsError:
        pass

    # create tax info
    logging.info("Getting taxonomic information...")
    taxonomy_dir = output_folder / "taxonomy"
    taxonomy_dir.mkdir(parents=True, exist_ok=True)
    df_tax = pd.read_csv(taxonomy_file, sep="\t")
    df_tax = df_tax.loc[
        :,
        [
            "#Genome Folder",
            "Domain",
            "Phylum",
            "Class",
            "Order",
            "Family",
            "Genus",
            "Species",
            "Organism",
        ],
    ]
    df_tax.to_csv(taxonomy_dir / f"{project_name}.tsv", sep="\t", index=False)

    # build metadata
    logging.info("Preparing metadata...")
    metadata = {
        "# Dataset name": project_name,
        "Path to folder": f"{output_folder.name}/",
        "Path to taxonomy": f"taxonomy/{project_name}.tsv",
        "Description": project_name,
    }
    df_metadata = pd.DataFrame.from_dict(metadata, orient="index").T
    df_metadata.to_csv(output_folder / "datasets.tsv", sep="\t", index=False)
    logging.info("Job done!")
    return


if __name__ == "__main__":
    prepare_bigslice(sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4])
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import logging
import sys

import numpy as np
import pandas as pd

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)


def triangle_convert(matrix, outfile):
    df_raw = pd.read_csv(matrix, index_col=0)
    genome_id_list = []
    for idx in df_raw.index:
        genome_id = idx.split("\t")[0].split("/")[-1].split(".fna")[0]
        genome_id_list.append(genome_id)

    df = pd.DataFrame(0, index=genome_id_list, columns=genome_id_list)
    for idx in df_raw.index:
        genome_id = idx.split("\t")[0].split("/")[-1].split(".fna")[0]
        for cntr in range(len(idx.split("\t"))):
            if cntr > 0:
                value = idx.split("\t")[cntr]

                # if value is NA, replace with numpy null
                if value == "NA":
                    value = np.nan
                else:
                    value = float(value)

                df.loc[genome_id, genome_id_list[cntr - 1]] = value
                df.loc[genome_id_list[cntr - 1], genome_id] = value
    df.index.name = "genome_id"
    df.to_csv(outfile)
    return


if __name__ == "__main__":
    triangle_convert(sys.argv[1], sys.argv[2])
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
import json
import logging
import sys
from pathlib import Path

import pandas as pd
from alive_progress import alive_bar

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)


def combine_deeptfactor_prediction(input_json, filter_str="_deeptfactor"):
    container = {}
    logging.info("Reading json files...")
    with alive_bar(len(input_json), title="Merging DeepTFactor prediction:") as bar:
        for item in input_json:
            item = Path(item)
            genome_id = item.stem
            if filter_str in genome_id:
                genome_id = genome_id.replace(filter_str, "")
            logging.debug(f"Processing {genome_id}")
            with open(item, "r") as f:
                data = json.load(f)
                container.update(data)
            bar()
    return container


def write_deeptf_table(input_json, deeptf_table):
    """
    Write df_deeptfactor.csv table in processed data
    """
    # Handle multiple json
    input_json = Path(input_json)
    logging.info(input_json)
    if input_json.is_file() and input_json.suffix == ".json":
        logging.info(f"Getting deepTFactor overview from a single file: {input_json}")
        input_json_files = input_json

    elif input_json.is_file() and input_json.suffix == ".txt":
        logging.info(f"Getting deepTFactor overview from a text file: {input_json}")
        with open(input_json, "r") as file:
            file_content = [i.strip("\n") for i in file.readlines()]
            if len(file_content) == 1:
                # Paths space-separated on a single line
                paths = file_content[0].split()
            else:
                # Paths written on separate lines
                paths = file_content
            input_json_files = [
                Path(path) for path in paths if Path(path).suffix == ".json"
            ]
    else:
        input_json_files = [
            Path(file)
            for file in str(input_json).split()
            if Path(file).suffix == ".json"
        ]
        logging.info(
            f"Getting deepTFactor overview from the given list of {len(input_json_files)} files..."
        )

    df = combine_deeptfactor_prediction(input_json_files)
    df = pd.DataFrame.from_dict(df).T
    df.index.name = "locus_tag"

    logging.debug(f"Writing file to: {deeptf_table}")

    # Save dataframes to csv tables
    df_table = Path(deeptf_table)
    df_table.parent.mkdir(parents=True, exist_ok=True)
    df.to_csv(deeptf_table, index=True)
    logging.info("Job done")
    return None


if __name__ == "__main__":
    write_deeptf_table(sys.argv[1], sys.argv[2])
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import sys
from pathlib import Path

import pandas as pd


def deeptfactor_to_json(prediction, outfile, keep_false=True):
    genome_id = Path(prediction).parent.stem
    df = pd.read_csv(prediction, sep="\t")
    df.loc[:, "genome_id"] = genome_id
    # rename columns
    df = df.rename(
        columns={"prediction": "deeptfactor_prediction", "score": "deeptfactor_score"}
    )
    df = df.set_index("sequence_ID", drop=True)
    if keep_false:
        pass
    else:
        df = df[df.deeptfactor_prediction is True]
    df.T.to_json(outfile)
    return


if __name__ == "__main__":
    deeptfactor_to_json(sys.argv[1], sys.argv[2])
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
import logging
import sys
from pathlib import Path

import pandas as pd

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)


def extract_ncbi_information(assembly_report_path, outfile):
    """
    Extract NCBI assembly json reports generated by get_assembly_information.py in a given directory into a csv file

    Parameters
    ----------
    1. assembly_report_path : str / path
        Three different input types can be used:
        - A directory containing assembly information json files generated by get_assembly_information.py
        - A single json file generated by get_assembly_information.py
        - A list of files as string separated by spaces (put the string inside '' in bash expression)
        - A text file (.txt) containing the paths of the json files (one per line, or space-separated on a single line)
    2. outfile : str / path
        Location of the output csv file

    Returns
    -------
    1. outfile : .csv file
        A comma-separated table summarizing the NCBI assembly reports metadata
    """

    assembly_report_path = Path(assembly_report_path)
    outfile = Path(outfile)

    if assembly_report_path.is_dir():
        logging.info(
            f"Summarizing NCBI assembly information from {assembly_report_path}..."
        )
        ncbi_json_files = [
            file
            for file in assembly_report_path.iterdir()
            if file.is_file() and file.suffix == ".json"
        ]
    elif assembly_report_path.is_file() and assembly_report_path.suffix == ".json":
        logging.info(
            f"Getting NCBI assembly information from a single file: {assembly_report_path}"
        )
        ncbi_json_files = [assembly_report_path]
    elif assembly_report_path.is_file() and assembly_report_path.suffix == ".txt":
        logging.info(
            f"Reading NCBI assembly information from a text file: {assembly_report_path}"
        )
        with open(assembly_report_path, "r") as file:
            file_content = [i.strip("\n") for i in file.readlines()]
            if len(file_content) == 1:
                # Paths space-separated on a single line
                paths = file_content[0].split()
            else:
                # Paths written on separate lines
                paths = file_content
            ncbi_json_files = [
                Path(path) for path in paths if Path(path).suffix == ".json"
            ]
    else:
        ncbi_json_files = [
            Path(file)
            for file in str(assembly_report_path).split()
            if Path(file).suffix == ".json"
        ]
        logging.info(
            f"Summarizing NCBI assembly information from the given list of {len(ncbi_json_files)} files..."
        )

    dfs = []  # an empty list to store the data frames
    for file in ncbi_json_files:
        data = pd.read_json(file).T  # read data frame from json file
        dfs.append(data)  # append the data frame to the list

    df_ncbi = pd.concat(dfs)  # concatenate all the data frames in the list.
    df_ncbi.reset_index(inplace=True)
    df_ncbi = df_ncbi.rename(columns={"index": "genome_id"})

    logging.info(f"Summarized {len(df_ncbi.index)} NCBI assembly information.")
    df_ncbi.to_csv(outfile, index=False)
    return None


if __name__ == "__main__":
    extract_ncbi_information(sys.argv[1], sys.argv[2])
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
import sys

import pandas as pd


def extract_ncbi_information(
    ncbi_meta_path, patric_genome_summary, patric_genome_metadata, patric_meta_path
):
    """
    Extract PATRIC assembly json reports generated by get_assembly_information.py in a given directory into a csv file

    Parameters
    ----------
    1. assembly_report_path : str / path
        Three different input types can be used:
        - A directory containing assembly information json files generated by get_assembly_information.py
        - A single json file generated by get_assembly_information.py
        - A list of files as string separated by spaces (put the string inside '' in bash expression)
    2. outfile : str / path
        Location of the output csv file

    Returns
    -------
    1. outfile : .csv file
        A comma separated table summarizing the NCBI assembly reports metadata
    '''
    """

    df_ncbi = pd.read_csv(ncbi_meta_path, index_col="genome_id")
    df_patric_genome_summary = pd.read_csv(
        patric_genome_summary,
        sep="\t",
        dtype={"genome_id": str, "genome_name": str, "taxon_id": str, "strain": str},
    )
    df_patric_genome_summary.set_index("genome_id", inplace=True)

    df_patric_genome_metadata = pd.read_csv(
        patric_genome_metadata,
        sep="\t",
        dtype={"genome_id": str, "genome_name": str, "taxon_id": str},
    )
    df_patric_genome_metadata.set_index("genome_id", inplace=True)

    df_patric_genome_metadata = df_patric_genome_metadata[
        df_patric_genome_metadata["genome_status"] != "Plasmid"
    ]

    df_patric_meta = pd.DataFrame()

    for genome_id in df_ncbi.index:
        refseq = df_ncbi.loc[genome_id, "refseq"]
        genbank = df_ncbi.loc[genome_id, "genbank"]
        if refseq in df_patric_genome_metadata["assembly_accession"].tolist():
            patric_genome_ids_list = df_patric_genome_metadata[
                df_patric_genome_metadata["assembly_accession"] == refseq
            ].index.tolist()
        elif genbank in df_patric_genome_metadata["assembly_accession"].tolist():
            patric_genome_ids_list = df_patric_genome_metadata[
                df_patric_genome_metadata["assembly_accession"] == genbank
            ].index.tolist()

        for patric_genome_id in patric_genome_ids_list:
            df_patric_meta.loc[patric_genome_id, "bgcflow_genome_id"] = genome_id
            df_patric_meta.loc[
                patric_genome_id, df_patric_genome_metadata.columns
            ] = df_patric_genome_metadata.loc[
                patric_genome_id, df_patric_genome_metadata.columns
            ]
            if patric_genome_id in df_patric_genome_summary.index:
                for col in df_patric_genome_summary.columns:
                    if col not in df_patric_genome_metadata.columns:
                        df_patric_meta.loc[
                            patric_genome_id, col
                        ] = df_patric_genome_summary.loc[patric_genome_id, col]

    df_patric_meta.index.name = "patric_genome_id"
    df_patric_meta.to_csv(patric_meta_path)

    return None


if __name__ == "__main__":
    extract_ncbi_information(sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4])
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
import json
import logging
import sys
from pathlib import Path

import pandas as pd

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)


def summarize_gtdb_json(accession_list, df_gtdb_output):
    """
    Given a string of json filepaths generated by gtdb_prep.py,
    merge them together into one csv table
    """

    # Reading input
    logging.info("Reading GTDB metadata .json files...")

    input_json = Path(accession_list)

    if input_json.is_file() and input_json.suffix == ".json":
        logging.info(f"Getting GTDB taxonomy from a single file: {input_json}")
        input_json_files = input_json

    elif input_json.is_file() and input_json.suffix == ".txt":
        logging.info(f"Getting GTDB taxonomy from a text file: {input_json}")
        with open(input_json, "r") as file:
            file_content = [i.strip("\n") for i in file.readlines()]
            if len(file_content) == 1:
                # Paths space-separated on a single line
                paths = file_content[0].split()
            else:
                # Paths written on separate lines
                paths = file_content
            input_json_files = [
                Path(path) for path in paths if Path(path).suffix == ".json"
            ]
    else:
        input_json_files = [
            Path(file)
            for file in str(input_json).split()
            if Path(file).suffix == ".json"
        ]
        logging.info(
            f"Getting GTDB_taxonomy from the given list of {len(input_json_files)} files..."
        )

    accession = input_json_files
    out = []
    for a in accession:
        logging.info(f"Reading {a}...")
        with open(a, "r") as f:
            out.append(json.load(f))
    df = pd.DataFrame(out).set_index("genome_id", drop=False)

    # Getting taxonomic information
    logging.info("Getting taxonomic information...")
    df_taxonomy = pd.DataFrame.from_dict(
        {i: df.loc[i, "gtdb_taxonomy"] for i in df.index}
    ).T
    # Split between species (specific epithet) and organism
    df_taxonomy["organism"] = df_taxonomy["species"]
    for idx in df_taxonomy.index:
        try:
            df_taxonomy.loc[idx, "species"] = df_taxonomy.loc[idx, "species"].split(
                " "
            )[1]
        except IndexError:  # leave blank for empty taxonomy
            pass
    # Tidying up
    df_taxonomy.columns = [c.title() for c in df_taxonomy.columns]

    # Getting other metadata
    try:
        logging.info("Getting metadata into table...")
        if "metadata" not in df.columns:
            logging.warning(
                "metadata is not in the column information. Adding default values..."
            )
            df["metadata"] = [{"genome_id": genome_id} for genome_id in df.index]
        if "gtdb_release" not in df.columns:
            logging.warning(
                "gtdb_release is not in the column information. Adding default values..."
            )
            df["gtdb_release"] = "unknown"
        metadata = pd.DataFrame.from_dict(
            {i: df.loc[i, "metadata"] for i in df.index}
        ).T

        metadata_container = []
        for c in metadata.columns:
            value = {i: metadata.loc[i, c] for i in metadata.index}
            try:
                metadata_container.append(pd.DataFrame.from_dict(value).T)
            except ValueError:
                metadata_container.append(
                    pd.DataFrame([value]).T.rename(columns={0: c})
                )
        df_metadata = pd.concat(metadata_container, axis=1)

        logging.info("Finalizing table...")
        df_final = pd.concat(
            [df.loc[:, ["genome_id", "gtdb_release"]], df_taxonomy, df_metadata], axis=1
        )

    except KeyError:
        logging.info("No additional metadata found.")
        logging.info("Finalizing table...")
        df_final = pd.concat(
            [df.loc[:, ["genome_id", "gtdb_release"]], df_taxonomy], axis=1
        )

    # save to file
    logging.info(f"Writing to file: {df_gtdb_output}")
    df_final.to_csv(df_gtdb_output, index=False)
    return None


if __name__ == "__main__":
    summarize_gtdb_json(sys.argv[1], sys.argv[2])
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
import json
import logging
import re
import subprocess
import sys
from datetime import datetime
from pathlib import Path

import pandas as pd
from Bio import SeqIO

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)


def get_git_version():
    """
    Get the sha1 of the current git version
    """

    git_version = ""
    try:
        version_cmd = subprocess.run(
            ["git", "rev-parse", "--short", "HEAD"],
            universal_newlines=True,
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
        )
        status_cmd = subprocess.run(
            ["git", "status", "--porcelain"],
            universal_newlines=True,
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
        )

        git_version = str(version_cmd.stdout.strip())
        changes = str(status_cmd.stdout).strip()
        if changes:
            git_version += "(changed)"
    except OSError:
        pass

    return git_version


def get_version(version):
    """
    Get the current version string
    """

    git_version = get_git_version()
    if git_version:
        version += "-%s" % git_version
    return version


def add_bgcflow_comments(gbk_in_path, version, json_path, genome_id, gbk_out_path):
    """
    Add bgcflow meta-annotation to genbank output
    """

    version = get_version(version)

    logging.info(f"Formatting genbank file: {gbk_in_path}")

    temp_gbk = Path(gbk_in_path).parent / f"{genome_id}-change_log.gbk"
    try:
        [record for record in SeqIO.parse(gbk_in_path, "genbank")]
    except ValueError as e:
        logging.warning(f"Parsing fail: {e}")
        logging.info("Attempting to fix genbank file...")
        change_log_path = Path(gbk_in_path).parent / f"{genome_id}-change_log.json"
        change_log = correct_collided_headers(
            gbk_in_path, genome_id, temp_gbk, json_dump=change_log_path
        )
        logging.info("Retry parsing with Bio.SeqIO...")

    if temp_gbk.is_file():
        records = SeqIO.parse(temp_gbk, "genbank")
        temp_gbk.unlink()
    else:
        records = SeqIO.parse(gbk_in_path, "genbank")

    df_gtdb_meta = pd.read_json(json_path, orient="index").T
    df_gtdb_meta.fillna("Unclassified", inplace=True)
    df_tax = pd.DataFrame([i for i in df_gtdb_meta.gtdb_taxonomy])
    for col in df_tax.columns:
        df_tax[col] = df_tax[col].apply(lambda x: x.split("__")[1])
    df_tax.columns = [c.title() for c in df_tax.columns]
    tax_levels = ["Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species"]
    taxonomy_str = df_tax.loc[0, tax_levels].tolist()
    logging.debug(f"Taxonomy found: {taxonomy_str}")

    bgcflow_comment = (
        "##BGCflow-Data-START##\n"
        "Version      :: {version}\n"
        "Run date     :: {date}\n"
    ).format(
        version=version,
        date=str(datetime.now().strftime("%Y-%m-%d %H:%M:%S")),
    )

    new_records = []
    ctr = 0
    for record in records:
        logging.info(f"Formatting record {ctr}...")
        comment = bgcflow_comment
        if "comment" in record.annotations:
            record.annotations["comment"] += "\n" + comment
        else:
            record.annotations["comment"] = comment

        try:
            if not change_log[ctr]["accept_format"]:
                record.annotations[
                    "comment"
                ] += f"Original ID  :: {change_log[ctr]['original_id']}\n"
        except UnboundLocalError:
            pass

        record.annotations["comment"] += "##BGCflow-Data-END##"

        if "organism" in record.annotations:
            organism = record.annotations["organism"]
            if "Unclassified" in organism:
                record.annotations["organism"] = organism.split(" Unclassified")[
                    0
                ].strip()

        record.annotations["taxonomy"] = taxonomy_str

        new_records.append(record)

        ctr = ctr + 1

    logging.info(f"Writing final output: {gbk_out_path}")
    with open(gbk_out_path, "w") as output_handle:
        SeqIO.write(new_records, output_handle, "genbank")


# -- Correcting IDs --#
# 1. Traverse text, make a library of record header, length, and ids
# 2. Check the library for collided locus name and length
# 3. Propose changes
# 4. Make sure changes is unique
# 5. Write a new genbank file and recorded changes as json file


def correct_collided_headers(
    gbk_in_path, accession_id, outfile, json_dump="header.json"
):
    record_headers = get_record_headers(gbk_in_path)
    record_headers = shorten_record_headers(record_headers, accession_id)

    with open(json_dump, "w", encoding="utf8") as json_file:
        json.dump(record_headers, json_file, indent=4)

    logging.info(f"Writing result to: {outfile}")
    with open(gbk_in_path) as f:
        s = f.read()

    with open(outfile, "w") as f:
        for k in record_headers.keys():
            if not record_headers[k]["accept_format"]:
                s = s.replace(
                    record_headers[k]["original_header"],
                    record_headers[k]["new_header"],
                )
        f.write(s)
    return record_headers


def get_record_headers(gbk_in_path):
    """
    Get the header information for each records in a genbank file.
    Returns a json-like python dictionary.
    """
    record_headers = {}

    # Find all headers
    with open(gbk_in_path, "r") as file:
        logging.info(f"Reading file as text: {gbk_in_path}")
        ctr = 0
        for line in file:
            if line.startswith("LOCUS"):
                record_headers[ctr] = {"original_header": line}
            if "source" in line:
                length = line.split()[-1].split("..")[-1]
                record_headers[ctr]["length"] = length
                ctr = ctr + 1
    logging.debug(f"Found {len(record_headers)} records.")

    # Check for collided headers
    logging.info("Checking header format...")
    for k in record_headers.keys():
        query = record_headers[k]["original_header"].split()
        if query != 7 and query[3] != "bp":
            logging.warning("Record name and length collide in the LOCUS line")
            query_length = record_headers[k]["length"]
            if query_length in query[1]:
                original_id = query[1].replace(query_length, "")
                record_headers[k]["original_id"] = original_id
                record_headers[k]["accept_format"] = False
        else:
            record_headers[k]["original_id"] = query[1]
            record_headers[k]["accept_format"] = True
    return record_headers


def modify_id(accession_id, original_id, locus_index, record_counts, id_collection):
    """
    Attempt to fix collided name and length in the locus name by shortening id
    """
    logging.info(f"{original_id}: Attempting to change locus name...")

    # For refseq record, remove the accession prefix (first two digits) from string
    if accession_id.startswith("GCF"):
        logging.debug(
            f"{original_id}: Assuming locus came from Refseq. Removing refseq accession prefix."
        )
        refseq_type, refseq_number, refseq_version = re.split("_|[.]", original_id)
        new_id = f"{refseq_number}.{refseq_version}"

    elif accession_id.startswith("GCA"):
        logging.info(
            f"{original_id}: Assuming locus came from Genbank. Removing version from locus name..."
        )
        new_id, genbank_version = re.split("[.]", original_id)

    # For unknown source remove last 4 digit, add the contig index in front.
    # example: NZAJABAQG010000001.1 --> c1|NZAJABAQG010000
    else:
        logging.info(
            f"{original_id}: Cannot determine source. Shortening locus name..."
        )
        digit = len(str(record_counts))
        contig_number = str(locus_index + 1)

        new_id = f"c{contig_number.zfill(digit)}_{original_id[:-(digit + 4)]}"

    # make sure new_id is unique
    logging.info(f"{original_id}: Making sure id is unique...")
    if new_id in id_collection:
        raise
    else:
        logging.debug(f"{original_id}: shortened to {new_id}")
    return new_id


def shorten_record_headers(record_headers, accession_id):
    """
    Shorten record headers
    """
    record_counts = len(record_headers.keys())
    id_collection = [record_headers[k]["original_id"] for k in record_headers.keys()]

    for k in record_headers.keys():
        if record_headers[k]["accept_format"]:
            pass
        else:
            original_id = record_headers[k]["original_id"]
            old_header = record_headers[k]["original_header"]

            # Correct header name
            new_id = modify_id(
                accession_id, original_id, k, record_counts, id_collection
            )
            new_id_replacement_value = (
                f"{new_id}{(len(original_id) - len(new_id)) * ' '}"
            )
            new_header = old_header.replace(original_id, new_id_replacement_value)

            # Add value to dict
            record_headers[k]["new_id"] = new_id
            record_headers[k]["new_header"] = new_header

    return record_headers


if __name__ == "__main__":
    add_bgcflow_comments(
        sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4], sys.argv[5]
    )
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
import json
import logging
import sys
from pathlib import Path

import pandas as pd

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)


def correct_bgc_id_overview(overview_file, mapping_file, genome_id=False):
    """
    Use a mapping file to correct bgc_ids

    Parameters
    ----------
    overview_file : str / path
        Path to the BGC overview file in JSON format

    mapping_file : str / path
        Path to the mapping file in JSON format

    genome_id : str, optional
        Genome ID to correct BGC IDs for, if not provided, it is extracted from the overview file name

    Returns
    -------
    new_dict : dict
        Corrected BGC overview dictionary with updated BGC IDs
    """
    logging.info(f"Correcting shortened bgc ids for {genome_id}...")
    overview_path = Path(overview_file)
    mapping_path = Path(mapping_file)

    with open(overview_path, "r") as f:
        overview = json.load(f)

    with open(mapping_path, "r") as f:
        mapping = json.load(f)

    if not genome_id:
        genome_id = overview_path.stem.strip("_bgc_overview")
    else:
        pass

    new_dict = {}

    for bgc_id in overview.keys():
        for m in mapping[genome_id].keys():
            if bgc_id in m:
                log_change = mapping[genome_id][m]
                correct_bgc_id = Path(log_change["symlink_path"]).stem
                if log_change["record_id"] != log_change["original_id"]:
                    logging.debug(f"Replacing {bgc_id} to {correct_bgc_id}")
                    new_dict[correct_bgc_id] = overview[bgc_id]
                    new_dict[correct_bgc_id]["accession"] = log_change["original_id"]
                else:
                    new_dict[correct_bgc_id] = overview[bgc_id]
                    pass
                new_dict[correct_bgc_id]["source"] = "bgcflow"
                new_dict[correct_bgc_id]["gbk_path"] = log_change["target_path"]
    logging.info("Done!")
    return new_dict


def gather_bgc_overview(input_json, mapping_dir, table):
    """
    Gather BGC overview data from multiple JSON files and create a merged table

    Parameters
    ----------
    input_json : str
        Two different input types can be used:
        - Space-separated paths to BGC overview JSON files (put the string inside '' in bash expression)
        - A text file (.txt) containing the paths of the json files (one per line, or space-separated on a single line)
    mapping_dir : str / path
        Directory containing mapping files

    table : str / path
        Path to the output merged table

    Returns
    -------
    None
    """
    input_json = Path(input_json)
    logging.info(input_json)
    if input_json.is_file() and input_json.suffix == ".json":
        logging.info(f"Getting BGC overview from a single file: {input_json}")
        input_json_files = input_json

    elif input_json.is_file() and input_json.suffix == ".txt":
        logging.info(f"Getting BGC overview  from a text file: {input_json}")
        with open(input_json, "r") as file:
            file_content = [i.strip("\n") for i in file.readlines()]
            if len(file_content) == 1:
                # Paths space-separated on a single line
                paths = file_content[0].split()
            else:
                # Paths written on separate lines
                paths = file_content
            input_json_files = [
                Path(path) for path in paths if Path(path).suffix == ".json"
            ]
    else:
        input_json_files = [
            Path(file)
            for file in str(input_json).split()
            if Path(file).suffix == ".json"
        ]
        logging.info(
            f"Getting BGC overview from the given list of {len(input_json_files)} files..."
        )

    merged_dict = {}
    for j in input_json_files:
        mapping_file = Path(j)
        genome_id = mapping_file.name.replace("_bgc_overview.json", "")
        mapping_path = Path(mapping_dir) / f"{genome_id}/{genome_id}-change_log.json"
        corrected = correct_bgc_id_overview(mapping_file, mapping_path, genome_id)
        merged_dict.update(corrected)

    df = pd.DataFrame.from_dict(merged_dict).T
    df.index.name = "bgc_id"

    logging.debug(f"Writing file to: {table}")

    # Save dataframes to csv tables
    df_table = Path(table)
    df_table.parent.mkdir(parents=True, exist_ok=True)
    df.to_csv(table)
    logging.info("Job done")
    return None


if __name__ == "__main__":
    gather_bgc_overview(sys.argv[1], sys.argv[2], sys.argv[3])
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
import json
import logging
import sys
from pathlib import Path

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)


def region_table_builder(f, accession):
    """
    Given a feature of a record, return value to build database table
    """
    # grab region values
    region_number = f["qualifiers"]["region_number"][0]
    region_id = f"{accession}.region{str(region_number).zfill(3)}"
    location = f["location"].strip("[").strip("]")
    start_pos, end_pos = location.split(":")
    contig_edge = f["qualifiers"]["contig_edge"][0]
    # fill values
    value = {
        region_id: {
            "accession": accession,
            "region_number": region_number,
            "location": location,
            "start_pos": start_pos,
            "end_pos": end_pos,
            "contig_edge": contig_edge,
            "product": f["qualifiers"]["product"],
            "rules": f["qualifiers"]["rules"],
        }
    }
    return value


def get_antismash_overview(json_path, outfile, genome_id=False, n_hits=1):
    """
    Read antismash json output and get the summary page into a json format
    """
    path = Path(json_path)
    with open(path, "r") as f:
        data = json.load(f)

    logging.info(f"Processing: {json_path}, custom genome_id: {genome_id}")

    if not genome_id:
        genome_id = data["input_file"].strip(".gbk")
    else:
        pass

    logging.debug(f"Genome id: {genome_id}")

    # iterating over record
    output = {}
    for r, record in enumerate(data["records"]):
        logging.info(f"Getting antismash regions from record: {record['id']}")
        region_db = {}

        region_feat = [i for i in record["features"] if i["type"] == "region"]
        for f in region_feat:
            region_db.update(region_table_builder(f, record["id"]))

        for c, area in enumerate(record["areas"]):
            cluster_id = f"{r+1}.{c+1}"
            output_cluster = {}
            logging.info(f"Grabbing information from region {cluster_id}")
            # _from = area['start']
            # _to = area['end']
            # products = area['products']

            knownclusterblast = record["modules"]["antismash.modules.clusterblast"][
                "knowncluster"
            ]["results"][c]

            assert n_hits > 0

            output_hits = []

            for n, hits in enumerate(knownclusterblast["ranking"]):
                if n + 1 <= (n_hits):
                    most_similar_mibig_id = hits[0]["accession"]
                    most_similar_mibig_description = hits[0]["description"]
                    most_similar_mibig_clustertype = hits[0]["cluster_type"]
                    n_genes_in_target = len(hits[0]["tags"])
                    n_genes_hits = hits[1]["hits"]
                    hit_similarity = n_genes_hits / n_genes_in_target
                    output_hits.append(
                        {
                            "most_similar_known_cluster_id": most_similar_mibig_id,
                            "most_similar_known_cluster_description": most_similar_mibig_description,
                            "most_similar_known_cluster_type": most_similar_mibig_clustertype,
                            "similarity": hit_similarity,
                        }
                    )
                else:
                    pass

            bgc_id = f"{record['id']}.region{str(c+1).zfill(3)}"
            output_cluster = {
                "genome_id": genome_id,
                "region": cluster_id,
            }

            for column in [
                "accession",
                "start_pos",
                "end_pos",
                "contig_edge",
                "product",
            ]:
                output_cluster[column] = region_db[bgc_id][column]
            try:
                output_cluster["region_length"] = int(output_cluster["end_pos"]) - int(
                    output_cluster["start_pos"]
                )
            except ValueError:
                logging.warning(
                    f'Error calculating region length. Region might be incomplete: {output_cluster["start_pos"]}:{output_cluster["end_pos"]}'
                )
                start_pos = "".join(
                    [s for s in output_cluster["start_pos"] if s.isdigit()]
                )
                logging.warning(
                    f'Correcting start position from {output_cluster["start_pos"]} to {start_pos}'
                )
                output_cluster["start_pos"] = start_pos
                output_cluster["region_length"] = int(output_cluster["end_pos"]) - int(
                    output_cluster["start_pos"]
                )

            if len(output_hits) == 1:
                for k in output_hits[0].keys():
                    output_cluster[k] = output_hits[0][k]

            output[bgc_id] = output_cluster

    with open(outfile, "w") as f:
        logging.info(f"Writing results to {outfile}")
        json.dump(output, f, indent=2)

    return output


if __name__ == "__main__":
    get_antismash_overview(sys.argv[1], sys.argv[2], sys.argv[3])
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
import os
import sys

import pandas as pd


def get_ncbi_meta(assembly_report_path, outfile=None, genome_id=None):
    """
    Read NCBI assembly reports downloaded using ncbi-genome-download https://github.com/kblin/ncbi-genome-download and convert it to a json format.
    The assembly report can be downoloaded by using assembly_report as --formats while running ncbi-genome-download.

    Parameters
    ----------
    1. assembly_report_path : str / path
        Location of the assembly information json file generated by get_assembly_information.py
    2. outfile : str / path
        Location of the output csv file
    3. genome_id : str
        Genome accession id

    Returns
    -------
    1. outfile : .json file
        A json file summarizing the NCBI assembly reports metadata.
        If genome_id is not provided, the script will determine the genome id from the file basename.
        If outfile is not defined, it will generate a json file in the current directory with this format: {genome_id}.json.
    """
    if genome_id is None:
        genome_id = os.path.splitext(os.path.basename(assembly_report_path))[0]

    if outfile is None:
        outfile = f"{genome_id}.json"

    # List of columns in df_ncbi_meta
    ncbi_meta_columns = [
        "assembly",
        "organism",
        "genus",
        "species",
        "strain",
        "tax_id",
        "refseq_category",
        "refseq",
        "genbank",
        "assembly_type",
        "release_type",
        "assembly_level",
        "genome_representation",
        "refseq_genbank_identity",
        "biosample",
        "submitter",
        "date",
    ]

    df_ncbi_meta = pd.DataFrame(index=[genome_id], columns=ncbi_meta_columns)
    df_ncbi_meta.index.name = "genome_id"

    with open(assembly_report_path, "r") as report_file:
        lines = report_file.readlines()
        strain_found = False

        for line in lines:
            if line.startswith("# Assembly name:"):
                assembly_accn = line.split("Assembly name:")[1].strip()
                df_ncbi_meta.loc[genome_id, "assembly"] = assembly_accn
            elif line.startswith("# Organism name"):
                organism = line.split("Organism name:")[1].strip()
                df_ncbi_meta.loc[genome_id, "organism"] = organism
                df_ncbi_meta.loc[genome_id, "genus"] = organism.strip().split(" ")[0]
                df_ncbi_meta.loc[genome_id, "species"] = organism.strip().split(" ")[1]
            elif line.startswith("# Infraspecific name:"):
                strain = line.split("Infraspecific name:")[1].strip()
                if "strain=" in strain:
                    df_ncbi_meta.loc[genome_id, "strain"] = strain.split("strain=")[1]
                    strain_found = True
            elif line.startswith("# Isolate:"):
                if strain_found is False:
                    strain = line.split("Isolate:")[1].strip()
                    df_ncbi_meta.loc[genome_id, "strain"] = strain
                    strain_found = True
            elif line.startswith("# Taxid"):
                tax_id = line.split("Taxid:")[1].strip()
                df_ncbi_meta.loc[genome_id, "tax_id"] = tax_id
            elif line.startswith("# BioSample"):
                biosample = line.split("BioSample:")[1].strip()
                df_ncbi_meta.loc[genome_id, "biosample"] = biosample
            elif line.startswith("# BioProject"):
                bioproject = line.split("# BioProject:")[1].strip()
                df_ncbi_meta.loc[genome_id, "BioProject"] = bioproject
            elif line.startswith("# Submitter"):
                submitter = line.split("Submitter:")[1].strip()
                df_ncbi_meta.loc[genome_id, "submitter"] = submitter
            elif line.startswith("# Date"):
                date = line.split("Date:")[1].strip()
                df_ncbi_meta.loc[genome_id, "date"] = date
            elif line.startswith("# Assembly type:"):
                assembly_type = line.split("Assembly type:")[1].strip()
                df_ncbi_meta.loc[genome_id, "assembly_type"] = assembly_type
            elif line.startswith("# Release type:"):
                release_type = line.split("Release type:")[1].strip()
                df_ncbi_meta.loc[genome_id, "release_type"] = release_type
            elif line.startswith("# Assembly level:"):
                assembly_level = line.split("Assembly level:")[1].strip()
                df_ncbi_meta.loc[genome_id, "assembly_level"] = assembly_level
            elif line.startswith("# Genome representation:"):
                genome_representation = line.split("Genome representation:")[1].strip()
                df_ncbi_meta.loc[
                    genome_id, "genome_representation"
                ] = genome_representation
            elif line.startswith("# RefSeq category"):
                refseq_category = line.split("RefSeq category:")[1].strip()
                df_ncbi_meta.loc[genome_id, "refseq_category"] = refseq_category
            elif line.startswith("# RefSeq assembly accession"):
                refseq = line.split("RefSeq assembly accession:")[1].strip()
                df_ncbi_meta.loc[genome_id, "refseq"] = refseq
            elif line.startswith("# GenBank assembly accession"):
                genbank = line.split("GenBank assembly accession:")[1].strip()
                df_ncbi_meta.loc[genome_id, "genbank"] = genbank
            elif line.startswith("# RefSeq assembly and GenBank assemblies identical"):
                refseq_genbank_identity = line.split(
                    "RefSeq assembly and GenBank assemblies identical:"
                )[1].strip()
                df_ncbi_meta.loc[
                    genome_id, "refseq_genbank_identity"
                ] = refseq_genbank_identity

        if strain_found is False:
            df_ncbi_meta.loc[genome_id, "strain"] = genome_id

    df_ncbi_meta.to_json(outfile, orient="index", indent=4)
    return None


if __name__ == "__main__":
    get_ncbi_meta(sys.argv[1], sys.argv[2], sys.argv[3])
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import json
import logging
import sys
from pathlib import Path

from Bio import SeqIO

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)


def count_bgcs(gbk_file_path, genome_id=False, outfile=False):
    gbk_file_path = Path(gbk_file_path)

    if not type(genome_id) == str:
        genome_id = gbk_file_path.stem

    logging.debug(f"Summarize BGCs info for {gbk_file_path}")
    logging.debug(f"Genome id is: {genome_id}")
    records_list = SeqIO.parse(gbk_file_path, "genbank")

    bgc_stats = {}

    # statistics counter
    bgc_cntr = 0
    protoclusters_cntr = 0
    cand_clusters_cntr = 0
    contig_edge_cntr = 0

    # product capture
    bgc_type_dict = {}

    for rec in records_list:
        # Information on the number of BGCs, protoclusters and candidate clusters
        for feat in rec.features:
            if feat.type == "region":
                # get region counts
                bgc_cntr = bgc_cntr + 1
                if feat.qualifiers["contig_edge"][0] == "True":
                    contig_edge_cntr = contig_edge_cntr + 1

                # get product counts
                bgc_type = ".".join(sorted(feat.qualifiers["product"]))
                if bgc_type in bgc_type_dict.keys():
                    bgc_type_dict[bgc_type] = bgc_type_dict[bgc_type] + 1
                else:
                    bgc_type_dict[bgc_type] = 1

                if feat.type == "protocluster":
                    protoclusters_cntr = protoclusters_cntr + 1
                if feat.type == "cand_cluster":
                    cand_clusters_cntr = cand_clusters_cntr + 1

                bgc_stats["bgcs_count"] = bgc_cntr
                bgc_stats["bgcs_on_contig_edge"] = contig_edge_cntr
                bgc_stats["protoclusters_count"] = protoclusters_cntr
                bgc_stats["cand_clusters_count"] = cand_clusters_cntr
    result = {genome_id: bgc_stats | {"products": bgc_type_dict}}

    if not type(outfile) == str:
        return result
    else:
        logging.debug(f"Writing output to: {outfile}")
        with open(outfile, "w") as f:
            json.dump(result, f, indent=2)
        return


if __name__ == "__main__":
    count_bgcs(sys.argv[1], sys.argv[2], sys.argv[3])
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import sys
from pathlib import Path

import pandas as pd


def get_bigscape_mapping(path, outfile):
    """
    Given a path containing directories of antiSMASH result, return a table
    which maps BGC regions with its parent folder (aka accession ids)
    """
    container = {}
    for x in Path(path).iterdir():
        if x.is_dir():
            genome_id = x.name
            bgcs = {item.stem: genome_id for item in list(x.glob("*.region*.gbk"))}
            container.update(bgcs)
    df = pd.DataFrame.from_dict(container, orient="index", columns=["genome_id"])
    df.index.name = "bgc_id"
    df.to_csv(outfile)
    return None


if __name__ == "__main__":
    get_bigscape_mapping(sys.argv[1], sys.argv[2])
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import logging
import shutil
import sqlite3
import sys
from pathlib import Path

import pandas as pd

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)


def get_bigslice_query(
    project_name, output_folder, result_folder="resources/bigslice/full_run_result/"
):
    bigslice_full_run = Path(result_folder)
    query_id = find_query_result(project_name, bigslice_full_run)
    fetch_query_tables(query_id, output_folder, bigslice_full_run)
    logging.info("Copying SQL database of the run...")
    shutil.copy(
        bigslice_full_run / f"reports/{query_id}/data.db",
        Path(output_folder) / f"{query_id}.db",
    )
    logging.info("Job done!")
    return


def find_query_result(project_name, bigslice_full_run):
    logging.info("Connecting to BiG-SLICE SQL Reports...")
    conn = sqlite3.connect(bigslice_full_run / "reports/reports.db")
    df = pd.read_sql_query("select * from reports;", conn)

    # select only run result of the project
    match = []
    for i in df.index:
        if project_name in df.loc[i, "name"]:
            match.append(i)
    df = df.loc[match, :]
    logging.debug(f"Found resulting match to project {project_name}:\n{df}")

    # select the latest run for the project
    df["creation_date"] = pd.to_datetime(df["creation_date"])
    filter_time = df.creation_date == df.creation_date.max()
    df = df.where(filter_time).dropna()
    logging.debug(f"Extracting information from the latest run:{df.name.values}")
    return int(df.id.values)


def fetch_query_tables(query_id, output_folder, bigslice_full_run):
    # Connect to database
    conn = sqlite3.connect(bigslice_full_run / f"reports/{query_id}/data.db")

    # get all table name from sql
    cursor = conn.cursor()
    cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
    sql_table_names = [i[0] for i in cursor.fetchall()]

    output_folder = Path(output_folder)

    try:
        output_folder.mkdir(parents=True, exist_ok=False)
    except FileExistsError:
        logging.debug(f"Directory {output_folder} already exist")
    else:
        logging.debug(f"Creating directory: {output_folder}...")

    logging.debug(f"Extracting tables to {output_folder}...")

    # convert table to pandas df
    df_tables = {}
    for t in sql_table_names:
        df = pd.read_sql_query(f"select * from {t};", conn)
        outfile = output_folder / f"{t}.csv"
        df.to_csv(outfile, index=False)
        df_tables.update({t: df})
    return


if __name__ == "__main__":
    get_bigslice_query(sys.argv[1], sys.argv[2], sys.argv[3])
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import json
import os
import sys

import pandas as pd


def get_checkm_data(checkm_input_stats, checkm_json_folder, checkm_output_stats):
    """
    Read checkm output stats.tsv file, return json output for individual genomes and output stats table

    Parameters
    ----------
    1. checkm_input_stats : str / path
        Location of the output from checkm software 'storage/bin_stats_ext.tsv'
    2. checkm_json_folder : str / path
        Location of the json folder to store output for each genome
    3. checkm_output_stats : str / path
        Location of the output csv file in the processed/tables directory

    Returns
    -------
    1. json : .json file
        A json file for each genome summarizing the checkm output
    2. checkm_output_stats : str / path
        Output csv file in the processed/tables directory with checkm information for all genomes
    """

    # List of columns in df_checkm_stats.csv
    checkm_columns = [
        "Completeness",
        "Contamination",
        "GC",
        "GC std",
        "Genome size",
        "# ambiguous bases",
        "# scaffolds",
        "# contigs",
        "Longest scaffold",
        "Longest contig",
        "N50 (scaffolds)",
        "N50 (contigs)",
        "Mean scaffold length",
        "Mean contig length",
        "Coding density",
        "Translation table",
        "# predicted genes",
    ]

    df_checkm_input = pd.read_csv(
        checkm_input_stats, sep="\t", header=None, index_col=0
    )

    df_checkm_out = pd.DataFrame(index=df_checkm_input.index, columns=checkm_columns)
    df_checkm_out.index.name = "genome_id"

    for genome_id in df_checkm_out.index:
        json_genome_id = json.loads(df_checkm_input.loc[genome_id, 1].replace("'", '"'))
        json_path = os.path.join(checkm_json_folder, genome_id + ".json")

        with open(json_path, "w") as outfile:
            json.dump(json_genome_id, outfile)
        for col in df_checkm_out.columns:
            df_checkm_out.loc[genome_id, col] = json_genome_id[col]

    df_checkm_out.to_csv(checkm_output_stats)

    return df_checkm_out


if __name__ == "__main__":
    get_checkm_data(sys.argv[1], sys.argv[2], sys.argv[3])
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import json
import sys

import yaml

# list of the main dependecies used in the workflow
dependencies = {
    "antismash": r"workflow/envs/antismash.yaml",
    "prokka": r"workflow/envs/prokka.yaml",
    "mlst": r"workflow/envs/mlst.yaml",
    "eggnog-mapper": r"workflow/envs/eggnog.yaml",
    "roary": r"workflow/envs/roary.yaml",
    "refseq_masher": r"workflow/envs/refseq_masher.yaml",
    "seqfu": r"workflow/envs/seqfu.yaml",
}


def get_dependency_version(dep, dep_key):
    """
    return dependency version tags given a dictionary (dep) and its key (dep_key)
    """
    with open(dep[dep_key]) as file:
        result = []
        documents = yaml.full_load(file)
        for i in documents["dependencies"]:
            if type(i) == str:
                if i.startswith(dep_key):
                    result = i.split("=")[-1]
            elif type(i) == dict:
                assert list(i.keys()) == ["pip"], i.keys()
                for p in i["pip"]:
                    if dep_key in p:
                        if p.startswith("git+"):
                            result = p.split("@")[-1]
                            result = result.replace("-", ".")
                        else:
                            result = p.split("=")[-1]
    return str(result)


def write_dependecies_to_json(outfile, dep=dependencies):
    """
    write dependency version to a json file
    """
    with open(outfile, "w") as file:
        dv = {}
        for ky in dep.keys():
            vr = get_dependency_version(dep, ky)
            dv[ky] = vr
        json.dump(
            dv,
            file,
            indent=2,
        )
        file.close()
    return dv


if __name__ == "__main__":
    write_dependecies_to_json(sys.argv[1])
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
import json
import logging
import os
import sys

import pandas as pd

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)


def extract_mibig_info(mibig_json_path, mibig_bgc_table):
    """
    Returns mibig BGC information critical for the known bgc table from JSON files

    Parameters
    ----------
    1. mibig_json_path : str / path
        Location of the resources folder with all MIBIG JSON files extracted

    Returns
    -------
    1. mibig_bgc_table: str / path
        Dataframe with MIBIG BGCs information
    2. mibig_compound_table: str / path
        Dataframe with compounds information found at MIBIG
    """
    logging.info("Reading MIBIG files...")
    if not os.path.isdir(mibig_json_path):
        raise FileNotFoundError(f"No such file or directory: {mibig_json_path}")

    df_mibig_bgcs = pd.DataFrame(columns=["biosyn_class", "compounds"])
    # df_mibig_compounds = pd.DataFrame(columns=[])

    mibig_list = [
        mibig_file.split(".")[0]
        for mibig_file in os.listdir(mibig_json_path)
        if ".json" in mibig_file
    ]
    logging.debug(f"MIBIG entry has {len(mibig_list)} BGCs")
    for mibig_id in mibig_list:
        mibig_id_file = os.path.join(mibig_json_path, mibig_id + ".json")
        with open(mibig_id_file, "r") as json_obj:
            mibig_data = json.load(json_obj)
            df_mibig_bgcs.loc[mibig_id, "biosyn_class"] = ";".join(
                mibig_data.get("cluster").get("biosyn_class")
            )

            compounds_list = mibig_data.get("cluster").get("compounds")
            df_mibig_bgcs.loc[mibig_id, "compounds"] = ";".join(
                [compound.get("compound") for compound in compounds_list]
            )

            chem_acts_list = []
            for compound in mibig_data.get("cluster").get("compounds"):
                if "chem_acts" in compound.keys():
                    chem_acts = compound.get("chem_acts")
                    for activity in chem_acts:
                        if activity not in chem_acts_list:
                            chem_acts_list.append(activity)

            # logging.debug(f"{chem_acts_list}")
            # handle MIBIG 2.0 --> MIBIG 3.1 format difference
            chem_acts_list_clean = []
            for item in chem_acts_list:
                if type(item) is dict:
                    action = list(item.values())
                    for a in action:
                        chem_acts_list_clean.append(a)
                elif type(item) is str:
                    chem_acts_list_clean.append(item)

            # logging.debug(f"{chem_acts_list_clean}")
            df_mibig_bgcs.loc[mibig_id, "chem_acts"] = ";".join(chem_acts_list_clean)

            if "accession" in mibig_data.get("cluster").get("loci").keys():
                df_mibig_bgcs.loc[mibig_id, "accession"] = (
                    mibig_data.get("cluster").get("loci").get("accession")
                )
            if "completeness" in mibig_data.get("cluster").get("loci").keys():
                df_mibig_bgcs.loc[mibig_id, "completeness"] = (
                    mibig_data.get("cluster").get("loci").get("completeness")
                )
            if "evidence" in mibig_data.get("cluster").get("loci").keys():
                df_mibig_bgcs.loc[mibig_id, "evidence"] = ";".join(
                    mibig_data.get("cluster").get("loci").get("evidence")
                )
            if "organism_name" in mibig_data.get("cluster").keys():
                df_mibig_bgcs.loc[mibig_id, "organism_name"] = mibig_data.get(
                    "cluster"
                ).get("organism_name")
            if "ncbi_tax_id" in mibig_data.get("cluster").keys():
                df_mibig_bgcs.loc[mibig_id, "ncbi_tax_id"] = mibig_data.get(
                    "cluster"
                ).get("ncbi_tax_id")
            if "publications" in mibig_data.get("cluster").keys():
                df_mibig_bgcs.loc[mibig_id, "publications"] = ";".join(
                    mibig_data.get("cluster").get("publications")
                )

    df_mibig_bgcs.sort_index(inplace=True)
    df_mibig_bgcs.index.name = "mibig_id"

    df_mibig_bgcs.to_csv(mibig_bgc_table)

    return df_mibig_bgcs


if __name__ == "__main__":
    extract_mibig_info(sys.argv[1], sys.argv[2])
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
import os
import sys

import pandas as pd


def extract_org_info(genome_id, samples_path, assembly_report_path, prokka_dir):
    """
    Returns organism_info.txt with genus, species, and strain information required for prokka run inputs.
    This function returns values from provided sample.csv or extracted NCBI assembly reports in json format
    generated by get_assembly_information.py for "ncbi" source type.

    Parameters
    ----------
    1. genome_id : str
        Genome accession id
    2. samples_path : str / path (separated by space for multiple paths)
        Location of the csv file (s) containing sample information as defined in samples.schema.yaml.
        For multiple samples csv, write the paths within double tick ("") and separate each paths with space.
    3. assembly_report_path : str / path
        Location of the assembly information json file generated by get_assembly_information.py
    4. prokka_dir : str / path
        Location of the prokka directory from the given NCBI assembly id

    Returns
    -------
    1. {prokka_dir}/organism_info.txt
        A one lined text file with genus, species, and strain information of a given genome_id
    """
    # wrap single or multiple inputs & generate dataframe
    shell_input = samples_path.split()
    dfList = [pd.read_csv(s).set_index("genome_id", drop=False) for s in shell_input]
    df_samples = pd.concat(dfList, axis=0).fillna("")

    if df_samples.loc[genome_id, "source"] == "ncbi":
        # assembly_file = os.path.join(assembly_report_path, f"{genome_id}.txt")
        extract_ncbi_org_info(prokka_dir, genome_id, assembly_report_path)
    else:
        extract_samples_org_info(prokka_dir, genome_id, df_samples)
    return None


def extract_ncbi_org_info(prokka_dir, genome_id, assembly_report_path):
    """
    Returns organism_info.txt with genus, species, and strain information required for prokka run inputs.
    This function returns values from extracted NCBI assembly reports in json format generated by get_assembly_information.py.

    Parameters
    ----------
    1. prokka_dir : str / path
        Location of the prokka directory from the given NCBI assembly id
    2. genome_id : str
        NCBI assembly id (GCA... / GCF...)
    3. assembly_report_path : str / path
        Location of the assembly information json file generated by get_assembly_information.py

    Returns
    -------
    1. {prokka_dir}/organism_info.txt
        A one lined text file with genus, species, and strain information of a given NCBI assembly id
    """
    assembly_report_json = os.path.join(assembly_report_path, f"{genome_id}.json")
    df_ncbi_meta = pd.read_json(assembly_report_json).T
    df_ncbi_meta = df_ncbi_meta.fillna("")

    for idx in df_ncbi_meta.index:
        GENUS = str(df_ncbi_meta.loc[idx, "genus"])
        SPECIES = str(df_ncbi_meta.loc[idx, "species"])
        STRAIN_ID = str(df_ncbi_meta.loc[idx, "strain"])

        if not os.path.isdir(os.path.join(prokka_dir, idx)):
            os.mkdir(os.path.join(prokka_dir, idx))
        org_info_path = os.path.join(prokka_dir, idx, "organism_info.txt")
        with open(org_info_path, "w") as file_obj:
            file_obj.write(",".join([GENUS, SPECIES, STRAIN_ID]))

    return None


def extract_samples_org_info(prokka_dir, genome_id, df_samples):
    """
    Returns organism_info.txt with genus, species, and strain information required for prokka run inputs.
    This function returns values from provided sample.csv excluding "ncbi" source type.

    Parameters
    ----------
    1. prokka_dir : str / path
        Location of the prokka directory from the given NCBI assembly id
    2. genome_id : str
        Genome accession id
    3. df_sample : pd.DataFrame object
        A dataframe generated by the csv file containing sample information as defined in samples.schema.yaml

    Returns
    -------
    1. {prokka_dir}/organism_info.txt
        A one lined text file with genus, species, and strain information of a given genome_id
    """

    GENUS = str(df_samples.loc[genome_id, "genus"])
    SPECIES = str(df_samples.loc[genome_id, "species"])
    STRAIN_ID = str(df_samples.loc[genome_id, "strain"])

    if not os.path.isdir(os.path.join(prokka_dir, genome_id)):
        os.mkdir(os.path.join(prokka_dir, genome_id))
    org_info_path = os.path.join(prokka_dir, genome_id, "organism_info.txt")
    with open(org_info_path, "w") as file_obj:
        file_obj.write(",".join([GENUS, SPECIES, STRAIN_ID]))

    return None


if __name__ == "__main__":
    extract_org_info(sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4])
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
import json
import logging
import sys
from pathlib import Path

import peppy
import yaml

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)


def refine_bgcflow_project(p_bgcflow, p):
    """
    Refine a pep bgcflow project created from sample table.
    Exist for back compatibility with bgcflow=<0.3.3

    Arguments:
        p_bgcflow:
        p:

    Returns:
        p_bgcflow:
    """
    for k in p.keys():
        if k == "samples":
            p_bgcflow.config["sample_table"] = p[k]
            p_bgcflow["_config_file"] = str(Path(p[k]).parent / "project_config.yaml")
        elif k in p_bgcflow.keys():
            p_bgcflow[k] = p[k]
        elif k == "rules":
            with open(p["rules"], "r") as file:
                rules = yaml.safe_load(file)
                p_bgcflow.config[k] = rules["rules"]
        else:
            p_bgcflow.config[k] = Path(p[k]).name

    return p_bgcflow


def get_bgcflow_metadata(bgcflow_path="."):
    # BGCFlow config
    BGCFlow_path = Path(bgcflow_path).resolve()

    logging.info("Getting config metadata...")
    config_path = BGCFlow_path / "config/config.yaml"
    with open(config_path, "r") as f:
        config = yaml.safe_load(f)

    logging.info("Getting rules information...")
    # rules_dict
    rules_dict_path = BGCFlow_path / "workflow/rules.yaml"
    with open(rules_dict_path, "r") as f:
        rules_dict = yaml.safe_load(f)

    return BGCFlow_path, config, rules_dict


def get_all_metadata(config, rules_dict, BGCFlow_path, bgcflow_version):
    # Get Metadata
    logging.info("Getting metadata from projects...")
    project_metadata = {}

    for p in config["projects"]:
        # mask pep and name as alias
        if "pep" in p.keys():
            p["name"] = p.pop("pep")

        # process only pep projets
        if p["name"].endswith(".yaml"):
            project = peppy.Project(
                str(BGCFlow_path / p["name"]), sample_table_index="genome_id"
            )
            if "description" in project.config.keys():
                name = project["name"]
                description = project.config["description"]
                project_metadata[name] = {"description": description}
        else:
            # non peppy input
            name = p["name"]
            project = peppy.Project(
                str(BGCFlow_path / p["samples"]), sample_table_index="genome_id"
            )
            p = refine_bgcflow_project(project, p)
            project_metadata[name] = {"description": "No description provided."}

        # get what rules are being used
        rule_used = {}
        if "rules" in project.config.keys():
            rules = project.config["rules"]
        else:
            rules = config["rules"]
        for r in rules.keys():
            if rules[r]:
                if r in rules_dict.keys():
                    bgcflow_rules = rules_dict[r]
                    rule_used[r] = bgcflow_rules
        project_metadata[name].update({"rule_used": rule_used})

        # get sample size
        project_metadata[name]["sample_size"] = len(project.sample_table)

        # get citations
        citation_all = []
        for r in rule_used:
            citations = rules_dict[r]["references"]
            citation_all.extend(citations)
        citation_all.sort()
        project_metadata[name].update({"references": citation_all})
        project_metadata[name]["references"] = list(
            set(project_metadata[name]["references"])
        )

        # get bgcflow_version
        project_metadata[name]["bgcflow_version"] = bgcflow_version
    return project_metadata


def get_project_metadata(
    project_name, outfile, bgcflow_path=".", bgcflow_version="unknown"
):
    BGCFlow_path, config, rules_dict = get_bgcflow_metadata(bgcflow_path)
    all_metadata = get_all_metadata(config, rules_dict, BGCFlow_path, bgcflow_version)
    logging.info(f"Extracting project {project_name} metadata to {outfile}")
    with open(outfile, "w") as f:
        json.dump({project_name: all_metadata[project_name]}, f, indent=2)
    return


if __name__ == "__main__":
    get_project_metadata(sys.argv[1], sys.argv[2], bgcflow_version=sys.argv[3])
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
import json
import logging
import os
import sys

import pandas as pd
import requests

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)


def gtdb_prep(
    genome_id, outfile, samples_table, tax_path, release="R207"
):  # what happen if it does not find?
    """
    Given a genome id and the samples Pandas dataframe, write  a JSON file containing taxonomic information from GTDB API. The script will first search using the closest taxonomic placement (NCBI accession id), then using the genus information provided by user. If no information is provided, return an empty taxonomic information.
    """

    class EmptyTaxError(Exception):
        """Raised when this script returns empty dict"""

        pass

    class PlacementError(Exception):
        """Raised when this script returns empty dict"""

        pass

    def empty_result(genome_id):
        """
        Helper script for creating empty result
        """
        logging.info(
            f"No taxonomic information found, returning empty values for {genome_id}."
        )
        gtdb_tax = {}
        gtdb_tax["genome_id"] = genome_id
        gtdb_tax["gtdb_taxonomy"] = {
            "domain": "d__",
            "phylum": "p__",
            "class": "c__",
            "order": "o__",
            "family": "f__",
            "genus": "g__",
            "species": "s__",
        }
        return gtdb_tax

    def find_taxonomy(query, genome_id, gtdb_tax):
        """
        Helper script to decide taxonomic placement for a given query
        """
        # If closest placement reference is provided, try finding taxonomic information from GTDB API
        if query.closest_placement_reference.values[0] != "":
            try:
                logging.info(
                    "Inferring taxonomic placement from provided closest reference...."
                )
                gtdb_tax = get_ncbi_taxon_GTDB(
                    query.closest_placement_reference.values[0], release
                )
                gtdb_tax["genome_id"] = genome_id
            except KeyError as e:
                raise PlacementError(
                    f"Cannot infer taxonomic placement from provided closest reference. Make sure the accession id: {query.closest_placement_reference.values[0]} is part of GTDB release: {e}"
                )

        # If NCBI accession is provided, try to find taxonomic information from GTDB API
        elif query.source.values[0] == "ncbi":
            try:
                logging.info("Inferring taxonomic placement from NCBI accession....")
                gtdb_tax = get_ncbi_taxon_GTDB(query.genome_id.values[0], release)
            except KeyError:
                if query.genus.values[0] != "":
                    logging.info(
                        "Inferring taxonomic placement from provided genus information...."
                    )
                    gtdb_tax["genome_id"] = genome_id
                    gtdb_tax.update(
                        get_parent_taxon_GTDB(query.genus.values[0], "genus", release)
                    )
                    gtdb_tax["gtdb_taxonomy"][
                        "species"
                    ] = f"s__{gtdb_tax['gtdb_taxonomy']['genus'].split('__')[-1]} sp."
                else:
                    gtdb_tax = empty_result(genome_id)

        # Try to get taxonomic information from genus information
        elif query.genus.values[0] != "":
            logging.info(
                "Inferring taxonomic placement from provided genus information...."
            )
            gtdb_tax["genome_id"] = genome_id
            gtdb_tax.update(
                get_parent_taxon_GTDB(query.genus.values[0], "genus", release)
            )
            try:
                gtdb_tax["gtdb_taxonomy"][
                    "species"
                ] = f"s__{gtdb_tax['gtdb_taxonomy']['genus'].split('__')[-1]} sp."
            except KeyError:
                gtdb_tax = empty_result(genome_id)

        # If no information is found, return an empty dict
        else:
            gtdb_tax = empty_result(genome_id)

        return gtdb_tax

    # get query by subsetting samples df with genome id
    shell_input = samples_table.split()
    dfList = [pd.read_csv(s).set_index("genome_id", drop=False) for s in shell_input]
    df_samples = pd.concat(dfList, axis=0)
    query = df_samples[df_samples.loc[:, "genome_id"] == genome_id].fillna("")

    # create empty container
    gtdb_tax = {}

    # Starting process
    logging.info(f"Fetching taxonomic information for {genome_id}...")
    # Go through user provided taxonomic placement
    if any(os.path.isfile(t) for t in tax_path.split()):
        try:
            gtdb_tax = get_user_defined_classification(genome_id, tax_path)
        except KeyError:
            logging.warning(
                f"{genome_id}: Not found in user provided taxonomic placement..."
            )
            gtdb_tax = find_taxonomy(query, genome_id, gtdb_tax)
    else:
        gtdb_tax = find_taxonomy(query, genome_id, gtdb_tax)

    if gtdb_tax == {}:
        raise EmptyTaxError(
            "Oops, this shouldn't happen. It returns an empty dict. Something is wrong with the script."
        )

    logging.info(f"Writing results to {outfile}...")
    with open(outfile, "w") as file:
        json.dump(gtdb_tax, file, indent=2)
        file.close

    logging.info("Job finished!")
    return


def get_user_defined_classification(genome_id, tax_path):
    """
    Get taxonomic information from user provided GTDB-like output
    """
    shell_input = tax_path.split()
    dfList = [
        pd.read_csv(s, sep="\t").set_index("user_genome", drop=False)
        for s in shell_input
    ]
    df_tax_raw = pd.concat(dfList, axis=0)

    # drop duplicates! causes error
    logging.debug(f"Checking user provided taxonomy table from: {shell_input}")
    logging.info(
        "Checking if user provided taxonomy table contains duplicated genome ids.."
    )
    df_tax_dup = df_tax_raw["user_genome"].duplicated()
    if df_tax_dup.any():
        logging.warning(
            f"Found duplicated genome ids: {list(df_tax_raw[df_tax_dup]['user_genome'].unique())}"
        )
        duplicates_mask = df_tax_raw.duplicated(keep="first")
        df_tax = df_tax_raw[~duplicates_mask].copy()
        logging.debug("Making sure duplicated genome ids values are identical...")
        assert (
            not df_tax.duplicated().any()
        ), "Two or more genome ids have more than one taxonomic placement! Please check your taxonomy files!"
    else:
        df_tax = df_tax_raw.copy()

    level_dict = {
        "d": "domain",
        "p": "phylum",
        "c": "class",
        "o": "order",
        "f": "family",
        "g": "genus",
        "s": "species",
    }

    query = df_tax.loc[genome_id, "classification"].split(";")

    result = {}
    result["genome_id"] = genome_id
    result["gtdb_url"] = "user provided classification"
    result["gtdb_release"] = "unknown"
    result["gtdb_taxonomy"] = {level_dict[q.split("__")[0]]: q for q in query}
    logging.info("Using user provided GTDB classification.")
    return result


def get_ncbi_taxon_GTDB(accession, release="R207"):
    """
    Given an NCBI accession, return a json object of taxonomic information from GTDB API
    """

    def gtdb_api_request(accession, api_type):
        if api_type == "taxonomy":
            api_url = (
                f"https://api.gtdb.ecogenomic.org/genome/{accession}/taxon-history"
            )
        elif api_type == "summary":
            api_url = f"https://api.gtdb.ecogenomic.org/genome/{accession}/card"
        response = requests.get(api_url)

        try:
            js = response.json()
        except json.JSONDecodeError:
            logging.critical(
                f"Cannot decode response from GTDB API. Make sure this is a valid url: {api_url}"
            )
            raise

        return js, api_url

    # Mapping to bgcflow format
    level_dict = {
        "d": "domain",
        "p": "phylum",
        "c": "class",
        "o": "order",
        "f": "family",
        "g": "genus",
        "s": "species",
    }

    # get taxonomic information
    js_tax, api_url_tax = gtdb_api_request(accession, "taxonomy")
    result = {}
    result["genome_id"] = accession
    result["gtdb_url"] = api_url_tax
    result["gtdb_release"] = release
    card_detail = "Genome found"
    try:
        if js_tax == []:
            logging.warning(
                f"Genome id: {accession} is in GTDB but has no taxonomic placement. Returning empty values."
            )
            result["gtdb_taxonomy"] = {
                level_dict[k]: f"{k}__" for k in level_dict.keys()
            }
            card_detail = (
                f"Genome not found - no taxonomic assignment in release {release}"
            )
        else:
            logging.info(js_tax)
            tax = [tax for tax in js_tax if tax["release"] == release]
            if len(tax) == 1:
                result["gtdb_taxonomy"] = tax[0]
            elif len(tax) == 0:
                logging.warning(
                    f"Genome id: {accession} is in GTDB but has no taxonomic placement in release {release}. Returning empty values."
                )
                result["gtdb_taxonomy"] = {
                    level_dict[k]: f"{k}__" for k in level_dict.keys()
                }
                card_detail = (
                    f"Genome not found - no taxonomic assignment in release {release}"
                )
            else:
                raise
            result["gtdb_taxonomy"].pop("release", None)
            result["gtdb_taxonomy"] = {
                level_dict[k]: result["gtdb_taxonomy"][k]
                for k in result["gtdb_taxonomy"].keys()
            }
    except KeyError as err:
        if err.args[0] == "gtdb_taxonomy":
            logging.critical(
                f"Malformed genome id: {accession}. Make sure to use the right NCBI genome accession format."
            )
            raise
        elif err.args[0] == release:
            logging.critical(f"Cannot find genome id: {accession} in GTDB API.")
            raise

    # get other metadata from genome summary
    js_sum, api_url_sum = gtdb_api_request(accession, "summary")
    result["metadata_url"] = api_url_sum
    result["metadata"] = js_sum

    if "detail" in result["metadata"].keys():
        pass
    else:
        result["metadata"]["detail"] = card_detail

    return result


def get_parent_taxon_GTDB(taxon, level, release="R207"):
    """
    Given a taxon and its level, return a json object of parent taxons from GTDB API
    """
    level_dict = {
        "domain": "d",
        "phylum": "p",
        "class": "c",
        "order": "o",
        "family": "f",
        "genus": "g",
        "species": "s",
    }

    try:
        query = f"{level_dict[level]}__{taxon}"
    except KeyError:
        logging.critical(
            f"Incorrect taxon level format. Please choose from available format: {list(level_dict.keys())}."
        )
        raise

    api_url = f"https://api.gtdb.ecogenomic.org/taxonomy/partial/{query}/all-releases"
    response = requests.get(api_url)
    js = response.json()
    result = {}
    result["gtdb_url"] = api_url
    result["gtdb_release"] = release
    result["gtdb_taxonomy"] = {}

    level_dict_rev = {
        "d": "domain",
        "p": "phylum",
        "c": "class",
        "o": "order",
        "f": "family",
        "g": "genus",
        "s": "species",
    }

    try:
        for item in js:
            if release in item["release"]:
                t = item["taxonomy"]
                taxonomy = {level_dict_rev[k]: t[k] for k in t.keys()}

                result["gtdb_taxonomy"] = taxonomy
    except TypeError:
        logging.critical(f"{js['message']}")
        raise

    return result


if __name__ == "__main__":
    gtdb_prep(sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4], sys.argv[5])
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
import json
import logging
import sys
from pathlib import Path

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)


def assess_gtdb_json_file(item):
    """
    Given a json file, assess whether the accession can be found via GTDB-API or not.
    Return a genome_id when accession cannot be found.
    """
    logging.info(f"Assessing {item}")
    with open(item, "r") as json_file:
        data = json.load(json_file)
        genome_id = data["genome_id"]
        try:
            gtdb_release = data["gtdb_release"]
            metadata = data["metadata"]
            if "Genome not found" in metadata["detail"]:
                logging.debug(
                    f" - {genome_id} : {metadata['detail']} in GTDB-API release {gtdb_release}"
                )
                return genome_id
            elif type(metadata["genome"]["accession"]) == str:
                logging.debug(
                    f" - {genome_id} can be found via GTDB-API release {gtdb_release}"
                )
                return None

        except KeyError:
            logging.debug(f" - {genome_id} does not have metadata")
            return genome_id


def generate_symlink_gtdbtk(input_fna, gtdb_json, outdir):
    """
    Given a gtdb_json file and an input_fna file, generate a symlink to a desired location
    if genome_id cannot be found via GTDB API
    """
    input_fna = Path(input_fna).resolve()
    gtdb_json = Path(gtdb_json).resolve()
    outdir = Path(outdir)
    outdir.mkdir(parents=True, exist_ok=True)
    assert input_fna.is_file() and gtdb_json.is_file()

    genome_id = assess_gtdb_json_file(gtdb_json)
    if genome_id is not None:
        outfile = outdir / f"{genome_id}.fna"
        logging.info(f"Generating input files for GTDB-tk: {outfile}")
        outfile.symlink_to(input_fna)
    return None


def input_handling(input_list, category, suffix=".json"):
    input_list = Path(input_list)
    if input_list.is_file() and input_list.suffix == suffix:
        logging.info(f"Getting {category} from a single file: {input_list}")
        input_list_files = input_list

    elif input_list.is_file() and input_list.suffix == ".txt":
        logging.info(f"Getting {category} from a text file: {input_list}")
        with open(input_list, "r") as file:
            file_content = [i.strip("\n") for i in file.readlines()]
            if len(file_content) == 1:
                # Paths space-separated on a single line
                logging.info(
                    " - Detecting space-separated input in a single line format."
                )
                paths = file_content[0].split()
            else:
                # Paths written on separate lines
                logging.info(" - Detecting input in a multi-line format.")
                paths = file_content
            input_list_files = [
                Path(path) for path in paths if Path(path).suffix == suffix
            ]
    else:
        input_list_files = [
            Path(file)
            for file in str(input_list).split()
            if Path(file).suffix == suffix
        ]
        logging.info(
            f"Getting {category} from the given list of {len(input_list_files)} files..."
        )
    return input_list_files


def gtdbtk_prep(fna_list, json_list, outdir):
    """
    Given a list of gtdb_json file and an list of fna, generate a symlinks to a desired location
    if genome_id cannot be found via GTDB API
    """
    shell_json_input = input_handling(json_list, "taxonomy json")
    shell_fna_input = input_handling(fna_list, "fna files", suffix=".fna")

    for gtdb_json in shell_json_input:
        gid = Path(gtdb_json).stem
        input_fna = [fna for fna in shell_fna_input if gid in Path(fna).stem]
        logging.info(
            f"Found {gid} in {[str(i) for i in input_fna]}. Generating symlink..."
        )
        generate_symlink_gtdbtk(str(input_fna[0]), str(gtdb_json), str(outdir))
    return


if __name__ == "__main__":
    gtdbtk_prep(sys.argv[1], sys.argv[2], sys.argv[3])
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
import logging
import os
import sys
from collections import OrderedDict
from datetime import datetime
from pathlib import Path

import networkx as nx
import pandas as pd
from alive_progress import alive_bar

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)


def get_cluster_dataframes(
    df_genomes, df_nodes, mibig_bgc_table, as_dir="../data/antismash/"
):
    """
    Returns two dataframes of clusters with information from genomes and MIBIG
    """

    df_clusters = pd.DataFrame(
        columns=["product", "bigscape_class", "genome_id", "accn_id"]
    )
    df_known = pd.DataFrame(
        columns=[
            "product",
            "bigscape_class",
            "biosyn_class",
            "compounds",
            "chem_acts",
            "accession",
            "completeness",
            "organism_name",
            "ncbi_tax_id",
            "publications",
            "evidence",
        ]
    )
    df_mibig_bgcs = pd.read_csv(mibig_bgc_table, index_col="mibig_id")

    # Generate bgcs dataframe with metadata from df_nodes and df_genomes
    logging.info(
        "Generating bgcs dataframe with metadata from df_nodes and df_genomes..."
    )
    with alive_bar(len(df_genomes.index)) as bar:
        for genome_id in df_genomes.index:
            logging.info(f"Processing BGCs in the genome: {genome_id}")
            if genome_id in os.listdir(as_dir):
                genome_dir = os.path.join(as_dir, genome_id)
                bgc_id_list = [
                    region[:-4]
                    for region in os.listdir(genome_dir)
                    if "region0" in region
                ]
                for bgc_id in bgc_id_list:
                    logging.debug(f"Processing {bgc_id}")
                    if bgc_id in df_nodes.index:
                        df_clusters.loc[bgc_id, "genome_id"] = genome_id
                        df_clusters.loc[bgc_id, "product"] = df_nodes.loc[
                            bgc_id, "Product Prediction"
                        ]
                        df_clusters.loc[bgc_id, "bigscape_class"] = df_nodes.loc[
                            bgc_id, "BiG-SCAPE class"
                        ]
                        df_clusters.loc[bgc_id, "accn_id"] = df_nodes.loc[
                            bgc_id, "Accession ID"
                        ]
                    else:
                        logging.debug(f"{bgc_id} not in df_nodes")
            else:
                logging.warning(f"{genome_id} not in directory!")
            bar()

    # Generate separate table for known BGCs from MIBIG
    logging.info("Generating separate table for known BGCs from MIBIG")
    with alive_bar(len(df_nodes.index)) as bar:
        for bgc_id in df_nodes.index:
            if "BGC0" in bgc_id:
                df_known.loc[bgc_id, "product"] = df_nodes.loc[
                    bgc_id, "Product Prediction"
                ]
                df_known.loc[bgc_id, "bigscape_class"] = df_nodes.loc[
                    bgc_id, "BiG-SCAPE class"
                ]
                if "." in bgc_id:
                    mibig_id = bgc_id.split(".")[0]
                else:
                    mibig_id = bgc_id

                if mibig_id in df_mibig_bgcs.index:
                    for col in df_mibig_bgcs.columns:
                        df_known.loc[bgc_id, col] = df_mibig_bgcs.loc[mibig_id, col]
                else:
                    logging.debug(f"{bgc_id} not in df_mibig_bgcs dataframe index")
            bar()

    return df_known, df_clusters


def add_bigscape_families(df_clusters, df_known, net_data_path):
    """
    Adds GCC and GCF numbers detected by BiG-SCAPE clustering for different cut-offs
    """

    cluster_class_set = [
        cluster_class
        for cluster_class in os.listdir(net_data_path)
        if ".tsv" not in cluster_class
    ]

    for cluster_class in cluster_class_set:
        logging.info(f"Processing all BGCs from {cluster_class}")
        class_dir = os.path.join(net_data_path, cluster_class)
        gcf_cutoffs_files = [
            file for file in os.listdir(class_dir) if "_clustering_" in file
        ]
        for gcf_file in gcf_cutoffs_files:
            cutoff = gcf_file[-8:-4]
            gcf_path = os.path.join(class_dir, gcf_file)
            df_clusters = read_gcf_df(df_clusters, gcf_path, cutoff)
            df_known = read_gcf_df(df_known, gcf_path, cutoff)

        clan_files = [file for file in os.listdir(class_dir) if "_clans_" in file]
        if len(clan_files) == 1:
            clan_select = clan_files[0]
            clan_path = os.path.join(class_dir, clan_select)

            df_clusters = read_gcc_df(df_clusters, clan_path)
            df_known = read_gcc_df(df_known, clan_path)

    return df_clusters, df_known


def read_gcf_df(df_clusters, gcf_path, cutoff):
    """
    Adds GCF (Gene Cluster Family) number for each BGC
    """

    df_gcf = pd.read_csv(gcf_path, sep="\t", index_col="#BGC Name", dtype=str)
    col_name = "gcf_" + cutoff

    for bgc in df_clusters.index:
        if bgc in df_gcf.index:
            df_clusters.loc[bgc, col_name] = df_gcf.loc[bgc, "Family Number"]

    return df_clusters


def read_gcc_df(df_clusters, clan_path):
    """
    Adds GCC (Gene Cluster Clan) number for each BGC
    """

    df_gcc = pd.read_csv(clan_path, sep="\t", index_col="#BGC Name", dtype=str)
    col_name = "Clan Number"

    for bgc in df_clusters.index:
        if bgc in df_gcc.index:
            df_clusters.loc[bgc, col_name] = df_gcc.loc[bgc, "Clan Number"]

    return df_clusters


def get_bigscape_network(net_data_path, cutoff="0.30"):
    """
    Reads similarity network for a particular to a dataframe
    """

    cluster_class_set = [
        cluster_class
        for cluster_class in os.listdir(net_data_path)
        if ".tsv" not in cluster_class
    ]

    df_network = pd.DataFrame()
    for cluster_class in cluster_class_set:
        class_dir = os.path.join(net_data_path, cluster_class)
        net_file = cluster_class + "_c" + cutoff + ".network"
        net_path = os.path.join(class_dir, net_file)
        if os.path.isfile(net_path):
            df_class_net = pd.read_csv(net_path, sep="\t")
            df_network = pd.concat([df_network, df_class_net], ignore_index=True)

    return df_network


def get_network_graph(df_network, weight="Raw distance"):
    """
    Returns networkX graph for a given network
    """

    G_clusters = nx.from_pandas_edgelist(
        df_network, "Clustername 1", "Clustername 2", weight
    )
    G_families = nx.connected_components(G_clusters)
    family_nodes = [c for c in sorted(G_families, key=len, reverse=True)]

    return G_clusters, family_nodes


def remove_single_mibig(df_network, df_known, family_nodes):
    """
    Removes singleton MIBIG BGCs from the network
    """
    logging.info("Removing singleton MIBIG BGCs from the network")
    nodes_to_remove = []

    for fam in family_nodes:
        single_mibig = True

        for node in fam:
            if "BGC" not in node:
                single_mibig = False

        if single_mibig:
            for node in fam:
                if node not in nodes_to_remove:
                    nodes_to_remove.append(node)
    logging.debug(
        f"{len(nodes_to_remove)} number of MIBIG nodes will be removed from analysis due to no similarity"
    )
    df_network = df_network[~df_network["Clustername 1"].isin(nodes_to_remove)]
    df_network = df_network[~df_network["Clustername 2"].isin(nodes_to_remove)]
    for node in nodes_to_remove:
        if node in df_known.index:
            df_known = df_known.drop(node)

    return df_network, df_known


def get_family_graph(G_clusters):
    """
    Returns families list as networkx graph
    """

    # Find connected components or cluster families
    Families_list = list(nx.connected_components(G_clusters))
    # Sort families in decreasing order of occurrence
    family_size = [len(family) for family in Families_list]
    orig_index = list(range(len(family_size)))
    orig_index_fam_size_dict = dict(zip(orig_index, family_size))

    sorted_index_fam_size_dict = OrderedDict(
        sorted(orig_index_fam_size_dict.items(), key=lambda x: x[1])
    )
    new_index = list(range(len(family_size) - 1, -1, -1))
    orig_index_sorted = list(sorted_index_fam_size_dict.keys())
    new_orig_index_dict = dict(zip(new_index, orig_index_sorted))

    # Ordered family graphs
    family_graphs = [
        Families_list[new_orig_index_dict[fam_id]]
        for fam_id in range(len(Families_list))
    ]

    return family_graphs


def update_cluster_family(
    df_clusters, df_known, family_nodes, mibig_bgc_table, cutoff="0.30"
):
    """
    Updates df_clusters with family ids (connected components)
    """
    df_mibig_bgcs = pd.read_csv(mibig_bgc_table, index_col="mibig_id")

    df_families = pd.DataFrame(
        columns=["fam_type", "fam_name", "clusters_in_fam", "mibig_ids"]
    )
    for cntr in range(len(family_nodes)):
        fam_id = cntr + 1
        family = family_nodes[cntr]
        known_bgcs = [bgc for bgc in family if bgc in df_mibig_bgcs.index]  #

        if len(known_bgcs) > 0:
            df_families.loc[fam_id, "fam_type"] = "known_family"
            logging.debug(
                f'Family {fam_id} has {len(known_bgcs)} known BGCs: {", ".join(known_bgcs)}'
            )

            # handle missing entry from MIBIG release
            try:
                known_compounds = ";".join(
                    df_known.loc[known_bgcs, "compounds"].tolist()
                )
            except TypeError:
                logging.warning(
                    f"MIBIG entries {known_bgcs} returned no compounds: {df_known.loc[known_bgcs, 'compounds'].values}"
                )
                logging.warning(
                    "This is likely because of missing JSON entry in the downloaded MIBIG release."
                )
                logging.warning(
                    "Check if this is the case in the resources/mibig/df_mibig_bgcs.csv folder"
                )
                logging.warning("Replacing compound value with link to MIBIG...")
                for h in known_bgcs:
                    link_mibig = f"https://mibig.secondarymetabolites.org/repository/{h.split('.')[0]}/index.html"
                    logging.warning(
                        f"{h}: Replacing compound {df_known.loc[h, 'compounds']} with {link_mibig}"
                    )
                    df_known.loc[h, "compounds"] = link_mibig

            df_families.loc[fam_id, "fam_name"] = known_compounds
            df_families.loc[fam_id, "clusters_in_fam"] = len(family)
            df_families.loc[fam_id, "mibig_ids"] = ";".join(known_bgcs)
        else:
            df_families.loc[fam_id, "fam_type"] = "unknown_family"
            bgc_class = ",".join(
                df_clusters.loc[list(family), "bigscape_class"].unique().tolist()
            )
            df_families.loc[fam_id, "fam_name"] = "u_" + bgc_class + "_" + str(fam_id)
            df_families.loc[fam_id, "clusters_in_fam"] = len(family)

        for bgc in family:
            if bgc in df_clusters.index:
                df_clusters.loc[bgc, "fam_id_" + cutoff] = str(fam_id)
                if len(known_bgcs) > 0:
                    df_clusters.loc[bgc, "fam_type_" + cutoff] = "known_family"
                    known_compounds = ";".join(
                        df_known.loc[known_bgcs, "compounds"].tolist()
                    )
                    df_clusters.loc[
                        bgc, "fam_known_compounds_" + cutoff
                    ] = known_compounds
                else:
                    df_clusters.loc[bgc, "fam_type_" + cutoff] = "unknown_family"
                    df_clusters.loc[bgc, "fam_known_compounds_" + cutoff] = (
                        "u_" + bgc_class + "_" + str(fam_id)
                    )

            elif bgc in df_known.index:
                df_known.loc[bgc, "fam_id_" + cutoff] = str(fam_id)
                df_known.loc[bgc, "fam_type_" + cutoff] = "known_family"
                known_compounds = ";".join(
                    df_known.loc[known_bgcs, "compounds"].tolist()
                )
                df_known.loc[bgc, "fam_known_compounds_" + cutoff] = known_compounds

    return df_clusters, df_known, df_families


def get_family_presence(df_clusters, df_genomes, df_families, cutoff):
    """
    Returns BGC family presence absence matrix across genomes
    """

    df_family_presence = pd.DataFrame(
        0, index=df_genomes.index, columns=df_families.index.astype(str)
    )
    for bgc_id in df_clusters.index:
        fam_id = str(df_clusters.loc[bgc_id, "fam_id_" + cutoff])
        genome_id = df_clusters.loc[bgc_id, "genome_id"]
        df_family_presence.loc[genome_id, fam_id] = 1

    return df_family_presence


def run_family_analysis(
    cutoff,
    net_data_path,
    df_clusters,
    df_genomes,
    df_known_all,
    output_dir,
    mibig_table,
    query_name,
):
    logging.info(f"Processing data from BiG-SCAPE with cutoff {cutoff}")
    df_network = get_bigscape_network(net_data_path, cutoff=cutoff)
    G_clusters, family_nodes = get_network_graph(df_network, weight="Raw distance")
    df_network, df_known = remove_single_mibig(df_network, df_known_all, family_nodes)
    G_clusters, family_nodes = get_network_graph(df_network, weight="Raw distance")
    singleton_bgc = [list(fam)[0] for fam in family_nodes if len(fam) == 1]
    family_graphs = get_family_graph(G_clusters)
    df_clusters, df_known, df_families = update_cluster_family(
        df_clusters, df_known, family_nodes, mibig_table, cutoff=cutoff
    )
    df_family_presence = get_family_presence(
        df_clusters, df_genomes, df_families, cutoff=cutoff
    )
    logging.debug(f"Number of genomes: {df_genomes.shape[0]}")
    logging.debug(f"Number of BGCs: {df_clusters.shape[0]}")
    logging.debug(f"Number of edges in the network: {df_network.shape[0]}")
    logging.debug(f"Number of total families: {len(family_nodes)}")
    logging.debug(
        f"Number of total non-single families: {len(family_nodes) - len(singleton_bgc)}"
    )
    logging.debug(f"Number of singleton families in the network: {len(singleton_bgc)}")
    logging.debug(
        f"Number of families with known BGCs: {df_families[df_families.fam_type == 'known_family'].shape[0]}"
    )
    logging.debug(f"Number of known BGCs in the network: {df_known.shape[0]}")
    logging.debug(
        f"Number of BGCs in top 10 families {[len(fam) for fam in family_graphs[:10]]}"
    )
    df_known_families = df_families[df_families.fam_type == "known_family"]
    logging.debug(
        f"Some of the common known BGCs{chr(10)}{chr(10)}{df_known_families.iloc[:20,1:3]}{chr(10)}"
    )
    df_unknown_families = df_families[df_families.fam_type == "unknown_family"]
    logging.debug(
        f"Some of the common unknown BGCs:{chr(10)}{chr(10)}{df_unknown_families.iloc[:20,1:3]}{chr(10)}"
    )

    # Assign index name
    df_network.index.name = "bigscape_edge_id"
    df_known.index.name = "bgc_id"
    df_families.index.name = f"fam_id_{cutoff}"
    df_clusters.index.name = "bgc_id"
    df_family_presence.index.name = "genome_id"

    # Save all the dataframes
    df_network.to_csv(f"{output_dir}/{query_name}_df_network_" + cutoff + ".csv")
    df_known.to_csv(f"{output_dir}/{query_name}_df_known_" + cutoff + ".csv")
    df_families.to_csv(f"{output_dir}/{query_name}_df_families_" + cutoff + ".csv")
    df_clusters.to_csv(f"{output_dir}/{query_name}_df_clusters_" + cutoff + ".csv")
    df_family_presence.to_csv(
        f"{output_dir}/{query_name}_df_family_presence_" + cutoff + ".csv"
    )

    return df_clusters, df_families, df_network


def process_bigscape_output(
    bigscape_directory, as_dir, df_genomes_path, mibig_bgc_table, output_dir
):
    bigscape_directory = Path(bigscape_directory)
    output_dir = Path(output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)

    # get the latest run
    bigscape_runs = {}
    logging.info("Getting BiG-SCAPE runs...")
    for net_data_path in bigscape_directory.glob("network_files/*"):
        logging.debug(f"Found network data: {net_data_path}")
        selected_run_folder = net_data_path.name
        selected_run_time = selected_run_folder.split("_glocal")[0]
        selected_run_time = datetime.strptime(selected_run_time, "%Y-%m-%d_%H-%M-%S")
        bigscape_runs[selected_run_time] = net_data_path
    run_times = list(bigscape_runs.keys())
    run_times.sort(reverse=True)

    # make sure run times has values
    assert len(run_times) > 0
    selected_run = run_times[0]
    net_data_path = bigscape_runs[selected_run]

    logging.info(f"Processing {selected_run}")
    node_annot_path = (
        net_data_path / "Network_Annotations_Full.tsv"
    )  # Read the BGC table

    df_nodes = pd.read_csv(node_annot_path, index_col="BGC", sep="\t")

    # Generate df_clusters and df_known dataframe
    df_genomes = pd.read_csv(df_genomes_path).set_index("genome_id", drop=False)
    df_genomes.to_csv(f"{output_dir}/df_genome_antismash_summary.csv")

    df_known_all, df_clusters = get_cluster_dataframes(
        df_genomes, df_nodes, mibig_bgc_table, as_dir
    )
    assert (
        len(df_clusters) > 0
    ), f"df_cluster is empty {df_clusters.shape}. Check folder data/interim/bgcs to have proper antismash region genbank files."
    # Enrich dataframes with BiGSCAPE information on GCCs and GCFs with cutoffs
    df_clusters, df_known_all = add_bigscape_families(
        df_clusters, df_known_all, net_data_path
    )

    # Get GCF data as per the cutoff
    for cutoff in ["0.30", "0.40", "0.50"]:
        df_clusters, df_families, df_network = run_family_analysis(
            cutoff,
            net_data_path,
            df_clusters,
            df_genomes,
            df_known_all,
            output_dir,
            mibig_bgc_table,
            str(selected_run).replace(":", "_"),
        )
    return


if __name__ == "__main__":
    process_bigscape_output(
        sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4], sys.argv[5]
    )
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
import json
import logging
import sys
from pathlib import Path

import pandas as pd
from alive_progress import alive_bar

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)


def combine_bgc_counts(input_json, filter_str="_bgc_counts"):
    container = {}
    logging.info("Reading json files...")
    with alive_bar(len(input_json), title="Merging BGC counts:") as bar:
        for item in input_json:
            item = Path(item)
            genome_id = item.stem
            if filter_str in genome_id:
                genome_id = genome_id.replace(filter_str, "")
            logging.debug(f"Processing {genome_id}")
            with open(item, "r") as f:
                data = json.load(f)
                products = data[genome_id].pop("products")
                data[genome_id] = data[genome_id] | products
                container.update(data)
            bar()
    return container


def write_genome_table(input_json, samples_table, genome_table):
    """
    Write df_genomes.csv table in processed data
    """
    # Accomodate multiple inputs to generate dataframe
    shell_input = samples_table.split()
    logging.info(f"Reading samples table: {shell_input}")
    dfList = [pd.read_csv(s).set_index("genome_id", drop=False) for s in shell_input]
    df_samples = pd.concat(dfList, axis=0)

    # Handle multiple json
    input_json = Path(input_json)
    logging.info(input_json)
    if input_json.is_file() and input_json.suffix == ".json":
        logging.info(f"Getting BGC overview from a single file: {input_json}")
        input_json_files = input_json

    elif input_json.is_file() and input_json.suffix == ".txt":
        logging.info(f"Getting BGC overview  from a text file: {input_json}")
        with open(input_json, "r") as file:
            file_content = [i.strip("\n") for i in file.readlines()]
            if len(file_content) == 1:
                # Paths space-separated on a single line
                paths = file_content[0].split()
            else:
                # Paths written on separate lines
                paths = file_content
            input_json_files = [
                Path(path) for path in paths if Path(path).suffix == ".json"
            ]
    else:
        input_json_files = [
            Path(file)
            for file in str(input_json).split()
            if Path(file).suffix == ".json"
        ]
        logging.info(
            f"Getting BGC overview from the given list of {len(input_json_files)} files..."
        )

    bgc_counts = combine_bgc_counts(input_json_files)
    bgc_counts = pd.DataFrame.from_dict(bgc_counts).T

    logging.debug(f"Writing file to: {genome_table}")

    # Generate dataframe
    df_genomes = pd.concat([df_samples, bgc_counts], axis=1)

    # Save dataframes to csv tables
    genome_table = Path(genome_table)
    genome_table.parent.mkdir(parents=True, exist_ok=True)
    df_genomes.to_csv(genome_table, index=False)
    logging.info("Job done")
    return None


if __name__ == "__main__":
    write_genome_table(sys.argv[1], sys.argv[2], sys.argv[3])
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
import logging
import os
import sys
from shutil import copyfile

import matplotlib.pyplot as plt
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from Bio import Phylo, SeqIO

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)


def get_roary_data(
    roary_interim_folder, roary_processed_folder, core=0.99, softcore=0.95, shell=0.15
):
    """
    Copy important files from Roary interim to processed directory

    Parameters
    ----------
    1. roary_folder : str / path
        Location of the output from Roary in the interim directory
    2. roary_processed_folder : str / path
        Location of the processed output directory for important Roary results
    3. core: int (0 to 1)
        default = 0.99
        Fraction of genomes the gene must be present to be group in core (99% <= strains <= 100%)
    4. softcore: int (0 to 1)
        default = 0.95
        Fraction of genomes the gene must be present to be group in softcore (95% <= strains < 99%)
    5. shell: int (0 to 1)
        default = 0.15
        Fraction of genomes the gene must be present to be group in softcore (15% <= strains < 95%)
        Remaining genes will be in cloud
    Returns
    -------
    1. roary_processed_folder : str / path
        Location of the processed output directory for important Roary results
    """

    gene_presence_path = os.path.join(roary_interim_folder, "gene_presence_absence.csv")
    df_gene_presence_summary = pd.read_csv(
        gene_presence_path, index_col="Gene", low_memory=False
    )

    # Extract gene annotation columns to separate dataframe
    logging.info("Extract gene annotation columns to separate dataframe")
    gene_summary_columns = [
        "Non-unique Gene name",
        "Annotation",
        "No. isolates",
        "No. sequences",
        "Avg sequences per isolate",
        "Genome Fragment",
        "Order within Fragment",
        "Accessory Fragment",
        "Accessory Order with Fragment",
        "QC",
        "Min group size nuc",
        "Max group size nuc",
        "Avg group size nuc",
    ]

    gene_summary_out_interim = os.path.join(
        roary_interim_folder, "df_pangene_summary.csv"
    )
    gene_summary_out_processed = os.path.join(
        roary_processed_folder, "df_pangene_summary.csv"
    )
    df_gene_summary = df_gene_presence_summary[gene_summary_columns].fillna("")

    # Extract locus tags
    logging.info("Extract locus tags")
    df_gene_presence = df_gene_presence_summary.drop(
        columns=gene_summary_columns
    ).fillna("")
    gene_presence_locustag_out_interim = os.path.join(
        roary_interim_folder, "df_gene_presence_locustag.csv"
    )
    gene_presence_locustag_out_processed = os.path.join(
        roary_processed_folder, "df_gene_presence_locustag.csv"
    )
    df_gene_presence.to_csv(gene_presence_locustag_out_interim)
    df_gene_presence.to_csv(gene_presence_locustag_out_processed)

    # Save the gene presence absence binary matrix
    logging.info("Save the gene presence absence binary matrix")
    gene_presence_binary_path = os.path.join(
        roary_interim_folder, "gene_presence_absence.Rtab"
    )
    gene_presence_binary_out_interim = os.path.join(
        roary_interim_folder, "df_gene_presence_binary.csv"
    )
    gene_presence_binary_out_processed = os.path.join(
        roary_processed_folder, "df_gene_presence_binary.csv"
    )

    df_gene_presence_binary = pd.read_csv(
        gene_presence_binary_path, index_col="Gene", sep="\t"
    )
    df_gene_presence_binary.to_csv(gene_presence_binary_out_interim)
    df_gene_presence_binary.to_csv(gene_presence_binary_out_processed)

    # Add locus_tag and pangenome_id from pan_genome_reference.fa file
    logging.info("Add locus_tag and pangenome_id from pan_genome_reference.fa file")
    pan_fasta_path = os.path.join(roary_interim_folder, "pan_genome_reference.fa")
    records = SeqIO.parse(pan_fasta_path, format="fasta")
    for rec in records:
        locus_tag = rec.id
        pan_gene_desc = rec.description.split(" ")
        pan_gene_id = " ".join([pan_gene_id for pan_gene_id in pan_gene_desc[1:]])
        if pan_gene_id in df_gene_summary.index:
            df_gene_summary.loc[pan_gene_id, "locus_tag"] = locus_tag

    # Add pangenome class and save gene summary table
    logging.info("Add pangenome class and save gene summary table")
    total_genomes = df_gene_presence.shape[1]
    for pan_gene_id in df_gene_summary.index:
        no_isolates = df_gene_summary.loc[pan_gene_id, "No. isolates"]
        if no_isolates >= core * total_genomes:
            df_gene_summary.loc[pan_gene_id, "pangenome_class"] = "core"
        elif no_isolates >= softcore * total_genomes:
            df_gene_summary.loc[pan_gene_id, "pangenome_class"] = "softcore"
        elif no_isolates >= shell * total_genomes:
            df_gene_summary.loc[pan_gene_id, "pangenome_class"] = "shell"
        else:
            df_gene_summary.loc[pan_gene_id, "pangenome_class"] = "cloud"
    df_gene_summary.to_csv(gene_summary_out_interim)
    df_gene_summary.to_csv(gene_summary_out_processed)

    # Copy other output files to processed directory
    logging.info("Copy other output files to processed directory")
    copyfile(
        os.path.join(roary_interim_folder, "conserved_vs_total_genes.png"),
        os.path.join(roary_processed_folder, "conserved_vs_total_genes.png"),
    )

    copyfile(
        os.path.join(roary_interim_folder, "Rplots.pdf"),
        os.path.join(roary_processed_folder, "Rplots.pdf"),
    )

    copyfile(
        os.path.join(roary_interim_folder, "pan_genome_reference.fa"),
        os.path.join(roary_processed_folder, "pan_genome_reference.fa"),
    )

    copyfile(
        os.path.join(roary_interim_folder, "summary_statistics.txt"),
        os.path.join(roary_processed_folder, "summary_statistics.txt"),
    )

    copyfile(
        os.path.join(roary_interim_folder, "number_of_conserved_genes.Rtab"),
        os.path.join(roary_processed_folder, "number_of_conserved_genes.Rtab"),
    )

    copyfile(
        os.path.join(roary_interim_folder, "number_of_genes_in_pan_genome.Rtab"),
        os.path.join(roary_processed_folder, "number_of_genes_in_pan_genome.Rtab"),
    )
    return df_gene_presence, df_gene_presence_binary, df_gene_summary


def plot_core_pan_curve(roary_interim_folder, roary_processed_folder):
    """
    Plots pangenome curves for core, pan, new and unique gene against number of genomes

    Parameters
    ----------
    1. roary_folder : str / path
        Location of the output from Roary in the interim directory

    Returns
    -------
    1. roary_processed_folder : str / path
        Location of the processed output directory where two plots will be saved
    """

    core_path = os.path.join(roary_interim_folder, "number_of_conserved_genes.Rtab")
    pan_path = os.path.join(roary_interim_folder, "number_of_genes_in_pan_genome.Rtab")
    new_path = os.path.join(roary_interim_folder, "number_of_new_genes.Rtab")
    uniq_path = os.path.join(roary_interim_folder, "number_of_unique_genes.Rtab")

    df_core_cnt = pd.read_table(core_path, sep="\t")
    df_pan_cnt = pd.read_table(pan_path, sep="\t")
    df_new_cnt = pd.read_csv(new_path, sep="\t")
    df_uniq_cnt = pd.read_csv(uniq_path, sep="\t")

    avg_core = df_core_cnt.sum(0) / df_core_cnt.shape[0]
    max_core = df_core_cnt.max(0)
    min_core = df_core_cnt.min(0)

    avg_pan = df_pan_cnt.sum(0) / df_core_cnt.shape[0]
    max_pan = df_pan_cnt.max(0)
    min_pan = df_pan_cnt.min(0)

    avg_new = df_new_cnt.sum(0) / df_uniq_cnt.shape[0]
    max_new = df_new_cnt.max(0)
    min_new = df_new_cnt.min(0)

    avg_uniq = df_uniq_cnt.sum(0) / df_uniq_cnt.shape[0]
    max_uniq = df_uniq_cnt.max(0)
    min_uniq = df_uniq_cnt.min(0)

    genome_numbers = list(range(1, df_core_cnt.shape[1] + 1))

    # Plot pan and core genome curves in JPEG format
    fig_pan_core = go.Figure()
    fig_pan_core.add_trace(
        go.Scatter(
            x=genome_numbers,
            y=avg_core,
            fill="tonexty",
            mode="lines+markers",
            name="Core",
            hoverinfo="name+x+y",
            error_y=dict(
                type="data",
                symmetric=False,
                array=max_core - avg_core,
                arrayminus=avg_core - min_core,
            ),
        )
    )  # fill down to xaxis

    fig_pan_core.add_trace(
        go.Scatter(
            x=genome_numbers,
            y=avg_pan,
            fill="tonexty",
            mode="lines+markers",
            name="Pan",
            hoverinfo="name+x+y",
            error_y=dict(
                type="data",
                symmetric=False,
                array=max_pan - avg_pan,
                arrayminus=avg_pan - min_pan,
            ),
        )
    )  # fill to trace0 y

    fig_pan_core.update_layout(
        autosize=False,
        width=800,
        height=500,
        margin=dict(l=20, r=20, b=30, t=30, pad=4),
        title_text="Pangenome and coregenome curve",
        xaxis_title_text="#Genomes",
        yaxis_title_text="#Genes",
        paper_bgcolor="White",
    )

    fig_pan_core_path = os.path.join(roary_processed_folder, "pan_core_curve.jpeg")
    if not os.path.isfile(fig_pan_core_path):
        fig_pan_core.write_image(fig_pan_core_path)

    # Plot new and unique genome curves in JPEG format
    fig_new_uniq = go.Figure()
    fig_new_uniq.add_trace(
        go.Scatter(
            x=genome_numbers,
            y=avg_new,
            fill="tonexty",
            mode="lines+markers",
            name="New",
            hoverinfo="name+x+y",
            error_y=dict(
                type="data",
                symmetric=False,
                array=max_new - avg_new,
                arrayminus=avg_new - min_new,
            ),
        )
    )  # fill down to xaxis

    fig_new_uniq.add_trace(
        go.Scatter(
            x=genome_numbers,
            y=avg_uniq,
            fill="tonexty",
            mode="lines+markers",
            name="Unique",
            hoverinfo="name+x+y",
            error_y=dict(
                type="data",
                symmetric=False,
                array=max_uniq - avg_uniq,
                arrayminus=avg_uniq - min_uniq,
            ),
        )
    )  # fill to trace0 y

    fig_new_uniq.update_layout(
        autosize=False,
        width=800,
        height=500,
        margin=dict(l=20, r=20, b=30, t=30, pad=4),
        title_text="New and unique genes curve",
        xaxis_title_text="#Genomes",
        yaxis_title_text="#Genes",
        paper_bgcolor="White",
    )

    fig_new_uniq_path = os.path.join(roary_processed_folder, "new_unique_curve.jpeg")
    if not os.path.isfile(fig_new_uniq_path):
        fig_new_uniq.write_image(fig_new_uniq_path)

    return


def plot_pan_freq_plot(df_gene_presence_binary, roary_processed_folder):
    """
    Plots pangenome frequence plot for number of genes present in number of genomes

    Parameters
    ----------
    1. df_gene_presence_binary : pd.DataFrame
        Dataframe with presence absence matrix of genes in the pangenome

    Returns
    -------
    1. roary_processed_folder : str / path
        Location of the processed output directory where gene frequency plot will be saved
    """

    df_freq = pd.DataFrame(index=df_gene_presence_binary.index, columns=["#Genomes"])
    df_freq["#Genomes"] = df_gene_presence_binary.sum(axis=1)
    nbins = df_gene_presence_binary.shape[1]
    fig = px.histogram(df_freq, x="#Genomes", nbins=nbins)

    fig.update_layout(
        autosize=False,
        width=800,
        height=500,
        margin=dict(l=20, r=20, b=30, t=30, pad=4),
        title_text="Pangenome frequency plot",  # title of plot
        xaxis_title_text="#Genomes",  # xaxis label
        yaxis_title_text="#Genes",  # yaxis label
    )
    fig_out_path = os.path.join(roary_processed_folder, "gene_frequency.jpeg")
    if not os.path.isfile(fig_out_path):
        fig.write_image(fig_out_path)

    return


def plot_pan_pie_chart(
    df_gene_summary,
    total_genomes,
    roary_processed_folder,
    core=0.99,
    softcore=0.95,
    shell=0.15,
):
    """
    Plots pangenome frequence plot for number of genes present in number of genomes

    Parameters
    ----------
    1. df_gene_summary : pd.DataFrame
        Dataframe with gene summary including pangenome_class
    2. total_genomes : int
        Number of total genomes in the pangenome
    3. roary_processed_folder : str / path
        Location of the processed output directory for Roary
    4. core: int (0 to 1)
        default = 0.99
        Fraction of genomes the gene must be present to be group in core (99% <= strains <= 100%)
    5. softcore: int (0 to 1)
        default = 0.95
        Fraction of genomes the gene must be present to be group in softcore (95% <= strains < 99%)
    6. shell: int (0 to 1)
        default = 0.15
        Fraction of genomes the gene must be present to be group in softcore (15% <= strains < 95%)
        Remaining genes will be in cloud

    Returns
    -------
    1. roary_processed_folder : str / path
        Location of the processed output directory where pangenome pie chart plot will be saved
    """

    # Plot the pangenome pie chart
    plt.figure(figsize=(10, 10))

    core_genes = df_gene_summary[df_gene_summary["pangenome_class"] == "core"].shape[0]
    softcore_genes = df_gene_summary[
        df_gene_summary["pangenome_class"] == "softcore"
    ].shape[0]
    shell_genes = df_gene_summary[df_gene_summary["pangenome_class"] == "shell"].shape[
        0
    ]
    cloud_genes = df_gene_summary[df_gene_summary["pangenome_class"] == "cloud"].shape[
        0
    ]
    total_genes = df_gene_summary.shape[0]

    def my_autopct(pct):
        val = int(pct * total_genes / 100.0)
        return "{v:d}".format(v=val)

    fig = plt.pie(
        [core_genes, softcore_genes, shell_genes, cloud_genes],
        labels=[
            "core  (<= %d strains)" % (total_genomes * core),
            "soft-core  (%d <= strains < %d)"
            % (total_genomes * softcore, total_genomes * core),
            "shell\n(%d <= strains < %d)"
            % (total_genomes * softcore, total_genomes * shell),
            "cloud\n(strains < %d)" % (total_genomes * shell),
        ],
        explode=[0.1, 0.05, 0.02, 0],
        radius=0.9,
        colors=[
            (0, 0, 1, float(x) / total_genes)
            for x in (core_genes, softcore_genes, shell_genes, cloud_genes)
        ],
        autopct=my_autopct,
        textprops={"fontsize": 10, "fontweight": "bold"},
    )

    fig_out_path = os.path.join(roary_processed_folder, "pangenome_pie.jpeg")
    if not os.path.isfile(fig_out_path):
        plt.savefig(
            fig_out_path,
            facecolor="w",
            edgecolor="w",
            orientation="portrait",
            format="jpeg",
            transparent=False,
            bbox_inches="tight",
            pad_inches=0.1,
        )

    return


def plot_tree_presence_map(
    t, df_gene_presence_binary, df_genomes_tree, df_roary_sorted, roary_processed_folder
):
    """
    Plots gene presence heatmap with phylogenetic tree

    Parameters
    ----------
    1. tree : Phlyo.tree
        Phylogenetic tree object of Bio.Phylo from autoMLST
    1. df_gene_presence_binary : pd.DataFrame
        Dataframe with presence absence matrix of genes in the pangenome
    2. df_genomes_tree : pd.DataFrame
        Dataframe with genomes ordered phylogenetically
    3. df_roary_sorted : pd.DataFrame
        Sorted matrix with gene presence across the genomes
    4. roary_processed_folder : str / path
        Location of the processed output directory for Roary

    Returns
    -------
    1. heatmap : jpeg
        Heatmap of gene presence with genomes in phylogenetic order
    """

    # Max distance to create better plot
    mdist = max([t.distance(t.root, x) for x in t.get_terminals()])

    # Considere clustering methods for sorting the matrix

    # PLot presence/absence matrix against the tree
    fig = plt.figure(figsize=(30, df_roary_sorted.shape[1] / 6))

    ax1 = plt.subplot2grid((1, 40), (0, 10), colspan=30)
    a = ax1.imshow(
        df_roary_sorted.T,
        cmap=plt.cm.Blues,
        vmin=0,
        vmax=1,
        aspect="auto",
        interpolation="none",
    )
    ax1.set_yticks([])
    ax1.set_xticks([])
    ax1.axis("off")

    ax = fig.add_subplot(1, 2, 1)
    ax = plt.subplot2grid((1, 40), (0, 0), colspan=10)

    fig.subplots_adjust(wspace=1, hspace=0)

    ax1.set_title("Roary matrix\n(%d gene clusters)" % df_roary_sorted.shape[0])

    strain_dict = dict()
    for x in t.get_terminals():
        if x.name in df_genomes_tree.index:
            strain_dict[x.name] = df_genomes_tree.loc[x.name, "strain"]
        else:
            genome_id = x.name[:-1] + "." + x.name[-1]
            strain_dict[x.name] = df_genomes_tree.loc[genome_id, "strain"]

    # Tree labels are extracted from the column 'strain' of df_genmomes_tree
    Phylo.draw(
        t,
        axes=ax,
        show_confidence=False,
        label_func=lambda x: strain_dict[x.name] if x.is_terminal() else "",
        xticks=([],),
        yticks=([],),
        ylabel=("",),
        xlabel=("",),
        xlim=(-0.01, mdist + 0.01),
        axis=("off",),
        title=("autoMLST tree\n(%d strains)" % df_roary_sorted.shape[1],),
    )

    fig_out_path = os.path.join(roary_processed_folder, "phylo_presence_heatmap.jpeg")
    if not os.path.isfile(fig_out_path):
        plt.savefig(
            fig_out_path,
            facecolor="w",
            edgecolor="w",
            orientation="portrait",
            format="jpeg",
            transparent=False,
            bbox_inches="tight",
            pad_inches=0.1,
        )

    return plt.show()


def sort_roary_matrix(
    df_gene_presence_binary,
    df_genomes_tree,
    roary_interim_folder,
    roary_processed_folder,
):
    """
    Sort roary presence matric- genomes in phylogenetic order and gene in order of the sum of strains present

    Parameters
    ----------
    1. df_gene_presence_binary : pd.DataFrame
        Dataframe with presence absence matrix of genes in the pangenome
    2. df_genomes_tree : pd.DataFrame
        Dataframe with genomes ordered phylogenetically

    Returns
    -------
    1. df_roary_sorted : pd.DataFrame
        Sorted matrix with gene presence across the genomes
        Saves sorted matrix as csv table
    """

    # Gene presence matrix
    df_roary = df_gene_presence_binary.copy()
    # Sort the matrix by the sum of strains presence
    sorted_idx = df_roary.sum(axis=1).sort_values(ascending=False).index
    df_roary_sorted = df_roary.loc[sorted_idx]
    # Sort the matrix according to phylogeny
    df_roary_sorted = df_roary_sorted[df_genomes_tree.index]

    roary_sorted_interim = os.path.join(
        roary_interim_folder, "df_roary_tree_sum_sorted.csv"
    )
    roary_sorted_processed = os.path.join(
        roary_processed_folder, "df_roary_tree_sum_sorted.csv"
    )

    df_roary_sorted.to_csv(roary_sorted_interim)
    df_roary_sorted.to_csv(roary_sorted_processed)

    # TO DO: Considere clustering methods for sorting the matrix

    return df_roary_sorted


if __name__ == "__main__":
    roary_interim_folder = sys.argv[1]
    roary_processed_folder = sys.argv[2]
    automlst_interim_folder = sys.argv[3]
    df_gene_presence, df_gene_presence_binary, df_gene_summary = get_roary_data(
        roary_interim_folder,
        roary_processed_folder,
        core=0.99,
        softcore=0.95,
        shell=0.15,
    )
    plot_core_pan_curve(roary_interim_folder, roary_processed_folder)
    plot_pan_freq_plot(df_gene_presence_binary, roary_processed_folder)
    total_genomes = float(df_gene_presence_binary.shape[1])
    plot_pan_pie_chart(
        df_gene_summary,
        total_genomes,
        roary_processed_folder,
        core=0.99,
        softcore=0.95,
        shell=0.15,
    )

    # Read tree file
    newick_path = os.path.join(automlst_interim_folder, "final.newick")
    t = Phylo.read(newick_path, "newick")
    genome_tree_table_path = os.path.join(
        automlst_interim_folder, "df_genomes_tree.csv"
    )
    df_genomes_tree = pd.read_csv(genome_tree_table_path, index_col="genome_id")

    # Sort the matrix according to phylogeny
    df_roary_sorted = sort_roary_matrix(
        df_gene_presence_binary,
        df_genomes_tree,
        roary_interim_folder,
        roary_processed_folder,
    )
    plot_tree_presence_map(
        t,
        df_gene_presence_binary,
        df_genomes_tree,
        df_roary_sorted,
        roary_processed_folder,
    )
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
import json
import os
import sys
from shutil import copyfile

import pandas as pd
from Bio import Phylo


def copy_automlst_out(automlst_interim_folder, automlst_processed_folder):
    """
    Copy important files from autoMLST interim to proecessed directory

    Parameters
    ----------
    1. automlst_interim_folder : str / path
        Location of the output from autoMLST in the interim directory

    Returns
    -------
    1. automlst_processed_folder : str / path
        Location of the processed output directory for important autoMLST results
    """

    files_to_copy = [
        "raxmlpart.txt",
        "raxmlpart.txt.bionj",
        "raxmlpart.txt.contree",
        "raxmlpart.txt.iqtree",
        "raxmlpart.txt.log",
        "raxmlpart.txt.mldist",
        "raxmlpart.txt.treefile",
    ]

    for tree_file in files_to_copy:
        from_path = os.path.join(automlst_interim_folder, tree_file)
        to_path = os.path.join(automlst_processed_folder, tree_file)

        if os.path.isfile(from_path):
            copyfile(from_path, to_path)

    # Selected tree newick file for downstream analysis by default automlst_wrapper
    treefile = os.path.join(automlst_interim_folder, "raxmlpart.txt.treefile")
    out_newick_processed = os.path.join(automlst_processed_folder, "final.newick")
    out_newick_interim = os.path.join(automlst_interim_folder, "final.newick")

    if os.path.isfile(treefile):
        copyfile(treefile, out_newick_interim)
        copyfile(treefile, out_newick_processed)

    # Save list of MLST genes in pandas table (currently with only index with TIGR IDs)
    aligned_dir = os.path.join(automlst_interim_folder, "aligned")
    mlst_out_path_processed = os.path.join(
        automlst_processed_folder, "df_mlst_genes.csv"
    )
    mlst_out_path_interim = os.path.join(automlst_interim_folder, "df_mlst_genes.csv")

    mlst_gene_list = os.listdir(aligned_dir)

    df_mlst = pd.DataFrame(index=mlst_gene_list, columns=["aligned_path"])
    for mlst_gene in df_mlst.index:
        df_mlst.loc[mlst_gene, "aligned_path"] = os.path.join(
            aligned_dir, mlst_gene + ".faa"
        )
    df_mlst.index.name = "mlst_gene"

    df_mlst.to_csv(mlst_out_path_interim)
    df_mlst.to_csv(mlst_out_path_processed)

    return None


def get_genome_tree_table(
    automlst_processed_folder, prokka_interim_folder, gtdb_interim_folder
):
    """
    Copy important files from autoMLST interim to proecessed directory

    Parameters
    ----------
    1. automlst_processed_folder : str / path
        Location of the processed output directory for important autoMLST results
    2. prokka_interim_folder : str/ path
        Path to get organism information for each genome as used in prokka
    3. gtdb_interim_folder : str/ path
        Path to get json files with gtdb information

    Returns
    -------
    1. df_genomes_tree : pd.DataFrame
        Combined datadrame of organism info ordered in phylogenetic tree (final.newick file)
    """

    newick_path = os.path.join(automlst_processed_folder, "final.newick")

    t = Phylo.read(newick_path, "newick")

    # Max distance to create better plot
    # mdist = max([t.distance(t.root, x) for x in t.get_terminals()])

    # Sort the matrix according to tip labels in the tree
    genome_ids_list = [
        x.name
        if x.name in os.listdir(prokka_interim_folder)
        else x.name[:-1] + "." + x.name[-1]
        for x in t.get_terminals()
    ]
    # The above is required in the default version of autmlst_wrapper
    # To be fixed in the orginal repo

    columns_list = [
        "genus_original",
        "species_original",
        "strain",
        "domain",
        "phylum",
        "class",
        "order",
        "family",
        "genus",
        "species",
    ]
    df_genomes_tree = pd.DataFrame(index=genome_ids_list, columns=columns_list)
    df_genomes_tree.index.name = "genome_id"

    for genome_id in df_genomes_tree.index:
        # Reading organism infor used for prokka run including strain ID
        org_info_path = os.path.join(
            prokka_interim_folder, genome_id, "organism_info.txt"
        )
        with open(org_info_path, "r") as file_obj:
            org_info = file_obj.readlines()[0]
            df_genomes_tree.loc[genome_id, "genus_original"] = org_info.split(",")[0]
            df_genomes_tree.loc[genome_id, "species_original"] = org_info.split(",")[1]
            df_genomes_tree.loc[genome_id, "strain"] = org_info.split(",")[2]

        # Reading gtdb metadata from JSON files
        gtdb_json_path = os.path.join(gtdb_interim_folder, genome_id + ".json")
        with open(gtdb_json_path, "r") as f:
            gtdb_dict = json.load(f)
            df_genomes_tree.loc[genome_id, "domain"] = gtdb_dict["gtdb_taxonomy"][
                "domain"
            ]
            df_genomes_tree.loc[genome_id, "phylum"] = gtdb_dict["gtdb_taxonomy"][
                "phylum"
            ]
            df_genomes_tree.loc[genome_id, "class"] = gtdb_dict["gtdb_taxonomy"][
                "class"
            ]
            df_genomes_tree.loc[genome_id, "order"] = gtdb_dict["gtdb_taxonomy"][
                "order"
            ]
            df_genomes_tree.loc[genome_id, "family"] = gtdb_dict["gtdb_taxonomy"][
                "family"
            ]
            df_genomes_tree.loc[genome_id, "genus"] = gtdb_dict["gtdb_taxonomy"][
                "genus"
            ]
            df_genomes_tree.loc[genome_id, "organism"] = gtdb_dict["gtdb_taxonomy"][
                "species"
            ]
            try:
                df_genomes_tree.loc[genome_id, "species"] = gtdb_dict["gtdb_taxonomy"][
                    "species"
                ].split(" ")[1]
            except IndexError:  # leave blank for empty taxonomy
                df_genomes_tree.loc[genome_id, "species"] = ""

    genomes_tree_path_interim = os.path.join(
        automlst_processed_folder, "df_genomes_tree.csv"
    )
    genomes_tree_path_processed = os.path.join(
        automlst_processed_folder, "df_genomes_tree.csv"
    )

    df_genomes_tree.to_csv(genomes_tree_path_interim)
    df_genomes_tree.to_csv(genomes_tree_path_processed)

    return df_genomes_tree


if __name__ == "__main__":
    automlst_interim_folder = sys.argv[1]
    automlst_processed_folder = sys.argv[2]
    prokka_interim_folder = sys.argv[3]
    gtdb_interim_folder = sys.argv[4]
    copy_automlst_out(automlst_interim_folder, automlst_processed_folder)
    df_genomes_tree = get_genome_tree_table(
        automlst_processed_folder, prokka_interim_folder, gtdb_interim_folder
    )
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import glob
import gzip
import io
import logging
import sys
from pathlib import Path

import ncbi_genome_download as ngd
import pandas as pd

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.INFO)


def build_prokka_refgbff(name, prokka_db_table, outfile):
    """
    Given a project name and csv table path, download all NCBI accession id
    and returns a concatenated gbk file for prokka --proteins params.
    """
    download_dir = Path(f"resources/prokka_db/{name}")
    prokka_db_table = Path(prokka_db_table).resolve()

    # Generate download directory
    download_dir.mkdir(parents=True, exist_ok=True)

    # Get lists of accession ids and its sources
    logging.info(f"Reading input file: {prokka_db_table}...")
    if prokka_db_table.suffix == ".csv":
        df = pd.read_csv(prokka_db_table)
    elif prokka_db_table.suffix == ".json":
        df = pd.read_json(prokka_db_table)

    ngd_input = {"refseq": [], "genbank": []}

    logging.info("Donwloading genomes from NCBI...")
    for acc in df.Accession:
        if acc.startswith("GCF"):
            ngd_input["refseq"].append(acc)
        elif acc.startswith("GCA"):
            ngd_input["genbank"].append(acc)
        else:
            raise

    # Download gbff with NCBI genome download
    for s in ngd_input.keys():
        if ngd_input[s]:
            acc_list = ",".join(ngd_input[s])
            logging.debug(f"Downloading {s} genome: {acc}")
            ngd.download(
                section=s,
                file_formats="genbank",
                assembly_accessions=acc_list,
                output=download_dir,
                groups="bacteria",
            )

    # Concatenate gbff
    logging.info("Concatenating genomes into a single file...")
    reference_files = glob.glob(f"{download_dir}/*/*/*/*.gbff.gz")

    with open(outfile, "w") as f_out:
        for names in reference_files:
            with io.TextIOWrapper(gzip.open(names, "r")) as infile:
                f = infile.read()
                f_out.write(f)
            f_out.write("\n")

        logging.info(f"Reference genomes saved as: {str(f_out)}")

    return


if __name__ == "__main__":
    build_prokka_refgbff(sys.argv[1], sys.argv[2], sys.argv[3])
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import json
import logging
import sys
from pathlib import Path

import pandas as pd
from alive_progress import alive_bar

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)


def gather_seqfu_json(seqfu_input, output_file):
    """
    Merge seqfu json input into one csv table
    """
    # Accomodate multiple inputs to generate dataframe
    shell_input = Path(seqfu_input)
    logging.info(shell_input)
    if shell_input.is_file() and shell_input.suffix == ".json":
        logging.info(f"Getting SeqFu overview from a single file: {shell_input}")
        shell_input_files = shell_input

    elif shell_input.is_file() and shell_input.suffix == ".txt":
        logging.info(f"Getting SeqFu overview  from a text file: {shell_input}")
        with open(shell_input, "r") as file:
            file_content = [i.strip("\n") for i in file.readlines()]
            if len(file_content) == 1:
                # Paths space-separated on a single line
                paths = file_content[0].split()
            else:
                # Paths written on separate lines
                paths = file_content
            shell_input_files = [
                Path(path) for path in paths if Path(path).suffix == ".json"
            ]
    else:
        shell_input_files = [
            Path(file)
            for file in str(shell_input).split()
            if Path(file).suffix == ".json"
        ]
        logging.info(
            f"Getting SeqFu overview from the given list of {len(shell_input_files)} files..."
        )

    logging.info("Merging SeqFu results...")
    container = {}
    with alive_bar(len(shell_input_files), title="Updating SeqFu information:") as bar:
        for d in shell_input_files:
            logging.debug(f"Reading input: {Path(d).name}")
            with open(d, "r") as f:
                read = json.load(f)
                genome_id = read[0]["Filename"]
                read[0].pop("Filename")
                container[genome_id] = read[0]
            bar()
    logging.info("Converting dictionary to table...")
    df = pd.DataFrame.from_dict(container).T
    df.index.name = "genome_id"
    df.to_csv(output_file)
    logging.info("Job done!")
    return


if __name__ == "__main__":
    gather_seqfu_json(sys.argv[1], sys.argv[2])
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
import json
import logging
import sqlite3
import sys
from pathlib import Path

import pandas as pd
from alive_progress import alive_bar

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)


def gcf_hits(df_gcf, cutoff=900):
    """
    Filter bigslice result based on distance threshold to model and generate data.
    """
    mask = df_gcf.loc[:, "membership_value"] <= cutoff
    df_gcf_filtered = df_gcf[mask]
    bgcs = df_gcf_filtered.bgc_id.unique()
    gcfs = df_gcf_filtered.gcf_id.unique()
    logging.debug(f"BiG-SLICE query with BiG-FAM run 6, distance cutoff {cutoff}")
    logging.debug(f"Number of bgc hits : {len(bgcs)}/{len(df_gcf.bgc_id.unique())}")
    logging.debug(f"Number of GCF hits : {len(gcfs)}")
    return df_gcf_filtered


def grab_gcf_model_summary(gcf_query, database):
    """
    Summarize gcf model from a give list of bigfam gcf ids
    gcf_query = list of bigfam gfc id
    database = the full run database of bigslice
    """

    def grab_mibig_members(q, mibig_bigfam):
        """
        Summarize information of mibig members in a model
        """
        mibig_members = list(set(q.bgc_id) & set(mibig_bigfam.id))
        mibig_members = [mibig_bigfam.loc[i, "name"] for i in mibig_members]
        return mibig_members

    # initiate connection
    logging.info(f"Initiating SQL connection to {database}...")
    conn = sqlite3.connect(database)

    # load main tables
    logging.info("Reading gcf_membership table...")
    df_bigfam = pd.read_sql_query("select * from gcf_membership;", conn)
    logging.info("Reading bgc table...")
    df_bgc_bigfam = pd.read_sql_query("select * from bgc;", conn)

    # Load BiG-FAM dataset
    mibig_bigfam = df_bgc_bigfam[df_bgc_bigfam.loc[:, "type"] == "mibig"].set_index(
        "id", drop=False
    )

    # Filter for hit queries
    # df_bigfam_query = df_bigfam[df_bigfam.loc[:, "gcf_id"].isin(gcf_query)]

    # return information of each model
    logging.info(f"Summarizing information for {len(gcf_query)} GCF models...")
    summary = {}
    with alive_bar(len(gcf_query)) as bar:
        for g in gcf_query:
            values = {}

            q = df_bigfam[df_bigfam.loc[:, "gcf_id"] == g]
            q = q[q.loc[:, "rank"] == 0]  # select only first hits

            # get core member info
            q_core = q[q.loc[:, "membership_value"] <= 900]
            q_core_mibig = grab_mibig_members(q_core, mibig_bigfam)

            # get putative member info
            q_putative = q[q.loc[:, "membership_value"] > 900]
            q_putative_mibig = grab_mibig_members(q_putative, mibig_bigfam)

            values["core_member"] = len(q_core)
            values["putative_member"] = len(q_putative)
            values["core_member_mibig"] = q_core_mibig
            values["putative_member_mibig"] = q_putative_mibig
            values["core_member_mibig_count"] = len(q_core_mibig)
            values["core_member_mibig_bool"] = len(q_core_mibig) > 0
            values["putative_member_mibig_count"] = len(q_putative_mibig)
            values["putative_member_mibig_bool"] = len(q_putative_mibig) > 0
            values[
                "link to BiG-FAM"
            ] = f"https://bigfam.bioinformatics.nl/run/6/gcf/{g}"
            summary[str(g)] = values

            bar()

    return summary


def summarize_bigslice_query(
    bigslice_query_path,
    output_path,
    database_path="resources/bigslice/full_run_result/result/data.db",
    cutoff=900,
):
    """
    Summarize bigslice query result.

    Input:
    - bigslice_query_path

    Output:
    - A folder to store two output files:
        1. query_network.csv to build Cytoscape Network
        2. JSON file summarizing GCF model hits
    """
    bigslice_query = Path(bigslice_query_path)
    # database = Path(database_path)
    output = Path(output_path)
    output.mkdir(parents=True, exist_ok=True)

    df_gcf_membership = pd.read_csv(bigslice_query / "gcf_membership.csv")
    df_bgc = pd.read_csv(bigslice_query / "bgc.csv")

    # Create edge table network for cytoscape enriched with ids for mapping metadata (bgc_name, genome_id)
    logging.info("Generating network table of BiG-SLICE hits...")
    data = gcf_hits(df_gcf_membership, cutoff)
    bgc_info = df_bgc.set_index("id", drop=True)
    bgc_info["bgc_id"] = [str(i).replace(".gbk", "") for i in bgc_info["orig_filename"]]
    for i in data.index:
        bgc_id = data.loc[i, "bgc_id"]
        data.loc[i, "bgc_id"] = str(bgc_info.loc[bgc_id, "bgc_id"])
    data.to_csv(output / "query_network.csv", index=False)

    # Summarizing GCF model hits
    logging.info("Summarizing GCF model hits...")
    gcf_query = list(data.gcf_id.unique())
    gcf_summary = grab_gcf_model_summary(gcf_query, database_path)

    # outputting as json
    with open(output / "gcf_summary.json", "w") as outfile:
        json.dump(gcf_summary, outfile, indent=4)

    # outputting as csv table
    as_table = pd.DataFrame.from_dict(gcf_summary).T
    as_table.index.name = "gcf_id"
    as_table.to_csv(output / "gcf_summary.csv")

    logging.info("Job done")
    return


if __name__ == "__main__":
    summarize_bigslice_query(sys.argv[1], sys.argv[2])
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import sys

from Bio import SeqIO


def update_gbk_automlst(input_gbk, out_gbk, genome_id):
    """
    Update organism field with accession ID for autoMLST run
    """

    input_handle = open(input_gbk, "r")
    new_seq_records = []

    for seq_record in SeqIO.parse(input_handle, "genbank"):
        seq_record.annotations["organism"] = genome_id
        new_seq_records.append(seq_record)

    SeqIO.write(new_seq_records, out_gbk, "genbank")
    input_handle.close()

    return None


if __name__ == "__main__":
    update_gbk_automlst(sys.argv[1], sys.argv[2], sys.argv[3])
 9
10
11
12
13
14
shell:
    """
    download-antismash-databases --database-dir resources/antismash_db 2>> {log}
    antismash --version >> {log}
    antismash --check-prereqs >> {log}
    """
33
34
35
36
shell:
    """
    antismash --genefinding-tool {params.genefinding} --output-dir {params.folder} --cb-general --cb-subclusters --cb-knownclusters -c {threads} {input.gbk} --logfile {log} 2>> {log}
    """
46
47
48
49
50
51
shell:
    """
    download-antismash-databases --database-dir {output} &>> {log}
    antismash --version >> {log}
    antismash --database {output} --prepare-data &>> {log}
    """
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
shell:
    """
    set +e

    # Find the latest existing JSON output for this strain
    latest_version=$(ls -d data/interim/antismash/*/{wildcards.strains}/{wildcards.strains}.json | grep {wildcards.strains} | sort -r | head -n 1 | cut -d '/' -f 4) 2>> {log}

    if [ -n "$latest_version" ]; then
        # Use existing JSON result as starting point
        old_json="data/interim/antismash/$latest_version/{wildcards.strains}/{wildcards.strains}.json"
        echo "Using existing JSON from $old_json as starting point..." >> {log}
        antismash_input="--reuse-result $old_json"
    else
        # No existing JSON result found, use genbank input
        echo "No existing JSON result found, starting AntiSMASH from scratch..." >> {log}
        antismash_input="{input.gbk}"
    fi

    # Run AntiSMASH
    antismash --genefinding-tool {params.genefinding} --output-dir {params.folder} \
        --database {input.resources} \
        --cb-general --cb-subclusters --cb-knownclusters -c {threads} $antismash_input --logfile {log} 2>> {log}

    # Check if the run failed due to changed detection results
    if grep -q "ValueError: Detection results have changed. No results can be reused" {log}; then
        # Use genbank input instead
        echo "Previous JSON result is invalid, starting AntiSMASH from scratch..." >> {log}
        antismash --genefinding-tool {params.genefinding} --output-dir {params.folder} \
            --database {input.resources} \
            --cb-general --cb-subclusters --cb-knownclusters -c {threads} {input.gbk} --logfile {log} 2>> {log}
    fi
    """
112
113
114
115
116
117
shell:
    """
    base_dir=$PWD
    mkdir {output.dir}
    (cd {output.dir} && for item in $(ls $base_dir/{input.dir}); do ln -s $base_dir/{input.dir}/$item $(basename $item); done) 2>> {log}
    """
15
16
17
18
19
shell:
    """
    mkdir -p data/interim/arts/antismash-{wildcards.version}/{wildcards.strains}
    python {params.resources}/artspipeline1.py {input.antismash} {params.ref} -rd data/interim/arts/antismash-{wildcards.version}/{wildcards.strains} -cpu {threads} -opt kres,phyl,duf -khmms {params.khmms} 2>> {log}
    """
SnakeMake From line 15 of rules/arts.smk
31
32
33
34
shell:
    """
    python workflow/bgcflow/bgcflow/data/arts_extract.py {input.arts}/tables/bgctable.tsv {input.json} {output.json} {wildcards.strains} 2>> {log}
    """
SnakeMake From line 31 of rules/arts.smk
50
51
52
53
54
55
56
57
58
shell:
    """
    TMPDIR="data/interim/tmp/{wildcards.name}/{wildcards.version}"
    mkdir -p $TMPDIR
    INPUT_JSON="$TMPDIR/df_arts.txt"
    echo '{input.json}' > $INPUT_JSON
    python workflow/bgcflow/bgcflow/data/arts_gather.py $INPUT_JSON {input.changelog} {output.table} 2>> {log}
    rm $INPUT_JSON
    """
SnakeMake From line 50 of rules/arts.smk
12
13
14
15
16
17
18
19
20
21
shell:
    """
    set -e
    mkdir -p resources 2>> {log}
    wget {params.source}/archive/refs/tags/v{params.version}.zip -O resources/automlst-simplified-wrapper-v{params.version}.zip 2>> {log}
    (cd resources && unzip -o automlst-simplified-wrapper-v{params.version}.zip && rm automlst-simplified-wrapper-v{params.version}.zip) &>> {log}
    (cd resources/automlst-simplified-wrapper-{params.version} && unzip -o reducedcore.zip) &>> {log}
    cp -r resources/automlst-simplified-wrapper-{params.version}/* {output.folder}/. 2>> {log}
    rm -rf resources/automlst-simplified-wrapper-{params.version} 2>> {log}
    """
33
34
35
36
shell:
    """
    python workflow/bgcflow/bgcflow/features/prep_automlst.py {input.gbk} {output.auto_gbk} {wildcards.strains} 2>> {log}
    """
52
53
54
55
56
shell:
    """
    mkdir -p "data/interim/automlst_wrapper/{wildcards.name}/singles" 2>> {log}
    python resources/automlst-simplified-wrapper-main/simplified_wrapper.py data/interim/automlst_wrapper/{wildcards.name} {threads} &>> {log}
    """
76
77
78
79
shell:
    """
    python workflow/bgcflow/bgcflow/data/make_phylo_tree.py {params.automlst_interim} {output.automlst_processed} {params.prokka_interim} {params.gtdb_interim} 2>> {log}
    """
10
11
12
13
shell:
    """
    python workflow/bgcflow/bgcflow/data/get_bgc_counts.py {input.antismash} {wildcards.strains} {output.bgc_count} 2>> {log}
    """
24
25
26
27
shell:
    """
    python workflow/bgcflow/bgcflow/data/get_antismash_overview.py {input.antismash} {output.bgc_table} {wildcards.strains} 2>> {log}
    """
43
44
45
46
47
48
49
50
51
52
shell:
    """
    TMPDIR="data/interim/tmp/{wildcards.name}/{wildcards.version}"
    mkdir -p $TMPDIR
    INPUT_JSON="$TMPDIR/df_regions_antismash.txt"
    echo '{input.bgc_overview}' > $INPUT_JSON
    python workflow/bgcflow/bgcflow/data/get_antismash_overview_gather.py \
        $INPUT_JSON {input.mapping_dir} {output.df_bgc} 2>> {log}
    rm $INPUT_JSON
    """
62
63
64
65
66
shell:
    """
    mkdir -p {output.mapping_dir} 2>> {log}
    cp {input.mapping_dir}/*/*-change_log.json {output.mapping_dir}/. 2>> {log}
    """
 96
 97
 98
 99
100
101
102
103
104
shell:
    """
    TMPDIR="data/interim/tmp/{wildcards.name}/{wildcards.version}"
    mkdir -p $TMPDIR
    INPUT_JSON="$TMPDIR/df_bgc_counts.txt"
    echo '{input.bgc_count}' > $INPUT_JSON
    python workflow/bgcflow/bgcflow/data/make_genome_dataset.py $INPUT_JSON '{params.df_samples}' {output.df_antismash_summary} 2>> {log}
    rm $INPUT_JSON
    """
115
116
117
118
shell:
    """
    python workflow/bgcflow/bgcflow/data/get_dependencies.py {output} 2> {log}
    """
129
130
131
132
shell:
    """
    python workflow/bgcflow/bgcflow/data/get_project_metadata.py {wildcards.name} {output} {params.bgcflow_version} 2> {log}
    """
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
shell:
    """
    echo "Preparing BGCs for {wildcards.name} downstream analysis..." 2>> {log.general}
    #mkdir -p {output.outdir} 2>> {log.general}
    # Generate symlink for each regions in genomes in dataset
    for i in $(dirname {input.gbk})
    do
        echo Processing $i 2>> {log.symlink}
        python workflow/bgcflow/bgcflow/data/bgc_downstream_prep.py $i {output.outdir} 2>> {log.symlink}
    done
    # generate taxonomic information for dataset
    python workflow/bgcflow/bgcflow/data/bigslice_prep.py {input.table} {output.taxonomy} 2>> {log.taxonomy}
    # append new dataset information
    ## check if previous dataset exists
    if [[ -s {params.dataset} ]]
    then
        echo "Previous dataset detected, appending dataset information for {wildcards.name}..."
        sed -i 'a {wildcards.name}_antismash_{wildcards.version}\t{wildcards.name}_antismash_{wildcards.version}\ttaxonomy/taxonomy_{wildcards.name}_antismash_{wildcards.version}.tsv\t{wildcards.name}' {params.dataset} 2>> {log.general}
    else
        echo "No previous dataset detected, generating dataset information for {wildcards.name}..."
        echo -e '# Dataset name\tPath to folder\tPath to taxonomy\tDescription' > {params.dataset} 2>> {log.general}
        sed -i 'a {wildcards.name}_antismash_{wildcards.version}\t{wildcards.name}_antismash_{wildcards.version}\ttaxonomy/taxonomy_{wildcards.name}_antismash_{wildcards.version}.tsv\t{wildcards.name}' {params.dataset} 2>> {log.general}
    fi
    # generate mapping for visualization
    python workflow/bgcflow/bgcflow/data/get_bigscape_mapping.py {output.outdir} {output.bgc_mapping} 2>> {log.general}
    """
SnakeMake From line 19 of rules/bgc.smk
10
11
12
13
14
15
16
shell:
    """
    (cd resources && wget https://github.com/medema-group/BiG-SCAPE/archive/refs/tags/v{params.release}.zip) &>> {log}
    (cd resources && unzip -o v{params.release}.zip && mv BiG-SCAPE-{params.release}/ BiG-SCAPE/ && rm v{params.release}.zip) &>> {log}
    (cd resources/BiG-SCAPE && wget ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam32.0/Pfam-A.hmm.gz && gunzip Pfam-A.hmm.gz) &>> {log}
    (cd resources/BiG-SCAPE && hmmpress Pfam-A.hmm) &>> {log}
    """
31
32
33
34
35
36
shell:
    """
    (cd resources && wget  https://dl.secondarymetabolites.org/mibig/mibig_json_{params.mibig_version}.tar.gz) &>> {log}
    (cd resources && tar -xvf mibig_json_{params.mibig_version}.tar.gz && mkdir -p mibig && mv mibig_json_{params.mibig_version}/ mibig/json && rm mibig_json_{params.mibig_version}.tar.gz) &>> {log}
    python workflow/bgcflow/bgcflow/data/get_mibig_data.py {output.mibig_json_folder} {output.mibig_bgc_table} 2>> {log}
    """
54
55
56
57
shell:
    """
    python {input.bigscape}/bigscape.py -i {input.antismash_dir} -o {params.bigscape_dir} -c {threads} --cutoff 0.3 0.4 0.5 --include_singletons --label {params.label} --hybrids-off --mibig --verbose &>> {log}
    """
76
77
78
79
shell:
    """
    python {input.bigscape}/bigscape.py -i {input.antismash_dir} -o {params.bigscape_dir} -c {threads} --cutoff 0.3 0.4 0.5 --include_singletons --label {params.label} --hybrids-off --verbose --no_classify --mix &>> {log}
    """
91
92
93
94
95
96
shell:
    """
    topdir=$PWD
    (cd data/interim/bigscape && zip -r $topdir/{output.zip} no_mibig_{wildcards.name}_antismash_{wildcards.version} \
        -x {wildcards.name}_antismash_{wildcards.version}/cache/**\* &>> $topdir/{log})
    """
115
116
117
118
119
shell:
    """
    python workflow/bgcflow/bgcflow/data/make_bigscape_to_cytoscape.py data/interim/bigscape/{wildcards.name}_antismash_{wildcards.version}/ {input.as_dir} {input.df_genomes_path} {input.mibig_bgc_table} {output.output_dir} 2>> {log}
    cp {input.bgc_mapping} {output.bgc_mapping}
    """
132
133
134
135
shell:
    """
    python workflow/bgcflow/bgcflow/data/bigscape_copy.py {input.index} data/processed/{wildcards.name}/bigscape/result_as{wildcards.version} 2>> {log}
    """
13
14
15
16
shell:
    """
    python workflow/bgcflow/bgcflow/data/bigslice_prep_single_project.py {input.dir} {input.taxonomy} {params.project_name} {output.folder} 2>> {log}
    """
33
34
35
36
shell:
    """
    bigslice -i {input.dir} {output.folder} --threshold {params.threshold} -t {threads} &>> {log}
    """
46
47
48
49
50
shell:
    """
    (cd resources/bigslice && wget -c -nc http://bioinformatics.nl/~kauts001/ltr/bigslice/paper_data/data/full_run_result.zip && unzip full_run_result.zip) &>> {log}
    rm resources/bigslice/full_run_result.zip &>> {log}
    """
70
71
72
73
74
75
shell:
    """
    TIMESTAMP=$(date --iso-8601=hours)
    bigslice --query {input.tmp_dir} --n_ranks {params.n_ranks} {input.bigslice_dir} -t {threads} --query_name {params.query_name}_$TIMESTAMP --run_id {params.run_id} &>> {log}
    python workflow/bgcflow/bgcflow/data/get_bigslice_query_result.py {params.query_name} {output.folder} {input.bigslice_dir} &>> {log}
    """
90
91
92
93
shell:
    """
    python workflow/bgcflow/bgcflow/data/summarize_bigslice_query.py {input.query_dir} {output.folder} {params.bigfam_db_path} {params.cutoff} 2>> {log}
    """
19
20
21
22
23
24
25
26
27
shell:
    """
    cblaster --version >> {output.version}
    diamond --version >> {output.version}
    cat {output.version} >> {log}
    cblaster config --email [email protected] 2>> {log}
    cblaster makedb --cpus {threads} -b {params.batch_size} -n {params.db_prefix} {input.gbk} 2>> {log}
    cp -r {output.folder_interim} {output.folder_processed} 2>> {log}
    """
48
49
50
51
52
53
shell:
    """
    cblaster config --email [email protected]
    cblaster makedb --cpus {threads} -b {params.batch_size} -n {params.db_prefix} {params.antismash_dir} 2>> {log}
    cp -r {output.folder_interim} {output.folder_processed} 2>> {log}
    """
 8
 9
10
11
12
13
shell:
    """
    (cd resources && wget -O checkm_data_2015_01_16.tar.gz https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz) &>> {log}
    (cd resources && mkdir -p checkm && tar -xvf checkm_data_2015_01_16.tar.gz -C checkm && rm checkm_data_2015_01_16.tar.gz) &>> {log}
    checkm data setRoot resources/checkm &>> {log}
    """
31
32
33
34
35
36
shell:
    """
    mkdir -p {output.fna}
    for f in {input.fna}; do cp $f {output.fna}/.; done
    checkm lineage_wf -t {threads} --reduced_tree -x fna {output.fna} {output.checkm_dir} &>> {log}
    """
54
55
56
57
58
shell:
    """
    mkdir -p {params.checkm_json}
    python workflow/bgcflow/bgcflow/data/get_checkm_data.py {input.stat} {params.checkm_json} {output.stat_processed} 2>> {log}
    """
10
11
12
13
14
shell:
    """
    python workflow/bgcflow/bgcflow/database/csv_to_parquet.py data/processed/{wildcards.name} 2>> {log}
    touch {output.parquet}
    """
 9
10
11
12
shell:
    """
    git clone https://bitbucket.org/kaistsystemsbiology/deeptfactor.git {output.folder} 2>> {log}
    """
28
29
30
31
32
33
34
35
shell:
    """
    workdir=$PWD
    mkdir -p data/interim/deeptfactor/{wildcards.strains} 2>> {log}
    (cd {input.resource} && python tf_running.py \
        -i $workdir/{input.faa} -o $workdir/{params.outdir} \
        -g cpu -cpu {threads}) 2>> {log}
    """
47
48
49
50
shell:
    """
    python workflow/bgcflow/bgcflow/data/deeptfactor_scatter.py {input.deeptfactor} {output.deeptfactor_json} 2>> {log}
    """
65
66
67
68
69
70
71
72
73
shell:
    """
    TMPDIR="data/interim/tmp/{wildcards.name}"
    mkdir -p $TMPDIR
    INPUT_JSON="$TMPDIR/df_deeptfactor.txt"
    echo '{input.deeptfactor_json}' > $INPUT_JSON
    python workflow/bgcflow/bgcflow/data/deeptfactor_gather.py $INPUT_JSON {output.df_deeptfactor} 2>> {log}
    rm $INPUT_JSON
    """
15
16
17
18
19
20
shell:
    """
    cat {input.faa} > {output.faa_compiled} 2>> {log}
    diamond makedb --in {output.faa_compiled} -d {params.diamond} -p {threads} 2>> {log}
    cp {output.diamond_interim} {output.diamond_processed} 2>> {log}
    """
 9
10
11
12
13
shell:
    """
    download_eggnog_data.py --data_dir {output.eggnog_db} -y &>> {log}
    create_dbs.py -m diamond --dbname bacteria --taxa Bacteria --data_dir {output.eggnog_db} -y &>> {log}
    """
28
29
30
31
32
shell:
    """
    mkdir -p {output.eggnog_dir}
    emapper.py -i {input.faa} --decorate_gff "yes" --excel --cpu {threads} -o {wildcards.strains} --output_dir {output.eggnog_dir} --data_dir {input.eggnog_db} &>> {log}
    """
13
14
15
16
17
18
19
20
shell:
    """
    for fna in {input.fna}
    do
        echo $fna >> {output.fastani_infile}
    done
    fastANI --ql {output.fastani_infile} --rl {output.fastani_infile} -t {threads} --matrix -o {output.fastani_out} 2>> {log}
    """
32
33
34
35
shell:
    """
    python workflow/bgcflow/bgcflow/data/convert_triangular_matrix.py {input.fastani_matrix} {output.df_fastani} 2>> {log}
    """
24
25
26
27
shell:
    """
    python workflow/bgcflow/bgcflow/data/gtdb_prep.py {wildcards.strains} {output.gtdb_json} '{params.samples_path}' '{params.gtdb_paths}' {params.version} 2> {log}
    """
SnakeMake From line 24 of rules/gtdb.smk
47
48
49
50
51
52
53
54
55
shell:
    """
    TMPDIR="data/interim/tmp/{wildcards.name}"
    mkdir -p $TMPDIR
    INPUT_JSON="$TMPDIR/df_gtdb.txt"
    echo '{input.json_list}' > $INPUT_JSON
    python workflow/bgcflow/bgcflow/data/fix_gtdb_taxonomy.py $INPUT_JSON {output.meta} 2> {log}
    rm $INPUT_JSON
    """
SnakeMake From line 47 of rules/gtdb.smk
37
38
39
40
41
shell:
    """
    (cd resources && wget https://data.gtdb.ecogenomic.org/releases/release{params.release_major}/{params.release_major}.{params.release_minor}/auxillary_files/gtdbtk_{params.release_version}_data.tar.gz -nc) 2>> {log}
    (cd resources && mkdir -p gtdbtk && tar -xvzf gtdbtk_{params.release_version}_data.tar.gz -C "gtdbtk" --strip 1 && rm gtdbtk_{params.release_version}_data.tar.gz) &>> {log}
    """
53
54
55
56
57
58
59
60
61
62
63
64
shell:
    """
    TMPDIR="data/interim/tmp/{wildcards.name}"
    mkdir -p $TMPDIR
    INPUT_FNA="$TMPDIR/df_fna_gtdbtk.txt"
    INPUT_JSON="$TMPDIR/df_json_gtdbtk.txt"
    echo '{input.fna}' > $INPUT_FNA
    echo '{input.json_list}' > $INPUT_JSON
    python workflow/bgcflow/bgcflow/data/gtdbtk_prep.py $INPUT_FNA $INPUT_JSON {output.fnadir} 2>> {log}
    rm $INPUT_FNA
    rm $INPUT_JSON
    """
83
84
85
86
87
88
shell:
    """
    mkdir -p {output.tmpdir}
    gtdbtk classify_wf --genome_dir {input.fnadir} --out_dir {output.gtdbtk_dir} --cpus {threads} --pplacer_cpus 1 --tmpdir {output.tmpdir} {params.ani_screen} &>> {log}
    cp {output.summary_interim} {output.summary_processed}
    """
12
13
14
15
16
17
18
19
shell:
    """
    for fna in {input.fna}
    do
        echo $fna >> {output.mash_infile}
    done
    (mash triangle -p {threads} -l {output.mash_infile} >> {output.triangle_dist}) 2>> {log}
    """
SnakeMake From line 12 of rules/mash.smk
31
32
33
34
shell:
    """
    python workflow/bgcflow/bgcflow/data/convert_triangular_matrix.py {input.mash_matrix} {output.df_mash} 2>> {log}
    """
SnakeMake From line 31 of rules/mash.smk
10
11
12
13
shell:
    """
    mlst --csv {input.fna} > {output.csv} 2>> {log}
    """
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
shell:
    """
    if [[ {wildcards.ncbi} == GCF* ]]
    then
        source="refseq"
    elif [[ {wildcards.ncbi} == GCA* ]]
    then
        source="genbank"
    else
        echo "accession must start with GCA or GCF" >> {log}
    fi
    ncbi-genome-download -s $source -F fasta,assembly-report -A {wildcards.ncbi} -o data/raw/ncbi/download -P -N --verbose bacteria 2>> {log}
    gunzip -c data/raw/ncbi/download/$source/bacteria/{wildcards.ncbi}/*.fna.gz > {output.fna}
    cp data/raw/ncbi/download/$source/bacteria/{wildcards.ncbi}/*report.txt {output.assembly_report}
    rm -rf data/raw/ncbi/download/$source/bacteria/{wildcards.ncbi}
    python workflow/bgcflow/bgcflow/data/get_assembly_information.py {output.assembly_report} {output.json_report} {wildcards.ncbi} 2>> {log}
    """
48
49
50
51
52
53
54
55
56
57
shell:
    """
    TMPDIR="data/interim/tmp/{wildcards.name}"
    mkdir -p $TMPDIR
    INPUT_JSON="$TMPDIR/df_ncbi_meta_input.txt"
    echo '{input.all_json}' > $INPUT_JSON
    python workflow/bgcflow/bgcflow/data/extract_ncbi_information.py \
        $INPUT_JSON {output.ncbi_meta_path} 2>> {log}
    rm $INPUT_JSON
    """
SnakeMake From line 48 of rules/ncbi.smk
67
68
69
70
71
shell:
    """
    wget ftp://ftp.patricbrc.org/RELEASE_NOTES/genome_summary -O {output.patric_genome_summary} 2>> {log}
    wget ftp://ftp.patricbrc.org/RELEASE_NOTES/genome_metadata -O {output.patric_genome_metadata} 2>> {log}
    """
SnakeMake From line 67 of rules/ncbi.smk
89
90
91
92
93
shell:
    """
    python workflow/bgcflow/bgcflow/data/extract_patric_meta.py \
        {input.ncbi_meta_path} {input.patric_genome_summary} {input.patric_genome_metadata} {output.patric_meta_path} 2>> {log}
    """
SnakeMake From line 89 of rules/ncbi.smk
12
13
14
15
shell:
    """
    (cd data/interim/fasta && wget -qN "ftp://ftp.patricbrc.org/genomes/{wildcards.patric}/{wildcards.patric}.fna" 2> ../../../{log})
    """
11
12
13
14
15
shell:
    """
    ln -s $PWD/resources/RNAmmer/rnammer $CONDA_PREFIX/bin/rnammer 2>> {log}
    rnammer -S bac -m lsu,ssu,tsu -gff - resources/RNAmmer/example/ecoli.fsa >> {output}
    """
31
32
33
34
35
36
37
38
39
40
shell:
    """
    if [[ {input} == *.fna || {input} == *.fasta || {input} == *.fa ]]
    then
        cp {input} {output} 2>> {log}
    else
        echo "ERROR: Wrong Extension:" {input} >> {log}
        exit 1
    fi
    """
51
52
53
54
55
shell:
    """
    python workflow/bgcflow/bgcflow/data/make_prokka_db.py {wildcards.prokka_db} \
        {input.table} {output.refgbff} 2>> {log}
    """
67
68
69
70
71
shell:
    """
    python workflow/bgcflow/bgcflow/data/get_organism_info.py {wildcards.strains} \
        "{params.samples_path}" data/interim/assembly_report/ data/interim/prokka/ 2>> {log}
    """
105
106
107
108
109
110
111
112
shell:
    """
    prokka --outdir data/interim/prokka/{wildcards.strains} --force \
        {params.refgbff} --prefix {wildcards.strains} --genus "`cut -d "," -f 1 {input.org_info}`" \
        --species "`cut -d "," -f 2 {input.org_info}`" --strain "`cut -d "," -f 3 {input.org_info}`" \
        --cdsrnaolap --cpus {threads} {params.rna_detection} --increment {params.increment} \
        {params.use_pfam} --evalue {params.evalue} {input.fna} &> {log}
    """
125
126
127
128
129
shell:
    """
    python workflow/bgcflow/bgcflow/data/format_genbank_meta.py {input.gbk_prokka} \
        {params.version} {input.gtdb_json} {wildcards.strains} {output.gbk_processed} 2> {log}
    """
SnakeMake From line 125 of rules/prokka.smk
144
145
146
147
148
149
shell:
    """
    cp {input.gbk} {output.gbk} 2>> {log}
    cp {input.summary} {output.summary} 2>> {log}
    cp {input.tsv} {output.tsv} 2>> {log}
    """
SnakeMake From line 144 of rules/prokka.smk
13
14
15
16
17
shell:
    """
    refseq_masher matches --output-type {params.output_type} --top-n-results \
        {params.top_n_results} --output {output.masher_csv} {input.fasta} 2>> {log}
    """
14
15
16
17
shell:
    """
    roary -p {threads} -f {output.roary_dir} -i {params.i} -g {params.g} -e -n -r -v {input.gff} &>> {log}
    """
34
35
36
37
38
shell:
    """
    mkdir -p {output.eggnog_dir}
    emapper.py -i {params.faa} --translate --itype "CDS" --excel --cpu {threads} -o {wildcards.name} --output_dir {output.eggnog_dir} --data_dir {input.eggnog_db} &>> {log}
    """
55
56
57
58
59
60
shell:
    """
    (cd {input.resource} && python tf_running.py \
        -i {params.faa} -o {params.outdir} \
        -g cpu -cpu {threads}) 2>> {log}
    """
77
78
79
80
81
shell:
    """
    diamond makedb --in {params.faa} -d {output.diamond_interim} -p {threads} &>> {log}
    cp {output.diamond_interim} {output.diamond_processed} &>> {log}
    """
95
96
97
98
shell:
    """
    python workflow/bgcflow/bgcflow/data/make_pangenome_dataset.py {input.roary_interim_dir} {output.roary_processed_dir} {input.automlst_processed_dir} 2>> {log}
    """
12
13
14
15
shell:
    """
    seqfu stats {input.fna} --json -b --gc --precision 3 > {output.json} 2>> {log}
    """
35
36
37
38
39
40
41
42
43
shell:
    """
    TMPDIR="data/interim/tmp/{wildcards.name}"
    mkdir -p $TMPDIR
    INPUT_JSON="$TMPDIR/df_seqfu.txt"
    echo '{input.json}' > $INPUT_JSON
    python workflow/bgcflow/bgcflow/data/make_seqfu_table.py $INPUT_JSON {output.all_csv} &>> {log}
    rm $INPUT_JSON
    """
ShowHide 94 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/NBChub/bgcflow
Name: bgcflow
Version: v0.7.4
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...