Snakemake workflow for the analysis of biosynthetic gene clusters across large collections of genomes (pangenomes)

public 1yr ago Version: v0.7.4 0 bookmarks

View Workflow

BGCFlow is a systematic workflow for the analysis of biosynthetic gene clusters across large collections of genomes (pangenomes) from internal & public datasets.

At present, BGCFlow is only tested and confirmed to work on Linux systems with conda / mamba package manager.

Publication

Matin Nuhamunada, Omkar S. Mohite, Patrick V. Phaneuf, Bernhard O. Palsson, and Tilmann Weber. (2023). BGCFlow: Systematic pangenome workflow for the analysis of biosynthetic gene clusters across large genomic datasets. bioRxiv 2023.06.14.545018; doi: https://doi.org/10.1101/2023.06.14.545018

Pre-requisites

BGCFlow requires gcc and the conda / mamba package manager. See installation instruction for details.

Please use the latest version of BGCFlow available.

Quick Start

A quick and easy way to use BGCFlow using bgcflow_wrapper .

Create a conda environment and install the BGCFlow python wrapper :

# create and activate a new conda environment
conda create -n bgcflow pip -y
conda activate bgcflow
# install `BGCFlow` wrapper
pip install git+https://github.com/NBChub/bgcflow_wrapper.git
# make sure to use bgcflow_wrapper version >= 0.2.7
bgcflow --version

Additional pre-requisites : With the environment activated, install or setup this configurations:

Set conda channel priorities to flexible

conda config --set channel_priority disabled
conda config --describe channel_priority

Java (required to run metabase )

conda install openjdk

Deploy and run BGCFlow, change your_bgcflow_directory variable accordingly:

# Deploy and run BGCFlow
bgcflow clone <your_bgcflow_directory> # clone `BGCFlow` to your_bgcflow_directory
cd <your_bgcflow_directory> # move to BGCFLOW_PATH
bgcflow init # initiate `BGCFlow` config and examples from template
bgcflow run -n # do a dry run, remove the flag "-n" to run the example dataset

Build and serve interactive report (after bgcflow run finished). The report will be served in http://localhost:8001/ :

# build a report
bgcflow build report
# show available projects
bgcflow serve
# serve interactive report
bgcflow serve --project Lactobacillus_delbrueckii

For detailed usage and configurations, have a look at the BGCFlow WIKI ( under development ) :warning:
Read more about bgcflow_wrapper for a detailed overview of the command line interface.

Workflow overview

The main Snakefile workflow comprises various pipelines for data selection, functional annotation, phylogenetic analysis, genome mining, and comparative genomics for Prokaryotic datasets.

dag

Available pipelines in the main Snakefile can be checked using the following command:

bgcflow pipelines

List of Available Pipelines

Here you can find pipeline keywords that you can run using the main Snakefile of BGCflow.

	Keyword	Description	Links
0	eggnog	Annotate samples with eggNOG database (http://eggnog5.embl.de)	eggnog-mapper
1	mash	Calculate distance estimation for all samples using MinHash.	Mash
2	fastani	Do pairwise Average Nucleotide Identity (ANI) calculation across all samples.	FastANI
3	automlst-wrapper	Simplified Tree building using autoMLST	automlst-simplified-wrapper
4	roary	Build pangenome using Roary.	Roary
5	eggnog-roary	Annotate Roary output using eggNOG mapper	eggnog-mapper
6	seqfu	Calculate sequence statistics using SeqFu.	seqfu2
7	bigslice	Cluster BGCs using BiG-SLiCE (https://github.com/medema-group/bigslice)	bigslice
8	query-bigslice	Map BGCs to BiG-FAM database (https://bigfam.bioinformatics.nl/)	bigfam.bioinformatics.nl
9	checkm	Assess genome quality with CheckM.	CheckM
10	gtdbtk	Taxonomic placement with GTDB-Tk	GTDBTk
11	prokka-gbk	Copy annotated genbank results.	prokka
12	antismash	Summarizes antiSMASH result.	antismash
13	arts	Run Antibiotic Resistant Target Seeker (ARTS) on samples.	arts
14	deeptfactor	Use deep learning to find Transcription Factors.	deeptfactor
15	deeptfactor-roary	Use DeepTFactor on Roary outputs.	Roary
16	cblaster-genome	Build diamond database of genomes for cblaster search.	cblaster
17	cblaster-bgc	Build diamond database of BGCs for cblaster search.	cblaster
18	bigscape	Cluster BGCs using BiG-SCAPE	BiG-SCAPE

References

eggNOG-mapper v2: functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Carlos P. Cantalapiedra, Ana Hernandez-Plaza, Ivica Letunic, Peer Bork, Jaime Huerta-Cepas. 2021. Molecular Biology and Evolution, msab293

eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Jaime Huerta-Cepas, Damian Szklarczyk, Davide Heller, Ana Hernández-Plaza, Sofia K Forslund, Helen Cook, Daniel R Mende, Ivica Letunic, Thomas Rattei, Lars J Jensen, Christian von Mering, Peer Bork Nucleic Acids Res. 2019 Jan 8; 47(Database issue): D309–D314. doi: 10.1093/nar/gky1085

Mash: fast genome and metagenome distance estimation using MinHash. Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM. Genome Biol. 2016 Jun 20;17(1):132. doi: 10.1186/s13059-016-0997-x.

Mash Screen: high-throughput sequence containment estimation for genome discovery. Ondov BD, Starrett GJ, Sappington A, Kostic A, Koren S, Buck CB, Phillippy AM. Genome Biol. 2019 Nov 5;20(1):232. doi: 10.1186/s13059-019-1841-x.

Jain, C., Rodriguez-R, L.M., Phillippy, A.M. et al. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat Commun 9, 5114 (2018). https://doi.org/10.1038/s41467-018-07641-9

Mohammad Alanjary, Katharina Steinke, Nadine Ziemert, AutoMLST: an automated web server for generating multi-locus species trees highlighting natural product potential, Nucleic Acids Research, Volume 47, Issue W1, 02 July 2019, Pages W276–W282

Andrew J. Page, Carla A. Cummins, Martin Hunt, Vanessa K. Wong, Sandra Reuter, Matthew T. G. Holden, Maria Fookes, Daniel Falush, Jacqueline A. Keane, Julian Parkhill, 'Roary: Rapid large-scale prokaryote pan genome analysis', Bioinformatics, 2015;31(22):3691-3693 doi:10.1093/bioinformatics/btv421

Andrew J. Page, Carla A. Cummins, Martin Hunt, Vanessa K. Wong, Sandra Reuter, Matthew T. G. Holden, Maria Fookes, Daniel Falush, Jacqueline A. Keane, Julian Parkhill, 'Roary: Rapid large-scale prokaryote pan genome analysis', Bioinformatics, 2015;31(22):3691-3693 doi:10.1093/bioinformatics/btv421

eggNOG-mapper v2: functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Carlos P. Cantalapiedra, Ana Hernandez-Plaza, Ivica Letunic, Peer Bork, Jaime Huerta-Cepas. 2021. Molecular Biology and Evolution, msab293

eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Jaime Huerta-Cepas, Damian Szklarczyk, Davide Heller, Ana Hernández-Plaza, Sofia K Forslund, Helen Cook, Daniel R Mende, Ivica Letunic, Thomas Rattei, Lars J Jensen, Christian von Mering, Peer Bork Nucleic Acids Res. 2019 Jan 8; 47(Database issue): D309–D314. doi: 10.1093/nar/gky1085

Telatin, A., Birolo, G., & Fariselli, P. SeqFu [Computer software]. GITHUB: https://github.com/telatin/seqfu2

Satria A Kautsar, Justin J J van der Hooft, Dick de Ridder, Marnix H Medema, BiG-SLiCE: A highly scalable tool maps the diversity of 1.2 million biosynthetic gene clusters, GigaScience, Volume 10, Issue 1, January 2021, giaa154

Satria A Kautsar, Kai Blin, Simon Shaw, Tilmann Weber, Marnix H Medema, BiG-FAM: the biosynthetic gene cluster families database, Nucleic Acids Research, gkaa812, https://doi.org/10.1093/nar/gkaa812

Satria A Kautsar, Justin J J van der Hooft, Dick de Ridder, Marnix H Medema, BiG-SLiCE: A highly scalable tool maps the diversity of 1.2 million biosynthetic gene clusters, GigaScience, Volume 10, Issue 1, January 2021, giaa154.

Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. 2014. Assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Research, 25: 1043-1055.

Chaumeil PA, et al. 2019. GTDB-Tk: A toolkit to classify genomes with the Genome Taxonomy Database. Bioinformatics, btz848.

Parks DH, et al. 2020. A complete domain-to-species taxonomy for Bacteria and Archaea. Nature Biotechnology, https://doi.org/10.1038/s41587-020-0501-8 .

Parks DH, et al. 2018. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nature Biotechnology, http://dx.doi.org/10.1038/nbt.4229 .

Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics 2014 Jul 15;30(14):2068-9. PMID:24642063

antiSMASH 6.0: improving cluster detection and comparison capabilities. Kai Blin, Simon Shaw, Alexander M Kloosterman, Zach Charlop-Powers, Gilles P van Weezel, Marnix H Medema, & Tilmann Weber. Nucleic Acids Research (2021) doi: 10.1093/nar/gkab335.

Mungan,M.D., Alanjary,M., Blin,K., Weber,T., Medema,M.H. and Ziemert,N. (2020) ARTS 2.0: feature updates and expansion of the Antibiotic Resistant Target Seeker for comparative genome mining. Nucleic Acids Res.,10.1093/nar/gkaa374

Alanjary,M., Kronmiller,B., Adamek,M., Blin,K., Weber,T., Huson,D., Philmus,B. and Ziemert,N. (2017) The Antibiotic Resistant Target Seeker (ARTS), an exploration engine for antibiotic cluster prioritization and novel drug target discovery. Nucleic Acids Res.,10.1093/nar/gkx360

Kim G.B., Gao Y., Palsson B.O., Lee S.Y. 2020. DeepTFactor: A deep learning-based tool for the prediction of transcription factors. PNAS. doi: 10.1073/pnas.2021171118

Kim G.B., Gao Y., Palsson B.O., Lee S.Y. 2020. DeepTFactor: A deep learning-based tool for the prediction of transcription factors. PNAS. doi: 10.1073/pnas.2021171118

Andrew J. Page, Carla A. Cummins, Martin Hunt, Vanessa K. Wong, Sandra Reuter, Matthew T. G. Holden, Maria Fookes, Daniel Falush, Jacqueline A. Keane, Julian Parkhill, 'Roary: Rapid large-scale prokaryote pan genome analysis', Bioinformatics, 2015;31(22):3691-3693 doi:10.1093/bioinformatics/btv421

Gilchrist, C., Booth, T. J., van Wersch, B., van Grieken, L., Medema, M. H., & Chooi, Y. (2021). cblaster: a remote search tool for rapid identification and visualisation of homologous gene clusters (Version 1.3.9) [Computer software]. https://doi.org/10.1101/2020.11.08.370601

Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60 (2015) .

Gilchrist, C., Booth, T. J., van Wersch, B., van Grieken, L., Medema, M. H., & Chooi, Y. (2021). cblaster: a remote search tool for rapid identification and visualisation of homologous gene clusters (Version 1.3.9) [Computer software]. https://doi.org/10.1101/2020.11.08.370601

Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60 (2015) .

Navarro-Muñoz, J.C., Selem-Mojica, N., Mullowney, M.W. et al. A computational framework to explore large-scale biosynthetic diversity. Nat Chem Biol 16, 60–68 (2020)

Code Snippets

import json
import logging
import sys
from pathlib import Path

import pandas as pd

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)


def generate_arts_dict(df_arts):
    """
    Convert arts table to dictionary
    """
    arts_dict = {}
    arts_scaffold_count = df_arts.Source.value_counts().to_dict()
    for k in arts_scaffold_count.keys():
        arts_dict[k] = {"counts": arts_scaffold_count[k]}
        arts_dict[k]["regions"] = []
        for i in df_arts[df_arts.loc[:, "Source"] == k].index:
            regions = {
                "cluster_id": df_arts.loc[i, "#Cluster"],
                "products": df_arts.loc[i, "Type"],
                "location": df_arts.loc[i, "Location"],
            }
            arts_dict[k]["regions"].append(regions)
    return arts_dict


def generate_arts_mapping(df_arts, as_json):
    arts_dict = generate_arts_dict(df_arts)
    # container for final result
    hit_mapper = {}
    contig_mapper = {}

    # iterate antismash json records
    for num, r in enumerate(as_json["records"]):
        # count number of detected regions per record
        region_count = len(r["areas"])
        # if antismash detects region
        if region_count > 0:
            contig_id = r["id"]
            logging.info(
                f"Contig {contig_id} has {region_count} regions detected. Finding corresponding scaffold in ARTS2..."
            )
            # find arts scaffold with the same region number detected
            arts_match = [
                k
                for k in arts_dict.keys()
                if int(arts_dict[k]["counts"]) == int(region_count)
            ]
            logging.debug(f"Finding matches from: {arts_match}")
            for n, a in enumerate(r["areas"]):
                bgc_id = f"{contig_id}.region{str(n+1).zfill(3)}"
                location = f"{a['start']} - {a['end']}"
                products = ",".join(a["products"])
                logging.debug(f"Finding match for: {bgc_id} | {location} | {products}")
                for m in arts_match:
                    bgc_match = arts_dict[m]["regions"][n]
                    if (bgc_match["location"] == location) and (
                        bgc_match["products"] == products
                    ):
                        logging.debug(
                            f"Found match! {bgc_match['cluster_id']} | {bgc_match['location']} | {bgc_match['products']}"
                        )
                        # logging.debug(f"Found match! {contig_id} == ARTS2 {m}")
                        hit_mapper[bgc_match["cluster_id"]] = bgc_id
                        contig_mapper[bgc_match["cluster_id"]] = contig_id
    return hit_mapper, arts_dict, contig_mapper


def extract_arts_table(input_table, input_as_json, outfile, genome_id=None):
    """
    Given an ARTS2 bgc table, map the corresponding cluster hits to antismash regions
    """
    df_arts = pd.read_csv(input_table, sep="\t")

    with open(input_as_json, "r") as f:
        as_json = json.load(f)

    if genome_id is None:
        logging.info("Assuming genome_id from ARTS2 input")
        genome_id = Path(input_table).parents[1].stem

    logging.info(f"Extracting ARTS2 BGC hits from: {genome_id}")

    logging.info("Generating ARTS2 to antiSMASH region mapper...")
    hit_mapper, arts_dict, contig_mapper = generate_arts_mapping(df_arts, as_json)

    logging.info("Mapping ARTS2 cluster to antiSMASH regions...")
    df = pd.DataFrame(columns=df_arts.columns)
    for i in df_arts.index:
        # only extract hits
        if len(df_arts.loc[i, "Genelist"]) > 2:
            cluster_id = df_arts.loc[i, "#Cluster"]
            bgc_id = hit_mapper[cluster_id]
            contig_id = contig_mapper[cluster_id]
            df.loc[i, :] = df_arts.loc[i, :]
            df.loc[i, "#Cluster"] = bgc_id
            df.loc[i, "Source"] = contig_id
            df.loc[i, "genome_id"] = genome_id
    df = df.rename(columns={"#Cluster": "bgc_id"}).set_index("bgc_id")

    logging.info(f"Writing results to {outfile}")
    df.T.to_json(outfile)
    return


if __name__ == "__main__":
    extract_arts_table(sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4])

Python Pandas JSON antiSMASH From line 1 of data/arts_extract.py

import json
import logging
import sys
from pathlib import Path

import pandas as pd
from alive_progress import alive_bar

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)


def generate_change_dict(change_log):
    logging.info("Generate BGC ids mapping dict...")
    change_log = Path(change_log)
    change_log = [i for i in change_log.glob("*/*.json")]
    change_dict = {}
    for i in change_log:
        with open(i, "r") as f:
            data = json.load(f)
            for k in data.keys():
                change_dict[k] = data[k]
    return change_dict


def correct_arts_bgc_ids(arts_json, change_dict, genome_id):
    logging.debug(f"Correcting BGC ids for {genome_id}")
    output = {}
    with open(arts_json, "r") as f:
        query = json.load(f)
        for k in query.keys():
            correct_bgc_id = Path(
                change_dict[genome_id][f"{k}.gbk"]["symlink_path"]
            ).stem
            output[correct_bgc_id] = query[k]
    return output


def combine_arts_json(input_json, change_log_path, table):
    logging.info("Combining and correcting ARTS output...")

    input_json = Path(input_json)
    logging.info(input_json)
    if input_json.is_file() and input_json.suffix == ".json":
        logging.info(f"Getting ARTS overview from a single file: {input_json}")
        input_json_files = input_json

    elif input_json.is_file() and input_json.suffix == ".txt":
        logging.info(f"Getting ARTS overview  from a text file: {input_json}")
        with open(input_json, "r") as file:
            file_content = [i.strip("\n") for i in file.readlines()]
            if len(file_content) == 1:
                # Paths space-separated on a single line
                paths = file_content[0].split()
            else:
                # Paths written on separate lines
                paths = file_content
            input_json_files = [
                Path(path) for path in paths if Path(path).suffix == ".json"
            ]
    else:
        input_json_files = [
            Path(file)
            for file in str(input_json).split()
            if Path(file).suffix == ".json"
        ]
        logging.info(
            f"Getting ARTS overview from the given list of {len(input_json_files)} files..."
        )

    container = {}
    change_dict = generate_change_dict(change_log_path)

    with alive_bar(len(input_json_files), title="Merging json:") as bar:
        for j in input_json_files:
            arts_json = Path(j)
            genome_id = arts_json.stem
            value = correct_arts_bgc_ids(arts_json, change_dict, genome_id)
            container.update(value)
            bar()

    df = pd.DataFrame.from_dict(container).T
    df.index.name = "bgc_id"

    logging.debug(f"Writing file to: {table}")

    # Save dataframes to csv tables
    df_table = Path(table)
    df_table.parent.mkdir(parents=True, exist_ok=True)
    df.to_csv(table)
    logging.info("Job done")
    return None


if __name__ == "__main__":
    combine_arts_json(sys.argv[1], sys.argv[2], sys.argv[3])

Python Pandas JSON alive-progress From line 1 of data/arts_gather.py

import logging
import sys
from pathlib import Path

import pandas as pd

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.INFO)


def csv_to_parquet(project_folder, output_folder="."):
    """
    Given a list of csv files, convert into parquet files
    """
    project_folder = Path(project_folder)
    if output_folder == ".":
        output_folder = project_folder / "data_warehouse"
    else:
        pass

    logging.info(
        f"Grabbing all csv from folder {project_folder} and saving parquets in {output_folder}"
    )
    for i in project_folder.rglob("*.csv"):
        if "docs" in str(i):
            pass
        elif "dbt" in str(i):
            pass
        elif "assets" in str(i):
            pass
        elif "ipynb_checkpoints" in str(i):
            pass
        else:
            category = str(i).split("/")[3]
            output_subfolder = output_folder / category
            df = pd.read_csv(i)
            # df = df.fillna("")
            # df.columns = [c.replace(".", "_") for c in df.columns]
            output_parquet = output_subfolder / f"{i.stem}.parquet"
            logging.info(f"Converting {i} to {output_parquet}")
            output_subfolder.mkdir(parents=True, exist_ok=True)
            df.to_parquet(output_parquet)

    return


if __name__ == "__main__":
    csv_to_parquet(sys.argv[1])

Python Pandas From line 1 of database/csv_to_parquet.py

import json
import logging
import sys
from pathlib import Path

from Bio import SeqIO

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)


def bgc_downstream_prep(input_dir, output_dir, selected_bgcs=False):
    """
    Given an antiSMASH directory, check for changed name
    """
    logging.info(f"Reading input directory: {input_dir}")
    path = Path(input_dir)

    if not path.is_dir():
        raise FileNotFoundError(f"No such file or directory: {path}")

    genome_id = path.name
    logging.debug(f"Deducting genome id as {genome_id}")

    change_log = {genome_id: {}}

    for gbk in path.glob("*.gbk"):
        logging.info(f"Parsing file: {selected_bgcs}")
        logging.info(f"Parsing file: {gbk.name}")
        region = SeqIO.parse(str(gbk), "genbank")
        for record in region:
            record_log = {}
            if "comment" in record.annotations:
                filename = gbk.name
                try:
                    original_id = record.annotations["structured_comment"][
                        "antiSMASH-Data"
                    ]["Original ID"].split()[0]
                except KeyError:
                    original_id = record.id
                    logging.warning(
                        f"Found shortened record.id: {record.id} <- {original_id}."
                    )

                # generate symlink
                outpath = Path(output_dir) / genome_id
                outpath.mkdir(parents=True, exist_ok=True)

                new_filename = filename.replace(record.id, original_id)
                target_path = Path.cwd() / gbk  # target for symlink

                link = outpath / new_filename

                logging.info(f"Generating symlink: {link}")
                try:
                    link.symlink_to(target_path)
                except FileExistsError:
                    logging.warning(
                        f"Previous symlink exist, updating target: {link} -> {target_path}"
                    )
                    link.unlink()
                    link.symlink_to(target_path)

                record_log["record_id"] = record.id
                record_log["original_id"] = original_id
                record_log["target_path"] = str(gbk)
                record_log["symlink_path"] = str(link)

            change_log[genome_id][filename] = record_log

    with open(
        outpath / f"{genome_id}-change_log.json", "w", encoding="utf8"
    ) as json_file:
        json.dump(change_log, json_file, indent=4)

    logging.info(f"{genome_id}: Job done!\n")
    return


if __name__ == "__main__":
    bgc_downstream_prep(sys.argv[1], sys.argv[2])

Python Biopython JSON From line 1 of data/bgc_downstream_prep.py

import logging
import sys
from pathlib import Path

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)


def bigscape_copy(input_index, output_main):
    input = Path(input_index)
    output = Path(output_main)
    output.mkdir(parents=True, exist_ok=True)
    for item in input.parent.glob("*"):
        target = output / item.name
        logging.debug(f"Generating symlink for: {target} --> {item.resolve()}")
        try:
            target.symlink_to(item.resolve(), target_is_directory=True)
        except FileExistsError as e:
            logging.debug(f"Got error:\n{e}")
    return


if __name__ == "__main__":
    bigscape_copy(sys.argv[1], sys.argv[2])

Python From line 1 of data/bigscape_copy.py

import sys

import pandas as pd


def prep_bigslice(df_gtdb_output, df_bigslice_output):
    df = pd.read_csv(df_gtdb_output)
    # Format to BiG-Slice
    df_bigslice = df.rename(columns={"genome_id": "#Genome Folder"})
    df_bigslice["#Genome Folder"] = df_bigslice["#Genome Folder"].apply(
        lambda x: f"{x}/"
    )
    df_bigslice.to_csv(df_bigslice_output, index=False, sep="\t")
    return None


if __name__ == "__main__":
    prep_bigslice(sys.argv[1], sys.argv[2])

Python Pandas From line 1 of data/bigslice_prep.py

import logging
import sys
from pathlib import Path

import pandas as pd

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)

# create intermediate folder


def prepare_bigslice(input_folder, taxonomy_file, project_name, output_folder):
    """
    Preparing bigslice input folder for a single project
    """
    logging.info(f"Preparing bigslice input folder for {project_name}")
    input_folder = Path(input_folder)
    output_folder = Path(output_folder)
    assert input_folder.is_dir()
    output_folder.mkdir(parents=True, exist_ok=True)

    # create symlink to antismash
    logging.info("Creating symlink to AntiSMASH BGCs...")
    antismash_dir = output_folder / project_name
    try:
        antismash_dir.symlink_to(input_folder.resolve(), target_is_directory=True)
    except FileExistsError:
        pass

    # create tax info
    logging.info("Getting taxonomic information...")
    taxonomy_dir = output_folder / "taxonomy"
    taxonomy_dir.mkdir(parents=True, exist_ok=True)
    df_tax = pd.read_csv(taxonomy_file, sep="\t")
    df_tax = df_tax.loc[
        :,
        [
            "#Genome Folder",
            "Domain",
            "Phylum",
            "Class",
            "Order",
            "Family",
            "Genus",
            "Species",
            "Organism",
        ],
    ]
    df_tax.to_csv(taxonomy_dir / f"{project_name}.tsv", sep="\t", index=False)

    # build metadata
    logging.info("Preparing metadata...")
    metadata = {
        "# Dataset name": project_name,
        "Path to folder": f"{output_folder.name}/",
        "Path to taxonomy": f"taxonomy/{project_name}.tsv",
        "Description": project_name,
    }
    df_metadata = pd.DataFrame.from_dict(metadata, orient="index").T
    df_metadata.to_csv(output_folder / "datasets.tsv", sep="\t", index=False)
    logging.info("Job done!")
    return


if __name__ == "__main__":
    prepare_bigslice(sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4])

Python Pandas antiSMASH bigslice From line 1 of data/bigslice_prep_single_project.py

import logging
import sys

import numpy as np
import pandas as pd

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)


def triangle_convert(matrix, outfile):
    df_raw = pd.read_csv(matrix, index_col=0)
    genome_id_list = []
    for idx in df_raw.index:
        genome_id = idx.split("\t")[0].split("/")[-1].split(".fna")[0]
        genome_id_list.append(genome_id)

    df = pd.DataFrame(0, index=genome_id_list, columns=genome_id_list)
    for idx in df_raw.index:
        genome_id = idx.split("\t")[0].split("/")[-1].split(".fna")[0]
        for cntr in range(len(idx.split("\t"))):
            if cntr > 0:
                value = idx.split("\t")[cntr]

                # if value is NA, replace with numpy null
                if value == "NA":
                    value = np.nan
                else:
                    value = float(value)

                df.loc[genome_id, genome_id_list[cntr - 1]] = value
                df.loc[genome_id_list[cntr - 1], genome_id] = value
    df.index.name = "genome_id"
    df.to_csv(outfile)
    return


if __name__ == "__main__":
    triangle_convert(sys.argv[1], sys.argv[2])

Python Pandas numpy From line 1 of data/convert_triangular_matrix.py

import json
import logging
import sys
from pathlib import Path

import pandas as pd
from alive_progress import alive_bar

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)


def combine_deeptfactor_prediction(input_json, filter_str="_deeptfactor"):
    container = {}
    logging.info("Reading json files...")
    with alive_bar(len(input_json), title="Merging DeepTFactor prediction:") as bar:
        for item in input_json:
            item = Path(item)
            genome_id = item.stem
            if filter_str in genome_id:
                genome_id = genome_id.replace(filter_str, "")
            logging.debug(f"Processing {genome_id}")
            with open(item, "r") as f:
                data = json.load(f)
                container.update(data)
            bar()
    return container


def write_deeptf_table(input_json, deeptf_table):
    """
    Write df_deeptfactor.csv table in processed data
    """
    # Handle multiple json
    input_json = Path(input_json)
    logging.info(input_json)
    if input_json.is_file() and input_json.suffix == ".json":
        logging.info(f"Getting deepTFactor overview from a single file: {input_json}")
        input_json_files = input_json

    elif input_json.is_file() and input_json.suffix == ".txt":
        logging.info(f"Getting deepTFactor overview from a text file: {input_json}")
        with open(input_json, "r") as file:
            file_content = [i.strip("\n") for i in file.readlines()]
            if len(file_content) == 1:
                # Paths space-separated on a single line
                paths = file_content[0].split()
            else:
                # Paths written on separate lines
                paths = file_content
            input_json_files = [
                Path(path) for path in paths if Path(path).suffix == ".json"
            ]
    else:
        input_json_files = [
            Path(file)
            for file in str(input_json).split()
            if Path(file).suffix == ".json"
        ]
        logging.info(
            f"Getting deepTFactor overview from the given list of {len(input_json_files)} files..."
        )

    df = combine_deeptfactor_prediction(input_json_files)
    df = pd.DataFrame.from_dict(df).T
    df.index.name = "locus_tag"

    logging.debug(f"Writing file to: {deeptf_table}")

    # Save dataframes to csv tables
    df_table = Path(deeptf_table)
    df_table.parent.mkdir(parents=True, exist_ok=True)
    df.to_csv(deeptf_table, index=True)
    logging.info("Job done")
    return None


if __name__ == "__main__":
    write_deeptf_table(sys.argv[1], sys.argv[2])

Python Pandas JSON alive-progress From line 1 of data/deeptfactor_gather.py

import sys
from pathlib import Path

import pandas as pd


def deeptfactor_to_json(prediction, outfile, keep_false=True):
    genome_id = Path(prediction).parent.stem
    df = pd.read_csv(prediction, sep="\t")
    df.loc[:, "genome_id"] = genome_id
    # rename columns
    df = df.rename(
        columns={"prediction": "deeptfactor_prediction", "score": "deeptfactor_score"}
    )
    df = df.set_index("sequence_ID", drop=True)
    if keep_false:
        pass
    else:
        df = df[df.deeptfactor_prediction is True]
    df.T.to_json(outfile)
    return


if __name__ == "__main__":
    deeptfactor_to_json(sys.argv[1], sys.argv[2])

Python Pandas From line 1 of data/deeptfactor_scatter.py

import logging
import sys
from pathlib import Path

import pandas as pd

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)


def extract_ncbi_information(assembly_report_path, outfile):
    """
    Extract NCBI assembly json reports generated by get_assembly_information.py in a given directory into a csv file

    Parameters
    ----------
    1. assembly_report_path : str / path
        Three different input types can be used:
        - A directory containing assembly information json files generated by get_assembly_information.py
        - A single json file generated by get_assembly_information.py
        - A list of files as string separated by spaces (put the string inside '' in bash expression)
        - A text file (.txt) containing the paths of the json files (one per line, or space-separated on a single line)
    2. outfile : str / path
        Location of the output csv file

    Returns
    -------
    1. outfile : .csv file
        A comma-separated table summarizing the NCBI assembly reports metadata
    """

    assembly_report_path = Path(assembly_report_path)
    outfile = Path(outfile)

    if assembly_report_path.is_dir():
        logging.info(
            f"Summarizing NCBI assembly information from {assembly_report_path}..."
        )
        ncbi_json_files = [
            file
            for file in assembly_report_path.iterdir()
            if file.is_file() and file.suffix == ".json"
        ]
    elif assembly_report_path.is_file() and assembly_report_path.suffix == ".json":
        logging.info(
            f"Getting NCBI assembly information from a single file: {assembly_report_path}"
        )
        ncbi_json_files = [assembly_report_path]
    elif assembly_report_path.is_file() and assembly_report_path.suffix == ".txt":
        logging.info(
            f"Reading NCBI assembly information from a text file: {assembly_report_path}"
        )
        with open(assembly_report_path, "r") as file:
            file_content = [i.strip("\n") for i in file.readlines()]
            if len(file_content) == 1:
                # Paths space-separated on a single line
                paths = file_content[0].split()
            else:
                # Paths written on separate lines
                paths = file_content
            ncbi_json_files = [
                Path(path) for path in paths if Path(path).suffix == ".json"
            ]
    else:
        ncbi_json_files = [
            Path(file)
            for file in str(assembly_report_path).split()
            if Path(file).suffix == ".json"
        ]
        logging.info(
            f"Summarizing NCBI assembly information from the given list of {len(ncbi_json_files)} files..."
        )

    dfs = []  # an empty list to store the data frames
    for file in ncbi_json_files:
        data = pd.read_json(file).T  # read data frame from json file
        dfs.append(data)  # append the data frame to the list

    df_ncbi = pd.concat(dfs)  # concatenate all the data frames in the list.
    df_ncbi.reset_index(inplace=True)
    df_ncbi = df_ncbi.rename(columns={"index": "genome_id"})

    logging.info(f"Summarized {len(df_ncbi.index)} NCBI assembly information.")
    df_ncbi.to_csv(outfile, index=False)
    return None


if __name__ == "__main__":
    extract_ncbi_information(sys.argv[1], sys.argv[2])

Python Pandas JSON From line 1 of data/extract_ncbi_information.py

import sys

import pandas as pd


def extract_ncbi_information(
    ncbi_meta_path, patric_genome_summary, patric_genome_metadata, patric_meta_path
):
    """
    Extract PATRIC assembly json reports generated by get_assembly_information.py in a given directory into a csv file

    Parameters
    ----------
    1. assembly_report_path : str / path
        Three different input types can be used:
        - A directory containing assembly information json files generated by get_assembly_information.py
        - A single json file generated by get_assembly_information.py
        - A list of files as string separated by spaces (put the string inside '' in bash expression)
    2. outfile : str / path
        Location of the output csv file

    Returns
    -------
    1. outfile : .csv file
        A comma separated table summarizing the NCBI assembly reports metadata
    '''
    """

    df_ncbi = pd.read_csv(ncbi_meta_path, index_col="genome_id")
    df_patric_genome_summary = pd.read_csv(
        patric_genome_summary,
        sep="\t",
        dtype={"genome_id": str, "genome_name": str, "taxon_id": str, "strain": str},
    )
    df_patric_genome_summary.set_index("genome_id", inplace=True)

    df_patric_genome_metadata = pd.read_csv(
        patric_genome_metadata,
        sep="\t",
        dtype={"genome_id": str, "genome_name": str, "taxon_id": str},
    )
    df_patric_genome_metadata.set_index("genome_id", inplace=True)

    df_patric_genome_metadata = df_patric_genome_metadata[
        df_patric_genome_metadata["genome_status"] != "Plasmid"
    ]

    df_patric_meta = pd.DataFrame()

    for genome_id in df_ncbi.index:
        refseq = df_ncbi.loc[genome_id, "refseq"]
        genbank = df_ncbi.loc[genome_id, "genbank"]
        if refseq in df_patric_genome_metadata["assembly_accession"].tolist():
            patric_genome_ids_list = df_patric_genome_metadata[
                df_patric_genome_metadata["assembly_accession"] == refseq
            ].index.tolist()
        elif genbank in df_patric_genome_metadata["assembly_accession"].tolist():
            patric_genome_ids_list = df_patric_genome_metadata[
                df_patric_genome_metadata["assembly_accession"] == genbank
            ].index.tolist()

        for patric_genome_id in patric_genome_ids_list:
            df_patric_meta.loc[patric_genome_id, "bgcflow_genome_id"] = genome_id
            df_patric_meta.loc[
                patric_genome_id, df_patric_genome_metadata.columns
            ] = df_patric_genome_metadata.loc[
                patric_genome_id, df_patric_genome_metadata.columns
            ]
            if patric_genome_id in df_patric_genome_summary.index:
                for col in df_patric_genome_summary.columns:
                    if col not in df_patric_genome_metadata.columns:
                        df_patric_meta.loc[
                            patric_genome_id, col
                        ] = df_patric_genome_summary.loc[patric_genome_id, col]

    df_patric_meta.index.name = "patric_genome_id"
    df_patric_meta.to_csv(patric_meta_path)

    return None


if __name__ == "__main__":
    extract_ncbi_information(sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4])

Python Pandas JSON From line 1 of data/extract_patric_meta.py

import json
import logging
import sys
from pathlib import Path

import pandas as pd

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)


def summarize_gtdb_json(accession_list, df_gtdb_output):
    """
    Given a string of json filepaths generated by gtdb_prep.py,
    merge them together into one csv table
    """

    # Reading input
    logging.info("Reading GTDB metadata .json files...")

    input_json = Path(accession_list)

    if input_json.is_file() and input_json.suffix == ".json":
        logging.info(f"Getting GTDB taxonomy from a single file: {input_json}")
        input_json_files = input_json

    elif input_json.is_file() and input_json.suffix == ".txt":
        logging.info(f"Getting GTDB taxonomy from a text file: {input_json}")
        with open(input_json, "r") as file:
            file_content = [i.strip("\n") for i in file.readlines()]
            if len(file_content) == 1:
                # Paths space-separated on a single line
                paths = file_content[0].split()
            else:
                # Paths written on separate lines
                paths = file_content
            input_json_files = [
                Path(path) for path in paths if Path(path).suffix == ".json"
            ]
    else:
        input_json_files = [
            Path(file)
            for file in str(input_json).split()
            if Path(file).suffix == ".json"
        ]
        logging.info(
            f"Getting GTDB_taxonomy from the given list of {len(input_json_files)} files..."
        )

    accession = input_json_files
    out = []
    for a in accession:
        logging.info(f"Reading {a}...")
        with open(a, "r") as f:
            out.append(json.load(f))
    df = pd.DataFrame(out).set_index("genome_id", drop=False)

    # Getting taxonomic information
    logging.info("Getting taxonomic information...")
    df_taxonomy = pd.DataFrame.from_dict(
        {i: df.loc[i, "gtdb_taxonomy"] for i in df.index}
    ).T
    # Split between species (specific epithet) and organism
    df_taxonomy["organism"] = df_taxonomy["species"]
    for idx in df_taxonomy.index:
        try:
            df_taxonomy.loc[idx, "species"] = df_taxonomy.loc[idx, "species"].split(
                " "
            )[1]
        except IndexError:  # leave blank for empty taxonomy
            pass
    # Tidying up
    df_taxonomy.columns = [c.title() for c in df_taxonomy.columns]

    # Getting other metadata
    try:
        logging.info("Getting metadata into table...")
        if "metadata" not in df.columns:
            logging.warning(
                "metadata is not in the column information. Adding default values..."
            )
            df["metadata"] = [{"genome_id": genome_id} for genome_id in df.index]
        if "gtdb_release" not in df.columns:
            logging.warning(
                "gtdb_release is not in the column information. Adding default values..."
            )
            df["gtdb_release"] = "unknown"
        metadata = pd.DataFrame.from_dict(
            {i: df.loc[i, "metadata"] for i in df.index}
        ).T

        metadata_container = []
        for c in metadata.columns:
            value = {i: metadata.loc[i, c] for i in metadata.index}
            try:
                metadata_container.append(pd.DataFrame.from_dict(value).T)
            except ValueError:
                metadata_container.append(
                    pd.DataFrame([value]).T.rename(columns={0: c})
                )
        df_metadata = pd.concat(metadata_container, axis=1)

        logging.info("Finalizing table...")
        df_final = pd.concat(
            [df.loc[:, ["genome_id", "gtdb_release"]], df_taxonomy, df_metadata], axis=1
        )

    except KeyError:
        logging.info("No additional metadata found.")
        logging.info("Finalizing table...")
        df_final = pd.concat(
            [df.loc[:, ["genome_id", "gtdb_release"]], df_taxonomy], axis=1
        )

    # save to file
    logging.info(f"Writing to file: {df_gtdb_output}")
    df_final.to_csv(df_gtdb_output, index=False)
    return None


if __name__ == "__main__":
    summarize_gtdb_json(sys.argv[1], sys.argv[2])

Python Pandas JSON From line 1 of data/fix_gtdb_taxonomy.py

import json
import logging
import re
import subprocess
import sys
from datetime import datetime
from pathlib import Path

import pandas as pd
from Bio import SeqIO

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)


def get_git_version():
    """
    Get the sha1 of the current git version
    """

    git_version = ""
    try:
        version_cmd = subprocess.run(
            ["git", "rev-parse", "--short", "HEAD"],
            universal_newlines=True,
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
        )
        status_cmd = subprocess.run(
            ["git", "status", "--porcelain"],
            universal_newlines=True,
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
        )

        git_version = str(version_cmd.stdout.strip())
        changes = str(status_cmd.stdout).strip()
        if changes:
            git_version += "(changed)"
    except OSError:
        pass

    return git_version


def get_version(version):
    """
    Get the current version string
    """

    git_version = get_git_version()
    if git_version:
        version += "-%s" % git_version
    return version


def add_bgcflow_comments(gbk_in_path, version, json_path, genome_id, gbk_out_path):
    """
    Add bgcflow meta-annotation to genbank output
    """

    version = get_version(version)

    logging.info(f"Formatting genbank file: {gbk_in_path}")

    temp_gbk = Path(gbk_in_path).parent / f"{genome_id}-change_log.gbk"
    try:
        [record for record in SeqIO.parse(gbk_in_path, "genbank")]
    except ValueError as e:
        logging.warning(f"Parsing fail: {e}")
        logging.info("Attempting to fix genbank file...")
        change_log_path = Path(gbk_in_path).parent / f"{genome_id}-change_log.json"
        change_log = correct_collided_headers(
            gbk_in_path, genome_id, temp_gbk, json_dump=change_log_path
        )
        logging.info("Retry parsing with Bio.SeqIO...")

    if temp_gbk.is_file():
        records = SeqIO.parse(temp_gbk, "genbank")
        temp_gbk.unlink()
    else:
        records = SeqIO.parse(gbk_in_path, "genbank")

    df_gtdb_meta = pd.read_json(json_path, orient="index").T
    df_gtdb_meta.fillna("Unclassified", inplace=True)
    df_tax = pd.DataFrame([i for i in df_gtdb_meta.gtdb_taxonomy])
    for col in df_tax.columns:
        df_tax[col] = df_tax[col].apply(lambda x: x.split("__")[1])
    df_tax.columns = [c.title() for c in df_tax.columns]
    tax_levels = ["Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species"]
    taxonomy_str = df_tax.loc[0, tax_levels].tolist()
    logging.debug(f"Taxonomy found: {taxonomy_str}")

    bgcflow_comment = (
        "##BGCflow-Data-START##\n"
        "Version      :: {version}\n"
        "Run date     :: {date}\n"
    ).format(
        version=version,
        date=str(datetime.now().strftime("%Y-%m-%d %H:%M:%S")),
    )

    new_records = []
    ctr = 0
    for record in records:
        logging.info(f"Formatting record {ctr}...")
        comment = bgcflow_comment
        if "comment" in record.annotations:
            record.annotations["comment"] += "\n" + comment
        else:
            record.annotations["comment"] = comment

        try:
            if not change_log[ctr]["accept_format"]:
                record.annotations[
                    "comment"
                ] += f"Original ID  :: {change_log[ctr]['original_id']}\n"
        except UnboundLocalError:
            pass

        record.annotations["comment"] += "##BGCflow-Data-END##"

        if "organism" in record.annotations:
            organism = record.annotations["organism"]
            if "Unclassified" in organism:
                record.annotations["organism"] = organism.split(" Unclassified")[
                    0
                ].strip()

        record.annotations["taxonomy"] = taxonomy_str

        new_records.append(record)

        ctr = ctr + 1

    logging.info(f"Writing final output: {gbk_out_path}")
    with open(gbk_out_path, "w") as output_handle:
        SeqIO.write(new_records, output_handle, "genbank")


# -- Correcting IDs --#
# 1. Traverse text, make a library of record header, length, and ids
# 2. Check the library for collided locus name and length
# 3. Propose changes
# 4. Make sure changes is unique
# 5. Write a new genbank file and recorded changes as json file


def correct_collided_headers(
    gbk_in_path, accession_id, outfile, json_dump="header.json"
):
    record_headers = get_record_headers(gbk_in_path)
    record_headers = shorten_record_headers(record_headers, accession_id)

    with open(json_dump, "w", encoding="utf8") as json_file:
        json.dump(record_headers, json_file, indent=4)

    logging.info(f"Writing result to: {outfile}")
    with open(gbk_in_path) as f:
        s = f.read()

    with open(outfile, "w") as f:
        for k in record_headers.keys():
            if not record_headers[k]["accept_format"]:
                s = s.replace(
                    record_headers[k]["original_header"],
                    record_headers[k]["new_header"],
                )
        f.write(s)
    return record_headers


def get_record_headers(gbk_in_path):
    """
    Get the header information for each records in a genbank file.
    Returns a json-like python dictionary.
    """
    record_headers = {}

    # Find all headers
    with open(gbk_in_path, "r") as file:
        logging.info(f"Reading file as text: {gbk_in_path}")
        ctr = 0
        for line in file:
            if line.startswith("LOCUS"):
                record_headers[ctr] = {"original_header": line}
            if "source" in line:
                length = line.split()[-1].split("..")[-1]
                record_headers[ctr]["length"] = length
                ctr = ctr + 1
    logging.debug(f"Found {len(record_headers)} records.")

    # Check for collided headers
    logging.info("Checking header format...")
    for k in record_headers.keys():
        query = record_headers[k]["original_header"].split()
        if query != 7 and query[3] != "bp":
            logging.warning("Record name and length collide in the LOCUS line")
            query_length = record_headers[k]["length"]
            if query_length in query[1]:
                original_id = query[1].replace(query_length, "")
                record_headers[k]["original_id"] = original_id
                record_headers[k]["accept_format"] = False
        else:
            record_headers[k]["original_id"] = query[1]
            record_headers[k]["accept_format"] = True
    return record_headers


def modify_id(accession_id, original_id, locus_index, record_counts, id_collection):
    """
    Attempt to fix collided name and length in the locus name by shortening id
    """
    logging.info(f"{original_id}: Attempting to change locus name...")

    # For refseq record, remove the accession prefix (first two digits) from string
    if accession_id.startswith("GCF"):
        logging.debug(
            f"{original_id}: Assuming locus came from Refseq. Removing refseq accession prefix."
        )
        refseq_type, refseq_number, refseq_version = re.split("_|[.]", original_id)
        new_id = f"{refseq_number}.{refseq_version}"

    elif accession_id.startswith("GCA"):
        logging.info(
            f"{original_id}: Assuming locus came from Genbank. Removing version from locus name..."
        )
        new_id, genbank_version = re.split("[.]", original_id)

    # For unknown source remove last 4 digit, add the contig index in front.
    # example: NZAJABAQG010000001.1 --> c1|NZAJABAQG010000
    else:
        logging.info(
            f"{original_id}: Cannot determine source. Shortening locus name..."
        )
        digit = len(str(record_counts))
        contig_number = str(locus_index + 1)

        new_id = f"c{contig_number.zfill(digit)}_{original_id[:-(digit + 4)]}"

    # make sure new_id is unique
    logging.info(f"{original_id}: Making sure id is unique...")
    if new_id in id_collection:
        raise
    else:
        logging.debug(f"{original_id}: shortened to {new_id}")
    return new_id


def shorten_record_headers(record_headers, accession_id):
    """
    Shorten record headers
    """
    record_counts = len(record_headers.keys())
    id_collection = [record_headers[k]["original_id"] for k in record_headers.keys()]

    for k in record_headers.keys():
        if record_headers[k]["accept_format"]:
            pass
        else:
            original_id = record_headers[k]["original_id"]
            old_header = record_headers[k]["original_header"]

            # Correct header name
            new_id = modify_id(
                accession_id, original_id, k, record_counts, id_collection
            )
            new_id_replacement_value = (
                f"{new_id}{(len(original_id) - len(new_id)) * ' '}"
            )
            new_header = old_header.replace(original_id, new_id_replacement_value)

            # Add value to dict
            record_headers[k]["new_id"] = new_id
            record_headers[k]["new_header"] = new_header

    return record_headers


if __name__ == "__main__":
    add_bgcflow_comments(
        sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4], sys.argv[5]
    )

Python Pandas Biopython JSON From line 1 of data/format_genbank_meta.py

import json
import logging
import sys
from pathlib import Path

import pandas as pd

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)


def correct_bgc_id_overview(overview_file, mapping_file, genome_id=False):
    """
    Use a mapping file to correct bgc_ids

    Parameters
    ----------
    overview_file : str / path
        Path to the BGC overview file in JSON format

    mapping_file : str / path
        Path to the mapping file in JSON format

    genome_id : str, optional
        Genome ID to correct BGC IDs for, if not provided, it is extracted from the overview file name

    Returns
    -------
    new_dict : dict
        Corrected BGC overview dictionary with updated BGC IDs
    """
    logging.info(f"Correcting shortened bgc ids for {genome_id}...")
    overview_path = Path(overview_file)
    mapping_path = Path(mapping_file)

    with open(overview_path, "r") as f:
        overview = json.load(f)

    with open(mapping_path, "r") as f:
        mapping = json.load(f)

    if not genome_id:
        genome_id = overview_path.stem.strip("_bgc_overview")
    else:
        pass

    new_dict = {}

    for bgc_id in overview.keys():
        for m in mapping[genome_id].keys():
            if bgc_id in m:
                log_change = mapping[genome_id][m]
                correct_bgc_id = Path(log_change["symlink_path"]).stem
                if log_change["record_id"] != log_change["original_id"]:
                    logging.debug(f"Replacing {bgc_id} to {correct_bgc_id}")
                    new_dict[correct_bgc_id] = overview[bgc_id]
                    new_dict[correct_bgc_id]["accession"] = log_change["original_id"]
                else:
                    new_dict[correct_bgc_id] = overview[bgc_id]
                    pass
                new_dict[correct_bgc_id]["source"] = "bgcflow"
                new_dict[correct_bgc_id]["gbk_path"] = log_change["target_path"]
    logging.info("Done!")
    return new_dict


def gather_bgc_overview(input_json, mapping_dir, table):
    """
    Gather BGC overview data from multiple JSON files and create a merged table

    Parameters
    ----------
    input_json : str
        Two different input types can be used:
        - Space-separated paths to BGC overview JSON files (put the string inside '' in bash expression)
        - A text file (.txt) containing the paths of the json files (one per line, or space-separated on a single line)
    mapping_dir : str / path
        Directory containing mapping files

    table : str / path
        Path to the output merged table

    Returns
    -------
    None
    """
    input_json = Path(input_json)
    logging.info(input_json)
    if input_json.is_file() and input_json.suffix == ".json":
        logging.info(f"Getting BGC overview from a single file: {input_json}")
        input_json_files = input_json

    elif input_json.is_file() and input_json.suffix == ".txt":
        logging.info(f"Getting BGC overview  from a text file: {input_json}")
        with open(input_json, "r") as file:
            file_content = [i.strip("\n") for i in file.readlines()]
            if len(file_content) == 1:
                # Paths space-separated on a single line
                paths = file_content[0].split()
            else:
                # Paths written on separate lines
                paths = file_content
            input_json_files = [
                Path(path) for path in paths if Path(path).suffix == ".json"
            ]
    else:
        input_json_files = [
            Path(file)
            for file in str(input_json).split()
            if Path(file).suffix == ".json"
        ]
        logging.info(
            f"Getting BGC overview from the given list of {len(input_json_files)} files..."
        )

    merged_dict = {}
    for j in input_json_files:
        mapping_file = Path(j)
        genome_id = mapping_file.name.replace("_bgc_overview.json", "")
        mapping_path = Path(mapping_dir) / f"{genome_id}/{genome_id}-change_log.json"
        corrected = correct_bgc_id_overview(mapping_file, mapping_path, genome_id)
        merged_dict.update(corrected)

    df = pd.DataFrame.from_dict(merged_dict).T
    df.index.name = "bgc_id"

    logging.debug(f"Writing file to: {table}")

    # Save dataframes to csv tables
    df_table = Path(table)
    df_table.parent.mkdir(parents=True, exist_ok=True)
    df.to_csv(table)
    logging.info("Job done")
    return None


if __name__ == "__main__":
    gather_bgc_overview(sys.argv[1], sys.argv[2], sys.argv[3])

Python Pandas JSON From line 1 of data/get_antismash_overview_gather.py

import json
import logging
import sys
from pathlib import Path

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)


def region_table_builder(f, accession):
    """
    Given a feature of a record, return value to build database table
    """
    # grab region values
    region_number = f["qualifiers"]["region_number"][0]
    region_id = f"{accession}.region{str(region_number).zfill(3)}"
    location = f["location"].strip("[").strip("]")
    start_pos, end_pos = location.split(":")
    contig_edge = f["qualifiers"]["contig_edge"][0]
    # fill values
    value = {
        region_id: {
            "accession": accession,
            "region_number": region_number,
            "location": location,
            "start_pos": start_pos,
            "end_pos": end_pos,
            "contig_edge": contig_edge,
            "product": f["qualifiers"]["product"],
            "rules": f["qualifiers"]["rules"],
        }
    }
    return value


def get_antismash_overview(json_path, outfile, genome_id=False, n_hits=1):
    """
    Read antismash json output and get the summary page into a json format
    """
    path = Path(json_path)
    with open(path, "r") as f:
        data = json.load(f)

    logging.info(f"Processing: {json_path}, custom genome_id: {genome_id}")

    if not genome_id:
        genome_id = data["input_file"].strip(".gbk")
    else:
        pass

    logging.debug(f"Genome id: {genome_id}")

    # iterating over record
    output = {}
    for r, record in enumerate(data["records"]):
        logging.info(f"Getting antismash regions from record: {record['id']}")
        region_db = {}

        region_feat = [i for i in record["features"] if i["type"] == "region"]
        for f in region_feat:
            region_db.update(region_table_builder(f, record["id"]))

        for c, area in enumerate(record["areas"]):
            cluster_id = f"{r+1}.{c+1}"
            output_cluster = {}
            logging.info(f"Grabbing information from region {cluster_id}")
            # _from = area['start']
            # _to = area['end']
            # products = area['products']

            knownclusterblast = record["modules"]["antismash.modules.clusterblast"][
                "knowncluster"
            ]["results"][c]

            assert n_hits > 0

            output_hits = []

            for n, hits in enumerate(knownclusterblast["ranking"]):
                if n + 1 <= (n_hits):
                    most_similar_mibig_id = hits[0]["accession"]
                    most_similar_mibig_description = hits[0]["description"]
                    most_similar_mibig_clustertype = hits[0]["cluster_type"]
                    n_genes_in_target = len(hits[0]["tags"])
                    n_genes_hits = hits[1]["hits"]
                    hit_similarity = n_genes_hits / n_genes_in_target
                    output_hits.append(
                        {
                            "most_similar_known_cluster_id": most_similar_mibig_id,
                            "most_similar_known_cluster_description": most_similar_mibig_description,
                            "most_similar_known_cluster_type": most_similar_mibig_clustertype,
                            "similarity": hit_similarity,
                        }
                    )
                else:
                    pass

            bgc_id = f"{record['id']}.region{str(c+1).zfill(3)}"
            output_cluster = {
                "genome_id": genome_id,
                "region": cluster_id,
            }

            for column in [
                "accession",
                "start_pos",
                "end_pos",
                "contig_edge",
                "product",
            ]:
                output_cluster[column] = region_db[bgc_id][column]
            try:
                output_cluster["region_length"] = int(output_cluster["end_pos"]) - int(
                    output_cluster["start_pos"]
                )
            except ValueError:
                logging.warning(
                    f'Error calculating region length. Region might be incomplete: {output_cluster["start_pos"]}:{output_cluster["end_pos"]}'
                )
                start_pos = "".join(
                    [s for s in output_cluster["start_pos"] if s.isdigit()]
                )
                logging.warning(
                    f'Correcting start position from {output_cluster["start_pos"]} to {start_pos}'
                )
                output_cluster["start_pos"] = start_pos
                output_cluster["region_length"] = int(output_cluster["end_pos"]) - int(
                    output_cluster["start_pos"]
                )

            if len(output_hits) == 1:
                for k in output_hits[0].keys():
                    output_cluster[k] = output_hits[0][k]

            output[bgc_id] = output_cluster

    with open(outfile, "w") as f:
        logging.info(f"Writing results to {outfile}")
        json.dump(output, f, indent=2)

    return output


if __name__ == "__main__":
    get_antismash_overview(sys.argv[1], sys.argv[2], sys.argv[3])

Python JSON antiSMASH From line 1 of data/get_antismash_overview.py

import os
import sys

import pandas as pd


def get_ncbi_meta(assembly_report_path, outfile=None, genome_id=None):
    """
    Read NCBI assembly reports downloaded using ncbi-genome-download https://github.com/kblin/ncbi-genome-download and convert it to a json format.
    The assembly report can be downoloaded by using assembly_report as --formats while running ncbi-genome-download.

    Parameters
    ----------
    1. assembly_report_path : str / path
        Location of the assembly information json file generated by get_assembly_information.py
    2. outfile : str / path
        Location of the output csv file
    3. genome_id : str
        Genome accession id

    Returns
    -------
    1. outfile : .json file
        A json file summarizing the NCBI assembly reports metadata.
        If genome_id is not provided, the script will determine the genome id from the file basename.
        If outfile is not defined, it will generate a json file in the current directory with this format: {genome_id}.json.
    """
    if genome_id is None:
        genome_id = os.path.splitext(os.path.basename(assembly_report_path))[0]

    if outfile is None:
        outfile = f"{genome_id}.json"

    # List of columns in df_ncbi_meta
    ncbi_meta_columns = [
        "assembly",
        "organism",
        "genus",
        "species",
        "strain",
        "tax_id",
        "refseq_category",
        "refseq",
        "genbank",
        "assembly_type",
        "release_type",
        "assembly_level",
        "genome_representation",
        "refseq_genbank_identity",
        "biosample",
        "submitter",
        "date",
    ]

    df_ncbi_meta = pd.DataFrame(index=[genome_id], columns=ncbi_meta_columns)
    df_ncbi_meta.index.name = "genome_id"

    with open(assembly_report_path, "r") as report_file:
        lines = report_file.readlines()
        strain_found = False

        for line in lines:
            if line.startswith("# Assembly name:"):
                assembly_accn = line.split("Assembly name:")[1].strip()
                df_ncbi_meta.loc[genome_id, "assembly"] = assembly_accn
            elif line.startswith("# Organism name"):
                organism = line.split("Organism name:")[1].strip()
                df_ncbi_meta.loc[genome_id, "organism"] = organism
                df_ncbi_meta.loc[genome_id, "genus"] = organism.strip().split(" ")[0]
                df_ncbi_meta.loc[genome_id, "species"] = organism.strip().split(" ")[1]
            elif line.startswith("# Infraspecific name:"):
                strain = line.split("Infraspecific name:")[1].strip()
                if "strain=" in strain:
                    df_ncbi_meta.loc[genome_id, "strain"] = strain.split("strain=")[1]
                    strain_found = True
            elif line.startswith("# Isolate:"):
                if strain_found is False:
                    strain = line.split("Isolate:")[1].strip()
                    df_ncbi_meta.loc[genome_id, "strain"] = strain
                    strain_found = True
            elif line.startswith("# Taxid"):
                tax_id = line.split("Taxid:")[1].strip()
                df_ncbi_meta.loc[genome_id, "tax_id"] = tax_id
            elif line.startswith("# BioSample"):
                biosample = line.split("BioSample:")[1].strip()
                df_ncbi_meta.loc[genome_id, "biosample"] = biosample
            elif line.startswith("# BioProject"):
                bioproject = line.split("# BioProject:")[1].strip()
                df_ncbi_meta.loc[genome_id, "BioProject"] = bioproject
            elif line.startswith("# Submitter"):
                submitter = line.split("Submitter:")[1].strip()
                df_ncbi_meta.loc[genome_id, "submitter"] = submitter
            elif line.startswith("# Date"):
                date = line.split("Date:")[1].strip()
                df_ncbi_meta.loc[genome_id, "date"] = date
            elif line.startswith("# Assembly type:"):
                assembly_type = line.split("Assembly type:")[1].strip()
                df_ncbi_meta.loc[genome_id, "assembly_type"] = assembly_type
            elif line.startswith("# Release type:"):
                release_type = line.split("Release type:")[1].strip()
                df_ncbi_meta.loc[genome_id, "release_type"] = release_type
            elif line.startswith("# Assembly level:"):
                assembly_level = line.split("Assembly level:")[1].strip()
                df_ncbi_meta.loc[genome_id, "assembly_level"] = assembly_level
            elif line.startswith("# Genome representation:"):
                genome_representation = line.split("Genome representation:")[1].strip()
                df_ncbi_meta.loc[
                    genome_id, "genome_representation"
                ] = genome_representation
            elif line.startswith("# RefSeq category"):
                refseq_category = line.split("RefSeq category:")[1].strip()
                df_ncbi_meta.loc[genome_id, "refseq_category"] = refseq_category
            elif line.startswith("# RefSeq assembly accession"):
                refseq = line.split("RefSeq assembly accession:")[1].strip()
                df_ncbi_meta.loc[genome_id, "refseq"] = refseq
            elif line.startswith("# GenBank assembly accession"):
                genbank = line.split("GenBank assembly accession:")[1].strip()
                df_ncbi_meta.loc[genome_id, "genbank"] = genbank
            elif line.startswith("# RefSeq assembly and GenBank assemblies identical"):
                refseq_genbank_identity = line.split(
                    "RefSeq assembly and GenBank assemblies identical:"
                )[1].strip()
                df_ncbi_meta.loc[
                    genome_id, "refseq_genbank_identity"
                ] = refseq_genbank_identity

        if strain_found is False:
            df_ncbi_meta.loc[genome_id, "strain"] = genome_id

    df_ncbi_meta.to_json(outfile, orient="index", indent=4)
    return None


if __name__ == "__main__":
    get_ncbi_meta(sys.argv[1], sys.argv[2], sys.argv[3])

Python Pandas JSON ncbi-genome-download From line 1 of data/get_assembly_information.py

import json
import logging
import sys
from pathlib import Path

from Bio import SeqIO

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)


def count_bgcs(gbk_file_path, genome_id=False, outfile=False):
    gbk_file_path = Path(gbk_file_path)

    if not type(genome_id) == str:
        genome_id = gbk_file_path.stem

    logging.debug(f"Summarize BGCs info for {gbk_file_path}")
    logging.debug(f"Genome id is: {genome_id}")
    records_list = SeqIO.parse(gbk_file_path, "genbank")

    bgc_stats = {}

    # statistics counter
    bgc_cntr = 0
    protoclusters_cntr = 0
    cand_clusters_cntr = 0
    contig_edge_cntr = 0

    # product capture
    bgc_type_dict = {}

    for rec in records_list:
        # Information on the number of BGCs, protoclusters and candidate clusters
        for feat in rec.features:
            if feat.type == "region":
                # get region counts
                bgc_cntr = bgc_cntr + 1
                if feat.qualifiers["contig_edge"][0] == "True":
                    contig_edge_cntr = contig_edge_cntr + 1

                # get product counts
                bgc_type = ".".join(sorted(feat.qualifiers["product"]))
                if bgc_type in bgc_type_dict.keys():
                    bgc_type_dict[bgc_type] = bgc_type_dict[bgc_type] + 1
                else:
                    bgc_type_dict[bgc_type] = 1

                if feat.type == "protocluster":
                    protoclusters_cntr = protoclusters_cntr + 1
                if feat.type == "cand_cluster":
                    cand_clusters_cntr = cand_clusters_cntr + 1

                bgc_stats["bgcs_count"] = bgc_cntr
                bgc_stats["bgcs_on_contig_edge"] = contig_edge_cntr
                bgc_stats["protoclusters_count"] = protoclusters_cntr
                bgc_stats["cand_clusters_count"] = cand_clusters_cntr
    result = {genome_id: bgc_stats | {"products": bgc_type_dict}}

    if not type(outfile) == str:
        return result
    else:
        logging.debug(f"Writing output to: {outfile}")
        with open(outfile, "w") as f:
            json.dump(result, f, indent=2)
        return


if __name__ == "__main__":
    count_bgcs(sys.argv[1], sys.argv[2], sys.argv[3])

Python Biopython JSON From line 1 of data/get_bgc_counts.py

import sys
from pathlib import Path

import pandas as pd


def get_bigscape_mapping(path, outfile):
    """
    Given a path containing directories of antiSMASH result, return a table
    which maps BGC regions with its parent folder (aka accession ids)
    """
    container = {}
    for x in Path(path).iterdir():
        if x.is_dir():
            genome_id = x.name
            bgcs = {item.stem: genome_id for item in list(x.glob("*.region*.gbk"))}
            container.update(bgcs)
    df = pd.DataFrame.from_dict(container, orient="index", columns=["genome_id"])
    df.index.name = "bgc_id"
    df.to_csv(outfile)
    return None


if __name__ == "__main__":
    get_bigscape_mapping(sys.argv[1], sys.argv[2])

Python Pandas From line 1 of data/get_bigscape_mapping.py

import logging
import shutil
import sqlite3
import sys
from pathlib import Path

import pandas as pd

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)


def get_bigslice_query(
    project_name, output_folder, result_folder="resources/bigslice/full_run_result/"
):
    bigslice_full_run = Path(result_folder)
    query_id = find_query_result(project_name, bigslice_full_run)
    fetch_query_tables(query_id, output_folder, bigslice_full_run)
    logging.info("Copying SQL database of the run...")
    shutil.copy(
        bigslice_full_run / f"reports/{query_id}/data.db",
        Path(output_folder) / f"{query_id}.db",
    )
    logging.info("Job done!")
    return


def find_query_result(project_name, bigslice_full_run):
    logging.info("Connecting to BiG-SLICE SQL Reports...")
    conn = sqlite3.connect(bigslice_full_run / "reports/reports.db")
    df = pd.read_sql_query("select * from reports;", conn)

    # select only run result of the project
    match = []
    for i in df.index:
        if project_name in df.loc[i, "name"]:
            match.append(i)
    df = df.loc[match, :]
    logging.debug(f"Found resulting match to project {project_name}:\n{df}")

    # select the latest run for the project
    df["creation_date"] = pd.to_datetime(df["creation_date"])
    filter_time = df.creation_date == df.creation_date.max()
    df = df.where(filter_time).dropna()
    logging.debug(f"Extracting information from the latest run:{df.name.values}")
    return int(df.id.values)


def fetch_query_tables(query_id, output_folder, bigslice_full_run):
    # Connect to database
    conn = sqlite3.connect(bigslice_full_run / f"reports/{query_id}/data.db")

    # get all table name from sql
    cursor = conn.cursor()
    cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
    sql_table_names = [i[0] for i in cursor.fetchall()]

    output_folder = Path(output_folder)

    try:
        output_folder.mkdir(parents=True, exist_ok=False)
    except FileExistsError:
        logging.debug(f"Directory {output_folder} already exist")
    else:
        logging.debug(f"Creating directory: {output_folder}...")

    logging.debug(f"Extracting tables to {output_folder}...")

    # convert table to pandas df
    df_tables = {}
    for t in sql_table_names:
        df = pd.read_sql_query(f"select * from {t};", conn)
        outfile = output_folder / f"{t}.csv"
        df.to_csv(outfile, index=False)
        df_tables.update({t: df})
    return


if __name__ == "__main__":
    get_bigslice_query(sys.argv[1], sys.argv[2], sys.argv[3])

Python Pandas From line 1 of data/get_bigslice_query_result.py

import json
import os
import sys

import pandas as pd


def get_checkm_data(checkm_input_stats, checkm_json_folder, checkm_output_stats):
    """
    Read checkm output stats.tsv file, return json output for individual genomes and output stats table

    Parameters
    ----------
    1. checkm_input_stats : str / path
        Location of the output from checkm software 'storage/bin_stats_ext.tsv'
    2. checkm_json_folder : str / path
        Location of the json folder to store output for each genome
    3. checkm_output_stats : str / path
        Location of the output csv file in the processed/tables directory

    Returns
    -------
    1. json : .json file
        A json file for each genome summarizing the checkm output
    2. checkm_output_stats : str / path
        Output csv file in the processed/tables directory with checkm information for all genomes
    """

    # List of columns in df_checkm_stats.csv
    checkm_columns = [
        "Completeness",
        "Contamination",
        "GC",
        "GC std",
        "Genome size",
        "# ambiguous bases",
        "# scaffolds",
        "# contigs",
        "Longest scaffold",
        "Longest contig",
        "N50 (scaffolds)",
        "N50 (contigs)",
        "Mean scaffold length",
        "Mean contig length",
        "Coding density",
        "Translation table",
        "# predicted genes",
    ]

    df_checkm_input = pd.read_csv(
        checkm_input_stats, sep="\t", header=None, index_col=0
    )

    df_checkm_out = pd.DataFrame(index=df_checkm_input.index, columns=checkm_columns)
    df_checkm_out.index.name = "genome_id"

    for genome_id in df_checkm_out.index:
        json_genome_id = json.loads(df_checkm_input.loc[genome_id, 1].replace("'", '"'))
        json_path = os.path.join(checkm_json_folder, genome_id + ".json")

        with open(json_path, "w") as outfile:
            json.dump(json_genome_id, outfile)
        for col in df_checkm_out.columns:
            df_checkm_out.loc[genome_id, col] = json_genome_id[col]

    df_checkm_out.to_csv(checkm_output_stats)

    return df_checkm_out


if __name__ == "__main__":
    get_checkm_data(sys.argv[1], sys.argv[2], sys.argv[3])

Python Pandas JSON CheckM From line 1 of data/get_checkm_data.py

import json
import sys

import yaml

# list of the main dependecies used in the workflow
dependencies = {
    "antismash": r"workflow/envs/antismash.yaml",
    "prokka": r"workflow/envs/prokka.yaml",
    "mlst": r"workflow/envs/mlst.yaml",
    "eggnog-mapper": r"workflow/envs/eggnog.yaml",
    "roary": r"workflow/envs/roary.yaml",
    "refseq_masher": r"workflow/envs/refseq_masher.yaml",
    "seqfu": r"workflow/envs/seqfu.yaml",
}


def get_dependency_version(dep, dep_key):
    """
    return dependency version tags given a dictionary (dep) and its key (dep_key)
    """
    with open(dep[dep_key]) as file:
        result = []
        documents = yaml.full_load(file)
        for i in documents["dependencies"]:
            if type(i) == str:
                if i.startswith(dep_key):
                    result = i.split("=")[-1]
            elif type(i) == dict:
                assert list(i.keys()) == ["pip"], i.keys()
                for p in i["pip"]:
                    if dep_key in p:
                        if p.startswith("git+"):
                            result = p.split("@")[-1]
                            result = result.replace("-", ".")
                        else:
                            result = p.split("=")[-1]
    return str(result)


def write_dependecies_to_json(outfile, dep=dependencies):
    """
    write dependency version to a json file
    """
    with open(outfile, "w") as file:
        dv = {}
        for ky in dep.keys():
            vr = get_dependency_version(dep, ky)
            dv[ky] = vr
        json.dump(
            dv,
            file,
            indent=2,
        )
        file.close()
    return dv


if __name__ == "__main__":
    write_dependecies_to_json(sys.argv[1])

Python JSON PyYAML metaprokka antiSMASH MLST Roary refseq_masher seqfu From line 1 of data/get_dependencies.py

import json
import logging
import os
import sys

import pandas as pd

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)


def extract_mibig_info(mibig_json_path, mibig_bgc_table):
    """
    Returns mibig BGC information critical for the known bgc table from JSON files

    Parameters
    ----------
    1. mibig_json_path : str / path
        Location of the resources folder with all MIBIG JSON files extracted

    Returns
    -------
    1. mibig_bgc_table: str / path
        Dataframe with MIBIG BGCs information
    2. mibig_compound_table: str / path
        Dataframe with compounds information found at MIBIG
    """
    logging.info("Reading MIBIG files...")
    if not os.path.isdir(mibig_json_path):
        raise FileNotFoundError(f"No such file or directory: {mibig_json_path}")

    df_mibig_bgcs = pd.DataFrame(columns=["biosyn_class", "compounds"])
    # df_mibig_compounds = pd.DataFrame(columns=[])

    mibig_list = [
        mibig_file.split(".")[0]
        for mibig_file in os.listdir(mibig_json_path)
        if ".json" in mibig_file
    ]
    logging.debug(f"MIBIG entry has {len(mibig_list)} BGCs")
    for mibig_id in mibig_list:
        mibig_id_file = os.path.join(mibig_json_path, mibig_id + ".json")
        with open(mibig_id_file, "r") as json_obj:
            mibig_data = json.load(json_obj)
            df_mibig_bgcs.loc[mibig_id, "biosyn_class"] = ";".join(
                mibig_data.get("cluster").get("biosyn_class")
            )

            compounds_list = mibig_data.get("cluster").get("compounds")
            df_mibig_bgcs.loc[mibig_id, "compounds"] = ";".join(
                [compound.get("compound") for compound in compounds_list]
            )

            chem_acts_list = []
            for compound in mibig_data.get("cluster").get("compounds"):
                if "chem_acts" in compound.keys():
                    chem_acts = compound.get("chem_acts")
                    for activity in chem_acts:
                        if activity not in chem_acts_list:
                            chem_acts_list.append(activity)

            # logging.debug(f"{chem_acts_list}")
            # handle MIBIG 2.0 --> MIBIG 3.1 format difference
            chem_acts_list_clean = []
            for item in chem_acts_list:
                if type(item) is dict:
                    action = list(item.values())
                    for a in action:
                        chem_acts_list_clean.append(a)
                elif type(item) is str:
                    chem_acts_list_clean.append(item)

            # logging.debug(f"{chem_acts_list_clean}")
            df_mibig_bgcs.loc[mibig_id, "chem_acts"] = ";".join(chem_acts_list_clean)

            if "accession" in mibig_data.get("cluster").get("loci").keys():
                df_mibig_bgcs.loc[mibig_id, "accession"] = (
                    mibig_data.get("cluster").get("loci").get("accession")
                )
            if "completeness" in mibig_data.get("cluster").get("loci").keys():
                df_mibig_bgcs.loc[mibig_id, "completeness"] = (
                    mibig_data.get("cluster").get("loci").get("completeness")
                )
            if "evidence" in mibig_data.get("cluster").get("loci").keys():
                df_mibig_bgcs.loc[mibig_id, "evidence"] = ";".join(
                    mibig_data.get("cluster").get("loci").get("evidence")
                )
            if "organism_name" in mibig_data.get("cluster").keys():
                df_mibig_bgcs.loc[mibig_id, "organism_name"] = mibig_data.get(
                    "cluster"
                ).get("organism_name")
            if "ncbi_tax_id" in mibig_data.get("cluster").keys():
                df_mibig_bgcs.loc[mibig_id, "ncbi_tax_id"] = mibig_data.get(
                    "cluster"
                ).get("ncbi_tax_id")
            if "publications" in mibig_data.get("cluster").keys():
                df_mibig_bgcs.loc[mibig_id, "publications"] = ";".join(
                    mibig_data.get("cluster").get("publications")
                )

    df_mibig_bgcs.sort_index(inplace=True)
    df_mibig_bgcs.index.name = "mibig_id"

    df_mibig_bgcs.to_csv(mibig_bgc_table)

    return df_mibig_bgcs


if __name__ == "__main__":
    extract_mibig_info(sys.argv[1], sys.argv[2])

Python Pandas JSON From line 1 of data/get_mibig_data.py

import os
import sys

import pandas as pd


def extract_org_info(genome_id, samples_path, assembly_report_path, prokka_dir):
    """
    Returns organism_info.txt with genus, species, and strain information required for prokka run inputs.
    This function returns values from provided sample.csv or extracted NCBI assembly reports in json format
    generated by get_assembly_information.py for "ncbi" source type.

    Parameters
    ----------
    1. genome_id : str
        Genome accession id
    2. samples_path : str / path (separated by space for multiple paths)
        Location of the csv file (s) containing sample information as defined in samples.schema.yaml.
        For multiple samples csv, write the paths within double tick ("") and separate each paths with space.
    3. assembly_report_path : str / path
        Location of the assembly information json file generated by get_assembly_information.py
    4. prokka_dir : str / path
        Location of the prokka directory from the given NCBI assembly id

    Returns
    -------
    1. {prokka_dir}/organism_info.txt
        A one lined text file with genus, species, and strain information of a given genome_id
    """
    # wrap single or multiple inputs & generate dataframe
    shell_input = samples_path.split()
    dfList = [pd.read_csv(s).set_index("genome_id", drop=False) for s in shell_input]
    df_samples = pd.concat(dfList, axis=0).fillna("")

    if df_samples.loc[genome_id, "source"] == "ncbi":
        # assembly_file = os.path.join(assembly_report_path, f"{genome_id}.txt")
        extract_ncbi_org_info(prokka_dir, genome_id, assembly_report_path)
    else:
        extract_samples_org_info(prokka_dir, genome_id, df_samples)
    return None


def extract_ncbi_org_info(prokka_dir, genome_id, assembly_report_path):
    """
    Returns organism_info.txt with genus, species, and strain information required for prokka run inputs.
    This function returns values from extracted NCBI assembly reports in json format generated by get_assembly_information.py.

    Parameters
    ----------
    1. prokka_dir : str / path
        Location of the prokka directory from the given NCBI assembly id
    2. genome_id : str
        NCBI assembly id (GCA... / GCF...)
    3. assembly_report_path : str / path
        Location of the assembly information json file generated by get_assembly_information.py

    Returns
    -------
    1. {prokka_dir}/organism_info.txt
        A one lined text file with genus, species, and strain information of a given NCBI assembly id
    """
    assembly_report_json = os.path.join(assembly_report_path, f"{genome_id}.json")
    df_ncbi_meta = pd.read_json(assembly_report_json).T
    df_ncbi_meta = df_ncbi_meta.fillna("")

    for idx in df_ncbi_meta.index:
        GENUS = str(df_ncbi_meta.loc[idx, "genus"])
        SPECIES = str(df_ncbi_meta.loc[idx, "species"])
        STRAIN_ID = str(df_ncbi_meta.loc[idx, "strain"])

        if not os.path.isdir(os.path.join(prokka_dir, idx)):
            os.mkdir(os.path.join(prokka_dir, idx))
        org_info_path = os.path.join(prokka_dir, idx, "organism_info.txt")
        with open(org_info_path, "w") as file_obj:
            file_obj.write(",".join([GENUS, SPECIES, STRAIN_ID]))

    return None


def extract_samples_org_info(prokka_dir, genome_id, df_samples):
    """
    Returns organism_info.txt with genus, species, and strain information required for prokka run inputs.
    This function returns values from provided sample.csv excluding "ncbi" source type.

    Parameters
    ----------
    1. prokka_dir : str / path
        Location of the prokka directory from the given NCBI assembly id
    2. genome_id : str
        Genome accession id
    3. df_sample : pd.DataFrame object
        A dataframe generated by the csv file containing sample information as defined in samples.schema.yaml

    Returns
    -------
    1. {prokka_dir}/organism_info.txt
        A one lined text file with genus, species, and strain information of a given genome_id
    """

    GENUS = str(df_samples.loc[genome_id, "genus"])
    SPECIES = str(df_samples.loc[genome_id, "species"])
    STRAIN_ID = str(df_samples.loc[genome_id, "strain"])

    if not os.path.isdir(os.path.join(prokka_dir, genome_id)):
        os.mkdir(os.path.join(prokka_dir, genome_id))
    org_info_path = os.path.join(prokka_dir, genome_id, "organism_info.txt")
    with open(org_info_path, "w") as file_obj:
        file_obj.write(",".join([GENUS, SPECIES, STRAIN_ID]))

    return None


if __name__ == "__main__":
    extract_org_info(sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4])

Python Pandas JSON metaprokka From line 1 of data/get_organism_info.py

import json
import logging
import sys
from pathlib import Path

import peppy
import yaml

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)


def refine_bgcflow_project(p_bgcflow, p):
    """
    Refine a pep bgcflow project created from sample table.
    Exist for back compatibility with bgcflow=<0.3.3

    Arguments:
        p_bgcflow:
        p:

    Returns:
        p_bgcflow:
    """
    for k in p.keys():
        if k == "samples":
            p_bgcflow.config["sample_table"] = p[k]
            p_bgcflow["_config_file"] = str(Path(p[k]).parent / "project_config.yaml")
        elif k in p_bgcflow.keys():
            p_bgcflow[k] = p[k]
        elif k == "rules":
            with open(p["rules"], "r") as file:
                rules = yaml.safe_load(file)
                p_bgcflow.config[k] = rules["rules"]
        else:
            p_bgcflow.config[k] = Path(p[k]).name

    return p_bgcflow


def get_bgcflow_metadata(bgcflow_path="."):
    # BGCFlow config
    BGCFlow_path = Path(bgcflow_path).resolve()

    logging.info("Getting config metadata...")
    config_path = BGCFlow_path / "config/config.yaml"
    with open(config_path, "r") as f:
        config = yaml.safe_load(f)

    logging.info("Getting rules information...")
    # rules_dict
    rules_dict_path = BGCFlow_path / "workflow/rules.yaml"
    with open(rules_dict_path, "r") as f:
        rules_dict = yaml.safe_load(f)

    return BGCFlow_path, config, rules_dict


def get_all_metadata(config, rules_dict, BGCFlow_path, bgcflow_version):
    # Get Metadata
    logging.info("Getting metadata from projects...")
    project_metadata = {}

    for p in config["projects"]:
        # mask pep and name as alias
        if "pep" in p.keys():
            p["name"] = p.pop("pep")

        # process only pep projets
        if p["name"].endswith(".yaml"):
            project = peppy.Project(
                str(BGCFlow_path / p["name"]), sample_table_index="genome_id"
            )
            if "description" in project.config.keys():
                name = project["name"]
                description = project.config["description"]
                project_metadata[name] = {"description": description}
        else:
            # non peppy input
            name = p["name"]
            project = peppy.Project(
                str(BGCFlow_path / p["samples"]), sample_table_index="genome_id"
            )
            p = refine_bgcflow_project(project, p)
            project_metadata[name] = {"description": "No description provided."}

        # get what rules are being used
        rule_used = {}
        if "rules" in project.config.keys():
            rules = project.config["rules"]
        else:
            rules = config["rules"]
        for r in rules.keys():
            if rules[r]:
                if r in rules_dict.keys():
                    bgcflow_rules = rules_dict[r]
                    rule_used[r] = bgcflow_rules
        project_metadata[name].update({"rule_used": rule_used})

        # get sample size
        project_metadata[name]["sample_size"] = len(project.sample_table)

        # get citations
        citation_all = []
        for r in rule_used:
            citations = rules_dict[r]["references"]
            citation_all.extend(citations)
        citation_all.sort()
        project_metadata[name].update({"references": citation_all})
        project_metadata[name]["references"] = list(
            set(project_metadata[name]["references"])
        )

        # get bgcflow_version
        project_metadata[name]["bgcflow_version"] = bgcflow_version
    return project_metadata


def get_project_metadata(
    project_name, outfile, bgcflow_path=".", bgcflow_version="unknown"
):
    BGCFlow_path, config, rules_dict = get_bgcflow_metadata(bgcflow_path)
    all_metadata = get_all_metadata(config, rules_dict, BGCFlow_path, bgcflow_version)
    logging.info(f"Extracting project {project_name} metadata to {outfile}")
    with open(outfile, "w") as f:
        json.dump({project_name: all_metadata[project_name]}, f, indent=2)
    return


if __name__ == "__main__":
    get_project_metadata(sys.argv[1], sys.argv[2], bgcflow_version=sys.argv[3])

Python JSON PyYAML Peppy From line 1 of data/get_project_metadata.py

import json
import logging
import os
import sys

import pandas as pd
import requests

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)


def gtdb_prep(
    genome_id, outfile, samples_table, tax_path, release="R207"
):  # what happen if it does not find?
    """
    Given a genome id and the samples Pandas dataframe, write  a JSON file containing taxonomic information from GTDB API. The script will first search using the closest taxonomic placement (NCBI accession id), then using the genus information provided by user. If no information is provided, return an empty taxonomic information.
    """

    class EmptyTaxError(Exception):
        """Raised when this script returns empty dict"""

        pass

    class PlacementError(Exception):
        """Raised when this script returns empty dict"""

        pass

    def empty_result(genome_id):
        """
        Helper script for creating empty result
        """
        logging.info(
            f"No taxonomic information found, returning empty values for {genome_id}."
        )
        gtdb_tax = {}
        gtdb_tax["genome_id"] = genome_id
        gtdb_tax["gtdb_taxonomy"] = {
            "domain": "d__",
            "phylum": "p__",
            "class": "c__",
            "order": "o__",
            "family": "f__",
            "genus": "g__",
            "species": "s__",
        }
        return gtdb_tax

    def find_taxonomy(query, genome_id, gtdb_tax):
        """
        Helper script to decide taxonomic placement for a given query
        """
        # If closest placement reference is provided, try finding taxonomic information from GTDB API
        if query.closest_placement_reference.values[0] != "":
            try:
                logging.info(
                    "Inferring taxonomic placement from provided closest reference...."
                )
                gtdb_tax = get_ncbi_taxon_GTDB(
                    query.closest_placement_reference.values[0], release
                )
                gtdb_tax["genome_id"] = genome_id
            except KeyError as e:
                raise PlacementError(
                    f"Cannot infer taxonomic placement from provided closest reference. Make sure the accession id: {query.closest_placement_reference.values[0]} is part of GTDB release: {e}"
                )

        # If NCBI accession is provided, try to find taxonomic information from GTDB API
        elif query.source.values[0] == "ncbi":
            try:
                logging.info("Inferring taxonomic placement from NCBI accession....")
                gtdb_tax = get_ncbi_taxon_GTDB(query.genome_id.values[0], release)
            except KeyError:
                if query.genus.values[0] != "":
                    logging.info(
                        "Inferring taxonomic placement from provided genus information...."
                    )
                    gtdb_tax["genome_id"] = genome_id
                    gtdb_tax.update(
                        get_parent_taxon_GTDB(query.genus.values[0], "genus", release)
                    )
                    gtdb_tax["gtdb_taxonomy"][
                        "species"
                    ] = f"s__{gtdb_tax['gtdb_taxonomy']['genus'].split('__')[-1]} sp."
                else:
                    gtdb_tax = empty_result(genome_id)

        # Try to get taxonomic information from genus information
        elif query.genus.values[0] != "":
            logging.info(
                "Inferring taxonomic placement from provided genus information...."
            )
            gtdb_tax["genome_id"] = genome_id
            gtdb_tax.update(
                get_parent_taxon_GTDB(query.genus.values[0], "genus", release)
            )
            try:
                gtdb_tax["gtdb_taxonomy"][
                    "species"
                ] = f"s__{gtdb_tax['gtdb_taxonomy']['genus'].split('__')[-1]} sp."
            except KeyError:
                gtdb_tax = empty_result(genome_id)

        # If no information is found, return an empty dict
        else:
            gtdb_tax = empty_result(genome_id)

        return gtdb_tax

    # get query by subsetting samples df with genome id
    shell_input = samples_table.split()
    dfList = [pd.read_csv(s).set_index("genome_id", drop=False) for s in shell_input]
    df_samples = pd.concat(dfList, axis=0)
    query = df_samples[df_samples.loc[:, "genome_id"] == genome_id].fillna("")

    # create empty container
    gtdb_tax = {}

    # Starting process
    logging.info(f"Fetching taxonomic information for {genome_id}...")
    # Go through user provided taxonomic placement
    if any(os.path.isfile(t) for t in tax_path.split()):
        try:
            gtdb_tax = get_user_defined_classification(genome_id, tax_path)
        except KeyError:
            logging.warning(
                f"{genome_id}: Not found in user provided taxonomic placement..."
            )
            gtdb_tax = find_taxonomy(query, genome_id, gtdb_tax)
    else:
        gtdb_tax = find_taxonomy(query, genome_id, gtdb_tax)

    if gtdb_tax == {}:
        raise EmptyTaxError(
            "Oops, this shouldn't happen. It returns an empty dict. Something is wrong with the script."
        )

    logging.info(f"Writing results to {outfile}...")
    with open(outfile, "w") as file:
        json.dump(gtdb_tax, file, indent=2)
        file.close

    logging.info("Job finished!")
    return


def get_user_defined_classification(genome_id, tax_path):
    """
    Get taxonomic information from user provided GTDB-like output
    """
    shell_input = tax_path.split()
    dfList = [
        pd.read_csv(s, sep="\t").set_index("user_genome", drop=False)
        for s in shell_input
    ]
    df_tax_raw = pd.concat(dfList, axis=0)

    # drop duplicates! causes error
    logging.debug(f"Checking user provided taxonomy table from: {shell_input}")
    logging.info(
        "Checking if user provided taxonomy table contains duplicated genome ids.."
    )
    df_tax_dup = df_tax_raw["user_genome"].duplicated()
    if df_tax_dup.any():
        logging.warning(
            f"Found duplicated genome ids: {list(df_tax_raw[df_tax_dup]['user_genome'].unique())}"
        )
        duplicates_mask = df_tax_raw.duplicated(keep="first")
        df_tax = df_tax_raw[~duplicates_mask].copy()
        logging.debug("Making sure duplicated genome ids values are identical...")
        assert (
            not df_tax.duplicated().any()
        ), "Two or more genome ids have more than one taxonomic placement! Please check your taxonomy files!"
    else:
        df_tax = df_tax_raw.copy()

    level_dict = {
        "d": "domain",
        "p": "phylum",
        "c": "class",
        "o": "order",
        "f": "family",
        "g": "genus",
        "s": "species",
    }

    query = df_tax.loc[genome_id, "classification"].split(";")

    result = {}
    result["genome_id"] = genome_id
    result["gtdb_url"] = "user provided classification"
    result["gtdb_release"] = "unknown"
    result["gtdb_taxonomy"] = {level_dict[q.split("__")[0]]: q for q in query}
    logging.info("Using user provided GTDB classification.")
    return result


def get_ncbi_taxon_GTDB(accession, release="R207"):
    """
    Given an NCBI accession, return a json object of taxonomic information from GTDB API
    """

    def gtdb_api_request(accession, api_type):
        if api_type == "taxonomy":
            api_url = (
                f"https://api.gtdb.ecogenomic.org/genome/{accession}/taxon-history"
            )
        elif api_type == "summary":
            api_url = f"https://api.gtdb.ecogenomic.org/genome/{accession}/card"
        response = requests.get(api_url)

        try:
            js = response.json()
        except json.JSONDecodeError:
            logging.critical(
                f"Cannot decode response from GTDB API. Make sure this is a valid url: {api_url}"
            )
            raise

        return js, api_url

    # Mapping to bgcflow format
    level_dict = {
        "d": "domain",
        "p": "phylum",
        "c": "class",
        "o": "order",
        "f": "family",
        "g": "genus",
        "s": "species",
    }

    # get taxonomic information
    js_tax, api_url_tax = gtdb_api_request(accession, "taxonomy")
    result = {}
    result["genome_id"] = accession
    result["gtdb_url"] = api_url_tax
    result["gtdb_release"] = release
    card_detail = "Genome found"
    try:
        if js_tax == []:
            logging.warning(
                f"Genome id: {accession} is in GTDB but has no taxonomic placement. Returning empty values."
            )
            result["gtdb_taxonomy"] = {
                level_dict[k]: f"{k}__" for k in level_dict.keys()
            }
            card_detail = (
                f"Genome not found - no taxonomic assignment in release {release}"
            )
        else:
            logging.info(js_tax)
            tax = [tax for tax in js_tax if tax["release"] == release]
            if len(tax) == 1:
                result["gtdb_taxonomy"] = tax[0]
            elif len(tax) == 0:
                logging.warning(
                    f"Genome id: {accession} is in GTDB but has no taxonomic placement in release {release}. Returning empty values."
                )
                result["gtdb_taxonomy"] = {
                    level_dict[k]: f"{k}__" for k in level_dict.keys()
                }
                card_detail = (
                    f"Genome not found - no taxonomic assignment in release {release}"
                )
            else:
                raise
            result["gtdb_taxonomy"].pop("release", None)
            result["gtdb_taxonomy"] = {
                level_dict[k]: result["gtdb_taxonomy"][k]
                for k in result["gtdb_taxonomy"].keys()
            }
    except KeyError as err:
        if err.args[0] == "gtdb_taxonomy":
            logging.critical(
                f"Malformed genome id: {accession}. Make sure to use the right NCBI genome accession format."
            )
            raise
        elif err.args[0] == release:
            logging.critical(f"Cannot find genome id: {accession} in GTDB API.")
            raise

    # get other metadata from genome summary
    js_sum, api_url_sum = gtdb_api_request(accession, "summary")
    result["metadata_url"] = api_url_sum
    result["metadata"] = js_sum

    if "detail" in result["metadata"].keys():
        pass
    else:
        result["metadata"]["detail"] = card_detail

    return result


def get_parent_taxon_GTDB(taxon, level, release="R207"):
    """
    Given a taxon and its level, return a json object of parent taxons from GTDB API
    """
    level_dict = {
        "domain": "d",
        "phylum": "p",
        "class": "c",
        "order": "o",
        "family": "f",
        "genus": "g",
        "species": "s",
    }

    try:
        query = f"{level_dict[level]}__{taxon}"
    except KeyError:
        logging.critical(
            f"Incorrect taxon level format. Please choose from available format: {list(level_dict.keys())}."
        )
        raise

    api_url = f"https://api.gtdb.ecogenomic.org/taxonomy/partial/{query}/all-releases"
    response = requests.get(api_url)
    js = response.json()
    result = {}
    result["gtdb_url"] = api_url
    result["gtdb_release"] = release
    result["gtdb_taxonomy"] = {}

    level_dict_rev = {
        "d": "domain",
        "p": "phylum",
        "c": "class",
        "o": "order",
        "f": "family",
        "g": "genus",
        "s": "species",
    }

    try:
        for item in js:
            if release in item["release"]:
                t = item["taxonomy"]
                taxonomy = {level_dict_rev[k]: t[k] for k in t.keys()}

                result["gtdb_taxonomy"] = taxonomy
    except TypeError:
        logging.critical(f"{js['message']}")
        raise

    return result


if __name__ == "__main__":
    gtdb_prep(sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4], sys.argv[5])

Python Pandas JSON requests From line 1 of data/gtdb_prep.py

import json
import logging
import sys
from pathlib import Path

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)


def assess_gtdb_json_file(item):
    """
    Given a json file, assess whether the accession can be found via GTDB-API or not.
    Return a genome_id when accession cannot be found.
    """
    logging.info(f"Assessing {item}")
    with open(item, "r") as json_file:
        data = json.load(json_file)
        genome_id = data["genome_id"]
        try:
            gtdb_release = data["gtdb_release"]
            metadata = data["metadata"]
            if "Genome not found" in metadata["detail"]:
                logging.debug(
                    f" - {genome_id} : {metadata['detail']} in GTDB-API release {gtdb_release}"
                )
                return genome_id
            elif type(metadata["genome"]["accession"]) == str:
                logging.debug(
                    f" - {genome_id} can be found via GTDB-API release {gtdb_release}"
                )
                return None

        except KeyError:
            logging.debug(f" - {genome_id} does not have metadata")
            return genome_id


def generate_symlink_gtdbtk(input_fna, gtdb_json, outdir):
    """
    Given a gtdb_json file and an input_fna file, generate a symlink to a desired location
    if genome_id cannot be found via GTDB API
    """
    input_fna = Path(input_fna).resolve()
    gtdb_json = Path(gtdb_json).resolve()
    outdir = Path(outdir)
    outdir.mkdir(parents=True, exist_ok=True)
    assert input_fna.is_file() and gtdb_json.is_file()

    genome_id = assess_gtdb_json_file(gtdb_json)
    if genome_id is not None:
        outfile = outdir / f"{genome_id}.fna"
        logging.info(f"Generating input files for GTDB-tk: {outfile}")
        outfile.symlink_to(input_fna)
    return None


def input_handling(input_list, category, suffix=".json"):
    input_list = Path(input_list)
    if input_list.is_file() and input_list.suffix == suffix:
        logging.info(f"Getting {category} from a single file: {input_list}")
        input_list_files = input_list

    elif input_list.is_file() and input_list.suffix == ".txt":
        logging.info(f"Getting {category} from a text file: {input_list}")
        with open(input_list, "r") as file:
            file_content = [i.strip("\n") for i in file.readlines()]
            if len(file_content) == 1:
                # Paths space-separated on a single line
                logging.info(
                    " - Detecting space-separated input in a single line format."
                )
                paths = file_content[0].split()
            else:
                # Paths written on separate lines
                logging.info(" - Detecting input in a multi-line format.")
                paths = file_content
            input_list_files = [
                Path(path) for path in paths if Path(path).suffix == suffix
            ]
    else:
        input_list_files = [
            Path(file)
            for file in str(input_list).split()
            if Path(file).suffix == suffix
        ]
        logging.info(
            f"Getting {category} from the given list of {len(input_list_files)} files..."
        )
    return input_list_files


def gtdbtk_prep(fna_list, json_list, outdir):
    """
    Given a list of gtdb_json file and an list of fna, generate a symlinks to a desired location
    if genome_id cannot be found via GTDB API
    """
    shell_json_input = input_handling(json_list, "taxonomy json")
    shell_fna_input = input_handling(fna_list, "fna files", suffix=".fna")

    for gtdb_json in shell_json_input:
        gid = Path(gtdb_json).stem
        input_fna = [fna for fna in shell_fna_input if gid in Path(fna).stem]
        logging.info(
            f"Found {gid} in {[str(i) for i in input_fna]}. Generating symlink..."
        )
        generate_symlink_gtdbtk(str(input_fna[0]), str(gtdb_json), str(outdir))
    return


if __name__ == "__main__":
    gtdbtk_prep(sys.argv[1], sys.argv[2], sys.argv[3])

Python JSON From line 1 of data/gtdbtk_prep.py

import logging
import os
import sys
from collections import OrderedDict
from datetime import datetime
from pathlib import Path

import networkx as nx
import pandas as pd
from alive_progress import alive_bar

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)


def get_cluster_dataframes(
    df_genomes, df_nodes, mibig_bgc_table, as_dir="../data/antismash/"
):
    """
    Returns two dataframes of clusters with information from genomes and MIBIG
    """

    df_clusters = pd.DataFrame(
        columns=["product", "bigscape_class", "genome_id", "accn_id"]
    )
    df_known = pd.DataFrame(
        columns=[
            "product",
            "bigscape_class",
            "biosyn_class",
            "compounds",
            "chem_acts",
            "accession",
            "completeness",
            "organism_name",
            "ncbi_tax_id",
            "publications",
            "evidence",
        ]
    )
    df_mibig_bgcs = pd.read_csv(mibig_bgc_table, index_col="mibig_id")

    # Generate bgcs dataframe with metadata from df_nodes and df_genomes
    logging.info(
        "Generating bgcs dataframe with metadata from df_nodes and df_genomes..."
    )
    with alive_bar(len(df_genomes.index)) as bar:
        for genome_id in df_genomes.index:
            logging.info(f"Processing BGCs in the genome: {genome_id}")
            if genome_id in os.listdir(as_dir):
                genome_dir = os.path.join(as_dir, genome_id)
                bgc_id_list = [
                    region[:-4]
                    for region in os.listdir(genome_dir)
                    if "region0" in region
                ]
                for bgc_id in bgc_id_list:
                    logging.debug(f"Processing {bgc_id}")
                    if bgc_id in df_nodes.index:
                        df_clusters.loc[bgc_id, "genome_id"] = genome_id
                        df_clusters.loc[bgc_id, "product"] = df_nodes.loc[
                            bgc_id, "Product Prediction"
                        ]
                        df_clusters.loc[bgc_id, "bigscape_class"] = df_nodes.loc[
                            bgc_id, "BiG-SCAPE class"
                        ]
                        df_clusters.loc[bgc_id, "accn_id"] = df_nodes.loc[
                            bgc_id, "Accession ID"
                        ]
                    else:
                        logging.debug(f"{bgc_id} not in df_nodes")
            else:
                logging.warning(f"{genome_id} not in directory!")
            bar()

    # Generate separate table for known BGCs from MIBIG
    logging.info("Generating separate table for known BGCs from MIBIG")
    with alive_bar(len(df_nodes.index)) as bar:
        for bgc_id in df_nodes.index:
            if "BGC0" in bgc_id:
                df_known.loc[bgc_id, "product"] = df_nodes.loc[
                    bgc_id, "Product Prediction"
                ]
                df_known.loc[bgc_id, "bigscape_class"] = df_nodes.loc[
                    bgc_id, "BiG-SCAPE class"
                ]
                if "." in bgc_id:
                    mibig_id = bgc_id.split(".")[0]
                else:
                    mibig_id = bgc_id

                if mibig_id in df_mibig_bgcs.index:
                    for col in df_mibig_bgcs.columns:
                        df_known.loc[bgc_id, col] = df_mibig_bgcs.loc[mibig_id, col]
                else:
                    logging.debug(f"{bgc_id} not in df_mibig_bgcs dataframe index")
            bar()

    return df_known, df_clusters


def add_bigscape_families(df_clusters, df_known, net_data_path):
    """
    Adds GCC and GCF numbers detected by BiG-SCAPE clustering for different cut-offs
    """

    cluster_class_set = [
        cluster_class
        for cluster_class in os.listdir(net_data_path)
        if ".tsv" not in cluster_class
    ]

    for cluster_class in cluster_class_set:
        logging.info(f"Processing all BGCs from {cluster_class}")
        class_dir = os.path.join(net_data_path, cluster_class)
        gcf_cutoffs_files = [
            file for file in os.listdir(class_dir) if "_clustering_" in file
        ]
        for gcf_file in gcf_cutoffs_files:
            cutoff = gcf_file[-8:-4]
            gcf_path = os.path.join(class_dir, gcf_file)
            df_clusters = read_gcf_df(df_clusters, gcf_path, cutoff)
            df_known = read_gcf_df(df_known, gcf_path, cutoff)

        clan_files = [file for file in os.listdir(class_dir) if "_clans_" in file]
        if len(clan_files) == 1:
            clan_select = clan_files[0]
            clan_path = os.path.join(class_dir, clan_select)

            df_clusters = read_gcc_df(df_clusters, clan_path)
            df_known = read_gcc_df(df_known, clan_path)

    return df_clusters, df_known


def read_gcf_df(df_clusters, gcf_path, cutoff):
    """
    Adds GCF (Gene Cluster Family) number for each BGC
    """

    df_gcf = pd.read_csv(gcf_path, sep="\t", index_col="#BGC Name", dtype=str)
    col_name = "gcf_" + cutoff

    for bgc in df_clusters.index:
        if bgc in df_gcf.index:
            df_clusters.loc[bgc, col_name] = df_gcf.loc[bgc, "Family Number"]

    return df_clusters


def read_gcc_df(df_clusters, clan_path):
    """
    Adds GCC (Gene Cluster Clan) number for each BGC
    """

    df_gcc = pd.read_csv(clan_path, sep="\t", index_col="#BGC Name", dtype=str)
    col_name = "Clan Number"

    for bgc in df_clusters.index:
        if bgc in df_gcc.index:
            df_clusters.loc[bgc, col_name] = df_gcc.loc[bgc, "Clan Number"]

    return df_clusters


def get_bigscape_network(net_data_path, cutoff="0.30"):
    """
    Reads similarity network for a particular to a dataframe
    """

    cluster_class_set = [
        cluster_class
        for cluster_class in os.listdir(net_data_path)
        if ".tsv" not in cluster_class
    ]

    df_network = pd.DataFrame()
    for cluster_class in cluster_class_set:
        class_dir = os.path.join(net_data_path, cluster_class)
        net_file = cluster_class + "_c" + cutoff + ".network"
        net_path = os.path.join(class_dir, net_file)
        if os.path.isfile(net_path):
            df_class_net = pd.read_csv(net_path, sep="\t")
            df_network = pd.concat([df_network, df_class_net], ignore_index=True)

    return df_network


def get_network_graph(df_network, weight="Raw distance"):
    """
    Returns networkX graph for a given network
    """

    G_clusters = nx.from_pandas_edgelist(
        df_network, "Clustername 1", "Clustername 2", weight
    )
    G_families = nx.connected_components(G_clusters)
    family_nodes = [c for c in sorted(G_families, key=len, reverse=True)]

    return G_clusters, family_nodes


def remove_single_mibig(df_network, df_known, family_nodes):
    """
    Removes singleton MIBIG BGCs from the network
    """
    logging.info("Removing singleton MIBIG BGCs from the network")
    nodes_to_remove = []

    for fam in family_nodes:
        single_mibig = True

        for node in fam:
            if "BGC" not in node:
                single_mibig = False

        if single_mibig:
            for node in fam:
                if node not in nodes_to_remove:
                    nodes_to_remove.append(node)
    logging.debug(
        f"{len(nodes_to_remove)} number of MIBIG nodes will be removed from analysis due to no similarity"
    )
    df_network = df_network[~df_network["Clustername 1"].isin(nodes_to_remove)]
    df_network = df_network[~df_network["Clustername 2"].isin(nodes_to_remove)]
    for node in nodes_to_remove:
        if node in df_known.index:
            df_known = df_known.drop(node)

    return df_network, df_known


def get_family_graph(G_clusters):
    """
    Returns families list as networkx graph
    """

    # Find connected components or cluster families
    Families_list = list(nx.connected_components(G_clusters))
    # Sort families in decreasing order of occurrence
    family_size = [len(family) for family in Families_list]
    orig_index = list(range(len(family_size)))
    orig_index_fam_size_dict = dict(zip(orig_index, family_size))

    sorted_index_fam_size_dict = OrderedDict(
        sorted(orig_index_fam_size_dict.items(), key=lambda x: x[1])
    )
    new_index = list(range(len(family_size) - 1, -1, -1))
    orig_index_sorted = list(sorted_index_fam_size_dict.keys())
    new_orig_index_dict = dict(zip(new_index, orig_index_sorted))

    # Ordered family graphs
    family_graphs = [
        Families_list[new_orig_index_dict[fam_id]]
        for fam_id in range(len(Families_list))
    ]

    return family_graphs


def update_cluster_family(
    df_clusters, df_known, family_nodes, mibig_bgc_table, cutoff="0.30"
):
    """
    Updates df_clusters with family ids (connected components)
    """
    df_mibig_bgcs = pd.read_csv(mibig_bgc_table, index_col="mibig_id")

    df_families = pd.DataFrame(
        columns=["fam_type", "fam_name", "clusters_in_fam", "mibig_ids"]
    )
    for cntr in range(len(family_nodes)):
        fam_id = cntr + 1
        family = family_nodes[cntr]
        known_bgcs = [bgc for bgc in family if bgc in df_mibig_bgcs.index]  #

        if len(known_bgcs) > 0:
            df_families.loc[fam_id, "fam_type"] = "known_family"
            logging.debug(
                f'Family {fam_id} has {len(known_bgcs)} known BGCs: {", ".join(known_bgcs)}'
            )

            # handle missing entry from MIBIG release
            try:
                known_compounds = ";".join(
                    df_known.loc[known_bgcs, "compounds"].tolist()
                )
            except TypeError:
                logging.warning(
                    f"MIBIG entries {known_bgcs} returned no compounds: {df_known.loc[known_bgcs, 'compounds'].values}"
                )
                logging.warning(
                    "This is likely because of missing JSON entry in the downloaded MIBIG release."
                )
                logging.warning(
                    "Check if this is the case in the resources/mibig/df_mibig_bgcs.csv folder"
                )
                logging.warning("Replacing compound value with link to MIBIG...")
                for h in known_bgcs:
                    link_mibig = f"https://mibig.secondarymetabolites.org/repository/{h.split('.')[0]}/index.html"
                    logging.warning(
                        f"{h}: Replacing compound {df_known.loc[h, 'compounds']} with {link_mibig}"
                    )
                    df_known.loc[h, "compounds"] = link_mibig

            df_families.loc[fam_id, "fam_name"] = known_compounds
            df_families.loc[fam_id, "clusters_in_fam"] = len(family)
            df_families.loc[fam_id, "mibig_ids"] = ";".join(known_bgcs)
        else:
            df_families.loc[fam_id, "fam_type"] = "unknown_family"
            bgc_class = ",".join(
                df_clusters.loc[list(family), "bigscape_class"].unique().tolist()
            )
            df_families.loc[fam_id, "fam_name"] = "u_" + bgc_class + "_" + str(fam_id)
            df_families.loc[fam_id, "clusters_in_fam"] = len(family)

        for bgc in family:
            if bgc in df_clusters.index:
                df_clusters.loc[bgc, "fam_id_" + cutoff] = str(fam_id)
                if len(known_bgcs) > 0:
                    df_clusters.loc[bgc, "fam_type_" + cutoff] = "known_family"
                    known_compounds = ";".join(
                        df_known.loc[known_bgcs, "compounds"].tolist()
                    )
                    df_clusters.loc[
                        bgc, "fam_known_compounds_" + cutoff
                    ] = known_compounds
                else:
                    df_clusters.loc[bgc, "fam_type_" + cutoff] = "unknown_family"
                    df_clusters.loc[bgc, "fam_known_compounds_" + cutoff] = (
                        "u_" + bgc_class + "_" + str(fam_id)
                    )

            elif bgc in df_known.index:
                df_known.loc[bgc, "fam_id_" + cutoff] = str(fam_id)
                df_known.loc[bgc, "fam_type_" + cutoff] = "known_family"
                known_compounds = ";".join(
                    df_known.loc[known_bgcs, "compounds"].tolist()
                )
                df_known.loc[bgc, "fam_known_compounds_" + cutoff] = known_compounds

    return df_clusters, df_known, df_families


def get_family_presence(df_clusters, df_genomes, df_families, cutoff):
    """
    Returns BGC family presence absence matrix across genomes
    """

    df_family_presence = pd.DataFrame(
        0, index=df_genomes.index, columns=df_families.index.astype(str)
    )
    for bgc_id in df_clusters.index:
        fam_id = str(df_clusters.loc[bgc_id, "fam_id_" + cutoff])
        genome_id = df_clusters.loc[bgc_id, "genome_id"]
        df_family_presence.loc[genome_id, fam_id] = 1

    return df_family_presence


def run_family_analysis(
    cutoff,
    net_data_path,
    df_clusters,
    df_genomes,
    df_known_all,
    output_dir,
    mibig_table,
    query_name,
):
    logging.info(f"Processing data from BiG-SCAPE with cutoff {cutoff}")
    df_network = get_bigscape_network(net_data_path, cutoff=cutoff)
    G_clusters, family_nodes = get_network_graph(df_network, weight="Raw distance")
    df_network, df_known = remove_single_mibig(df_network, df_known_all, family_nodes)
    G_clusters, family_nodes = get_network_graph(df_network, weight="Raw distance")
    singleton_bgc = [list(fam)[0] for fam in family_nodes if len(fam) == 1]
    family_graphs = get_family_graph(G_clusters)
    df_clusters, df_known, df_families = update_cluster_family(
        df_clusters, df_known, family_nodes, mibig_table, cutoff=cutoff
    )
    df_family_presence = get_family_presence(
        df_clusters, df_genomes, df_families, cutoff=cutoff
    )
    logging.debug(f"Number of genomes: {df_genomes.shape[0]}")
    logging.debug(f"Number of BGCs: {df_clusters.shape[0]}")
    logging.debug(f"Number of edges in the network: {df_network.shape[0]}")
    logging.debug(f"Number of total families: {len(family_nodes)}")
    logging.debug(
        f"Number of total non-single families: {len(family_nodes) - len(singleton_bgc)}"
    )
    logging.debug(f"Number of singleton families in the network: {len(singleton_bgc)}")
    logging.debug(
        f"Number of families with known BGCs: {df_families[df_families.fam_type == 'known_family'].shape[0]}"
    )
    logging.debug(f"Number of known BGCs in the network: {df_known.shape[0]}")
    logging.debug(
        f"Number of BGCs in top 10 families {[len(fam) for fam in family_graphs[:10]]}"
    )
    df_known_families = df_families[df_families.fam_type == "known_family"]
    logging.debug(
        f"Some of the common known BGCs{chr(10)}{chr(10)}{df_known_families.iloc[:20,1:3]}{chr(10)}"
    )
    df_unknown_families = df_families[df_families.fam_type == "unknown_family"]
    logging.debug(
        f"Some of the common unknown BGCs:{chr(10)}{chr(10)}{df_unknown_families.iloc[:20,1:3]}{chr(10)}"
    )

    # Assign index name
    df_network.index.name = "bigscape_edge_id"
    df_known.index.name = "bgc_id"
    df_families.index.name = f"fam_id_{cutoff}"
    df_clusters.index.name = "bgc_id"
    df_family_presence.index.name = "genome_id"

    # Save all the dataframes
    df_network.to_csv(f"{output_dir}/{query_name}_df_network_" + cutoff + ".csv")
    df_known.to_csv(f"{output_dir}/{query_name}_df_known_" + cutoff + ".csv")
    df_families.to_csv(f"{output_dir}/{query_name}_df_families_" + cutoff + ".csv")
    df_clusters.to_csv(f"{output_dir}/{query_name}_df_clusters_" + cutoff + ".csv")
    df_family_presence.to_csv(
        f"{output_dir}/{query_name}_df_family_presence_" + cutoff + ".csv"
    )

    return df_clusters, df_families, df_network


def process_bigscape_output(
    bigscape_directory, as_dir, df_genomes_path, mibig_bgc_table, output_dir
):
    bigscape_directory = Path(bigscape_directory)
    output_dir = Path(output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)

    # get the latest run
    bigscape_runs = {}
    logging.info("Getting BiG-SCAPE runs...")
    for net_data_path in bigscape_directory.glob("network_files/*"):
        logging.debug(f"Found network data: {net_data_path}")
        selected_run_folder = net_data_path.name
        selected_run_time = selected_run_folder.split("_glocal")[0]
        selected_run_time = datetime.strptime(selected_run_time, "%Y-%m-%d_%H-%M-%S")
        bigscape_runs[selected_run_time] = net_data_path
    run_times = list(bigscape_runs.keys())
    run_times.sort(reverse=True)

    # make sure run times has values
    assert len(run_times) > 0
    selected_run = run_times[0]
    net_data_path = bigscape_runs[selected_run]

    logging.info(f"Processing {selected_run}")
    node_annot_path = (
        net_data_path / "Network_Annotations_Full.tsv"
    )  # Read the BGC table

    df_nodes = pd.read_csv(node_annot_path, index_col="BGC", sep="\t")

    # Generate df_clusters and df_known dataframe
    df_genomes = pd.read_csv(df_genomes_path).set_index("genome_id", drop=False)
    df_genomes.to_csv(f"{output_dir}/df_genome_antismash_summary.csv")

    df_known_all, df_clusters = get_cluster_dataframes(
        df_genomes, df_nodes, mibig_bgc_table, as_dir
    )
    assert (
        len(df_clusters) > 0
    ), f"df_cluster is empty {df_clusters.shape}. Check folder data/interim/bgcs to have proper antismash region genbank files."
    # Enrich dataframes with BiGSCAPE information on GCCs and GCFs with cutoffs
    df_clusters, df_known_all = add_bigscape_families(
        df_clusters, df_known_all, net_data_path
    )

    # Get GCF data as per the cutoff
    for cutoff in ["0.30", "0.40", "0.50"]:
        df_clusters, df_families, df_network = run_family_analysis(
            cutoff,
            net_data_path,
            df_clusters,
            df_genomes,
            df_known_all,
            output_dir,
            mibig_bgc_table,
            str(selected_run).replace(":", "_"),
        )
    return


if __name__ == "__main__":
    process_bigscape_output(
        sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4], sys.argv[5]
    )

Python Pandas networkx antiSMASH alive-progress From line 1 of data/make_bigscape_to_cytoscape.py

import json
import logging
import sys
from pathlib import Path

import pandas as pd
from alive_progress import alive_bar

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)


def combine_bgc_counts(input_json, filter_str="_bgc_counts"):
    container = {}
    logging.info("Reading json files...")
    with alive_bar(len(input_json), title="Merging BGC counts:") as bar:
        for item in input_json:
            item = Path(item)
            genome_id = item.stem
            if filter_str in genome_id:
                genome_id = genome_id.replace(filter_str, "")
            logging.debug(f"Processing {genome_id}")
            with open(item, "r") as f:
                data = json.load(f)
                products = data[genome_id].pop("products")
                data[genome_id] = data[genome_id] | products
                container.update(data)
            bar()
    return container


def write_genome_table(input_json, samples_table, genome_table):
    """
    Write df_genomes.csv table in processed data
    """
    # Accomodate multiple inputs to generate dataframe
    shell_input = samples_table.split()
    logging.info(f"Reading samples table: {shell_input}")
    dfList = [pd.read_csv(s).set_index("genome_id", drop=False) for s in shell_input]
    df_samples = pd.concat(dfList, axis=0)

    # Handle multiple json
    input_json = Path(input_json)
    logging.info(input_json)
    if input_json.is_file() and input_json.suffix == ".json":
        logging.info(f"Getting BGC overview from a single file: {input_json}")
        input_json_files = input_json

    elif input_json.is_file() and input_json.suffix == ".txt":
        logging.info(f"Getting BGC overview  from a text file: {input_json}")
        with open(input_json, "r") as file:
            file_content = [i.strip("\n") for i in file.readlines()]
            if len(file_content) == 1:
                # Paths space-separated on a single line
                paths = file_content[0].split()
            else:
                # Paths written on separate lines
                paths = file_content
            input_json_files = [
                Path(path) for path in paths if Path(path).suffix == ".json"
            ]
    else:
        input_json_files = [
            Path(file)
            for file in str(input_json).split()
            if Path(file).suffix == ".json"
        ]
        logging.info(
            f"Getting BGC overview from the given list of {len(input_json_files)} files..."
        )

    bgc_counts = combine_bgc_counts(input_json_files)
    bgc_counts = pd.DataFrame.from_dict(bgc_counts).T

    logging.debug(f"Writing file to: {genome_table}")

    # Generate dataframe
    df_genomes = pd.concat([df_samples, bgc_counts], axis=1)

    # Save dataframes to csv tables
    genome_table = Path(genome_table)
    genome_table.parent.mkdir(parents=True, exist_ok=True)
    df_genomes.to_csv(genome_table, index=False)
    logging.info("Job done")
    return None


if __name__ == "__main__":
    write_genome_table(sys.argv[1], sys.argv[2], sys.argv[3])

Python Pandas JSON alive-progress From line 1 of data/make_genome_dataset.py

import logging
import os
import sys
from shutil import copyfile

import matplotlib.pyplot as plt
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from Bio import Phylo, SeqIO

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)


def get_roary_data(
    roary_interim_folder, roary_processed_folder, core=0.99, softcore=0.95, shell=0.15
):
    """
    Copy important files from Roary interim to processed directory

    Parameters
    ----------
    1. roary_folder : str / path
        Location of the output from Roary in the interim directory
    2. roary_processed_folder : str / path
        Location of the processed output directory for important Roary results
    3. core: int (0 to 1)
        default = 0.99
        Fraction of genomes the gene must be present to be group in core (99% <= strains <= 100%)
    4. softcore: int (0 to 1)
        default = 0.95
        Fraction of genomes the gene must be present to be group in softcore (95% <= strains < 99%)
    5. shell: int (0 to 1)
        default = 0.15
        Fraction of genomes the gene must be present to be group in softcore (15% <= strains < 95%)
        Remaining genes will be in cloud
    Returns
    -------
    1. roary_processed_folder : str / path
        Location of the processed output directory for important Roary results
    """

    gene_presence_path = os.path.join(roary_interim_folder, "gene_presence_absence.csv")
    df_gene_presence_summary = pd.read_csv(
        gene_presence_path, index_col="Gene", low_memory=False
    )

    # Extract gene annotation columns to separate dataframe
    logging.info("Extract gene annotation columns to separate dataframe")
    gene_summary_columns = [
        "Non-unique Gene name",
        "Annotation",
        "No. isolates",
        "No. sequences",
        "Avg sequences per isolate",
        "Genome Fragment",
        "Order within Fragment",
        "Accessory Fragment",
        "Accessory Order with Fragment",
        "QC",
        "Min group size nuc",
        "Max group size nuc",
        "Avg group size nuc",
    ]

    gene_summary_out_interim = os.path.join(
        roary_interim_folder, "df_pangene_summary.csv"
    )
    gene_summary_out_processed = os.path.join(
        roary_processed_folder, "df_pangene_summary.csv"
    )
    df_gene_summary = df_gene_presence_summary[gene_summary_columns].fillna("")

    # Extract locus tags
    logging.info("Extract locus tags")
    df_gene_presence = df_gene_presence_summary.drop(
        columns=gene_summary_columns
    ).fillna("")
    gene_presence_locustag_out_interim = os.path.join(
        roary_interim_folder, "df_gene_presence_locustag.csv"
    )
    gene_presence_locustag_out_processed = os.path.join(
        roary_processed_folder, "df_gene_presence_locustag.csv"
    )
    df_gene_presence.to_csv(gene_presence_locustag_out_interim)
    df_gene_presence.to_csv(gene_presence_locustag_out_processed)

    # Save the gene presence absence binary matrix
    logging.info("Save the gene presence absence binary matrix")
    gene_presence_binary_path = os.path.join(
        roary_interim_folder, "gene_presence_absence.Rtab"
    )
    gene_presence_binary_out_interim = os.path.join(
        roary_interim_folder, "df_gene_presence_binary.csv"
    )
    gene_presence_binary_out_processed = os.path.join(
        roary_processed_folder, "df_gene_presence_binary.csv"
    )

    df_gene_presence_binary = pd.read_csv(
        gene_presence_binary_path, index_col="Gene", sep="\t"
    )
    df_gene_presence_binary.to_csv(gene_presence_binary_out_interim)
    df_gene_presence_binary.to_csv(gene_presence_binary_out_processed)

    # Add locus_tag and pangenome_id from pan_genome_reference.fa file
    logging.info("Add locus_tag and pangenome_id from pan_genome_reference.fa file")
    pan_fasta_path = os.path.join(roary_interim_folder, "pan_genome_reference.fa")
    records = SeqIO.parse(pan_fasta_path, format="fasta")
    for rec in records:
        locus_tag = rec.id
        pan_gene_desc = rec.description.split(" ")
        pan_gene_id = " ".join([pan_gene_id for pan_gene_id in pan_gene_desc[1:]])
        if pan_gene_id in df_gene_summary.index:
            df_gene_summary.loc[pan_gene_id, "locus_tag"] = locus_tag

    # Add pangenome class and save gene summary table
    logging.info("Add pangenome class and save gene summary table")
    total_genomes = df_gene_presence.shape[1]
    for pan_gene_id in df_gene_summary.index:
        no_isolates = df_gene_summary.loc[pan_gene_id, "No. isolates"]
        if no_isolates >= core * total_genomes:
            df_gene_summary.loc[pan_gene_id, "pangenome_class"] = "core"
        elif no_isolates >= softcore * total_genomes:
            df_gene_summary.loc[pan_gene_id, "pangenome_class"] = "softcore"
        elif no_isolates >= shell * total_genomes:
            df_gene_summary.loc[pan_gene_id, "pangenome_class"] = "shell"
        else:
            df_gene_summary.loc[pan_gene_id, "pangenome_class"] = "cloud"
    df_gene_summary.to_csv(gene_summary_out_interim)
    df_gene_summary.to_csv(gene_summary_out_processed)

    # Copy other output files to processed directory
    logging.info("Copy other output files to processed directory")
    copyfile(
        os.path.join(roary_interim_folder, "conserved_vs_total_genes.png"),
        os.path.join(roary_processed_folder, "conserved_vs_total_genes.png"),
    )

    copyfile(
        os.path.join(roary_interim_folder, "Rplots.pdf"),
        os.path.join(roary_processed_folder, "Rplots.pdf"),
    )

    copyfile(
        os.path.join(roary_interim_folder, "pan_genome_reference.fa"),
        os.path.join(roary_processed_folder, "pan_genome_reference.fa"),
    )

    copyfile(
        os.path.join(roary_interim_folder, "summary_statistics.txt"),
        os.path.join(roary_processed_folder, "summary_statistics.txt"),
    )

    copyfile(
        os.path.join(roary_interim_folder, "number_of_conserved_genes.Rtab"),
        os.path.join(roary_processed_folder, "number_of_conserved_genes.Rtab"),
    )

    copyfile(
        os.path.join(roary_interim_folder, "number_of_genes_in_pan_genome.Rtab"),
        os.path.join(roary_processed_folder, "number_of_genes_in_pan_genome.Rtab"),
    )
    return df_gene_presence, df_gene_presence_binary, df_gene_summary


def plot_core_pan_curve(roary_interim_folder, roary_processed_folder):
    """
    Plots pangenome curves for core, pan, new and unique gene against number of genomes

    Parameters
    ----------
    1. roary_folder : str / path
        Location of the output from Roary in the interim directory

    Returns
    -------
    1. roary_processed_folder : str / path
        Location of the processed output directory where two plots will be saved
    """

    core_path = os.path.join(roary_interim_folder, "number_of_conserved_genes.Rtab")
    pan_path = os.path.join(roary_interim_folder, "number_of_genes_in_pan_genome.Rtab")
    new_path = os.path.join(roary_interim_folder, "number_of_new_genes.Rtab")
    uniq_path = os.path.join(roary_interim_folder, "number_of_unique_genes.Rtab")

    df_core_cnt = pd.read_table(core_path, sep="\t")
    df_pan_cnt = pd.read_table(pan_path, sep="\t")
    df_new_cnt = pd.read_csv(new_path, sep="\t")
    df_uniq_cnt = pd.read_csv(uniq_path, sep="\t")

    avg_core = df_core_cnt.sum(0) / df_core_cnt.shape[0]
    max_core = df_core_cnt.max(0)
    min_core = df_core_cnt.min(0)

    avg_pan = df_pan_cnt.sum(0) / df_core_cnt.shape[0]
    max_pan = df_pan_cnt.max(0)
    min_pan = df_pan_cnt.min(0)

    avg_new = df_new_cnt.sum(0) / df_uniq_cnt.shape[0]
    max_new = df_new_cnt.max(0)
    min_new = df_new_cnt.min(0)

    avg_uniq = df_uniq_cnt.sum(0) / df_uniq_cnt.shape[0]
    max_uniq = df_uniq_cnt.max(0)
    min_uniq = df_uniq_cnt.min(0)

    genome_numbers = list(range(1, df_core_cnt.shape[1] + 1))

    # Plot pan and core genome curves in JPEG format
    fig_pan_core = go.Figure()
    fig_pan_core.add_trace(
        go.Scatter(
            x=genome_numbers,
            y=avg_core,
            fill="tonexty",
            mode="lines+markers",
            name="Core",
            hoverinfo="name+x+y",
            error_y=dict(
                type="data",
                symmetric=False,
                array=max_core - avg_core,
                arrayminus=avg_core - min_core,
            ),
        )
    )  # fill down to xaxis

    fig_pan_core.add_trace(
        go.Scatter(
            x=genome_numbers,
            y=avg_pan,
            fill="tonexty",
            mode="lines+markers",
            name="Pan",
            hoverinfo="name+x+y",
            error_y=dict(
                type="data",
                symmetric=False,
                array=max_pan - avg_pan,
                arrayminus=avg_pan - min_pan,
            ),
        )
    )  # fill to trace0 y

    fig_pan_core.update_layout(
        autosize=False,
        width=800,
        height=500,
        margin=dict(l=20, r=20, b=30, t=30, pad=4),
        title_text="Pangenome and coregenome curve",
        xaxis_title_text="#Genomes",
        yaxis_title_text="#Genes",
        paper_bgcolor="White",
    )

    fig_pan_core_path = os.path.join(roary_processed_folder, "pan_core_curve.jpeg")
    if not os.path.isfile(fig_pan_core_path):
        fig_pan_core.write_image(fig_pan_core_path)

    # Plot new and unique genome curves in JPEG format
    fig_new_uniq = go.Figure()
    fig_new_uniq.add_trace(
        go.Scatter(
            x=genome_numbers,
            y=avg_new,
            fill="tonexty",
            mode="lines+markers",
            name="New",
            hoverinfo="name+x+y",
            error_y=dict(
                type="data",
                symmetric=False,
                array=max_new - avg_new,
                arrayminus=avg_new - min_new,
            ),
        )
    )  # fill down to xaxis

    fig_new_uniq.add_trace(
        go.Scatter(
            x=genome_numbers,
            y=avg_uniq,
            fill="tonexty",
            mode="lines+markers",
            name="Unique",
            hoverinfo="name+x+y",
            error_y=dict(
                type="data",
                symmetric=False,
                array=max_uniq - avg_uniq,
                arrayminus=avg_uniq - min_uniq,
            ),
        )
    )  # fill to trace0 y

    fig_new_uniq.update_layout(
        autosize=False,
        width=800,
        height=500,
        margin=dict(l=20, r=20, b=30, t=30, pad=4),
        title_text="New and unique genes curve",
        xaxis_title_text="#Genomes",
        yaxis_title_text="#Genes",
        paper_bgcolor="White",
    )

    fig_new_uniq_path = os.path.join(roary_processed_folder, "new_unique_curve.jpeg")
    if not os.path.isfile(fig_new_uniq_path):
        fig_new_uniq.write_image(fig_new_uniq_path)

    return


def plot_pan_freq_plot(df_gene_presence_binary, roary_processed_folder):
    """
    Plots pangenome frequence plot for number of genes present in number of genomes

    Parameters
    ----------
    1. df_gene_presence_binary : pd.DataFrame
        Dataframe with presence absence matrix of genes in the pangenome

    Returns
    -------
    1. roary_processed_folder : str / path
        Location of the processed output directory where gene frequency plot will be saved
    """

    df_freq = pd.DataFrame(index=df_gene_presence_binary.index, columns=["#Genomes"])
    df_freq["#Genomes"] = df_gene_presence_binary.sum(axis=1)
    nbins = df_gene_presence_binary.shape[1]
    fig = px.histogram(df_freq, x="#Genomes", nbins=nbins)

    fig.update_layout(
        autosize=False,
        width=800,
        height=500,
        margin=dict(l=20, r=20, b=30, t=30, pad=4),
        title_text="Pangenome frequency plot",  # title of plot
        xaxis_title_text="#Genomes",  # xaxis label
        yaxis_title_text="#Genes",  # yaxis label
    )
    fig_out_path = os.path.join(roary_processed_folder, "gene_frequency.jpeg")
    if not os.path.isfile(fig_out_path):
        fig.write_image(fig_out_path)

    return


def plot_pan_pie_chart(
    df_gene_summary,
    total_genomes,
    roary_processed_folder,
    core=0.99,
    softcore=0.95,
    shell=0.15,
):
    """
    Plots pangenome frequence plot for number of genes present in number of genomes

    Parameters
    ----------
    1. df_gene_summary : pd.DataFrame
        Dataframe with gene summary including pangenome_class
    2. total_genomes : int
        Number of total genomes in the pangenome
    3. roary_processed_folder : str / path
        Location of the processed output directory for Roary
    4. core: int (0 to 1)
        default = 0.99
        Fraction of genomes the gene must be present to be group in core (99% <= strains <= 100%)
    5. softcore: int (0 to 1)
        default = 0.95
        Fraction of genomes the gene must be present to be group in softcore (95% <= strains < 99%)
    6. shell: int (0 to 1)
        default = 0.15
        Fraction of genomes the gene must be present to be group in softcore (15% <= strains < 95%)
        Remaining genes will be in cloud

    Returns
    -------
    1. roary_processed_folder : str / path
        Location of the processed output directory where pangenome pie chart plot will be saved
    """

    # Plot the pangenome pie chart
    plt.figure(figsize=(10, 10))

    core_genes = df_gene_summary[df_gene_summary["pangenome_class"] == "core"].shape[0]
    softcore_genes = df_gene_summary[
        df_gene_summary["pangenome_class"] == "softcore"
    ].shape[0]
    shell_genes = df_gene_summary[df_gene_summary["pangenome_class"] == "shell"].shape[
        0
    ]
    cloud_genes = df_gene_summary[df_gene_summary["pangenome_class"] == "cloud"].shape[
        0
    ]
    total_genes = df_gene_summary.shape[0]

    def my_autopct(pct):
        val = int(pct * total_genes / 100.0)
        return "{v:d}".format(v=val)

    fig = plt.pie(
        [core_genes, softcore_genes, shell_genes, cloud_genes],
        labels=[
            "core  (<= %d strains)" % (total_genomes * core),
            "soft-core  (%d <= strains < %d)"
            % (total_genomes * softcore, total_genomes * core),
            "shell\n(%d <= strains < %d)"
            % (total_genomes * softcore, total_genomes * shell),
            "cloud\n(strains < %d)" % (total_genomes * shell),
        ],
        explode=[0.1, 0.05, 0.02, 0],
        radius=0.9,
        colors=[
            (0, 0, 1, float(x) / total_genes)
            for x in (core_genes, softcore_genes, shell_genes, cloud_genes)
        ],
        autopct=my_autopct,
        textprops={"fontsize": 10, "fontweight": "bold"},
    )

    fig_out_path = os.path.join(roary_processed_folder, "pangenome_pie.jpeg")
    if not os.path.isfile(fig_out_path):
        plt.savefig(
            fig_out_path,
            facecolor="w",
            edgecolor="w",
            orientation="portrait",
            format="jpeg",
            transparent=False,
            bbox_inches="tight",
            pad_inches=0.1,
        )

    return


def plot_tree_presence_map(
    t, df_gene_presence_binary, df_genomes_tree, df_roary_sorted, roary_processed_folder
):
    """
    Plots gene presence heatmap with phylogenetic tree

    Parameters
    ----------
    1. tree : Phlyo.tree
        Phylogenetic tree object of Bio.Phylo from autoMLST
    1. df_gene_presence_binary : pd.DataFrame
        Dataframe with presence absence matrix of genes in the pangenome
    2. df_genomes_tree : pd.DataFrame
        Dataframe with genomes ordered phylogenetically
    3. df_roary_sorted : pd.DataFrame
        Sorted matrix with gene presence across the genomes
    4. roary_processed_folder : str / path
        Location of the processed output directory for Roary

    Returns
    -------
    1. heatmap : jpeg
        Heatmap of gene presence with genomes in phylogenetic order
    """

    # Max distance to create better plot
    mdist = max([t.distance(t.root, x) for x in t.get_terminals()])

    # Considere clustering methods for sorting the matrix

    # PLot presence/absence matrix against the tree
    fig = plt.figure(figsize=(30, df_roary_sorted.shape[1] / 6))

    ax1 = plt.subplot2grid((1, 40), (0, 10), colspan=30)
    a = ax1.imshow(
        df_roary_sorted.T,
        cmap=plt.cm.Blues,
        vmin=0,
        vmax=1,
        aspect="auto",
        interpolation="none",
    )
    ax1.set_yticks([])
    ax1.set_xticks([])
    ax1.axis("off")

    ax = fig.add_subplot(1, 2, 1)
    ax = plt.subplot2grid((1, 40), (0, 0), colspan=10)

    fig.subplots_adjust(wspace=1, hspace=0)

    ax1.set_title("Roary matrix\n(%d gene clusters)" % df_roary_sorted.shape[0])

    strain_dict = dict()
    for x in t.get_terminals():
        if x.name in df_genomes_tree.index:
            strain_dict[x.name] = df_genomes_tree.loc[x.name, "strain"]
        else:
            genome_id = x.name[:-1] + "." + x.name[-1]
            strain_dict[x.name] = df_genomes_tree.loc[genome_id, "strain"]

    # Tree labels are extracted from the column 'strain' of df_genmomes_tree
    Phylo.draw(
        t,
        axes=ax,
        show_confidence=False,
        label_func=lambda x: strain_dict[x.name] if x.is_terminal() else "",
        xticks=([],),
        yticks=([],),
        ylabel=("",),
        xlabel=("",),
        xlim=(-0.01, mdist + 0.01),
        axis=("off",),
        title=("autoMLST tree\n(%d strains)" % df_roary_sorted.shape[1],),
    )

    fig_out_path = os.path.join(roary_processed_folder, "phylo_presence_heatmap.jpeg")
    if not os.path.isfile(fig_out_path):
        plt.savefig(
            fig_out_path,
            facecolor="w",
            edgecolor="w",
            orientation="portrait",
            format="jpeg",
            transparent=False,
            bbox_inches="tight",
            pad_inches=0.1,
        )

    return plt.show()


def sort_roary_matrix(
    df_gene_presence_binary,
    df_genomes_tree,
    roary_interim_folder,
    roary_processed_folder,
):
    """
    Sort roary presence matric- genomes in phylogenetic order and gene in order of the sum of strains present

    Parameters
    ----------
    1. df_gene_presence_binary : pd.DataFrame
        Dataframe with presence absence matrix of genes in the pangenome
    2. df_genomes_tree : pd.DataFrame
        Dataframe with genomes ordered phylogenetically

    Returns
    -------
    1. df_roary_sorted : pd.DataFrame
        Sorted matrix with gene presence across the genomes
        Saves sorted matrix as csv table
    """

    # Gene presence matrix
    df_roary = df_gene_presence_binary.copy()
    # Sort the matrix by the sum of strains presence
    sorted_idx = df_roary.sum(axis=1).sort_values(ascending=False).index
    df_roary_sorted = df_roary.loc[sorted_idx]
    # Sort the matrix according to phylogeny
    df_roary_sorted = df_roary_sorted[df_genomes_tree.index]

    roary_sorted_interim = os.path.join(
        roary_interim_folder, "df_roary_tree_sum_sorted.csv"
    )
    roary_sorted_processed = os.path.join(
        roary_processed_folder, "df_roary_tree_sum_sorted.csv"
    )

    df_roary_sorted.to_csv(roary_sorted_interim)
    df_roary_sorted.to_csv(roary_sorted_processed)

    # TO DO: Considere clustering methods for sorting the matrix

    return df_roary_sorted


if __name__ == "__main__":
    roary_interim_folder = sys.argv[1]
    roary_processed_folder = sys.argv[2]
    automlst_interim_folder = sys.argv[3]
    df_gene_presence, df_gene_presence_binary, df_gene_summary = get_roary_data(
        roary_interim_folder,
        roary_processed_folder,
        core=0.99,
        softcore=0.95,
        shell=0.15,
    )
    plot_core_pan_curve(roary_interim_folder, roary_processed_folder)
    plot_pan_freq_plot(df_gene_presence_binary, roary_processed_folder)
    total_genomes = float(df_gene_presence_binary.shape[1])
    plot_pan_pie_chart(
        df_gene_summary,
        total_genomes,
        roary_processed_folder,
        core=0.99,
        softcore=0.95,
        shell=0.15,
    )

    # Read tree file
    newick_path = os.path.join(automlst_interim_folder, "final.newick")
    t = Phylo.read(newick_path, "newick")
    genome_tree_table_path = os.path.join(
        automlst_interim_folder, "df_genomes_tree.csv"
    )
    df_genomes_tree = pd.read_csv(genome_tree_table_path, index_col="genome_id")

    # Sort the matrix according to phylogeny
    df_roary_sorted = sort_roary_matrix(
        df_gene_presence_binary,
        df_genomes_tree,
        roary_interim_folder,
        roary_processed_folder,
    )
    plot_tree_presence_map(
        t,
        df_gene_presence_binary,
        df_genomes_tree,
        df_roary_sorted,
        roary_processed_folder,
    )

Python Pandas matplotlib Biopython plotly Roary From line 1 of data/make_pangenome_dataset.py

import json
import os
import sys
from shutil import copyfile

import pandas as pd
from Bio import Phylo


def copy_automlst_out(automlst_interim_folder, automlst_processed_folder):
    """
    Copy important files from autoMLST interim to proecessed directory

    Parameters
    ----------
    1. automlst_interim_folder : str / path
        Location of the output from autoMLST in the interim directory

    Returns
    -------
    1. automlst_processed_folder : str / path
        Location of the processed output directory for important autoMLST results
    """

    files_to_copy = [
        "raxmlpart.txt",
        "raxmlpart.txt.bionj",
        "raxmlpart.txt.contree",
        "raxmlpart.txt.iqtree",
        "raxmlpart.txt.log",
        "raxmlpart.txt.mldist",
        "raxmlpart.txt.treefile",
    ]

    for tree_file in files_to_copy:
        from_path = os.path.join(automlst_interim_folder, tree_file)
        to_path = os.path.join(automlst_processed_folder, tree_file)

        if os.path.isfile(from_path):
            copyfile(from_path, to_path)

    # Selected tree newick file for downstream analysis by default automlst_wrapper
    treefile = os.path.join(automlst_interim_folder, "raxmlpart.txt.treefile")
    out_newick_processed = os.path.join(automlst_processed_folder, "final.newick")
    out_newick_interim = os.path.join(automlst_interim_folder, "final.newick")

    if os.path.isfile(treefile):
        copyfile(treefile, out_newick_interim)
        copyfile(treefile, out_newick_processed)

    # Save list of MLST genes in pandas table (currently with only index with TIGR IDs)
    aligned_dir = os.path.join(automlst_interim_folder, "aligned")
    mlst_out_path_processed = os.path.join(
        automlst_processed_folder, "df_mlst_genes.csv"
    )
    mlst_out_path_interim = os.path.join(automlst_interim_folder, "df_mlst_genes.csv")

    mlst_gene_list = os.listdir(aligned_dir)

    df_mlst = pd.DataFrame(index=mlst_gene_list, columns=["aligned_path"])
    for mlst_gene in df_mlst.index:
        df_mlst.loc[mlst_gene, "aligned_path"] = os.path.join(
            aligned_dir, mlst_gene + ".faa"
        )
    df_mlst.index.name = "mlst_gene"

    df_mlst.to_csv(mlst_out_path_interim)
    df_mlst.to_csv(mlst_out_path_processed)

    return None


def get_genome_tree_table(
    automlst_processed_folder, prokka_interim_folder, gtdb_interim_folder
):
    """
    Copy important files from autoMLST interim to proecessed directory

    Parameters
    ----------
    1. automlst_processed_folder : str / path
        Location of the processed output directory for important autoMLST results
    2. prokka_interim_folder : str/ path
        Path to get organism information for each genome as used in prokka
    3. gtdb_interim_folder : str/ path
        Path to get json files with gtdb information

    Returns
    -------
    1. df_genomes_tree : pd.DataFrame
        Combined datadrame of organism info ordered in phylogenetic tree (final.newick file)
    """

    newick_path = os.path.join(automlst_processed_folder, "final.newick")

    t = Phylo.read(newick_path, "newick")

    # Max distance to create better plot
    # mdist = max([t.distance(t.root, x) for x in t.get_terminals()])

    # Sort the matrix according to tip labels in the tree
    genome_ids_list = [
        x.name
        if x.name in os.listdir(prokka_interim_folder)
        else x.name[:-1] + "." + x.name[-1]
        for x in t.get_terminals()
    ]
    # The above is required in the default version of autmlst_wrapper
    # To be fixed in the orginal repo

    columns_list = [
        "genus_original",
        "species_original",
        "strain",
        "domain",
        "phylum",
        "class",
        "order",
        "family",
        "genus",
        "species",
    ]
    df_genomes_tree = pd.DataFrame(index=genome_ids_list, columns=columns_list)
    df_genomes_tree.index.name = "genome_id"

    for genome_id in df_genomes_tree.index:
        # Reading organism infor used for prokka run including strain ID
        org_info_path = os.path.join(
            prokka_interim_folder, genome_id, "organism_info.txt"
        )
        with open(org_info_path, "r") as file_obj:
            org_info = file_obj.readlines()[0]
            df_genomes_tree.loc[genome_id, "genus_original"] = org_info.split(",")[0]
            df_genomes_tree.loc[genome_id, "species_original"] = org_info.split(",")[1]
            df_genomes_tree.loc[genome_id, "strain"] = org_info.split(",")[2]

        # Reading gtdb metadata from JSON files
        gtdb_json_path = os.path.join(gtdb_interim_folder, genome_id + ".json")
        with open(gtdb_json_path, "r") as f:
            gtdb_dict = json.load(f)
            df_genomes_tree.loc[genome_id, "domain"] = gtdb_dict["gtdb_taxonomy"][
                "domain"
            ]
            df_genomes_tree.loc[genome_id, "phylum"] = gtdb_dict["gtdb_taxonomy"][
                "phylum"
            ]
            df_genomes_tree.loc[genome_id, "class"] = gtdb_dict["gtdb_taxonomy"][
                "class"
            ]
            df_genomes_tree.loc[genome_id, "order"] = gtdb_dict["gtdb_taxonomy"][
                "order"
            ]
            df_genomes_tree.loc[genome_id, "family"] = gtdb_dict["gtdb_taxonomy"][
                "family"
            ]
            df_genomes_tree.loc[genome_id, "genus"] = gtdb_dict["gtdb_taxonomy"][
                "genus"
            ]
            df_genomes_tree.loc[genome_id, "organism"] = gtdb_dict["gtdb_taxonomy"][
                "species"
            ]
            try:
                df_genomes_tree.loc[genome_id, "species"] = gtdb_dict["gtdb_taxonomy"][
                    "species"
                ].split(" ")[1]
            except IndexError:  # leave blank for empty taxonomy
                df_genomes_tree.loc[genome_id, "species"] = ""

    genomes_tree_path_interim = os.path.join(
        automlst_processed_folder, "df_genomes_tree.csv"
    )
    genomes_tree_path_processed = os.path.join(
        automlst_processed_folder, "df_genomes_tree.csv"
    )

    df_genomes_tree.to_csv(genomes_tree_path_interim)
    df_genomes_tree.to_csv(genomes_tree_path_processed)

    return df_genomes_tree


if __name__ == "__main__":
    automlst_interim_folder = sys.argv[1]
    automlst_processed_folder = sys.argv[2]
    prokka_interim_folder = sys.argv[3]
    gtdb_interim_folder = sys.argv[4]
    copy_automlst_out(automlst_interim_folder, automlst_processed_folder)
    df_genomes_tree = get_genome_tree_table(
        automlst_processed_folder, prokka_interim_folder, gtdb_interim_folder
    )

Python Pandas Biopython JSON metaprokka From line 1 of data/make_phylo_tree.py

import glob
import gzip
import io
import logging
import sys
from pathlib import Path

import ncbi_genome_download as ngd
import pandas as pd

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.INFO)


def build_prokka_refgbff(name, prokka_db_table, outfile):
    """
    Given a project name and csv table path, download all NCBI accession id
    and returns a concatenated gbk file for prokka --proteins params.
    """
    download_dir = Path(f"resources/prokka_db/{name}")
    prokka_db_table = Path(prokka_db_table).resolve()

    # Generate download directory
    download_dir.mkdir(parents=True, exist_ok=True)

    # Get lists of accession ids and its sources
    logging.info(f"Reading input file: {prokka_db_table}...")
    if prokka_db_table.suffix == ".csv":
        df = pd.read_csv(prokka_db_table)
    elif prokka_db_table.suffix == ".json":
        df = pd.read_json(prokka_db_table)

    ngd_input = {"refseq": [], "genbank": []}

    logging.info("Donwloading genomes from NCBI...")
    for acc in df.Accession:
        if acc.startswith("GCF"):
            ngd_input["refseq"].append(acc)
        elif acc.startswith("GCA"):
            ngd_input["genbank"].append(acc)
        else:
            raise

    # Download gbff with NCBI genome download
    for s in ngd_input.keys():
        if ngd_input[s]:
            acc_list = ",".join(ngd_input[s])
            logging.debug(f"Downloading {s} genome: {acc}")
            ngd.download(
                section=s,
                file_formats="genbank",
                assembly_accessions=acc_list,
                output=download_dir,
                groups="bacteria",
            )

    # Concatenate gbff
    logging.info("Concatenating genomes into a single file...")
    reference_files = glob.glob(f"{download_dir}/*/*/*/*.gbff.gz")

    with open(outfile, "w") as f_out:
        for names in reference_files:
            with io.TextIOWrapper(gzip.open(names, "r")) as infile:
                f = infile.read()
                f_out.write(f)
            f_out.write("\n")

        logging.info(f"Reference genomes saved as: {str(f_out)}")

    return


if __name__ == "__main__":
    build_prokka_refgbff(sys.argv[1], sys.argv[2], sys.argv[3])

Python Pandas metaprokka ncbi-genome-download From line 1 of data/make_prokka_db.py

import json
import logging
import sys
from pathlib import Path

import pandas as pd
from alive_progress import alive_bar

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)


def gather_seqfu_json(seqfu_input, output_file):
    """
    Merge seqfu json input into one csv table
    """
    # Accomodate multiple inputs to generate dataframe
    shell_input = Path(seqfu_input)
    logging.info(shell_input)
    if shell_input.is_file() and shell_input.suffix == ".json":
        logging.info(f"Getting SeqFu overview from a single file: {shell_input}")
        shell_input_files = shell_input

    elif shell_input.is_file() and shell_input.suffix == ".txt":
        logging.info(f"Getting SeqFu overview  from a text file: {shell_input}")
        with open(shell_input, "r") as file:
            file_content = [i.strip("\n") for i in file.readlines()]
            if len(file_content) == 1:
                # Paths space-separated on a single line
                paths = file_content[0].split()
            else:
                # Paths written on separate lines
                paths = file_content
            shell_input_files = [
                Path(path) for path in paths if Path(path).suffix == ".json"
            ]
    else:
        shell_input_files = [
            Path(file)
            for file in str(shell_input).split()
            if Path(file).suffix == ".json"
        ]
        logging.info(
            f"Getting SeqFu overview from the given list of {len(shell_input_files)} files..."
        )

    logging.info("Merging SeqFu results...")
    container = {}
    with alive_bar(len(shell_input_files), title="Updating SeqFu information:") as bar:
        for d in shell_input_files:
            logging.debug(f"Reading input: {Path(d).name}")
            with open(d, "r") as f:
                read = json.load(f)
                genome_id = read[0]["Filename"]
                read[0].pop("Filename")
                container[genome_id] = read[0]
            bar()
    logging.info("Converting dictionary to table...")
    df = pd.DataFrame.from_dict(container).T
    df.index.name = "genome_id"
    df.to_csv(output_file)
    logging.info("Job done!")
    return


if __name__ == "__main__":
    gather_seqfu_json(sys.argv[1], sys.argv[2])

Python Pandas JSON alive-progress seqfu From line 1 of data/make_seqfu_table.py

import json
import logging
import sqlite3
import sys
from pathlib import Path

import pandas as pd
from alive_progress import alive_bar

log_format = "%(levelname)-8s %(asctime)s   %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)


def gcf_hits(df_gcf, cutoff=900):
    """
    Filter bigslice result based on distance threshold to model and generate data.
    """
    mask = df_gcf.loc[:, "membership_value"] <= cutoff
    df_gcf_filtered = df_gcf[mask]
    bgcs = df_gcf_filtered.bgc_id.unique()
    gcfs = df_gcf_filtered.gcf_id.unique()
    logging.debug(f"BiG-SLICE query with BiG-FAM run 6, distance cutoff {cutoff}")
    logging.debug(f"Number of bgc hits : {len(bgcs)}/{len(df_gcf.bgc_id.unique())}")
    logging.debug(f"Number of GCF hits : {len(gcfs)}")
    return df_gcf_filtered


def grab_gcf_model_summary(gcf_query, database):
    """
    Summarize gcf model from a give list of bigfam gcf ids
    gcf_query = list of bigfam gfc id
    database = the full run database of bigslice
    """

    def grab_mibig_members(q, mibig_bigfam):
        """
        Summarize information of mibig members in a model
        """
        mibig_members = list(set(q.bgc_id) & set(mibig_bigfam.id))
        mibig_members = [mibig_bigfam.loc[i, "name"] for i in mibig_members]
        return mibig_members

    # initiate connection
    logging.info(f"Initiating SQL connection to {database}...")
    conn = sqlite3.connect(database)

    # load main tables
    logging.info("Reading gcf_membership table...")
    df_bigfam = pd.read_sql_query("select * from gcf_membership;", conn)
    logging.info("Reading bgc table...")
    df_bgc_bigfam = pd.read_sql_query("select * from bgc;", conn)

    # Load BiG-FAM dataset
    mibig_bigfam = df_bgc_bigfam[df_bgc_bigfam.loc[:, "type"] == "mibig"].set_index(
        "id", drop=False
    )

    # Filter for hit queries
    # df_bigfam_query = df_bigfam[df_bigfam.loc[:, "gcf_id"].isin(gcf_query)]

    # return information of each model
    logging.info(f"Summarizing information for {len(gcf_query)} GCF models...")
    summary = {}
    with alive_bar(len(gcf_query)) as bar:
        for g in gcf_query:
            values = {}

            q = df_bigfam[df_bigfam.loc[:, "gcf_id"] == g]
            q = q[q.loc[:, "rank"] == 0]  # select only first hits

            # get core member info
            q_core = q[q.loc[:, "membership_value"] <= 900]
            q_core_mibig = grab_mibig_members(q_core, mibig_bigfam)

            # get putative member info
            q_putative = q[q.loc[:, "membership_value"] > 900]
            q_putative_mibig = grab_mibig_members(q_putative, mibig_bigfam)

            values["core_member"] = len(q_core)
            values["putative_member"] = len(q_putative)
            values["core_member_mibig"] = q_core_mibig
            values["putative_member_mibig"] = q_putative_mibig
            values["core_member_mibig_count"] = len(q_core_mibig)
            values["core_member_mibig_bool"] = len(q_core_mibig) > 0
            values["putative_member_mibig_count"] = len(q_putative_mibig)
            values["putative_member_mibig_bool"] = len(q_putative_mibig) > 0
            values[
                "link to BiG-FAM"
            ] = f"https://bigfam.bioinformatics.nl/run/6/gcf/{g}"
            summary[str(g)] = values

            bar()

    return summary


def summarize_bigslice_query(
    bigslice_query_path,
    output_path,
    database_path="resources/bigslice/full_run_result/result/data.db",
    cutoff=900,
):
    """
    Summarize bigslice query result.

    Input:
    - bigslice_query_path

    Output:
    - A folder to store two output files:
        1. query_network.csv to build Cytoscape Network
        2. JSON file summarizing GCF model hits
    """
    bigslice_query = Path(bigslice_query_path)
    # database = Path(database_path)
    output = Path(output_path)
    output.mkdir(parents=True, exist_ok=True)

    df_gcf_membership = pd.read_csv(bigslice_query / "gcf_membership.csv")
    df_bgc = pd.read_csv(bigslice_query / "bgc.csv")

    # Create edge table network for cytoscape enriched with ids for mapping metadata (bgc_name, genome_id)
    logging.info("Generating network table of BiG-SLICE hits...")
    data = gcf_hits(df_gcf_membership, cutoff)
    bgc_info = df_bgc.set_index("id", drop=True)
    bgc_info["bgc_id"] = [str(i).replace(".gbk", "") for i in bgc_info["orig_filename"]]
    for i in data.index:
        bgc_id = data.loc[i, "bgc_id"]
        data.loc[i, "bgc_id"] = str(bgc_info.loc[bgc_id, "bgc_id"])
    data.to_csv(output / "query_network.csv", index=False)

    # Summarizing GCF model hits
    logging.info("Summarizing GCF model hits...")
    gcf_query = list(data.gcf_id.unique())
    gcf_summary = grab_gcf_model_summary(gcf_query, database_path)

    # outputting as json
    with open(output / "gcf_summary.json", "w") as outfile:
        json.dump(gcf_summary, outfile, indent=4)

    # outputting as csv table
    as_table = pd.DataFrame.from_dict(gcf_summary).T
    as_table.index.name = "gcf_id"
    as_table.to_csv(output / "gcf_summary.csv")

    logging.info("Job done")
    return


if __name__ == "__main__":
    summarize_bigslice_query(sys.argv[1], sys.argv[2])

Python Pandas JSON bigslice alive-progress From line 1 of data/summarize_bigslice_query.py

import sys

from Bio import SeqIO


def update_gbk_automlst(input_gbk, out_gbk, genome_id):
    """
    Update organism field with accession ID for autoMLST run
    """

    input_handle = open(input_gbk, "r")
    new_seq_records = []

    for seq_record in SeqIO.parse(input_handle, "genbank"):
        seq_record.annotations["organism"] = genome_id
        new_seq_records.append(seq_record)

    SeqIO.write(new_seq_records, out_gbk, "genbank")
    input_handle.close()

    return None


if __name__ == "__main__":
    update_gbk_automlst(sys.argv[1], sys.argv[2], sys.argv[3])

Python Biopython From line 1 of features/prep_automlst.py

shell:
    """
    download-antismash-databases --database-dir resources/antismash_db 2>> {log}
    antismash --version >> {log}
    antismash --check-prereqs >> {log}
    """

SnakeMake antiSMASH From line 9 of rules/antismash.smk

shell:
    """
    antismash --genefinding-tool {params.genefinding} --output-dir {params.folder} --cb-general --cb-subclusters --cb-knownclusters -c {threads} {input.gbk} --logfile {log} 2>> {log}
    """

SnakeMake antiSMASH From line 33 of rules/antismash.smk

shell:
    """
    download-antismash-databases --database-dir {output} &>> {log}
    antismash --version >> {log}
    antismash --database {output} --prepare-data &>> {log}
    """

SnakeMake antiSMASH From line 46 of rules/antismash.smk

shell:
    """
    set +e

    # Find the latest existing JSON output for this strain
    latest_version=$(ls -d data/interim/antismash/*/{wildcards.strains}/{wildcards.strains}.json | grep {wildcards.strains} | sort -r | head -n 1 | cut -d '/' -f 4) 2>> {log}

    if [ -n "$latest_version" ]; then
        # Use existing JSON result as starting point
        old_json="data/interim/antismash/$latest_version/{wildcards.strains}/{wildcards.strains}.json"
        echo "Using existing JSON from $old_json as starting point..." >> {log}
        antismash_input="--reuse-result $old_json"
    else
        # No existing JSON result found, use genbank input
        echo "No existing JSON result found, starting AntiSMASH from scratch..." >> {log}
        antismash_input="{input.gbk}"
    fi

    # Run AntiSMASH
    antismash --genefinding-tool {params.genefinding} --output-dir {params.folder} \
        --database {input.resources} \
        --cb-general --cb-subclusters --cb-knownclusters -c {threads} $antismash_input --logfile {log} 2>> {log}

    # Check if the run failed due to changed detection results
    if grep -q "ValueError: Detection results have changed. No results can be reused" {log}; then
        # Use genbank input instead
        echo "Previous JSON result is invalid, starting AntiSMASH from scratch..." >> {log}
        antismash --genefinding-tool {params.genefinding} --output-dir {params.folder} \
            --database {input.resources} \
            --cb-general --cb-subclusters --cb-knownclusters -c {threads} {input.gbk} --logfile {log} 2>> {log}
    fi
    """

SnakeMake antiSMASH From line 70 of rules/antismash.smk

shell:
    """
    base_dir=$PWD
    mkdir {output.dir}
    (cd {output.dir} && for item in $(ls $base_dir/{input.dir}); do ln -s $base_dir/{input.dir}/$item $(basename $item); done) 2>> {log}
    """

SnakeMake From line 112 of rules/antismash.smk

shell:
    """
    mkdir -p data/interim/arts/antismash-{wildcards.version}/{wildcards.strains}
    python {params.resources}/artspipeline1.py {input.antismash} {params.ref} -rd data/interim/arts/antismash-{wildcards.version}/{wildcards.strains} -cpu {threads} -opt kres,phyl,duf -khmms {params.khmms} 2>> {log}
    """

SnakeMake From line 15 of rules/arts.smk

shell:
    """
    python workflow/bgcflow/bgcflow/data/arts_extract.py {input.arts}/tables/bgctable.tsv {input.json} {output.json} {wildcards.strains} 2>> {log}
    """

SnakeMake From line 31 of rules/arts.smk

shell:
    """
    TMPDIR="data/interim/tmp/{wildcards.name}/{wildcards.version}"
    mkdir -p $TMPDIR
    INPUT_JSON="$TMPDIR/df_arts.txt"
    echo '{input.json}' > $INPUT_JSON
    python workflow/bgcflow/bgcflow/data/arts_gather.py $INPUT_JSON {input.changelog} {output.table} 2>> {log}
    rm $INPUT_JSON
    """

SnakeMake From line 50 of rules/arts.smk

shell:
    """
    set -e
    mkdir -p resources 2>> {log}
    wget {params.source}/archive/refs/tags/v{params.version}.zip -O resources/automlst-simplified-wrapper-v{params.version}.zip 2>> {log}
    (cd resources && unzip -o automlst-simplified-wrapper-v{params.version}.zip && rm automlst-simplified-wrapper-v{params.version}.zip) &>> {log}
    (cd resources/automlst-simplified-wrapper-{params.version} && unzip -o reducedcore.zip) &>> {log}
    cp -r resources/automlst-simplified-wrapper-{params.version}/* {output.folder}/. 2>> {log}
    rm -rf resources/automlst-simplified-wrapper-{params.version} 2>> {log}
    """

SnakeMake From line 12 of rules/automlst_wrapper.smk

shell:
    """
    python workflow/bgcflow/bgcflow/features/prep_automlst.py {input.gbk} {output.auto_gbk} {wildcards.strains} 2>> {log}
    """

SnakeMake From line 33 of rules/automlst_wrapper.smk

shell:
    """
    mkdir -p "data/interim/automlst_wrapper/{wildcards.name}/singles" 2>> {log}
    python resources/automlst-simplified-wrapper-main/simplified_wrapper.py data/interim/automlst_wrapper/{wildcards.name} {threads} &>> {log}
    """

SnakeMake From line 52 of rules/automlst_wrapper.smk

shell:
    """
    python workflow/bgcflow/bgcflow/data/make_phylo_tree.py {params.automlst_interim} {output.automlst_processed} {params.prokka_interim} {params.gtdb_interim} 2>> {log}
    """

SnakeMake From line 76 of rules/automlst_wrapper.smk

shell:
    """
    python workflow/bgcflow/bgcflow/data/get_bgc_counts.py {input.antismash} {wildcards.strains} {output.bgc_count} 2>> {log}
    """

SnakeMake From line 10 of rules/bgc_analytics.smk

shell:
    """
    python workflow/bgcflow/bgcflow/data/get_antismash_overview.py {input.antismash} {output.bgc_table} {wildcards.strains} 2>> {log}
    """

SnakeMake From line 24 of rules/bgc_analytics.smk

shell:
    """
    TMPDIR="data/interim/tmp/{wildcards.name}/{wildcards.version}"
    mkdir -p $TMPDIR
    INPUT_JSON="$TMPDIR/df_regions_antismash.txt"
    echo '{input.bgc_overview}' > $INPUT_JSON
    python workflow/bgcflow/bgcflow/data/get_antismash_overview_gather.py \
        $INPUT_JSON {input.mapping_dir} {output.df_bgc} 2>> {log}
    rm $INPUT_JSON
    """

SnakeMake From line 43 of rules/bgc_analytics.smk

shell:
    """
    mkdir -p {output.mapping_dir} 2>> {log}
    cp {input.mapping_dir}/*/*-change_log.json {output.mapping_dir}/. 2>> {log}
    """

SnakeMake From line 62 of rules/bgc_analytics.smk

shell:
    """
    TMPDIR="data/interim/tmp/{wildcards.name}/{wildcards.version}"
    mkdir -p $TMPDIR
    INPUT_JSON="$TMPDIR/df_bgc_counts.txt"
    echo '{input.bgc_count}' > $INPUT_JSON
    python workflow/bgcflow/bgcflow/data/make_genome_dataset.py $INPUT_JSON '{params.df_samples}' {output.df_antismash_summary} 2>> {log}
    rm $INPUT_JSON
    """

SnakeMake From line 96 of rules/bgc_analytics.smk

shell:
    """
    python workflow/bgcflow/bgcflow/data/get_dependencies.py {output} 2> {log}
    """

SnakeMake From line 115 of rules/bgc_analytics.smk

shell:
    """
    python workflow/bgcflow/bgcflow/data/get_project_metadata.py {wildcards.name} {output} {params.bgcflow_version} 2> {log}
    """

SnakeMake From line 129 of rules/bgc_analytics.smk

shell:
    """
    echo "Preparing BGCs for {wildcards.name} downstream analysis..." 2>> {log.general}
    #mkdir -p {output.outdir} 2>> {log.general}
    # Generate symlink for each regions in genomes in dataset
    for i in $(dirname {input.gbk})
    do
        echo Processing $i 2>> {log.symlink}
        python workflow/bgcflow/bgcflow/data/bgc_downstream_prep.py $i {output.outdir} 2>> {log.symlink}
    done
    # generate taxonomic information for dataset
    python workflow/bgcflow/bgcflow/data/bigslice_prep.py {input.table} {output.taxonomy} 2>> {log.taxonomy}
    # append new dataset information
    ## check if previous dataset exists
    if [[ -s {params.dataset} ]]
    then
        echo "Previous dataset detected, appending dataset information for {wildcards.name}..."
        sed -i 'a {wildcards.name}_antismash_{wildcards.version}\t{wildcards.name}_antismash_{wildcards.version}\ttaxonomy/taxonomy_{wildcards.name}_antismash_{wildcards.version}.tsv\t{wildcards.name}' {params.dataset} 2>> {log.general}
    else
        echo "No previous dataset detected, generating dataset information for {wildcards.name}..."
        echo -e '# Dataset name\tPath to folder\tPath to taxonomy\tDescription' > {params.dataset} 2>> {log.general}
        sed -i 'a {wildcards.name}_antismash_{wildcards.version}\t{wildcards.name}_antismash_{wildcards.version}\ttaxonomy/taxonomy_{wildcards.name}_antismash_{wildcards.version}.tsv\t{wildcards.name}' {params.dataset} 2>> {log.general}
    fi
    # generate mapping for visualization
    python workflow/bgcflow/bgcflow/data/get_bigscape_mapping.py {output.outdir} {output.bgc_mapping} 2>> {log.general}
    """

SnakeMake From line 19 of rules/bgc.smk

shell:
    """
    (cd resources && wget https://github.com/medema-group/BiG-SCAPE/archive/refs/tags/v{params.release}.zip) &>> {log}
    (cd resources && unzip -o v{params.release}.zip && mv BiG-SCAPE-{params.release}/ BiG-SCAPE/ && rm v{params.release}.zip) &>> {log}
    (cd resources/BiG-SCAPE && wget ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam32.0/Pfam-A.hmm.gz && gunzip Pfam-A.hmm.gz) &>> {log}
    (cd resources/BiG-SCAPE && hmmpress Pfam-A.hmm) &>> {log}
    """

SnakeMake From line 10 of rules/bigscape.smk

shell:
    """
    (cd resources && wget  https://dl.secondarymetabolites.org/mibig/mibig_json_{params.mibig_version}.tar.gz) &>> {log}
    (cd resources && tar -xvf mibig_json_{params.mibig_version}.tar.gz && mkdir -p mibig && mv mibig_json_{params.mibig_version}/ mibig/json && rm mibig_json_{params.mibig_version}.tar.gz) &>> {log}
    python workflow/bgcflow/bgcflow/data/get_mibig_data.py {output.mibig_json_folder} {output.mibig_bgc_table} 2>> {log}
    """

SnakeMake JSON From line 31 of rules/bigscape.smk

shell:
    """
    python {input.bigscape}/bigscape.py -i {input.antismash_dir} -o {params.bigscape_dir} -c {threads} --cutoff 0.3 0.4 0.5 --include_singletons --label {params.label} --hybrids-off --mibig --verbose &>> {log}
    """

SnakeMake From line 54 of rules/bigscape.smk

shell:
    """
    python {input.bigscape}/bigscape.py -i {input.antismash_dir} -o {params.bigscape_dir} -c {threads} --cutoff 0.3 0.4 0.5 --include_singletons --label {params.label} --hybrids-off --verbose --no_classify --mix &>> {log}
    """

SnakeMake From line 76 of rules/bigscape.smk

shell:
    """
    topdir=$PWD
    (cd data/interim/bigscape && zip -r $topdir/{output.zip} no_mibig_{wildcards.name}_antismash_{wildcards.version} \
        -x {wildcards.name}_antismash_{wildcards.version}/cache/**\* &>> $topdir/{log})
    """

SnakeMake From line 91 of rules/bigscape.smk

shell:
    """
    python workflow/bgcflow/bgcflow/data/make_bigscape_to_cytoscape.py data/interim/bigscape/{wildcards.name}_antismash_{wildcards.version}/ {input.as_dir} {input.df_genomes_path} {input.mibig_bgc_table} {output.output_dir} 2>> {log}
    cp {input.bgc_mapping} {output.bgc_mapping}
    """

SnakeMake From line 115 of rules/bigscape.smk

shell:
    """
    python workflow/bgcflow/bgcflow/data/bigscape_copy.py {input.index} data/processed/{wildcards.name}/bigscape/result_as{wildcards.version} 2>> {log}
    """

SnakeMake From line 132 of rules/bigscape.smk

shell:
    """
    python workflow/bgcflow/bgcflow/data/bigslice_prep_single_project.py {input.dir} {input.taxonomy} {params.project_name} {output.folder} 2>> {log}
    """

SnakeMake From line 13 of rules/bigslice.smk

shell:
    """
    bigslice -i {input.dir} {output.folder} --threshold {params.threshold} -t {threads} &>> {log}
    """

SnakeMake bigslice From line 33 of rules/bigslice.smk

shell:
    """
    (cd resources/bigslice && wget -c -nc http://bioinformatics.nl/~kauts001/ltr/bigslice/paper_data/data/full_run_result.zip && unzip full_run_result.zip) &>> {log}
    rm resources/bigslice/full_run_result.zip &>> {log}
    """

SnakeMake bigslice From line 46 of rules/bigslice.smk

shell:
    """
    TIMESTAMP=$(date --iso-8601=hours)
    bigslice --query {input.tmp_dir} --n_ranks {params.n_ranks} {input.bigslice_dir} -t {threads} --query_name {params.query_name}_$TIMESTAMP --run_id {params.run_id} &>> {log}
    python workflow/bgcflow/bgcflow/data/get_bigslice_query_result.py {params.query_name} {output.folder} {input.bigslice_dir} &>> {log}
    """

SnakeMake bigslice From line 70 of rules/bigslice.smk

shell:
    """
    python workflow/bgcflow/bgcflow/data/summarize_bigslice_query.py {input.query_dir} {output.folder} {params.bigfam_db_path} {params.cutoff} 2>> {log}
    """

SnakeMake From line 90 of rules/bigslice.smk

shell:
    """
    cblaster --version >> {output.version}
    diamond --version >> {output.version}
    cat {output.version} >> {log}
    cblaster config --email [email protected] 2>> {log}
    cblaster makedb --cpus {threads} -b {params.batch_size} -n {params.db_prefix} {input.gbk} 2>> {log}
    cp -r {output.folder_interim} {output.folder_processed} 2>> {log}
    """

SnakeMake Diamond cblaster From line 19 of rules/cblaster.smk

shell:
    """
    cblaster config --email [email protected]
    cblaster makedb --cpus {threads} -b {params.batch_size} -n {params.db_prefix} {params.antismash_dir} 2>> {log}
    cp -r {output.folder_interim} {output.folder_processed} 2>> {log}
    """

SnakeMake cblaster From line 48 of rules/cblaster.smk

shell:
    """
    (cd resources && wget -O checkm_data_2015_01_16.tar.gz https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz) &>> {log}
    (cd resources && mkdir -p checkm && tar -xvf checkm_data_2015_01_16.tar.gz -C checkm && rm checkm_data_2015_01_16.tar.gz) &>> {log}
    checkm data setRoot resources/checkm &>> {log}
    """

SnakeMake CheckM From line 8 of rules/checkm.smk

shell:
    """
    mkdir -p {output.fna}
    for f in {input.fna}; do cp $f {output.fna}/.; done
    checkm lineage_wf -t {threads} --reduced_tree -x fna {output.fna} {output.checkm_dir} &>> {log}
    """

SnakeMake CheckM From line 31 of rules/checkm.smk

shell:
    """
    mkdir -p {params.checkm_json}
    python workflow/bgcflow/bgcflow/data/get_checkm_data.py {input.stat} {params.checkm_json} {output.stat_processed} 2>> {log}
    """

SnakeMake From line 54 of rules/checkm.smk

shell:
    """
    python workflow/bgcflow/bgcflow/database/csv_to_parquet.py data/processed/{wildcards.name} 2>> {log}
    touch {output.parquet}
    """

SnakeMake From line 10 of rules/data_warehouse.smk

shell:
    """
    git clone https://bitbucket.org/kaistsystemsbiology/deeptfactor.git {output.folder} 2>> {log}
    """

SnakeMake From line 9 of rules/deeptfactor.smk

shell:
    """
    workdir=$PWD
    mkdir -p data/interim/deeptfactor/{wildcards.strains} 2>> {log}
    (cd {input.resource} && python tf_running.py \
        -i $workdir/{input.faa} -o $workdir/{params.outdir} \
        -g cpu -cpu {threads}) 2>> {log}
    """

SnakeMake From line 28 of rules/deeptfactor.smk

shell:
    """
    python workflow/bgcflow/bgcflow/data/deeptfactor_scatter.py {input.deeptfactor} {output.deeptfactor_json} 2>> {log}
    """

SnakeMake From line 47 of rules/deeptfactor.smk

shell:
    """
    TMPDIR="data/interim/tmp/{wildcards.name}"
    mkdir -p $TMPDIR
    INPUT_JSON="$TMPDIR/df_deeptfactor.txt"
    echo '{input.deeptfactor_json}' > $INPUT_JSON
    python workflow/bgcflow/bgcflow/data/deeptfactor_gather.py $INPUT_JSON {output.df_deeptfactor} 2>> {log}
    rm $INPUT_JSON
    """

SnakeMake From line 65 of rules/deeptfactor.smk

shell:
    """
    cat {input.faa} > {output.faa_compiled} 2>> {log}
    diamond makedb --in {output.faa_compiled} -d {params.diamond} -p {threads} 2>> {log}
    cp {output.diamond_interim} {output.diamond_processed} 2>> {log}
    """

SnakeMake Diamond From line 15 of rules/diamond.smk

shell:
    """
    download_eggnog_data.py --data_dir {output.eggnog_db} -y &>> {log}
    create_dbs.py -m diamond --dbname bacteria --taxa Bacteria --data_dir {output.eggnog_db} -y &>> {log}
    """

SnakeMake Diamond From line 9 of rules/eggnog.smk

shell:
    """
    mkdir -p {output.eggnog_dir}
    emapper.py -i {input.faa} --decorate_gff "yes" --excel --cpu {threads} -o {wildcards.strains} --output_dir {output.eggnog_dir} --data_dir {input.eggnog_db} &>> {log}
    """

SnakeMake eggNOG-mapper v2 From line 28 of rules/eggnog.smk

shell:
    """
    for fna in {input.fna}
    do
        echo $fna >> {output.fastani_infile}
    done
    fastANI --ql {output.fastani_infile} --rl {output.fastani_infile} -t {threads} --matrix -o {output.fastani_out} 2>> {log}
    """

SnakeMake FastANI From line 13 of rules/fastani.smk

shell:
    """
    python workflow/bgcflow/bgcflow/data/convert_triangular_matrix.py {input.fastani_matrix} {output.df_fastani} 2>> {log}
    """

SnakeMake From line 32 of rules/fastani.smk

shell:
    """
    python workflow/bgcflow/bgcflow/data/gtdb_prep.py {wildcards.strains} {output.gtdb_json} '{params.samples_path}' '{params.gtdb_paths}' {params.version} 2> {log}
    """

SnakeMake From line 24 of rules/gtdb.smk

shell:
    """
    TMPDIR="data/interim/tmp/{wildcards.name}"
    mkdir -p $TMPDIR
    INPUT_JSON="$TMPDIR/df_gtdb.txt"
    echo '{input.json_list}' > $INPUT_JSON
    python workflow/bgcflow/bgcflow/data/fix_gtdb_taxonomy.py $INPUT_JSON {output.meta} 2> {log}
    rm $INPUT_JSON
    """

SnakeMake From line 47 of rules/gtdb.smk

shell:
    """
    (cd resources && wget https://data.gtdb.ecogenomic.org/releases/release{params.release_major}/{params.release_major}.{params.release_minor}/auxillary_files/gtdbtk_{params.release_version}_data.tar.gz -nc) 2>> {log}
    (cd resources && mkdir -p gtdbtk && tar -xvzf gtdbtk_{params.release_version}_data.tar.gz -C "gtdbtk" --strip 1 && rm gtdbtk_{params.release_version}_data.tar.gz) &>> {log}
    """

SnakeMake gtdbtk From line 37 of rules/gtdbtk.smk

shell:
    """
    TMPDIR="data/interim/tmp/{wildcards.name}"
    mkdir -p $TMPDIR
    INPUT_FNA="$TMPDIR/df_fna_gtdbtk.txt"
    INPUT_JSON="$TMPDIR/df_json_gtdbtk.txt"
    echo '{input.fna}' > $INPUT_FNA
    echo '{input.json_list}' > $INPUT_JSON
    python workflow/bgcflow/bgcflow/data/gtdbtk_prep.py $INPUT_FNA $INPUT_JSON {output.fnadir} 2>> {log}
    rm $INPUT_FNA
    rm $INPUT_JSON
    """

SnakeMake From line 53 of rules/gtdbtk.smk

shell:
    """
    mkdir -p {output.tmpdir}
    gtdbtk classify_wf --genome_dir {input.fnadir} --out_dir {output.gtdbtk_dir} --cpus {threads} --pplacer_cpus 1 --tmpdir {output.tmpdir} {params.ani_screen} &>> {log}
    cp {output.summary_interim} {output.summary_processed}
    """

SnakeMake gtdbtk From line 83 of rules/gtdbtk.smk

shell:
    """
    for fna in {input.fna}
    do
        echo $fna >> {output.mash_infile}
    done
    (mash triangle -p {threads} -l {output.mash_infile} >> {output.triangle_dist}) 2>> {log}
    """

SnakeMake From line 12 of rules/mash.smk

shell:
    """
    python workflow/bgcflow/bgcflow/data/convert_triangular_matrix.py {input.mash_matrix} {output.df_mash} 2>> {log}
    """

SnakeMake From line 31 of rules/mash.smk

shell:
    """
    mlst --csv {input.fna} > {output.csv} 2>> {log}
    """

SnakeMake MLST From line 10 of rules/mlst.smk

shell:
    """
    if [[ {wildcards.ncbi} == GCF* ]]
    then
        source="refseq"
    elif [[ {wildcards.ncbi} == GCA* ]]
    then
        source="genbank"
    else
        echo "accession must start with GCA or GCF" >> {log}
    fi
    ncbi-genome-download -s $source -F fasta,assembly-report -A {wildcards.ncbi} -o data/raw/ncbi/download -P -N --verbose bacteria 2>> {log}
    gunzip -c data/raw/ncbi/download/$source/bacteria/{wildcards.ncbi}/*.fna.gz > {output.fna}
    cp data/raw/ncbi/download/$source/bacteria/{wildcards.ncbi}/*report.txt {output.assembly_report}
    rm -rf data/raw/ncbi/download/$source/bacteria/{wildcards.ncbi}
    python workflow/bgcflow/bgcflow/data/get_assembly_information.py {output.assembly_report} {output.json_report} {wildcards.ncbi} 2>> {log}
    """

SnakeMake ncbi-genome-download From line 14 of rules/ncbi.smk

shell:
    """
    TMPDIR="data/interim/tmp/{wildcards.name}"
    mkdir -p $TMPDIR
    INPUT_JSON="$TMPDIR/df_ncbi_meta_input.txt"
    echo '{input.all_json}' > $INPUT_JSON
    python workflow/bgcflow/bgcflow/data/extract_ncbi_information.py \
        $INPUT_JSON {output.ncbi_meta_path} 2>> {log}
    rm $INPUT_JSON
    """

SnakeMake From line 48 of rules/ncbi.smk

shell:
    """
    wget ftp://ftp.patricbrc.org/RELEASE_NOTES/genome_summary -O {output.patric_genome_summary} 2>> {log}
    wget ftp://ftp.patricbrc.org/RELEASE_NOTES/genome_metadata -O {output.patric_genome_metadata} 2>> {log}
    """

SnakeMake From line 67 of rules/ncbi.smk

shell:
    """
    python workflow/bgcflow/bgcflow/data/extract_patric_meta.py \
        {input.ncbi_meta_path} {input.patric_genome_summary} {input.patric_genome_metadata} {output.patric_meta_path} 2>> {log}
    """

SnakeMake From line 89 of rules/ncbi.smk

shell:
    """
    (cd data/interim/fasta && wget -qN "ftp://ftp.patricbrc.org/genomes/{wildcards.patric}/{wildcards.patric}.fna" 2> ../../../{log})
    """

SnakeMake From line 12 of rules/patric.smk

shell:
    """
    ln -s $PWD/resources/RNAmmer/rnammer $CONDA_PREFIX/bin/rnammer 2>> {log}
    rnammer -S bac -m lsu,ssu,tsu -gff - resources/RNAmmer/example/ecoli.fsa >> {output}
    """

SnakeMake From line 11 of rules/prokka.smk

shell:
    """
    if [[ {input} == *.fna || {input} == *.fasta || {input} == *.fa ]]
    then
        cp {input} {output} 2>> {log}
    else
        echo "ERROR: Wrong Extension:" {input} >> {log}
        exit 1
    fi
    """

SnakeMake From line 31 of rules/prokka.smk

shell:
    """
    python workflow/bgcflow/bgcflow/data/make_prokka_db.py {wildcards.prokka_db} \
        {input.table} {output.refgbff} 2>> {log}
    """

SnakeMake From line 51 of rules/prokka.smk

shell:
    """
    python workflow/bgcflow/bgcflow/data/get_organism_info.py {wildcards.strains} \
        "{params.samples_path}" data/interim/assembly_report/ data/interim/prokka/ 2>> {log}
    """

SnakeMake From line 67 of rules/prokka.smk

shell:
    """
    prokka --outdir data/interim/prokka/{wildcards.strains} --force \
        {params.refgbff} --prefix {wildcards.strains} --genus "`cut -d "," -f 1 {input.org_info}`" \
        --species "`cut -d "," -f 2 {input.org_info}`" --strain "`cut -d "," -f 3 {input.org_info}`" \
        --cdsrnaolap --cpus {threads} {params.rna_detection} --increment {params.increment} \
        {params.use_pfam} --evalue {params.evalue} {input.fna} &> {log}
    """

SnakeMake metaprokka From line 105 of rules/prokka.smk

shell:
    """
    python workflow/bgcflow/bgcflow/data/format_genbank_meta.py {input.gbk_prokka} \
        {params.version} {input.gtdb_json} {wildcards.strains} {output.gbk_processed} 2> {log}
    """

SnakeMake From line 125 of rules/prokka.smk

shell:
    """
    cp {input.gbk} {output.gbk} 2>> {log}
    cp {input.summary} {output.summary} 2>> {log}
    cp {input.tsv} {output.tsv} 2>> {log}
    """

SnakeMake From line 144 of rules/prokka.smk

shell:
    """
    refseq_masher matches --output-type {params.output_type} --top-n-results \
        {params.top_n_results} --output {output.masher_csv} {input.fasta} 2>> {log}
    """

SnakeMake refseq_masher From line 13 of rules/refseq_masher.smk

shell:
    """
    roary -p {threads} -f {output.roary_dir} -i {params.i} -g {params.g} -e -n -r -v {input.gff} &>> {log}
    """

SnakeMake Roary From line 14 of rules/roary.smk

shell:
    """
    mkdir -p {output.eggnog_dir}
    emapper.py -i {params.faa} --translate --itype "CDS" --excel --cpu {threads} -o {wildcards.name} --output_dir {output.eggnog_dir} --data_dir {input.eggnog_db} &>> {log}
    """

SnakeMake eggNOG-mapper v2 From line 34 of rules/roary.smk

shell:
    """
    (cd {input.resource} && python tf_running.py \
        -i {params.faa} -o {params.outdir} \
        -g cpu -cpu {threads}) 2>> {log}
    """

SnakeMake From line 55 of rules/roary.smk

shell:
    """
    diamond makedb --in {params.faa} -d {output.diamond_interim} -p {threads} &>> {log}
    cp {output.diamond_interim} {output.diamond_processed} &>> {log}
    """

SnakeMake Diamond From line 77 of rules/roary.smk

shell:
    """
    python workflow/bgcflow/bgcflow/data/make_pangenome_dataset.py {input.roary_interim_dir} {output.roary_processed_dir} {input.automlst_processed_dir} 2>> {log}
    """

SnakeMake From line 95 of rules/roary.smk

shell:
    """
    seqfu stats {input.fna} --json -b --gc --precision 3 > {output.json} 2>> {log}
    """

SnakeMake seqfu From line 12 of rules/seqfu.smk

shell:
    """
    TMPDIR="data/interim/tmp/{wildcards.name}"
    mkdir -p $TMPDIR
    INPUT_JSON="$TMPDIR/df_seqfu.txt"
    echo '{input.json}' > $INPUT_JSON
    python workflow/bgcflow/bgcflow/data/make_seqfu_table.py $INPUT_JSON {output.all_csv} &>> {log}
    rm $INPUT_JSON
    """