Snakemake workflow: megadudes-evaluation

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

This is the template for a new Snakemake workflow. Replace this text with a comprehensive description covering the purpose and domain. Insert your code into the respective folders, i.e. scripts , rules , and envs . Define the entry point of the workflow in the Snakefile and the main configuration in the config.yaml file.

Authors

Henning Schiebenhoefer (@rababerladuseladim)

Usage

If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and, if available, its DOI (see above).

Step 1: Obtain a copy of this workflow

Create a new github repository using this workflow as a template .
Clone the newly created repository to your local system, into the place where you want to perform the data analysis.

Step 2: Install Snakemake

Install Snakemake using conda :

conda env create -f env.yaml

Install pre-commit by running

pre-commit install

For installation details, see the instructions in the Snakemake documentation .

Step 3: Download Resources

Download resources by executing:

./download_resources.sh

Step 4: Configure workflow

Configure the workflow according to your needs via editing the files in the config/ folder. Adjust config.yaml to configure the workflow execution, and samples.tsv to specify your sample setup.

Step 5: Execute workflow

Activate the conda environment:

conda activate snakemake

Test your configuration by performing a dry-run via

snakemake --use-conda -n

Execute the workflow locally via

snakemake --use-conda --cores $N

using $N cores or run it in a cluster environment via

snakemake --use-conda --cluster qsub --jobs 100

snakemake --use-conda --drmaa --jobs 100

If you not only want to fix the software stack but also the underlying OS, use

snakemake --use-conda --use-singularity

in combination with any of the modes above. See the Snakemake documentation for further details.

Step 6: Investigate results

After successful execution, you can create a self-contained interactive HTML report with all results via:

snakemake --report report.html

This report can, e.g., be forwarded to your collaborators. An example (using some trivial test data) can be seen here .

Step 7: Commit changes

Whenever you change something, don't forget to commit the changes back to your github copy of the repository:

git commit -a
git push

Updating

update the local environment: conda update -n snakemake --all
update env file: conda env export -n snakemake > env.yaml
update lock-file: conda list --explicit -n snakemake > spec-file.txt
update pre-commit hooks: pre-commit autoupdate

Testing

Test cases are in the subfolder test . They are automatically executed via continuous integration with GitHub Actions .

Code Snippets

shell:
    "cat {input} | tail -n +2 | cut -f1 | sed 's/.*/>&\\n&/' > {output} 2>{log}"

SnakeMake From line 8 of rules/convert_msfragger_peptides_tsv_to_fasta.smk

script:
    "../scripts/plot_diamond_qc.py"

SnakeMake Diamond From line 11 of rules/plot_diamond_qc.smk

script:
    "../scripts/plot_megadudes_qc.py"

SnakeMake From line 12 of rules/plot_megadudes_qc.smk

shell:
    "zcat {input} | diamond makedb --threads {threads} --db {output} > {log} 2>&1"

SnakeMake Diamond From line 11 of rules/run_diamond.smk

shell:
    """
    dudesdb -m up \
    -f {input.uniprot_fastas} \
    -g {input.idmap} \
    -n {input.ncbi_nodes} \
    -a {input.ncbi_names} \
    -o {params.db_base_name} \
    -t {threads} > {log.stdout} 2> {log.stderr}
    """

SnakeMake From line 20 of rules/run_megadudes.smk

shell:
    """
    dudes \
    -c {input.diamond_result} \
    -d {input.dudes_db} \
    -t {threads} \
    -o {params.result_wo_ext}\
    --debug_plots_dir {output.plots} \
    --debug > {log.stdout} 2> {log.stderr}
    """

SnakeMake dudes From line 47 of rules/run_megadudes.smk

script:
    "../scripts/sample_taxons.py"

SnakeMake From line 8 of rules/simulate_sample.smk

script:
    "../scripts/map_taxids_to_uniprot_accessions.py"

SnakeMake From line 20 of rules/simulate_sample.smk

script:
    "../scripts/sample_peptides.py"

SnakeMake From line 33 of rules/simulate_sample.smk

shell:
    "cat {input} | sed 's/.*/>&\\n&/' > {output} 2>{log}"

SnakeMake From line 44 of rules/simulate_sample.smk

import json

import pandas as pd


def map_taxids_to_uniprot_accessions(idmapping_selected_file, taxid_file, output_json, log_handle):
    taxids = []
    with open(taxid_file) as infile:
        for line in infile.readlines():
            content = line.strip()
            if content:
                taxids.append(int(content))

    df = pd.read_csv(
        idmapping_selected_file,
        compression="gzip" if idmapping_selected_file.endswith(".gz") else None,
        sep="\t",
        header=None,
        usecols=[0, 12],
        names=["acc", "taxid"],
        converters={12: lambda x: int(x) if int(x) in taxids else -1}
    )
    acc2taxid = df[df["taxid"] != -1]

    tax2acc_map = {}
    missing_taxids = []
    for taxid in taxids:
        if taxid in acc2taxid["taxid"].values:
            tax2acc_map[taxid] = acc2taxid[acc2taxid["taxid"] == taxid]["acc"].to_list()
        else:
            missing_taxids.append(taxid)
    if missing_taxids:
        print(f"Taxids missing in idmapping file: {missing_taxids}", file=log_handle)
    with open(output_json, "w") as outfile:
        json.dump(tax2acc_map, outfile)
        outfile.write("\n")


def main():
    with open(snakemake.log[0], "w") as log_handle:
        map_taxids_to_uniprot_accessions(
            idmapping_selected_file=snakemake.input["idmap"],
            taxid_file=snakemake.input["taxids"],
            output_json=snakemake.output[0],
            log_handle=log_handle
        )


if "snakemake" in globals():
    main()

Python Snakemake Pandas JSON From line 1 of scripts/map_taxids_to_uniprot_accessions.py

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams["figure.figsize"] = [16 * 0.6, 9 * 0.6]
# context keywords: notebook, talk, paper, poster
sns.set_theme(context="talk", rc={"axes.grid": False})


def qc_plots(diamond_hits_file, input_fasta_file, output):
    df_hits = pd.read_table(
        diamond_hits_file,
        usecols=[0, 4],
        header=None,
        dtype={"query": str, "evalue": float},
        names=["query", "evalue"],
    )
    df_hits["length"] = df_hits["query"].str.len()

    input_peptide_lengths = []
    with open(input_fasta_file) as fh:
        for line in fh:
            if line.startswith(">") or len(line.strip()) == 0:
                continue
            else:
                input_peptide_lengths.append(len(line.strip()))

    df_lengths = pd.concat(
        [
            pd.DataFrame({"length": df_hits.drop_duplicates(subset="query").loc[:, "length"],
                          "source": "hits"}),
            pd.DataFrame({"length": input_peptide_lengths,
                          "source": "input"}),
        ],
        ignore_index=True
    )

    # plot
    f, (ax0, ax1) = plt.subplots(2, 1, squeeze=True)
    sns.histplot(df_lengths, x="length", hue="source", ax=ax0)
    sns.histplot(df_hits, x="evalue", ax=ax1, log_scale=True)
    f.tight_layout()
    f.savefig(output)
    return f


def test_qc_plots(tmpdir):
    input_fasta_f = "/home/hennings/Projects/megadudes-evaluation/results/fastas/sample_kleiner_equal_protein.fasta"
    diamond_hits_f = "/home/hennings/Projects/megadudes-evaluation/results/diamond/sample_kleiner_equal_protein.tsv"
    output_plot = tmpdir / "hist_of_hits.svg"
    f = qc_plots(
        diamond_hits_file=diamond_hits_f,
        input_fasta_file=input_fasta_f,
        output=output_plot,
    )
    f.show()


if "snakemake" in globals():
    qc_plots(
        diamond_hits_file=snakemake.input[0],
        input_fasta_file=snakemake.input[1],
        output=snakemake.output[0],
    )

Python Snakemake Pandas matplotlib seaborn From line 1 of scripts/plot_diamond_qc.py

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path


plt.rcParams["figure.figsize"] = [16 * 0.6, 9 * 0.6]
# context keywords: notebook, talk, paper, poster
sns.set_theme(context="talk", rc={"axes.grid": False})

TAX_LEVELS = ["superkingdom", "phylum", "class", "order", "family", "genus", "species", "subspecies"]


def read_ground_truth_file(ground_truth_file):
    df_t_taxids = pd.read_csv(
        ground_truth_file,
        usecols=TAX_LEVELS,
        sep="\t",
        dtype={
            c: pd.Int64Dtype()  # support nan values
            for c in TAX_LEVELS
        },
    )
    return df_t_taxids


def get_value_overlap(hits, ground_truth):
    """
    build series of values from both input series and annotate them with TP, FP, FN

    :param hits:
    :param ground_truth:
    :return: series with index: union of values from hits and ground_truth, values: TP if index in both,
        FP if only in hits, FN if only in ground_trugh
    """
    left = {t for t in hits if (not pd.isna(t) and (t != "-"))}
    right = {t for t in ground_truth if (not pd.isna(t) and (t != "-"))}
    taxa = left.union(right)
    dct = dict()
    for t in taxa:
        found = t in left
        expected = t in right
        if found and expected:
            dct[t] = "TP"
        elif found:
            dct[t] = "FP"
        elif expected:
            dct[t] = "FN"
        else:
            raise KeyError(f"Taxon '{t}' missing in both Series!")
    return pd.Series(dct)


def get_unipept_hit_counts(f, ground_truth_df_tax_ids, method_name):
    df_uni = pd.read_csv(
        f,
        usecols=[f"{t}_id" for t in TAX_LEVELS],
        dtype=pd.Int64Dtype()
    )
    df_uni.rename(lambda x: x.strip("_id"), axis="columns", inplace=True)
    df = pd.DataFrame(columns=TAX_LEVELS, index=["TP", "FP", "FN"],)
    for t in TAX_LEVELS:
        s = get_value_overlap(df_uni[t], ground_truth_df_tax_ids[t])
        df[t] = s.value_counts()
    df = df.fillna(0)
    df["eval"] = df.index
    df["method"] = method_name
    return df


def get_megadudes_hit_counts(f, ground_truth_df_tax_ids, method_name):
    df_mega = pd.read_csv(f, sep="\t", skiprows=5)
    df_mega = df_mega.rename(lambda c: c.lower().strip("@"), axis=1)
    df = pd.DataFrame(columns=TAX_LEVELS, index=["TP", "FP", "FN"])
    for t in TAX_LEVELS:
        s_mega = (
            df_mega.loc[df_mega["rank"] == t, "taxpath"]
            .map(lambda x: x.split("|")[-1])
            .astype(int)
        )
        s = get_value_overlap(s_mega, ground_truth_df_tax_ids[t])
        s_classification = s.value_counts()
        for k in ["TP", "FP", "FN"]:
            if k not in s_classification:
                s_classification[k] = 0
        df[t] = s_classification
    df = df.fillna(0)
    df["eval"] = df.index
    df["method"] = method_name
    return df


def calc_eval_metrics(df_hits):
    df_hits_long = df_hits.melt(id_vars=["method", "eval"], var_name="taxon_level")
    df_hits_long["value"] = df_hits_long["value"].apply(int)
    grouped = df_hits_long.groupby(["method", "taxon_level"], sort=False)

    df_sens_g = grouped.apply(
        lambda df: df[df["eval"] == "TP"]["value"].values[0]
        * 100
        / df[df["eval"].isin(["TP", "FN"])]["value"].sum()
    )
    # specificity: TN / (TN + FP)
    # how to access TN?
    # precision: TP / (TP + FP)
    df_precision_g = grouped.apply(
        lambda df: df[df["eval"] == "TP"]["value"].values[0]
        * 100
        / df[df["eval"].isin(["TP", "FP"])]["value"].sum()
    )
    df_f_one_g = (2 * df_precision_g * df_sens_g) / (df_precision_g + df_sens_g)
    df_fdr_g = grouped.apply(
        lambda df: df[df["eval"] == "FP"]["value"].values[0]
        * 100
        / df[df["eval"].isin(["TP", "FP"])]["value"].sum()
    )
    df_eval = pd.DataFrame(
        {
            "sensitivity": df_sens_g,
            "precision": df_precision_g,
            "F1-score": df_f_one_g,
            "fdr": df_fdr_g,
        }
    )
    df_eval = df_eval.reset_index()
    return df_eval


def plot_qc(df_plt, output):
    metrics = ["sensitivity", "precision", "F1-score"]
    f, axes = plt.subplots(ncols=3, figsize=(15, 9), sharex="all", sharey="all")
    for metric, ax in zip(metrics, axes):
        sns.lineplot(
            data=df_plt,
            x="taxon_level",
            y=metric,
            hue="method",
            ax=ax,
            # legend only in first plot
            legend=bool(metric == metrics[0]),
            marker="s",
        )
        ax.set_title(metric)
        ax.set_ylim(0, 105)
        plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")
        ax.xaxis.label.set_visible(False)
    axes[0].yaxis.label.set_visible(False)
    axes[-1].tick_params(labelright=True)
    f.savefig(output)
    return f


def qc_plots(ground_truth_file, unipept_file_list, megadudes_file_list, output):
    # read ground truth
    df_gt_taxids = read_ground_truth_file(ground_truth_file)
    # get unipept hits
    lst_df_uni = [
        get_unipept_hit_counts(f, df_gt_taxids, method_name=Path(f).stem)
        for f in unipept_file_list
    ]
    # get megadudes hits
    lst_df_mega = [
        get_megadudes_hit_counts(
            f, df_gt_taxids, method_name=("megadudes-" + Path(f).parts[-1])
        )
        for f in megadudes_file_list
    ]
    # build TP/FP/FN dataframe
    df_hits = pd.concat([*lst_df_uni, *lst_df_mega], ignore_index=True)
    # build evaluation dataframe, containing sensitivity, precision, f1-score, fdr
    df_eval = calc_eval_metrics(df_hits)
    # plot
    # drop subspecies, as no method has any true hits on this level
    df_plt = df_eval[df_eval["taxon_level"] != "subspecies"]
    return plot_qc(df_plt, output), df_eval


def test_qc_plots(tmpdir):
    ground_truth = "/home/hennings/Projects/megadudes-evaluation/resources/kleiner_ground_truth-equal_protein-full_taxid_lineage.csv"
    unipept_results = [
        "/home/hennings/Nextcloud/PhD/projects/20210100-taxFDR/01-researchgap/02-tax_analysis/unipept_results-Kleiner_P_msfragger.csv"
    ]
    megadudes_file_list = [
        "/home/hennings/Nextcloud/PhD/projects/20210100-taxFDR/02-megadudes/v0.8/megadudes-result.out",
        "/home/hennings/Nextcloud/PhD/projects/20210100-taxFDR/02-megadudes/v0.9/megadudes-result.out",
    ]
    output_plot = tmpdir / "qc_plot.svg"
    print(output_plot)
    f, df_eval = qc_plots(
        ground_truth_file=ground_truth,
        unipept_file_list=unipept_results,
        megadudes_file_list=megadudes_file_list,
        output=output_plot,
    )
    f.show()


if "snakemake" in globals():
    with open(snakemake.log[0], "w") as log_handle:
        LOG_HANDLE = log_handle
        qc_plots(
            ground_truth_file=snakemake.input.ground_truth,
            unipept_file_list=snakemake.input.unipept_results,
            megadudes_file_list=snakemake.input.megadudes_results,
            output=snakemake.output[0],
        )

Python Snakemake Pandas matplotlib seaborn From line 1 of scripts/plot_megadudes_qc.py

import random
from itertools import groupby
from pathlib import Path
from typing import cast, Dict, Iterable
from pyteomics.parser import cleave
import requests


import json
import sys

LOG_HANDLE = sys.stderr


def sample_peptides(accessions_file, output):
    with open(accessions_file, "r") as handle:
        tax2acc = cast(Dict[str, list[str]], json.load(handle))

    # sample accessions
    random.seed(str(output))
    accessions_sample = []
    for i, (tax_id, accessions) in enumerate(tax2acc.items()):
        size = min(len(accessions), 100)
        accessions_sample.extend(random.sample(accessions, size))
        if i >= 100:
            break

    print(f"Sampled {len(accessions_sample)} accessions", file=LOG_HANDLE)
    seed = random.getstate()

    # get protein sequences for sampled accessions
    fastas = []
    chunk_size = 100
    connector = UniProtConnector()
    for chunk_start in range(0, len(accessions_sample), chunk_size):
        chunk = accessions_sample[chunk_start:chunk_start + chunk_size]
        fastas.append(connector.get_fasta(chunk))
    acc2seq = {acc: seq for f in fastas for acc, seq in convert_fasta_str_to_dict(f).items()}

    # ensure deterministic behaviour although previous fetching of sequences takes a non-deterministic amount of
    # computation
    random.setstate(seed)

    # cleave proteins sequences and sample peptides
    sequences = sorted(acc2seq.values())
    peptides_sample = list()
    for seq in sequences:
        peptides = sorted(cleave_protein_sequence(seq))
        peptides_sample.extend(random.sample(list(peptides), min(3, len(peptides))))

    print(f"Sampled {len(peptides_sample)} peptides", file=LOG_HANDLE)

    # write fasta
    with open(output, "w") as handle:
        handle.write("\n".join(peptides_sample) + "\n")


def cleave_protein_sequence(sequence: str, min_length: int = 5):
    return cleave(sequence, rule="trypsin", min_length=min_length)


def get_protein_sequence(accession):
    connector = UniProtConnector()
    fasta = connector.get_fasta([accession])
    return "".join(fasta.split("\n")[1:])


class UniProtConnector:
    url = "https://rest.uniprot.org/uniprotkb/"

    def __init__(self):
        self.session = requests.Session()
        self.uniprot_version = requests.get(self.url).headers.get('X-UniProt-Release')
        print(f"Uniprot Release Number: {self.uniprot_version}", file=LOG_HANDLE)

    def get_fasta(self, uniprot_accessions: Iterable[str]):
        uniprot_accessions = ["accession%3A" + acc for acc in uniprot_accessions]
        url = self.url + f"search?query={'+OR+'.join(uniprot_accessions)}&format=fasta"
        response = self.session.get(url)
        response.raise_for_status()
        fasta = response.text
        return fasta


def convert_fasta_str_to_dict(fasta):
    protein_dict = {}
    fasta = fasta.strip()
    fasta_iter = (x[1] for x in groupby(fasta.split("\n"), lambda line: line[0] == ">"))

    for header in fasta_iter:
        # keep only uniprot accession, between pipe-chars ('|')
        acc = header.__next__().split("|")[1]

        # join all sequence lines to one.
        seq = "".join(s.strip() for s in fasta_iter.__next__())

        protein_dict[acc] = seq
    return protein_dict


def test_cleave_protein_sequence():
    seq = "MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE"
    peps = {
        "GIIWGEDTLMEYLENPK",
        "TGQAPGYSYTAANK",
        "TGPNLHGLFGR",
        "CSQCHTVEK",
        "ADLIAYLK",
        "MIFVGIK",
        "YIPGTK",
        "MGDVEK",
        "IFIMK",
    }
    assert cleave_protein_sequence(seq) == peps


def test_fasta_to_dict():
    fasta = """>sp|P12345|AATM_RABIT Aspartate aminotransferase, mitochondrial OS=Oryctolagus cuniculus OX=9986 GN=GOT2 PE=1 SV=2
MALLHSARVLSGVASAFHPGLAAAASARASSWWAHVEMGPPDPILGVTEAYK
>sp|P99999|CYC_HUMAN Cytochrome c OS=Homo sapiens OX=9606 GN=CYCS PE=1 SV=2
MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIW
GEDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE"""
    assert convert_fasta_str_to_dict(fasta) == {
        "P12345": "MALLHSARVLSGVASAFHPGLAAAASARASSWWAHVEMGPPDPILGVTEAYK",
        "P99999": "MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE",
    }


def test_get_protein_sequence():
    assert (
        get_protein_sequence("P99999")
        == "MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE"
    )


def test_sample_peptides(tmpdir):
    sample_peptides(
        Path(__file__).parent.parent.parent / "test/unit/simulate_sample/sample_peptides/data/results/map_taxids_to_uniprot_accessions/tax2accessions.json",
        "/dev/null"
        # tmpdir / "peptides.txt"
    )
    expected_accestions = "P68254 Q9CQV8 P63101 P68510 P61982 P62259 O70456 P62260 P63102 P68511 P35213 P68255 P61983 Q9MB95 Q80HU0 P18558 P18559 P18560 P68744 Q65209 Q65210 P18557 P0C9H2 P0C9K1 P0C9F1 P0C9J9 P0C9I7 P0C9H5 P0C9F9 P0C9J5 P0C9I3 P0C9G7 P0C9F5 P0C9K3 P0C9G3 P0C9H9 P16536 P05080 P21215 Q8GBW6"
    expected_peptides = "IASALEMYATK LNNLLDPHSFDEVGAFR IVDWGDYLEVK QIFLTAIGDQAR GDSVHALCALWK GFPCK EAAENSLVAYK GDYHR YDEMVESMK DSTLIMQLLR IISSIEQK GDYHR SVTEQGAELSNEER DSTLIMQLLR GIVDQSQQAYQEAFEISK TEGAEK QQMAR DNLTLWTSDTQGDEAEAGEGGEN QTIENSQGAYQEAFDISK LGLALNFSVFYYEILNNPELACTLAK YLAEVACGDDR YDDMATCMK NVVGGR YLAEVACGDDR YLAEVATGDDK SAYQEAMDISK VETELR NLLSVAYK AVTELNEPLSNEDR QAFDDAIAELDTLNEDSYK QAFDDAIAELDTLNEDSYK NLLSVAYK GDYYR QYCLYFIIGIAYTDCFICALCK MGGGGDHQQLSIK VLDDMPLIVQNDYISK NVAIFQDYHGFPEFR CVEPGWFR GLEEVGINCLK CMYEAHFR DHPIITQNDYIVNCTVSR LLALLSILIWLSQPALNRPLSIFYMK CSYEK ELEHWCTHGK NHSPILENNYIANCSIYR LLTYGFYLVGCVLVANYVR IGSIPPER SLQTIMYLMVLLVIFFLLSQLMLYR ITIDVTPK DVLLAQSVAVEEAK IGNVVTISYNLEK MLVIFLGILGLLANQVLGLPIQAGGHLCSTDNPPQEELGYWCTYMESCK VNESMPLIIENSYLTSCEVSR WYNQCTYGEGNGHYHVMDCSNPVPHNRPHQLR WYNQCTYSEGNGHYHVMDCSNPVPHNRPHR FCWECAHGICK MLVIFLGILGLLANQVLGLPTQAGGHLR MLVIFLGILGLLANQVSSQLVGQLHPTENPSENELEYWCTYMECCQFCWDCQNGLCVNK LGNTTILENEYVHPCIVSR CMYDLDK IDGSAIYK NEYVK QTTVSNSQQAYQEAFEISK ACSLAK DSTLIMQLLR VISSIEQK IEAELQDICSDVLELLDK LAEQAER ELEAVCQDVLSLLDNYLIK VFYLK LAEQAER YDDMAAAMK NVVGAR TSADGNEK"


if "snakemake" in globals():
    with open(snakemake.log[0], "w") as log_handle:
        LOG_HANDLE = log_handle
        sample_peptides(
            accessions_file=snakemake.input["accessions"], output=snakemake.output[0]
        )

Python Snakemake JSON requests Pyteomics From line 1 of scripts/sample_peptides.py

import random
import sys


def sample_taxons(uniprot_speclist_file, output_file, log=sys.stderr, count=10):
    """Take a random sample of taxids from the uniprot speclist file.

    Args:
        uniprot_speclist_file: path to uniprot speclist file, get from: https://www.uniprot.org/docs/speclist.txt
        output_file: path to file for writing TaxIDs to, one TaxID per line. Is also used for initializing
            random.seed.
        log: handle for writing the log
        count: number of taxa that should be sampled
    """
    with open(uniprot_speclist_file, "r") as file_handle:
        taxids = read_taxids_from_uniprot_speclist_file(file_handle)
    if count > len(taxids):
        print(
            f"Number of requested taxids ({count}) is larger than number of available taxids ({len(taxids)}).",
            file=log,
        )
        sys.exit(1)
    random.seed(output_file)
    sample_taxids = random.sample(taxids, count)
    with open(output_file, "w") as output_handle:
        for taxid in sample_taxids:
            output_handle.write(f"{taxid}\n")


def read_taxids_from_uniprot_speclist_file(file_handle):
    """

    :param file_handle: open file handle to uniprot speclist file. download from: https://www.uniprot.org/docs/speclist.txt
    :return:
    """

    taxids = []
    is_headline_reached = False
    is_taxid_section = False
    for line in file_handle.readlines():
        if not is_taxid_section:
            if not is_headline_reached:
                if line.startswith("(1) Real organism codes"):
                    is_headline_reached = True
            else:
                if line.startswith("_____"):
                    is_taxid_section = True
            continue
        else:
            # stop at empty line
            if not line.strip():
                break
            line_up_to_taxid = line[:15]
            if line_up_to_taxid.strip():
                taxid = int(line_up_to_taxid.split(" ")[-1])
                taxids.append(taxid)

    return taxids


if "snakemake" in globals():
    with open(snakemake.log[0], "w") as log_handle:
        sample_taxons(snakemake.input[0], snakemake.output[0], log=log_handle)