Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
This is the template for a new Snakemake workflow. Replace this text with a comprehensive description covering the purpose and domain.
Insert your code into the respective folders, i.e.
scripts
,
rules
, and
envs
. Define the entry point of the workflow in the
Snakefile
and the main configuration in the
config.yaml
file.
Authors
- Henning Schiebenhoefer (@rababerladuseladim)
Usage
If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and, if available, its DOI (see above).
Step 1: Obtain a copy of this workflow
-
Create a new github repository using this workflow as a template .
-
Clone the newly created repository to your local system, into the place where you want to perform the data analysis.
Step 2: Install Snakemake
Install Snakemake using conda :
conda env create -f env.yaml
Install pre-commit by running
pre-commit install
For installation details, see the instructions in the Snakemake documentation .
Step 3: Download Resources
Download resources by executing:
./download_resources.sh
Step 4: Configure workflow
Configure the workflow according to your needs via editing the files in the
config/
folder. Adjust
config.yaml
to configure the workflow execution, and
samples.tsv
to specify your sample setup.
Step 5: Execute workflow
Activate the conda environment:
conda activate snakemake
Test your configuration by performing a dry-run via
snakemake --use-conda -n
Execute the workflow locally via
snakemake --use-conda --cores $N
using
$N
cores or run it in a cluster environment via
snakemake --use-conda --cluster qsub --jobs 100
or
snakemake --use-conda --drmaa --jobs 100
If you not only want to fix the software stack but also the underlying OS, use
snakemake --use-conda --use-singularity
in combination with any of the modes above. See the Snakemake documentation for further details.
Step 6: Investigate results
After successful execution, you can create a self-contained interactive HTML report with all results via:
snakemake --report report.html
This report can, e.g., be forwarded to your collaborators. An example (using some trivial test data) can be seen here .
Step 7: Commit changes
Whenever you change something, don't forget to commit the changes back to your github copy of the repository:
git commit -a
git push
Updating
-
update the local environment:
conda update -n snakemake --all
-
update env file:
conda env export -n snakemake > env.yaml
-
update lock-file:
conda list --explicit -n snakemake > spec-file.txt
-
update pre-commit hooks:
pre-commit autoupdate
Testing
Test cases are in the subfolder
test
. They are automatically executed via continuous integration with
GitHub Actions
.
Code Snippets
8 9 | shell: "cat {input} | tail -n +2 | cut -f1 | sed 's/.*/>&\\n&/' > {output} 2>{log}" |
11 12 | script: "../scripts/plot_diamond_qc.py" |
12 13 | script: "../scripts/plot_megadudes_qc.py" |
11 12 | shell: "zcat {input} | diamond makedb --threads {threads} --db {output} > {log} 2>&1" |
20 21 22 23 24 25 26 27 28 29 | shell: """ dudesdb -m up \ -f {input.uniprot_fastas} \ -g {input.idmap} \ -n {input.ncbi_nodes} \ -a {input.ncbi_names} \ -o {params.db_base_name} \ -t {threads} > {log.stdout} 2> {log.stderr} """ |
47 48 49 50 51 52 53 54 55 56 | shell: """ dudes \ -c {input.diamond_result} \ -d {input.dudes_db} \ -t {threads} \ -o {params.result_wo_ext}\ --debug_plots_dir {output.plots} \ --debug > {log.stdout} 2> {log.stderr} """ |
8 9 | script: "../scripts/sample_taxons.py" |
20 21 | script: "../scripts/map_taxids_to_uniprot_accessions.py" |
33 34 | script: "../scripts/sample_peptides.py" |
44 45 | shell: "cat {input} | sed 's/.*/>&\\n&/' > {output} 2>{log}" |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | import json import pandas as pd def map_taxids_to_uniprot_accessions(idmapping_selected_file, taxid_file, output_json, log_handle): taxids = [] with open(taxid_file) as infile: for line in infile.readlines(): content = line.strip() if content: taxids.append(int(content)) df = pd.read_csv( idmapping_selected_file, compression="gzip" if idmapping_selected_file.endswith(".gz") else None, sep="\t", header=None, usecols=[0, 12], names=["acc", "taxid"], converters={12: lambda x: int(x) if int(x) in taxids else -1} ) acc2taxid = df[df["taxid"] != -1] tax2acc_map = {} missing_taxids = [] for taxid in taxids: if taxid in acc2taxid["taxid"].values: tax2acc_map[taxid] = acc2taxid[acc2taxid["taxid"] == taxid]["acc"].to_list() else: missing_taxids.append(taxid) if missing_taxids: print(f"Taxids missing in idmapping file: {missing_taxids}", file=log_handle) with open(output_json, "w") as outfile: json.dump(tax2acc_map, outfile) outfile.write("\n") def main(): with open(snakemake.log[0], "w") as log_handle: map_taxids_to_uniprot_accessions( idmapping_selected_file=snakemake.input["idmap"], taxid_file=snakemake.input["taxids"], output_json=snakemake.output[0], log_handle=log_handle ) if "snakemake" in globals(): main() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | import pandas as pd import matplotlib.pyplot as plt import seaborn as sns plt.rcParams["figure.figsize"] = [16 * 0.6, 9 * 0.6] # context keywords: notebook, talk, paper, poster sns.set_theme(context="talk", rc={"axes.grid": False}) def qc_plots(diamond_hits_file, input_fasta_file, output): df_hits = pd.read_table( diamond_hits_file, usecols=[0, 4], header=None, dtype={"query": str, "evalue": float}, names=["query", "evalue"], ) df_hits["length"] = df_hits["query"].str.len() input_peptide_lengths = [] with open(input_fasta_file) as fh: for line in fh: if line.startswith(">") or len(line.strip()) == 0: continue else: input_peptide_lengths.append(len(line.strip())) df_lengths = pd.concat( [ pd.DataFrame({"length": df_hits.drop_duplicates(subset="query").loc[:, "length"], "source": "hits"}), pd.DataFrame({"length": input_peptide_lengths, "source": "input"}), ], ignore_index=True ) # plot f, (ax0, ax1) = plt.subplots(2, 1, squeeze=True) sns.histplot(df_lengths, x="length", hue="source", ax=ax0) sns.histplot(df_hits, x="evalue", ax=ax1, log_scale=True) f.tight_layout() f.savefig(output) return f def test_qc_plots(tmpdir): input_fasta_f = "/home/hennings/Projects/megadudes-evaluation/results/fastas/sample_kleiner_equal_protein.fasta" diamond_hits_f = "/home/hennings/Projects/megadudes-evaluation/results/diamond/sample_kleiner_equal_protein.tsv" output_plot = tmpdir / "hist_of_hits.svg" f = qc_plots( diamond_hits_file=diamond_hits_f, input_fasta_file=input_fasta_f, output=output_plot, ) f.show() if "snakemake" in globals(): qc_plots( diamond_hits_file=snakemake.input[0], input_fasta_file=snakemake.input[1], output=snakemake.output[0], ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 | import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from pathlib import Path plt.rcParams["figure.figsize"] = [16 * 0.6, 9 * 0.6] # context keywords: notebook, talk, paper, poster sns.set_theme(context="talk", rc={"axes.grid": False}) TAX_LEVELS = ["superkingdom", "phylum", "class", "order", "family", "genus", "species", "subspecies"] def read_ground_truth_file(ground_truth_file): df_t_taxids = pd.read_csv( ground_truth_file, usecols=TAX_LEVELS, sep="\t", dtype={ c: pd.Int64Dtype() # support nan values for c in TAX_LEVELS }, ) return df_t_taxids def get_value_overlap(hits, ground_truth): """ build series of values from both input series and annotate them with TP, FP, FN :param hits: :param ground_truth: :return: series with index: union of values from hits and ground_truth, values: TP if index in both, FP if only in hits, FN if only in ground_trugh """ left = {t for t in hits if (not pd.isna(t) and (t != "-"))} right = {t for t in ground_truth if (not pd.isna(t) and (t != "-"))} taxa = left.union(right) dct = dict() for t in taxa: found = t in left expected = t in right if found and expected: dct[t] = "TP" elif found: dct[t] = "FP" elif expected: dct[t] = "FN" else: raise KeyError(f"Taxon '{t}' missing in both Series!") return pd.Series(dct) def get_unipept_hit_counts(f, ground_truth_df_tax_ids, method_name): df_uni = pd.read_csv( f, usecols=[f"{t}_id" for t in TAX_LEVELS], dtype=pd.Int64Dtype() ) df_uni.rename(lambda x: x.strip("_id"), axis="columns", inplace=True) df = pd.DataFrame(columns=TAX_LEVELS, index=["TP", "FP", "FN"],) for t in TAX_LEVELS: s = get_value_overlap(df_uni[t], ground_truth_df_tax_ids[t]) df[t] = s.value_counts() df = df.fillna(0) df["eval"] = df.index df["method"] = method_name return df def get_megadudes_hit_counts(f, ground_truth_df_tax_ids, method_name): df_mega = pd.read_csv(f, sep="\t", skiprows=5) df_mega = df_mega.rename(lambda c: c.lower().strip("@"), axis=1) df = pd.DataFrame(columns=TAX_LEVELS, index=["TP", "FP", "FN"]) for t in TAX_LEVELS: s_mega = ( df_mega.loc[df_mega["rank"] == t, "taxpath"] .map(lambda x: x.split("|")[-1]) .astype(int) ) s = get_value_overlap(s_mega, ground_truth_df_tax_ids[t]) s_classification = s.value_counts() for k in ["TP", "FP", "FN"]: if k not in s_classification: s_classification[k] = 0 df[t] = s_classification df = df.fillna(0) df["eval"] = df.index df["method"] = method_name return df def calc_eval_metrics(df_hits): df_hits_long = df_hits.melt(id_vars=["method", "eval"], var_name="taxon_level") df_hits_long["value"] = df_hits_long["value"].apply(int) grouped = df_hits_long.groupby(["method", "taxon_level"], sort=False) df_sens_g = grouped.apply( lambda df: df[df["eval"] == "TP"]["value"].values[0] * 100 / df[df["eval"].isin(["TP", "FN"])]["value"].sum() ) # specificity: TN / (TN + FP) # how to access TN? # precision: TP / (TP + FP) df_precision_g = grouped.apply( lambda df: df[df["eval"] == "TP"]["value"].values[0] * 100 / df[df["eval"].isin(["TP", "FP"])]["value"].sum() ) df_f_one_g = (2 * df_precision_g * df_sens_g) / (df_precision_g + df_sens_g) df_fdr_g = grouped.apply( lambda df: df[df["eval"] == "FP"]["value"].values[0] * 100 / df[df["eval"].isin(["TP", "FP"])]["value"].sum() ) df_eval = pd.DataFrame( { "sensitivity": df_sens_g, "precision": df_precision_g, "F1-score": df_f_one_g, "fdr": df_fdr_g, } ) df_eval = df_eval.reset_index() return df_eval def plot_qc(df_plt, output): metrics = ["sensitivity", "precision", "F1-score"] f, axes = plt.subplots(ncols=3, figsize=(15, 9), sharex="all", sharey="all") for metric, ax in zip(metrics, axes): sns.lineplot( data=df_plt, x="taxon_level", y=metric, hue="method", ax=ax, # legend only in first plot legend=bool(metric == metrics[0]), marker="s", ) ax.set_title(metric) ax.set_ylim(0, 105) plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor") ax.xaxis.label.set_visible(False) axes[0].yaxis.label.set_visible(False) axes[-1].tick_params(labelright=True) f.savefig(output) return f def qc_plots(ground_truth_file, unipept_file_list, megadudes_file_list, output): # read ground truth df_gt_taxids = read_ground_truth_file(ground_truth_file) # get unipept hits lst_df_uni = [ get_unipept_hit_counts(f, df_gt_taxids, method_name=Path(f).stem) for f in unipept_file_list ] # get megadudes hits lst_df_mega = [ get_megadudes_hit_counts( f, df_gt_taxids, method_name=("megadudes-" + Path(f).parts[-1]) ) for f in megadudes_file_list ] # build TP/FP/FN dataframe df_hits = pd.concat([*lst_df_uni, *lst_df_mega], ignore_index=True) # build evaluation dataframe, containing sensitivity, precision, f1-score, fdr df_eval = calc_eval_metrics(df_hits) # plot # drop subspecies, as no method has any true hits on this level df_plt = df_eval[df_eval["taxon_level"] != "subspecies"] return plot_qc(df_plt, output), df_eval def test_qc_plots(tmpdir): ground_truth = "/home/hennings/Projects/megadudes-evaluation/resources/kleiner_ground_truth-equal_protein-full_taxid_lineage.csv" unipept_results = [ "/home/hennings/Nextcloud/PhD/projects/20210100-taxFDR/01-researchgap/02-tax_analysis/unipept_results-Kleiner_P_msfragger.csv" ] megadudes_file_list = [ "/home/hennings/Nextcloud/PhD/projects/20210100-taxFDR/02-megadudes/v0.8/megadudes-result.out", "/home/hennings/Nextcloud/PhD/projects/20210100-taxFDR/02-megadudes/v0.9/megadudes-result.out", ] output_plot = tmpdir / "qc_plot.svg" print(output_plot) f, df_eval = qc_plots( ground_truth_file=ground_truth, unipept_file_list=unipept_results, megadudes_file_list=megadudes_file_list, output=output_plot, ) f.show() if "snakemake" in globals(): with open(snakemake.log[0], "w") as log_handle: LOG_HANDLE = log_handle qc_plots( ground_truth_file=snakemake.input.ground_truth, unipept_file_list=snakemake.input.unipept_results, megadudes_file_list=snakemake.input.megadudes_results, output=snakemake.output[0], ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 | import random from itertools import groupby from pathlib import Path from typing import cast, Dict, Iterable from pyteomics.parser import cleave import requests import json import sys LOG_HANDLE = sys.stderr def sample_peptides(accessions_file, output): with open(accessions_file, "r") as handle: tax2acc = cast(Dict[str, list[str]], json.load(handle)) # sample accessions random.seed(str(output)) accessions_sample = [] for i, (tax_id, accessions) in enumerate(tax2acc.items()): size = min(len(accessions), 100) accessions_sample.extend(random.sample(accessions, size)) if i >= 100: break print(f"Sampled {len(accessions_sample)} accessions", file=LOG_HANDLE) seed = random.getstate() # get protein sequences for sampled accessions fastas = [] chunk_size = 100 connector = UniProtConnector() for chunk_start in range(0, len(accessions_sample), chunk_size): chunk = accessions_sample[chunk_start:chunk_start + chunk_size] fastas.append(connector.get_fasta(chunk)) acc2seq = {acc: seq for f in fastas for acc, seq in convert_fasta_str_to_dict(f).items()} # ensure deterministic behaviour although previous fetching of sequences takes a non-deterministic amount of # computation random.setstate(seed) # cleave proteins sequences and sample peptides sequences = sorted(acc2seq.values()) peptides_sample = list() for seq in sequences: peptides = sorted(cleave_protein_sequence(seq)) peptides_sample.extend(random.sample(list(peptides), min(3, len(peptides)))) print(f"Sampled {len(peptides_sample)} peptides", file=LOG_HANDLE) # write fasta with open(output, "w") as handle: handle.write("\n".join(peptides_sample) + "\n") def cleave_protein_sequence(sequence: str, min_length: int = 5): return cleave(sequence, rule="trypsin", min_length=min_length) def get_protein_sequence(accession): connector = UniProtConnector() fasta = connector.get_fasta([accession]) return "".join(fasta.split("\n")[1:]) class UniProtConnector: url = "https://rest.uniprot.org/uniprotkb/" def __init__(self): self.session = requests.Session() self.uniprot_version = requests.get(self.url).headers.get('X-UniProt-Release') print(f"Uniprot Release Number: {self.uniprot_version}", file=LOG_HANDLE) def get_fasta(self, uniprot_accessions: Iterable[str]): uniprot_accessions = ["accession%3A" + acc for acc in uniprot_accessions] url = self.url + f"search?query={'+OR+'.join(uniprot_accessions)}&format=fasta" response = self.session.get(url) response.raise_for_status() fasta = response.text return fasta def convert_fasta_str_to_dict(fasta): protein_dict = {} fasta = fasta.strip() fasta_iter = (x[1] for x in groupby(fasta.split("\n"), lambda line: line[0] == ">")) for header in fasta_iter: # keep only uniprot accession, between pipe-chars ('|') acc = header.__next__().split("|")[1] # join all sequence lines to one. seq = "".join(s.strip() for s in fasta_iter.__next__()) protein_dict[acc] = seq return protein_dict def test_cleave_protein_sequence(): seq = "MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE" peps = { "GIIWGEDTLMEYLENPK", "TGQAPGYSYTAANK", "TGPNLHGLFGR", "CSQCHTVEK", "ADLIAYLK", "MIFVGIK", "YIPGTK", "MGDVEK", "IFIMK", } assert cleave_protein_sequence(seq) == peps def test_fasta_to_dict(): fasta = """>sp|P12345|AATM_RABIT Aspartate aminotransferase, mitochondrial OS=Oryctolagus cuniculus OX=9986 GN=GOT2 PE=1 SV=2 MALLHSARVLSGVASAFHPGLAAAASARASSWWAHVEMGPPDPILGVTEAYK >sp|P99999|CYC_HUMAN Cytochrome c OS=Homo sapiens OX=9606 GN=CYCS PE=1 SV=2 MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIW GEDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE""" assert convert_fasta_str_to_dict(fasta) == { "P12345": "MALLHSARVLSGVASAFHPGLAAAASARASSWWAHVEMGPPDPILGVTEAYK", "P99999": "MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE", } def test_get_protein_sequence(): assert ( get_protein_sequence("P99999") == "MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE" ) def test_sample_peptides(tmpdir): sample_peptides( Path(__file__).parent.parent.parent / "test/unit/simulate_sample/sample_peptides/data/results/map_taxids_to_uniprot_accessions/tax2accessions.json", "/dev/null" # tmpdir / "peptides.txt" ) expected_accestions = "P68254 Q9CQV8 P63101 P68510 P61982 P62259 O70456 P62260 P63102 P68511 P35213 P68255 P61983 Q9MB95 Q80HU0 P18558 P18559 P18560 P68744 Q65209 Q65210 P18557 P0C9H2 P0C9K1 P0C9F1 P0C9J9 P0C9I7 P0C9H5 P0C9F9 P0C9J5 P0C9I3 P0C9G7 P0C9F5 P0C9K3 P0C9G3 P0C9H9 P16536 P05080 P21215 Q8GBW6" expected_peptides = "IASALEMYATK LNNLLDPHSFDEVGAFR IVDWGDYLEVK QIFLTAIGDQAR GDSVHALCALWK GFPCK EAAENSLVAYK GDYHR YDEMVESMK DSTLIMQLLR IISSIEQK GDYHR SVTEQGAELSNEER DSTLIMQLLR GIVDQSQQAYQEAFEISK TEGAEK QQMAR DNLTLWTSDTQGDEAEAGEGGEN QTIENSQGAYQEAFDISK LGLALNFSVFYYEILNNPELACTLAK YLAEVACGDDR YDDMATCMK NVVGGR YLAEVACGDDR YLAEVATGDDK SAYQEAMDISK VETELR NLLSVAYK AVTELNEPLSNEDR QAFDDAIAELDTLNEDSYK QAFDDAIAELDTLNEDSYK NLLSVAYK GDYYR QYCLYFIIGIAYTDCFICALCK MGGGGDHQQLSIK VLDDMPLIVQNDYISK NVAIFQDYHGFPEFR CVEPGWFR GLEEVGINCLK CMYEAHFR DHPIITQNDYIVNCTVSR LLALLSILIWLSQPALNRPLSIFYMK CSYEK ELEHWCTHGK NHSPILENNYIANCSIYR LLTYGFYLVGCVLVANYVR IGSIPPER SLQTIMYLMVLLVIFFLLSQLMLYR ITIDVTPK DVLLAQSVAVEEAK IGNVVTISYNLEK MLVIFLGILGLLANQVLGLPIQAGGHLCSTDNPPQEELGYWCTYMESCK VNESMPLIIENSYLTSCEVSR WYNQCTYGEGNGHYHVMDCSNPVPHNRPHQLR WYNQCTYSEGNGHYHVMDCSNPVPHNRPHR FCWECAHGICK MLVIFLGILGLLANQVLGLPTQAGGHLR MLVIFLGILGLLANQVSSQLVGQLHPTENPSENELEYWCTYMECCQFCWDCQNGLCVNK LGNTTILENEYVHPCIVSR CMYDLDK IDGSAIYK NEYVK QTTVSNSQQAYQEAFEISK ACSLAK DSTLIMQLLR VISSIEQK IEAELQDICSDVLELLDK LAEQAER ELEAVCQDVLSLLDNYLIK VFYLK LAEQAER YDDMAAAMK NVVGAR TSADGNEK" if "snakemake" in globals(): with open(snakemake.log[0], "w") as log_handle: LOG_HANDLE = log_handle sample_peptides( accessions_file=snakemake.input["accessions"], output=snakemake.output[0] ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | import random import sys def sample_taxons(uniprot_speclist_file, output_file, log=sys.stderr, count=10): """Take a random sample of taxids from the uniprot speclist file. Args: uniprot_speclist_file: path to uniprot speclist file, get from: https://www.uniprot.org/docs/speclist.txt output_file: path to file for writing TaxIDs to, one TaxID per line. Is also used for initializing random.seed. log: handle for writing the log count: number of taxa that should be sampled """ with open(uniprot_speclist_file, "r") as file_handle: taxids = read_taxids_from_uniprot_speclist_file(file_handle) if count > len(taxids): print( f"Number of requested taxids ({count}) is larger than number of available taxids ({len(taxids)}).", file=log, ) sys.exit(1) random.seed(output_file) sample_taxids = random.sample(taxids, count) with open(output_file, "w") as output_handle: for taxid in sample_taxids: output_handle.write(f"{taxid}\n") def read_taxids_from_uniprot_speclist_file(file_handle): """ :param file_handle: open file handle to uniprot speclist file. download from: https://www.uniprot.org/docs/speclist.txt :return: """ taxids = [] is_headline_reached = False is_taxid_section = False for line in file_handle.readlines(): if not is_taxid_section: if not is_headline_reached: if line.startswith("(1) Real organism codes"): is_headline_reached = True else: if line.startswith("_____"): is_taxid_section = True continue else: # stop at empty line if not line.strip(): break line_up_to_taxid = line[:15] if line_up_to_taxid.strip(): taxid = int(line_up_to_taxid.split(" ")[-1]) taxids.append(taxid) return taxids if "snakemake" in globals(): with open(snakemake.log[0], "w") as log_handle: sample_taxons(snakemake.input[0], snakemake.output[0], log=log_handle) |
Support
- Future updates