Created for testing the metabolomics pipeline through snakemake

public public 1yr ago 0 bookmarks
Loading...

PEP compatible

This is the Snakemake implementation of the pyOpenMS workflow tailored by Eftychia Eva Kontou and Axel Walter .

Workflow overview

The pipeline consists of seven interconnected steps:

  1. File conversion: Simply add your Thermo raw files under the directory data/raw/ and they will be converted to centroid mzML files. If you have Agilent, Bruker, or other vendor files, skip that step (write "FALSE" for rule fileconversion in the config.yaml file - see more under "Configure workflow"), convert them independently using proteowizard and add them under the data/mzML/ directory.

  2. Pre-processing: converting raw data to a feature table with a series of algorithms through feature detection, alignment and grouping. This step includes an optional removal of blank/QC samples if defined by the user. Optional "minfrac" step here allows for removal of consensus features with too many missing values.

  3. Re-quantification (optional): Re-quantify all features with missing values across samples resulted from the pre-processing step for more reliable statistical analysis and data exploration. Optional "minfrac" step here allows for removal of consensus features with too many missing values.

  4. Structural and formula predictions (SIRIUS and CSI:FingeID) and annotation of the feature matrix with those predictions (MSI level 3).

  5. GNPSexport: generate all the files necessary to create a FBMN or IIMN job at GNPS.

  6. Spectral matching with in-house or a publicly available library (MGF/MSP/mzML format) and annotation of the feature matrix with matches that have a score above 60 (MSI level 2).

  7. After FBMN or IIMN: Integrate Sirius and CSI predictions to the network (GraphML) and MSMS spectral library annotations to the feature matrix- MSI level 2 (optional).

  8. MS2Query: add another annotation step with a machine learning tool, MS2Query, that searches for exact spectral matches, as well as analogues, using Spec2Vec and MS2Deepscore.

See README file for details.

Overview

dag

Usage

Step 1: Clone the workflow

Clone this repository to your local system, into the place where you want to perform the data analysis.

git clone https://github.com/biosustain/snakemake_UmetaFlow.git

Make sure to have the right access / SSH Key. If not , follow the steps:

Step (i): https://docs.github.com/en/github/authenticating-to-github/connecting-to-github-with-ssh/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent

Step (ii): https://docs.github.com/en/github/authenticating-to-github/connecting-to-github-with-ssh/adding-a-new-ssh-key-to-your-github-account

Step 2: Install all dependencies

Homebrew and wget dependencies:

For both systems

 /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Press enter (RETURN) to continue

For Linux(!) only

Follow the next instructions to add Linuxbrew to your PATH and to your bash shell profile script, either ~/.profile on Debian/Ubuntu or ~/.bash_profile on CentOS/Fedora/RedHat (https://github.com/Linuxbrew/brew).

 test -d ~/.linuxbrew && eval $(~/.linuxbrew/bin/brew shellenv) test -d /home/linuxbrew/.linuxbrew && eval $(/home/linuxbrew/.linuxbrew/bin/brew shellenv) test -r ~/.bash_profile && echo "eval \$($(brew --prefix)/bin/brew shellenv)" >> ~/.bash_profile echo "eval \$($(brew --prefix)/bin/brew shellenv)" >>~/.profile

For both systems

 brew install wget

Conda , Mamba and Snakemake dependencies:

For both systems

Install conda for any system . Installing Snakemake using Mamba is advised. >>Install Mamba into any other Conda-based Python distribution with:

 conda install -n base -c conda-forge mamba

Then install Snakemake with:

 mamba create -c conda-forge -c bioconda -n snakemake snakemake

SIRIUS executable:

Download the latest SIRIUS executable manually from here until available as a conda-forge installation. Choose the headless zipped file compatible for your operating system (linux, macOS or windows) and unzip it under the directory "resources/". Make sure to register using your university email and password. Tip: avoid SNAPSHOTS unless temporarily necessary.

Tip: Download a version >5.6.3. Make sure you register with your institution's email. Example (for linux OS:)

(cd resources/ && wget https://github.com/boecker-lab/sirius/releases/download/v5.7.2/sirius-5.7.2-linux64.zip && unzip *.zip)

Build OpenMS :

For both systems (challenging step!)

Build OpenMS on Linux , MacOS until the 3.0 release is published.

For Linux(!) only

Then add the binaries to your path (Linux):

 export PATH=$PATH:/path/to/openms-build/bin/
 source ~/.bashrc

For MacOS(!) only

Then add the binaries to your path (MacOS) by opening one of these files in a text editor:

 /etc/profile
 ~/.bash_profile
 ~/.bash_login (if .bash_profile does not exist)
 ~/.profile (if .bash_login does not exist)

and adding the path to the binaries at the very end (path-dependent):

 export PATH=$PATH:/path/to/openms-build/bin/

Step 3: Configure workflow

Configure the workflow according to your metabolomics data and instrument method via editing the files in the config/ folder.

Adjust the config.yaml to:

  • Configure the workflow execution (write TRUE / FALSE if you want to run/skip a specific rules of the workflow)

  • Adjust the parameters in the configuration file for your dataset as explained in the commented section in the yaml file (e.g. positive/negative ionisation, etc.)

Complete the dataset.tsv table to specify the samples (files) that will be processed. Suggestion: Use the Jupyter notebook Create_dataset_tsv after you add all your files in the data/raw/ or data/mzML/ directory and avoid errors in the sample names or simply run:

python data_files.py
  • config/dataset.tsv example:
sample_name comment
ISP2_blank blank media
NBC_00162 pyracrimicin
MDNA_WGS_14 epemicins_A_B

If there are blanks in the file list, then add them to the config/blanks.tsv file

  • config/blanks.tsv example:
sample_name comment
ISP2_blank blank media
  • config/samples.tsv example:
sample_name comment
NBC_00162 pyracrimicin
MDNA_WGS_14 epemicins_A_B

Step 4: Execute workflow

Activate the conda environment:

conda activate snakemake

Get example input data (only for testing the workflow with the example dataset)

(cd data && wget https://zenodo.org/record/6948449/files/Commercial_std_raw.zip?download=1 && unzip *.zip -d raw)

Test your configuration by performing a dry-run via

snakemake --use-conda -n

Execute the workflow locally via

snakemake --use-conda --cores $N --keep-going --quiet --default-resources mem_mb='{config.system.memory}'

See the Snakemake documentation for further details.

Step 5: Investigate results

All the results are in a .TSV format and can be opened simply with excel or using pandas dataframes. All the files under results/interim can be ignored and eventualy discarded.

Developer Notes

All the workflow outputs are silenced for performance enhancement through the flag -no_progress or >/dev/null in each rule or --quiet for snakemake command (see Execute the workflow locally via) that one can remove. Nevertheless, the error outputs, if any, are written in the specified log files.

Config & Schemas

  • Config & schemas define the input formatting and are important to generate wildcards . The idea of using samples and units came from here .

Rules

  • Snakefile : the main entry of the pipeline which tells the final output to be generated and the rules being used

  • common.smk : a rule that generates the variables used (sample names) & other helper scripts

  • The main rules (*.smk) : the bash code that has been chopped into modular units, with defined input & output. Snakemake then chains this rules together to generate required jobs. This should be intuitive and makes things easier for adding / changing steps in the pipeline.

Environments

  • Conda environments are defined as .yaml file in workflow/envs

  • Note that not all dependencies are compatible/available as conda libraries. Once installed, the virtual environment are stored in .snakemake/conda with unique hashes. The ALE and pilon are example where environment needs to be modified / dependencies need to be installed.

  • It might be better to utilise containers / dockers and cloud execution for "hard to install" dependencies

  • Custom dependencies and databases are stored in the resources/ folder.

  • Snakemake dependencies with conda packages is one of the drawbacks and why Nextflow might be more preferable. Nevertheless, the pythonic language of snakemake enables newcomers to learn and develop their own pipeline faster.

Test Data (only for testing the workflow with the example dataset)

  • Current test data are built from known metabolite producer strains or standard samples that have been analyzed with a Thermo Orbitrap IDX instrument. The presence of the metabolites and their fragmentation patterns has been manually confirmed using TOPPView.

Citations

Kontou, E.E., Walter, A., Alka, O. et al. UmetaFlow: an untargeted metabolomics workflow for high-throughput data processing and analysis. J Cheminform 15, 52 (2023). https://doi.org/10.1186/s13321-023-00724-w

Pfeuffer J, Sachsenberg T, Alka O, et al. OpenMS – A platform for reproducible analysis of mass spectrometry data. J Biotechnol. 2017;261:142-148. doi:10.1016/j.jbiotec.2017.05.016

Dührkop K, Fleischauer M, Ludwig M, et al. SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information. Nat Methods. 2019;16(4):299-302. doi:10.1038/s41592-019-0344-8

Dührkop K, Shen H, Meusel M, Rousu J, Böcker S. Searching molecular structure databases with tandem mass spectra using CSI:FingerID. Proc Natl Acad Sci. 2015;112(41):12580-12585. doi:10.1073/pnas.1509788112

Nothias LF, Petras D, Schmid R, et al. Feature-based molecular networking in the GNPS analysis environment. Nat Methods. 2020;17(9):905-908. doi:10.1038/s41592-020-0933-6

Schmid R, Petras D, Nothias LF, et al. Ion identity molecular networking for mass spectrometry-based metabolomics in the GNPS environment. Nat Commun. 2021;12(1):3832. doi:10.1038/s41467-021-23953-9

Mölder F, Jablonski KP, Letcher B, et al. Sustainable data analysis with Snakemake. Published online January 18, 2021. doi:10.12688/f1000research.29032.1

de Jonge, N.F., Louwen, J.J.R., Chekmeneva, E. et al. MS2Query: reliable and scalable MS2 mass spectra-based analogue search. Nat Commun 14, 1752 (2023). doi:10.1038/s41467-023-37446-4

Code Snippets

19
20
21
22
shell:
    """
    python workflow/scripts/FBMN_SIRIUS.py {input.input_matrix} {input.input_mgf} {input.input_graphml} {output.output_graphml} > /dev/null 2>> {log}
    """
35
36
37
38
shell:
    """
    python workflow/scripts/FBMN_SIRIUS.py {input.input_matrix} {input.input_mgf} {input.input_graphml} {output.output_graphml} > /dev/null 2>> {log}
    """
57
58
59
60
shell:
    """
    python workflow/scripts/GNPS.py {input.lib} {input.featurematrix} {input.mgf_path} {output.gnps} > /dev/null 2>> {log}
    """
71
72
73
74
shell:
    """ 
    echo "No GNPS metabolite annotation file was found" > {output} > /dev/null 2>> {log}
    """
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
import pandas as pd
import networkx as nx
import glob
import os
import sys
from pyteomics import mgf, auxiliary

def integration(input_matrix, input_mgf, input_graphml, output_graphml):
    # input matrix
    Matrix= pd.read_csv(input_matrix, sep="\t")
    Matrix["id"]= Matrix["id"].astype(str)
    Matrix["feature_ids"]= Matrix["feature_ids"].values.tolist()
    #input mgf file
    file= mgf.MGF(source=input_mgf, use_header=True, convert_arrays=2, read_charges=True, read_ions=False, dtype=None, encoding=None)
    parameters=[]
    for spectrum in file:
        parameters.append(spectrum['params'])
    mgf_file= pd.DataFrame(parameters)
    mgf_file["feature_id"]= mgf_file["feature_id"].str.replace(r"e_", "")
    #add scan number from mgf file by matching feature_IDs
    Matrix.insert(0, "SCANS", "")
    for i, id in zip(Matrix.index, Matrix["id"]):
        hits = []
        for scans, feature_id in zip(mgf_file["scans"], mgf_file["feature_id"]): 
            if feature_id==id:
                hit = f"{scans}"
                if hit not in hits:
                    hits.append(hit)
        Matrix["SCANS"][i] = " ## ".join(hits)
    #introduce the SIRIUS predictions
    input_formulas= glob.glob(os.path.join("results", "SiriusCSI", "formulas_*.tsv"))
    DF_SIRIUS = pd.DataFrame()
    list_of_df=[]
    for tsv in input_formulas:
        df= pd.read_csv(tsv, sep="\t", index_col="Unnamed: 0")
        s= df["opt_global_rank"]
        pd.to_numeric(s)
        df= df.loc[df["opt_global_rank"]==1]
        df=df.reset_index()
        list_of_df.append(df)
    DF_SIRIUS= pd.concat(list_of_df,ignore_index=True)
    DF_SIRIUS= DF_SIRIUS.drop(columns="index")
    DF_SIRIUS["opt_global_featureId"]= DF_SIRIUS["opt_global_featureId"].str.replace(r"id_", "")
    #add the scan number to the SIRIUS predictions   
    DF_SIRIUS.insert(0, "SCANS", "")
    for i, Pred_id in zip(DF_SIRIUS.index, DF_SIRIUS["opt_global_featureId"]):
        hits = []
        for scans, feature_id in zip(Matrix["SCANS"], Matrix["feature_ids"]): 
            if Pred_id in feature_id:
                hit = f"{scans}"
                if hit not in hits:
                    hits.append(hit)
        DF_SIRIUS["SCANS"][i] = " ## ".join(hits)
    #introduce the CSI predictions
    input_structures= glob.glob(os.path.join("results", "SiriusCSI", "structures_*.tsv"))
    DF_CSI = pd.DataFrame()
    list_of_df=[]
    for tsv in input_structures:
        df= pd.read_csv(tsv, sep="\t", index_col="Unnamed: 0")
        s= df["opt_global_rank"]
        pd.to_numeric(s)
        df= df.loc[df["opt_global_rank"]==1]
        df=df.reset_index()
        list_of_df.append(df)
    DF_CSI= pd.concat(list_of_df,ignore_index=True)
    DF_CSI= DF_CSI.drop(columns="index")
    DF_CSI["opt_global_featureId"]= DF_CSI["opt_global_featureId"].str.replace(r"id_", "")
    #add the scan number to the CSI predictions   
    DF_CSI.insert(0, "SCANS", "")

    for i, Pred_id in zip(DF_CSI.index, DF_CSI["opt_global_featureId"]):
        hits = []
        for scans, feature_id in zip(Matrix["SCANS"], Matrix["feature_ids"]): 
            if Pred_id in feature_id:
                hit = f"{scans}"
                if hit not in hits:
                    hits.append(hit)
        DF_CSI["SCANS"][i] = " ## ".join(hits)
    #read the graphml file downloaded from GNPS FBMN and add the SIRIUS and CSI information
    G = nx.read_graphml(input_graphml)
    for result in DF_SIRIUS.to_dict(orient="records"):
        scan = str(result["SCANS"])
        if scan in G:
            G.nodes[scan]["sirius:molecularFormula"] = result["chemical_formula"]
            G.nodes[scan]["sirius:adduct"] = result["opt_global_adduct"]
            G.nodes[scan]["sirius:TreeScore"] = result["TreeScore"]
            G.nodes[scan]["sirius:IsotopeScore"] = result["IsotopeScore"]
            G.nodes[scan]["sirius:explainedPeaks"] = result["opt_global_explainedPeaks"]
            G.nodes[scan]["sirius:explainedIntensity"] = result["opt_global_explainedIntensity"]
            G.nodes[scan]["sirius:explainedPeaks"] = result["opt_global_explainedPeaks"]

    for result in DF_CSI.to_dict(orient="records"):
        scan = str(result["SCANS"])
        if scan in G:
            G.nodes[scan]["csifingerid:smiles"] = result["smiles"]
            G.nodes[scan]["csifingerid:Confidence_Score"] = result["best_search_engine_score[1]"]
            G.nodes[scan]["csifingerid:dbflags"] = result["opt_global_dbflags"]
            G.nodes[scan]["csifingerid:description"] = result["description"]

    nx.write_graphml(G, output_graphml)

if __name__ == "__main__":
    integration(sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4])
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import numpy as np
import pandas as pd
import sys
from pyteomics import mgf, auxiliary

def GNPS_annotations(lib, featurematrix, mgf_path, gnps):
    #import GNPS MSMS library matches
    df= pd.read_csv(lib, sep="\t")
    df.drop(df.index[df['IonMode'] == "negative"], inplace=True)
    df.drop(df.index[df['MZErrorPPM'] > 10.0], inplace=True)
    GNPS=df.drop_duplicates(subset="Compound_Name", keep='first')
    GNPS["#Scan#"]= GNPS["#Scan#"].astype(str)
    #Import annonated feature matrix
    Matrix= pd.read_csv(featurematrix, sep="\t")
    Matrix["id"]= Matrix["id"].astype(str)
    Matrix["feature_ids"]= Matrix["feature_ids"].values.tolist()
    #Import MGF file with SCAN numbers
    file= mgf.MGF(source=mgf_path, use_header=True, convert_arrays=2, read_charges=True, read_ions=False, dtype=None, encoding=None)
    parameters=[]
    for spectrum in file:
        parameters.append(spectrum['params'])
    mgf_file= pd.DataFrame(parameters)
    mgf_file["feature_id"]= mgf_file["feature_id"].str.replace(r"e_", "")
    #Add SCAN numbers to the feature matrix
    Matrix.insert(0, "SCANS", "")
    for i, id in zip(Matrix.index, Matrix["id"]):
        hits = []
        for scan, feature_id in zip(mgf_file["scans"], mgf_file["feature_id"]): 
            if feature_id==id:
                hit = f"{scan}"
                if hit not in hits:
                    hits.append(hit)
        Matrix["SCANS"][i] = " ## ".join(hits)                  

    Matrix.insert(0, "GNPS", "")

    for i, scan in zip(Matrix.index, Matrix["SCANS"]):
        hits = []
        for name, scan_number, in zip(GNPS["Compound_Name"], GNPS["#Scan#"]):
            if scan==scan_number:
                hit = f"{name}"
                if hit not in hits:
                    hits.append(hit)
        Matrix["GNPS"][i] = " ## ".join(hits)

    Matrix.to_csv(gnps, sep="\t", index = False)
    return Matrix

if __name__ == "__main__":
    GNPS_annotations(sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4])
ShowHide 4 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/NBChub/snakemake-UmetaFlow
Name: snakemake-umetaflow
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: Apache License 2.0
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...