Plastic Enzymes Degrading in Metagenomic databases Analysis

public public 1yr ago Version: v0.2.2 0 bookmarks

Version 1.0.0


M-PARTY is a free to use, open source user friendly CLI (early release) implemented workflow and database for the detection of plastic degrading enzymes in metagenomic samples, through structural annotation using Hidden Markov Models.


Index

  1. Introduction

  2. Installation

    1. GitHub

    2. Bioconda

  3. Usage

    1. Annotation

    2. Database Construction

    3. Validation

    4. Full M-PARTY Execution

    5. Metagenomic Analisys

  4. Output

  5. Additional arguments

Introduction

M-PARTY is a free to use, open source user friendly CLI implemented workflow and database for the detection of plastic degrading enzymes in metagenomic samples, through structural annotation using Hidden Markov Models, that allows the user to freely interacte with the tool in-built databases and backbone.

Basic steps of M-PARTY annotation workflow in its frist stages are:

  1. The acceptence of any number of protein sequences in a single FASTA file as query; also an KEGG ID representing the protein sequences involved in certain reaction; an InterPro ID or a set of protein IDs of interest from latter database;

  2. Execution of hmmsearch from the HMMER package using the built HMMs from previously knowns sequences able to have some kind of PE deterioration levels as database; and KMA to map and search raw metagenomes for interest genes;

  3. Able to input three different genres of files:

    • Protein datasets

    • Assembled metagenomes

    • Raw metagenomes

  4. A quality benchmark to determine good and bad hits from the queries against models;

  5. Three output files, consisting in a FASTA file with the protein sequences returned as a hit from the search, a report in text format (if requested by the user) with simply puted information about the inputed and already built (HMMs) data, run and processing parameters and conclusions, and an easy to read report table in xlsx format, with all the important data about the annotation results, in particular:

    • Sequence IDs

    • HMM IDs (Degraded plastic + number)

    • Bit scores

    • E-values

  6. A validation workflow for the results.


Installation

GitHuh Clonning

M-PARTY is, avaliable for Linux platforms though GitHub repository clonning, using the following line in a git bash terminal inside the desired (empty) folder:

cd path/to/desired/dir
git clone https://github.com/ozefreitas/M-PARTY.git

I highly recommed users to create an appropriate conda environment with the required dependencies so M-PARTY executes smoothly, with:

cd workflow/envs/
conda env create -n <name of env> -f mparty.yaml
conda activate <name of env>
cd ../..

Clonning though GitHub is only recommended in last case scenario, as this as deprecated in detriment of bioconda distribution aplication.


Bioconda

M-PARTY is available as a conda package from bioconda. Due to tool recent name change, the package still remains with the old name. Simply open an Anaconda prompt or a command line interface with Anaconda or Miniconda distributions installed and:

conda install -c conda-forge -c bioconda m-party

and you will be good to go.

If somethig goes wrong, I sugest you to first create a conda environment with:

conda create -n <name of env> -c conda-forge -c bioconda m-party

due to possible compatibility issues that may occur.


Usage

Annotation workflow

The main and most basic use for M-PARTY is the annotation with Hidden Markov Models (must be previously created by the user):

m-party.py -i path/to/input_file -o output_folder -rt --output_type excel --hmm_db_name <db_name> --verbose

where the -i input file must be in FASTA format and contain only (for the time being) aminoacidic sequences, otherwise, program will exit. -o output folder can be a pre-existing folder or any name for a folder that will be created anyways. The -rt option flag instructs the tool to include in the output the report in text format, for an easier interpretation of the annotation results and conclusion taking. Also, --output_type is recommended to be set to "excel" on these earlier versions, as other output format for the table report will be incrementally coded. --hmm_db_name is mandatory and represents a name to be given to each run, and folders with that name will be saved as databases.


Database Construction

M-PARTY does not have any pre-built database, and so, all HMMs must be generated from scratch from a given set of proteins/nucleotides. M-PARTY accepts this input from 3 distinct methods:

  1. A FASTA file with sequences with known functions from the user;

  2. A KEGG Orthodology ID(s) (KO) or E.C. number(s);

  3. An InterPro ID or Protein ID(s)

With previous knowledge of a given reaction or protein family, a user can input an ID which represents a set of sequences involved in such reaction, from where this tool will automatically search and download.


FASTA file

If you possess a FASTA file with interest sequences for your study to after be searched:

m-party.py -w database_construction --input_seqs_db_const path/to/interest_sequences --hmm_db_name <db_name>

KEGG

If you want to build an HMM database from a reaction represented by a certain KO:

m-party.py -w database_construction --kegg <KO> --hmm_db_name <db_name>

or for a certain E.C. number (EC):

m-party.py -w database_construction --kegg <EC> --hmm_db_name <db_name>

By default, M-PARTY will download the aminoacid sequences of the found enzymes for each ID. If you wish to build the models for nucleic sequences just add the --input_type_db_const and set it to "nucleic" like (in both cases of FASTA user file or KEGG IDs):

m-party.py -w database_construction --input_seqs_db_const path/to/interest_sequences OR --kegg <EC/KO> --input_type_db_const nucleic --hmm_db_name <db_name> 

InterPro

As for the case of InterPro retriver, just change the argument kegg for --interpro :

m-party.py -w database_construction --interpro <IPR> --hmm_db_name <db_name>

or

m-party.py -w database_construction --interpro <PID> --hmm_db_name <db_name>

For the case of an InterPro ID (IPR) or a Protein ID (PID). In this database, only aminoacid sequences are available, and so it is not possible to set --input_type_db_const to "nucleic".

Interpro its not a curated database and so most of his entries are unreviewed. To counter this, you can add the --curated flag to the previous line

Note: All commands must have the --hmm_db_name <db_name> argument! Otherwise, M-PARTY will instantly raise a ValueERROR .


Validation

M-PARTY also has available a validation workflow, as some models can be false positive and give deceiving results. So giving the name where the HMM are stored:

m-party.py --hmm_validation --hmm_db_name <db_name>

Obviously, using the 'leave-one-out' cross validation method, a dataset of negative sequences must be given, and so each dataset will be distinct in each run, depending of the content of the HMMs. So just add the --negative_db with a FASTA dataset of sequences that you know to be different from the ones of interest:

m-party.py --hmm_validation --hmm_db_name <db_name> --negative_db path/to/negative_dataset

Now, if you want, you can instantly run the annotation workflow from a set of proteins of your liking, and so performing the validation beforehand, only if you already ran the database construction workflow, and so the models are already present, with:

m-party -i path/to/input_file -o path/to/output_folder -rt --output_type excel --hmm_db_name <db_name> --hmm_validation 

Full MPARTY Execution

How could not miss, all this can be done with a single command, from the construction of the models, to the annotation of the unknown sequence file.

m-party.py -w both -i path/to/input_file (--input_seqs_db_const path/to/interest_sequences OR --kegg <EC/KO> OR --interpro <IPR>) -o path/to/output_folder -rt --output_type excel --hmm_db_name <db_name> --verbose 

Metagenomic Analisys

M-PARTY also suports the search of genes in metagenome samples:

m-party.py -w database_construction --it metagenome --kegg <KO> --input_type_db_const nucleic --hmm_db_name <db_name>

I firstly recommend to run the database construction workflow in order to download or input the interest genes. In order to avoid building the models and so waste time, --it is set to metagenome.

After this, just give the metagenome file, say again that what is inside is in fact a metagenome, give the same <db_name> and run it:

m-party.py -i path/to/metagenome -o path/to/output_folder -it metagenome --hmm_db_name <db_name>

Warning: This method is only viable for nucleotide sequences, so --input_type_db_const nucleic is obligatory!

Other way is to just do both workflows at the same time:

m-party.py -w both -it metagenome -i path/to/metagenome -o path/to/output_folder --kegg <KO> --input_type_db_const nucleic --hmm_db_name <db_name> --verbose

Note that instead of --kegg you can put instead the --input_seqs_db_const argument with a FASTA file of nucleotides.


Database Expansion (in development)

Sometimes, the amount of data for the interest sequences is not enough to make a roboust model. So M-PARTY is expected to implement a workflow that will allow the user to expand the initial dataset, in order to build a more diverse dataset. This methodology is based on XXXXXXXXXXXXX:

Output

M-PARTY will result in three distinct outputs: report table , text report and aligned . In earlier versions, report table is only available in excel format, although later will also be for tsv and csv . Text report is a user friendly easy to understand summary of the annotation run performed by M-PARTY, and embrace a series of useful information for the user, depending on the given arguments. For last, aligned is a FASTA file with all the sequences that had a match in one or more models (this will be refined as model benchmarking and validation are introduced into M-PARTY).

Optional output file is the config file , a file with all parameters used in each M-PARTY run, and that can be useful to traceback error. If you are having trouble generating the expect content in the other output files, you can add the --diplay_config flag to every command you execute.


Results Validation (in development)

Aditional arguments

M-PARTY is in continuous development, so the "validation" workflow is needing some changes, as well as the expansion argument, which still needs to be reviewed and validated. So I highly recommend you to follow the steps in the usage section.

usage: m-party.py [-h] [-i INPUT] [--input_seqs_db_const INPUT_SEQS_DB_CONST]
 [-db DATABASE] [--hmm_db_name HMM_DB_NAME] [-it INPUT_TYPE]
 [--input_type_db_const INPUT_TYPE_DB_CONST] [--consensus]
 [-o OUTPUT] [--output_type OUTPUT_TYPE] [-rt]
 [--hmms_output_type HMMS_OUTPUT_TYPE] [--hmm_validation]
 [-p] [--negative_db NEGATIVE_DB] [-s SNAKEFILE] [-ex]
 [--kegg KEGG [KEGG ...]]
 [--interpro INTERPRO [INTERPRO ...]] [--curated]
 [-t THREADS] [--align_method ALIGN_METHOD]
 [--aligner ALIGNER] [-hm HMM_MODELS] [--concat_hmm_models]
 [--unlock] [-w WORKFLOW] [-c CONFIG_FILE] [--overwrite]
 [--verbose] [--display_config] [-v]
M-PARTY's main script
optional arguments:
 -h, --help show this help message and exit
 -i INPUT, --input INPUT
 input FASTA file containing a list of protein
 sequences to be analysed
 --input_seqs_db_const INPUT_SEQS_DB_CONST
 input a FASTA file with a set of sequences from which
 the user wants to create the HMM database from
 scratch.
 -db DATABASE, --database DATABASE
 FASTA database to run against the also user inputted
 sequences. DIAMOND is performed in order to expand the
 data and build the models. M-PARTY has no in-built
 database for this matter. If flag is given, download
 of the default database will start and model built
 from that. Defaults to UniProt DataBase.
 --hmm_db_name HMM_DB_NAME
 name to be assigned to the hmm database to be created.
 Its recomended to give a name that that describes the
 family or other characteristic of the given sequences.
 Be carefull as what name to use, as this will define
 what HMMs will be used for the search
 -it INPUT_TYPE, --input_type INPUT_TYPE
 specifies the nature of the sequences in the input
 file between 'protein', 'nucleic' or 'metagenome'.
 Defaults to 'protein'
 --input_type_db_const INPUT_TYPE_DB_CONST
 specifies the nature of the input sequences for the
 database construction between 'nucleic' and 'protein'.
 Defaults to 'protein'.
 --consensus call to build consensus sequences when building the
 database, in order to run KMA against raw metagenomes
 -o OUTPUT, --output OUTPUT
 name for the output directory. Defaults to
 'MPARTY_results'
 --output_type OUTPUT_TYPE
 choose report table outpt format from 'tsv', 'csv' or
 'excel'. Defaults to 'tsv'
 -rt, --report_text decides whether to produce or not a friendly report in
 txt format with easy to read information
 --hmms_output_type HMMS_OUTPUT_TYPE
 chose output type of hmmsearch run from 'out', 'tsv'
 or 'pfam' format. Defaults to 'tsv'
 --hmm_validation decides whether to perform models validation and
 filtration with the 'leave-one-out' cross validation
 methods. Call to set to True. Defaults to False
 -p, --produce_inter_tables
 call if user wants to save intermediate tables as
 parseale .csv files (tables from hmmsearch results
 processing)
 --negative_db NEGATIVE_DB
 path to a user defined negative control database.
 Default use of human gut microbiome
 -s SNAKEFILE, --snakefile SNAKEFILE
 user defined snakemake workflow Snakefile. Defaults to
 '/workflow/Snakefile
 -ex, --expansion Decides wheter to expand the interest dataset.
 Defaults to False.
 --kegg KEGG [KEGG ...]
 input KEGG ID(s) to download respective sequences, in
 order to build a pHMM based on those
 --interpro INTERPRO [INTERPRO ...]
 input InterPro ID(s) to download the respective
 sequences, in order to build a pHMM based on those
 --curated call to only retrieve reviewed sequences from InterPro
 -t THREADS, --threads THREADS
 number of threads for Snakemake to use. Defaults to
 max number of available logical CPUs.
 --align_method ALIGN_METHOD
 chose the alignment method for the initial sequences
 database expansion, between 'diamond', 'blast' and
 'upimapi'. Defaults to 'upimapi'
 --aligner ALIGNER chose the aligner program to perform the multiple
 sequence alignment for the models between 'tcoffee'
 and 'muscle'. Defaults to 'tcoffee'.
 -hm HMM_MODELS, --hmm_models HMM_MODELS
 path to a directory containing HMM models previously
 created by the user. By default M-PARTY uses the
 built-in HMMs from database in
 'resources/Data/HMMs/After_tcoffee_UPI/'
 --concat_hmm_models call to not concatenate HMM models into a single file.
 Defaults to True
 --unlock could be required after forced workflow termination
 -w WORKFLOW, --workflow WORKFLOW
 defines the workflow to follow, between "annotation",
 "database_construction" and "both". Latter keyword
 makes the database construction first and posterior
 annotation. Defaults to "annotation"
 -c CONFIG_FILE, --config_file CONFIG_FILE
 user defined config file. Only recommended for
 advanced users. Defaults to 'config.yaml'. If given,
 overrides config file construction from input
 --overwrite Call to overwrite inputted files. Defaults to False
 --verbose Call so M-PARTY display more messaging
 --display_config declare to output the written config file together
 with results. Useful in case of debug
 -v, --version show program's version number and exit

Code Snippets

36
37
shell:
	"t_coffee {input} -output clustalw_aln -outfile {output} -type PROTEIN -n_core 4"
63
64
shell:
	"hmmbuild {output} {input}"
101
102
shell:
	"cat {input} > {output}"
ShowHide 2 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/ozefreitas/M-PARTY
Name: m-party
Version: v0.2.2
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...