Python Protein conformational ensembles generation, building on PDBe-KB

public public 1yr ago Version: Version 1 0 bookmarks

Protein Conformational ensembles generation

Building on PDBe-KB to chart and characterize the conformation landscape of native proteins

This tutorial aims to illustrate the process of generating protein conformational ensembles from** 3D structures **and analysing its molecular flexibility , step by step, using the BioExcel Building Blocks library (biobb) .

Conformational landscape of native proteins

Proteins are dynamic systems that adopt multiple conformational states , a property essential for many biological processes (e.g. binding other proteins, nucleic acids, small molecule ligands, or switching between functionaly active and inactive states). Characterizing the different conformational states of proteins and the transitions between them is therefore critical for gaining insight into their biological function and can help explain the effects of genetic variants in health and disease and the action of drugs.

Structural biology has become increasingly efficient in sampling the different conformational states of proteins. The PDB has currently archived more than 170,000 individual structures , but over two thirds of these structures represent multiple conformations of the same or related protein, observed in different crystal forms, when interacting with other proteins or other macromolecules, or upon binding small molecule ligands. Charting this conformational diversity across the PDB can therefore be employed to build a useful approximation of the conformational landscape of native proteins.

A number of resources and tools describing and characterizing various often complementary aspects of protein conformational diversity in known structures have been developed, notably by groups in Europe. These tools include algorithms with varying degree of sophistication, for aligning the 3D structures of individual protein chains or domains, of protein assemblies, and evaluating their degree of structural similarity . Using such tools one can align structures pairwise , compute the corresponding similarity matrix , and identify ensembles of structures/conformations with a defined similarity level that tend to recur in different PDB entries, an operation typically performed using clustering methods. Such workflows are at the basis of resources such as CATH, Contemplate, or PDBflex that offer access to conformational ensembles comprised of similar conformations clustered according to various criteria. Other types of tools focus on differences between protein conformations , identifying regions of proteins that undergo large collective displacements in different PDB entries, those that act as hinges or linkers , or regions that are inherently flexible .

To build a meaningful approximation of the conformational landscape of native proteins, the conformational ensembles (and the differences between them), identified on the basis of structural similarity/dissimilarity measures alone, need to be biophysically characterized . This may be approached at two different levels .

  • At the biological level , it is important to link observed conformational ensembles , to their functional roles by evaluating the correspondence with protein family classifications based on sequence information and functional annotations in public databases e.g. Uniprot, PDKe-Knowledge Base (KB). These links should provide valuable mechanistic insights into how the conformational and dynamic properties of proteins are exploited by evolution to regulate their biological function .

  • At the physical level one needs to introduce energetic consideration to evaluate the likelihood that the identified conformational ensembles represent conformational states that the protein (or domain under study) samples in isolation. Such evaluation is notoriously challenging and can only be roughly approximated by using computational methods to evaluate the extent to which the observed conformational ensembles can be reproduced by algorithms that simulate the dynamic behavior of protein systems. These algorithms include the computationally expensive classical molecular dynamics (MD) simulations to sample local thermal fluctuations but also faster more approximate methods such as Elastic Network Models and Normal Node Analysis (NMA) to model low energy collective motions . Alternatively, enhanced sampling molecular dynamics can be used to model complex types of conformational changes but at a very high computational cost.

The ELIXIR 3D-Bioinfo Implementation Study Building on PDBe-KB to chart and characterize the conformation landscape of native proteins focuses on:

  1. Mapping the conformational diversity of proteins and their homologs across the PDB.
  2. Characterize the different flexibility properties of protein regions, and link this information to sequence and functional annotation.
  3. Benchmark computational methods that can predict a biophysical description of protein motions.

This notebook is part of the third objective, where a list of computational resources that are able to predict protein flexibility and conformational ensembles have been collected, evaluated, and integrated in reproducible and interoperable workflows using the BioExcel Building Blocks library . Note that the list is not meant to be exhaustive, it is built following the expertise of the implementation study partners.

Code Snippets

  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
import time
import argparse
import zipfile
from pathlib import Path, PurePath
from biobb_common.configuration import settings
from biobb_common.tools import file_utils as fu
from biobb_structure_utils.utils.extract_model import extract_model
from biobb_structure_utils.utils.extract_chain import extract_chain
from biobb_analysis.ambertools.cpptraj_mask import cpptraj_mask
from biobb_flexdyn.flexdyn.concoord_dist import concoord_dist
from biobb_flexdyn.flexdyn.concoord_disco import concoord_disco
from biobb_analysis.ambertools.cpptraj_rms import cpptraj_rms
from biobb_analysis.ambertools.cpptraj_convert import cpptraj_convert
from biobb_flexdyn.flexdyn.prody_anm import prody_anm
from biobb_flexserv.flexserv.bd_run import bd_run
from biobb_flexserv.flexserv.dmd_run import dmd_run
from biobb_flexserv.flexserv.nma_run import nma_run
from biobb_flexdyn.flexdyn.nolb_nma import nolb_nma
from biobb_flexdyn.flexdyn.imod_imode import imod_imode
from biobb_flexdyn.flexdyn.imod_imc import imod_imc
from biobb_gromacs.gromacs.make_ndx import make_ndx
from biobb_gromacs.gromacs.trjcat import trjcat
from biobb_analysis.gromacs.gmx_cluster import gmx_cluster
from biobb_flexserv.pcasuite.pcz_zip import pcz_zip
from biobb_flexserv.pcasuite.pcz_info import pcz_info
from biobb_flexserv.pcasuite.pcz_evecs import pcz_evecs
from biobb_flexserv.pcasuite.pcz_animate import pcz_animate
from biobb_flexserv.pcasuite.pcz_bfactor import pcz_bfactor
from biobb_flexserv.pcasuite.pcz_hinges import pcz_hinges
from biobb_flexserv.pcasuite.pcz_stiffness import pcz_stiffness
from biobb_flexserv.pcasuite.pcz_collectivity import pcz_collectivity


def main(config, system=None):
    start_time = time.time()
    conf = settings.ConfReader(config, system)
    global_log, _ = fu.get_logs(path=conf.get_working_dir_path(), light_format=True)
    global_prop = conf.get_prop_dic(global_log=global_log)
    global_paths = conf.get_paths_dic()

    global_log.info("step0_extract_model: Selecting a model")
    extract_model(**global_paths["step0_extract_model"], properties=global_prop["step0_extract_model"])

    global_log.info("step1_extract_chain: Selecting a monomer")
    extract_chain(**global_paths["step1_extract_chain"], properties=global_prop["step1_extract_chain"])

    global_log.info("step2_cpptraj_mask: Generating the reduced (coarse-grained) structure (backbone)")
    cpptraj_mask(**global_paths["step2_cpptraj_mask"], properties=global_prop["step2_cpptraj_mask"])

    global_log.info("step3_cpptraj_mask: Generating the reduced (coarse-grained) structure (alpha carbons)")
    cpptraj_mask(**global_paths["step3_cpptraj_mask"], properties=global_prop["step3_cpptraj_mask"])

    global_log.info("step4_concoord_dist: CONCOORD dist")
    concoord_dist(**global_paths["step4_concoord_dist"], properties=global_prop["step4_concoord_dist"])

    global_log.info("step5_concoord_disco: CONCOORD disco")
    concoord_disco(**global_paths["step5_concoord_disco"], properties=global_prop["step5_concoord_disco"])

    global_log.info("step6_cpptraj_rms: RMSd distribution of the generated ensemble (CONCOORD)")
    cpptraj_rms(**global_paths["step6_cpptraj_rms"], properties=global_prop["step6_cpptraj_rms"])

    global_log.info("step7_cpptraj_convert: Converting the generated ensemble (CONCOORD)")
    cpptraj_convert(**global_paths["step7_cpptraj_convert"], properties=global_prop["step7_cpptraj_convert"])

    global_log.info("step8_prody_anm: PRODY Anisotropic network model")
    prody_anm(**global_paths["step8_prody_anm"], properties=global_prop["step8_prody_anm"])

    global_log.info("step9_cpptraj_rms: RMSd distribution of the generated ensemble (PRODY)")
    cpptraj_rms(**global_paths["step9_cpptraj_rms"], properties=global_prop["step9_cpptraj_rms"])

    global_log.info("step10_cpptraj_convert: Converting the generated ensemble (PRODY)")
    cpptraj_convert(**global_paths["step10_cpptraj_convert"], properties=global_prop["step10_cpptraj_convert"])

    global_log.info("step11_bd_run: Brownian Dynamics (BD)")
    bd_run(**global_paths["step11_bd_run"], properties=global_prop["step11_bd_run"])

    global_log.info("step12_cpptraj_rms: Fitting and converting the generated ensemble (BD)")
    cpptraj_rms(**global_paths["step12_cpptraj_rms"], properties=global_prop["step12_cpptraj_rms"])

    global_log.info("step13_dmd_run: Discrete Molecular Dynamics (DMD)")
    dmd_run(**global_paths["step13_dmd_run"], properties=global_prop["step13_dmd_run"])

    global_log.info("step14_cpptraj_rms: RMSd distribution of the generated ensemble (DMD)")
    cpptraj_rms(**global_paths["step14_cpptraj_rms"], properties=global_prop["step14_cpptraj_rms"])

    global_log.info("step15_nma_run: Normal Mode Analysis (NMA)")
    nma_run(**global_paths["step15_nma_run"], properties=global_prop["step15_nma_run"])

    global_log.info("step16_cpptraj_rms: RMSd distribution of the generated ensemble (NMA)")
    cpptraj_rms(**global_paths["step16_cpptraj_rms"], properties=global_prop["step16_cpptraj_rms"])

    global_log.info("step17_cpptraj_convert: Converting the generated ensemble (NMA)")
    cpptraj_convert(**global_paths["step17_cpptraj_convert"], properties=global_prop["step17_cpptraj_convert"])

    global_log.info("step18_nolb_nma: NOn-Linear rigid Block NMA approach (NOLB)")
    nolb_nma(**global_paths["step18_nolb_nma"], properties=global_prop["step18_nolb_nma"])

    global_log.info("step19_cpptraj_rms: RMSd distribution of the generated ensemble (NOLB)")
    cpptraj_rms(**global_paths["step19_cpptraj_rms"], properties=global_prop["step19_cpptraj_rms"])

    global_log.info("step20_cpptraj_convert: Converting the generated ensemble (NOLB)")
    cpptraj_convert(**global_paths["step20_cpptraj_convert"], properties=global_prop["step20_cpptraj_convert"])

    global_log.info("step21_imod_imode: iMOD imode")
    imod_imode(**global_paths["step21_imod_imode"], properties=global_prop["step21_imod_imode"])

    global_log.info("step22_imod_imc: iMOD imc")
    imod_imc(**global_paths["step22_imod_imc"], properties=global_prop["step22_imod_imc"])

    global_log.info("step23_cpptraj_rms: RMSd distribution of the generated ensemble (iMOD)")
    cpptraj_rms(**global_paths["step23_cpptraj_rms"], properties=global_prop["step23_cpptraj_rms"])

    global_log.info("step24_cpptraj_convert: Converting the generated ensemble (iMOD)")
    cpptraj_convert(**global_paths["step24_cpptraj_convert"], properties=global_prop["step24_cpptraj_convert"])

    global_log.info("Compressing trajectories")
    traj_zip = "structure_concat_traj.zip"
    traj_list = [global_paths["step7_cpptraj_convert"]["output_cpptraj_path"],
                 global_paths["step10_cpptraj_convert"]["output_cpptraj_path"],
                 global_paths["step24_cpptraj_convert"]["output_cpptraj_path"],
                 global_paths["step12_cpptraj_rms"]["output_traj_path"],
                 global_paths["step17_cpptraj_convert"]["output_cpptraj_path"]]
    with zipfile.ZipFile(traj_zip, 'w') as myzip:
        for file in traj_list:
            myzip.write(file, PurePath(file).name, compress_type=zipfile.ZIP_DEFLATED)

    paths = global_paths["step25_trjcat"]
    paths["input_trj_zip_path"] = PurePath(Path().absolute()).joinpath(traj_zip)
    global_log.info("step25_trjcat: Building the meta-trajectory")
    trjcat(**paths, properties=global_prop["step25_trjcat"])

    global_log.info("step26_make_ndx: Clustering the meta-trajectory: make index")
    make_ndx(**global_paths["step26_make_ndx"], properties=global_prop["step26_make_ndx"])

    global_log.info("step27_gmx_cluster: Clustering the meta-trajectory: clustering")
    gmx_cluster(**global_paths["step27_gmx_cluster"], properties=global_prop["step27_gmx_cluster"])

    global_log.info("step28_cpptraj_rms: Fitting and converting the generated ensemble (meta-trajectory)")
    cpptraj_rms(**global_paths["step28_cpptraj_rms"], properties=global_prop["step28_cpptraj_rms"])

    global_log.info("step29_pcz_zip: Computing the Principal Component Analysis (PCA) - ensemble")
    pcz_zip(**global_paths["step29_pcz_zip"], properties=global_prop["step29_pcz_zip"])

    global_log.info("step30_pcz_zip: Computing the Principal Component Analysis (PCA) - ensemble gaussian")
    pcz_zip(**global_paths["step30_pcz_zip"], properties=global_prop["step30_pcz_zip"])

    global_log.info("step31_pcz_info: Analysing the PCA report")
    pcz_info(**global_paths["step31_pcz_info"], properties=global_prop["step31_pcz_info"])

    global_log.info("step32_pcz_evecs: PCA Eigenvectors & Eigenvalues")
    pcz_evecs(**global_paths["step32_pcz_evecs"], properties=global_prop["step32_pcz_evecs"])

    global_log.info("step33_pcz_animate: Animate Principal Components")
    pcz_animate(**global_paths["step33_pcz_animate"], properties=global_prop["step33_pcz_animate"])

    global_log.info("step34_cpptraj_convert: Converting the generated ensemble (PCA)")
    cpptraj_convert(**global_paths["step34_cpptraj_convert"], properties=global_prop["step34_cpptraj_convert"])

    global_log.info("step35_pcz_bfactor: B-Factor x Principal Components")
    pcz_bfactor(**global_paths["step35_pcz_bfactor"], properties=global_prop["step35_pcz_bfactor"])

    global_log.info("step36_pcz_hinges: Hinge Points Prediction - Bfactor_slope")
    pcz_hinges(**global_paths["step36_pcz_hinges"], properties=global_prop["step36_pcz_hinges"])

    global_log.info("step37_pcz_hinges: Hinge Points Prediction - Dynamic_domain")
    pcz_hinges(**global_paths["step37_pcz_hinges"], properties=global_prop["step37_pcz_hinges"])

    global_log.info("step38_pcz_hinges: Hinge Points Prediction - Force_constant")
    pcz_hinges(**global_paths["step38_pcz_hinges"], properties=global_prop["step38_pcz_hinges"])

    global_log.info("step39_pcz_stiffness: Apparent Stiffness")
    pcz_stiffness(**global_paths["step39_pcz_stiffness"], properties=global_prop["step39_pcz_stiffness"])

    global_log.info("step40_pcz_collectivity: Collectivity Index")
    pcz_collectivity(**global_paths["step40_pcz_collectivity"], properties=global_prop["step40_pcz_collectivity"])

    elapsed_time = time.time() - start_time
    global_log.info('')
    global_log.info('')
    global_log.info('Execution successful: ')
    global_log.info('  Workflow_path: %s' % conf.get_working_dir_path())
    global_log.info('  Config File: %s' % config)
    if system:
        global_log.info('  System: %s' % system)
    global_log.info('')
    global_log.info('Elapsed time: %.1f minutes' % (elapsed_time/60))
    global_log.info('')


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description="Based on the official Gromacs tutorial")
    parser.add_argument('--config', required=True)
    parser.add_argument('--system', required=False)
    args = parser.parse_args()
    main(args.config, args.system)

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/bioexcel/biobb_workflows/tree/master/biobb_wf_flexdyn/python
Name: python-protein-conformational-ensembles-generation
Version: Version 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...