/BioEn

BioEn - Bayesian Inference Of ENsembles

Primary LanguagePythonOtherNOASSERTION

BioEn - Bayesian Inference Of ENsembles

/img/bioen_logo_480.png

Authors
César Allande, Jürgen Köfinger, Katrin Reichel, Klaus Reuter, Lukas S. Stelzl
Year
2018
Licence
GPLv3
Copyright
© 2018 César Allande, Gerhard Hummer, Jürgen Köfinger, Katrin Reichel, Klaus Reuter, Lukas S. Stelzl

References

  • [Hummer2015] Hummer G. and Koefinger J., Bayesian Ensemble Refinement by Replica Simulations and Reweighting. J. Chem. Phys. 143(24):12B634_1 (2015). https://doi.org/10.1063/1.4937786
  • [Rozycki2011] Rozycki B., Kim Y. C., Hummer G., SAXS Ensemble Refinement of ESCRT-III Chmp3 Conformational Transitions Structure 19 109–116 (2011). https://doi.org/10.1016/j.str.2010.10.006
  • [Reichel2018] Reichel K., Stelzl L. S., Köfinger J., Hummer G., Precision DEER Distances from Spin-Label Reweighting, J. Phys. Chem. Lett. 9 19 5748-5752 (2018). https://doi.org/10.1021/acs.jpclett.8b02439
  • [Köfinger2019] Koefinger J., Stelzl L. S. Reuter K., Allande C., Reichel K., Hummer G., Efficient Ensemble Refinement by Reweighting J. Chem. Theory Comput. Article ASAP https://doi.org/10.1021/acs.jctc.8b01231

Description

BioEn integrates a broad range of experimental data to refine ensembles of structures. For a detailed description of the procedures and the algorithm, we refer to [Hummer2015].

/img/bioen.png

Overview of the BioEn software

The BioEn software consists of two Python packages:

  • optimize provides algorithms to solve the optimization problem underlying ensemble refinement by reweighting. This package can be used independently of the 'analyze' package.
  • analyze uses the 'optimize' package to integrate a wide range of experimental data and simulations in a user friendly way.

Help

Please, if you have an issue with the software, open an issue on the github repository https://github.com/bio-phys/bioen/issues.

If you have any questions or suggestions, please contact bioen@biophys.mpg.de.

Dependencies and Software Requirements

  • Python 2.7 or Python 3
  • Python packages: numpy, scipy, Cython, pyyaml, MDAnalysis, pandas
  • GCC (>= 4.9)
  • GSL (>= 2.1)
  • LIBLBFGS (>= 1.10)

Copies of GSL and LIBLBFGS are included in the source distribution of BioEn. To run Jupyter notebooks *.ipynb in ./examples/ you need additionally

Installation

Dependencies

In addition to the multidimensional minimizers from SciPy, BioEN supports the minimizers provided by the GSL library and by the LIBLBFGS library which do increase the performance significantly. To obtain these libraries, the following options are possible, where option 1 is recommended for most users:

  1. In the directory 'third-party', copies of the source codes of GSL and LIBLBFGS including a script to build and install them are provided. After having run the build script './install_dependencies.sh', both the libraries are placed into the '~/.local' directory where BioEN's 'setup.py' will find and use them automatically.
  2. Alternatively, you may install GSL and LIBLBFGS to a default location using the package manager of your operating system, where 'setup.py' will find them typically, as well.
  3. Finally, on some HPC systems, GSL and LIBLBFGS may already be provided via environment modules. In this case, load the respective modules before installing BioEN.

Note that the environment variables 'GSL_HOME' and 'LIBLBFGS_HOME' are evaluated by 'setup.py'. If set, they are assumed to contain the path to a valid installation of GSL or LIBLBFGS which will then be used. Setting these variables is only necessary if the libraries were installed to a non standard location, e.g. with Homebrew on the Mac.

Installation

Once the dependencies are available (see above), install the package as follows:

BIOEN_OPENMP=1 python setup.py install [--user]

BIOEN_OPENMP set to 1 enables OpenMP parallelism. On OSX use BIOEN_OPENMP=0.

You can use the optional '--user' flag for a local installation which does not require root privileges. When you install BioEn locally, please make sure that '$HOME/.local/bin' is on your path. You can add the folder to the path, e.g., by adding 'export PATH=$HOME/.local/bin:$PATH' to your '.bashrc' file. In a conda-environment, install the package without the '--user' flag.

Usage

We want to integrate a diverse set of experimental data with simulated observables. Therefore, we implemented three types of chi-square calculations to use different kinds of experimental data:

  • Generic data (chi-square optimization without nuisance parameters)
  • DEER/PELDOR data (includes the modulation depth as a nuisance parameter)
  • SAXS/WAXS/SANS data (includes the scaling parameter and constant offset as a nuisance parameter)

BioEn can also be used to obtain precision DEER distances from spin-label ensemble refinement [Reichel2018], for which we provide an example.

(1) Generic data

The term generic data refers to experimental data, where measurements provide single data points including noise (e.g. NOE, PREs, chemical shifts, J-couplings, distances, chemical cross-links etc). To use generic data, the bioen options should contain --experiments generic. In the experimental data file (e.g. ./test/generic/data/exp-generic.dat), the ID (first column) of a data point (second column) and its noise (third column) has to be provided. The ID refers than to the file from the simulated data (e.g. ./test/generic/data/sim-noe_1-generic.dat), in which each line is the simulated data point from a single ensemble member (e.g., simualted data extracted from a trajectory of a MD simulation).

The full list of options for generic data is:

--sim_path
--sim_prefix
--sim_suffix
--exp_path
--exp_prefix
--exp_suffix
--data_IDs
--data_weight
--input_pkl
--output_pkl

Please take note of the options --sim_path, --sim_prefix, --sim_sufffix, --exp_path, --exp_prefix, and --exp_suffix. These are useful to define the path to and names of the files. Defaults are provided.

(2) Experimental data from DEER/PELDOR measurements

For the reweighting with experimental data including a nuisance parameter (here: modulation depth), the structure of the input files is extended and more information is needed. To use DEER data, the bioen options should contain --experiments deer. In the case of DEER data, we can either perform reweighting over an ensemble of conformations with averaged spin-label rotamer states or over an ensemble of spin-label rotamer states with a single protein conformation.

If an ensemble of conformations is investigated, provide for each label pair (e.g. 319-259) a single file of the experimental data (e.g., ./test/deer/data/exp-319-259-deer.dat) and ensemble member (e.g., ./test/deer/data/conf0-319-259-deer.dat). The experimental data file contains:

#time   #raw        #polyfit
0.0     0.9886054   1.0
0.008   0.97737117  0.99091340848
0.016   1.0         0.988879614369
0.024   0.97842962  0.984631477624
0.032   0.98185696  0.983339482409

The simulated data file (e.g. conf0-319-259-deer.dat) contains:

#time   #simulated_data
0.0     1.0
0.008   0.99984697806
0.016   0.999388027044
0.024   0.998623491217
0.032   0.997553943855

Using DEER data in BioEn, the models file (models-deer.dat) is of particular interest: listed numbers (model IDs) in this file have to be the same as the deer file names (conf0-319-259-deer.dat, conf1-319-259-deer.dat, conf2-319-259-deer.dat and so on).

If an ensemble of spin-label rotamer states is investigated, we recommend to use the Jupyter notebook deer_spin_label_reweighting.ipynb in ./examples/DEER/rotamer-refinement/single_trace/. Here, the user can define the protein structure and a own rotamer library (or use the default). By executing the cells in the notebook, data preparation, BioEn run, and analysis can be performed in a smooth procedure. The analysis of the BioEn data include also the L-curve analysis. More details on the method are provided in [Reichel2018].

For both cases, refinement over an ensemble of protein conformations or over spin-label rotamer states, the modulation depth as the nuisance parameter is relevant. With the option --deer_modulation_depth, an initial guess ("<path_to_file>/modulation-depth.dat") can be provided or an initial optimization ("initial-optimization") can be performed for each spin-label pair. As indicated above, the modulation depth is needed to calculate the consistency of the simulated data with the experimental data correctly. To achieve this, we have to iteratively optimize the weights of the ensemble members and the modulation depth. For all cases tested with DEER data, 10 iterations seems to be sufficient until the optimization converges. To do so, we recommend to set the option --number_of_iterations to 10 or higher.

The full list of options for DEER data is:

--deer_sim_path
--deer_sim_prefix
--deer_sim_suffix
--deer_exp_path
--deer_exp_prefix
--deer_exp_suffix
--deer_labels
--deer_noise
--deer_modulation_depth
--deer_input_pkl
--deer_input_hd5
--deer_output_pkl
--deer_input_sim_pkl
--deer_input_sim_hd5

Please take note of the options --deer-sim_path, --deer_sim_prefix, --deer_sim_suffix, --deer_exp_path, --deer_exp_prefix, and --deer_exp_suffix. These options are useful to define the names of the simulated and experimental files. In addition, please define the spin-label pairs with --deer_labels (e.g.; "319-259,370-259"), which is also part of the experimental and simulated data file names (see above).

(3) Experimental data from SAXS/WAXS measurements

BioEn can be used with scattering data like SAXS or WAXS, for which we provide also the optimization of the nuisance parameter (here: coefficient). To use scattering data, the bioen options should contain --experiments scattering. The input data is handled in a similar way as the DEER data, but just for a single scattering curve and not different label-pairs. The standard file format for experimental data (e.g. lyz-exp.dat) is:

#   q                 I(q)      error/noise
4.138455E-02        5.904029    1.555333E-01
4.371607E-02        5.652469    1.527037E-01
4.604759E-02        5.533381    1.521723E-01
4.837912E-02        5.547052    1.474577E-01
5.071064E-02        5.296281    1.436712E-01

The simulated data file (e.g. lyz0-sim-saxs.dat) contains:

#   q               I(q)
4.138454e-02    2.906550e+06
4.371607e-02    2.865970e+06
4.604758e-02    2.823741e+06
4.837911e-02    2.779957e+06
5.071064e-02    2.734716e+06

To handle different data input, we recommend to use the ipython notebook ./examples/scattering/scattering_reweighting.ipynb.

The full list of options for scattering data is:

--scattering_sim_path
--scattering_sim_prefix.
--scattering_sim_suffix
--scattering_exp_pPath
--scattering_exp_prefix
--scattering_exp_suffix
--scattering_noise
--scattering_coefficient
--scattering_data_weight
--scattering_input_pkl
--scattering_input_hd5
--scattering_input_sim_pkl
--scattering_input_sim_hd5
--scattering_output_pkl

Please take note of the options --scattering_sim_prefix, --scattering_sim_sufffix, --scattering_exp_prefix, and --scattering_exp_suffix. These options are useful to define the names of the files of experimental and simulated.

As indicated above, a nuisance parameter (here: coefficient) is needed to calculate the consistency of the simulated data with the experimental data correctly. To achieve this, we have to iteratively optimize the weights of the ensemble members and the coefficient. For all cases tested with scattering data, 10 iterations seems to be sufficient until the optimization converges. To do so, we recommend to set the option --number_of_iterations to 10 or higher.

(4) Experimental data from Circular dichroism (CD) measurements

BioEn can be used with CD data. To use CD data, the bioen options should contain --experiments cd. The standard file format for experimental data (e.g. exp-cd.dat) is:

#wavelength   cd_raw          cd_poly_fit
190           -5.808250e+03   -6.356057524681091309e+03
191           -8.324000e+03   -8.437500596046447754e+03
192           -1.228125e+04   -1.166270971298217773e+04
193           -1.553750e+04   -1.528861761093139648e+04
194           -1.938975e+04   -1.879757702350616455e+04

The simulated data file (e.g. conf1-cd.dat) contains:

#wavelength   cd
190           1.522400000000000000e+04
191           1.838200000000000000e+04
192           2.215800000000000000e+04
193           2.556100000000000000e+04
194           2.796400000000000000e+04

The full list of options for scattering data is:

--cd_sim_path
--cd_sim_prefix.
--cd_sim_suffix
--cd_exp_pPath
--cd_exp_prefix
--cd_exp_suffix
--cd_noise
--cd_data_weight
--cd_input_pkl
--cd_input_hd5
--cd_input_sim_pkl
--cd_input_sim_hd5
--cd_output_pkl

Please take note of the options --cd_sim_prefix, --cd_sim_sufffix, --cd_exp_prefix, and --cd_exp_suffix. These options are useful to define the names of the files of experimental and simulated.

Other options and settings

The initial and reference weights can be set with --reference_weights and --initial_weights. For both options, one can either choose uniform (uniformly distributed weights; default), random (randomly distributed weights), or provide a file as input.

As described in [Hummer2015], we have to balance the consistency with the experimental data (chi-square) with the changes in the weights (relative entropy) by the confidence parameter theta. We can achieve this aim by the maximum-entropy principle and as such avoid over-fitting. To decide for the correct confidence parameter theta for a specific set of data, usually a theta-series is applied. This means, that for each theta an independent ensemble refinement run is performed. Subsequent L-curve analysis (relative entropy vs. chi-square) leads us to the optimal weight distribution. Please note, that the choice of the confidence parameter depends on the system and data. In the BioEn software package, one can choose --theta by defining a single value (e.g., 10.0) or a theta-series, which can be provided as a list (e.g., 100.0,10.0,1.0) or a list in a file (e.g., <path_to_file>/thetas.dat).

To check the BioEn results quickly, a simple plot can be generated, that compares experimental data and ensemble averaged simulated data for the used confidence values. Therefore, the following three options have to be set: --simple_plot, --simple_plot_input and --simple_plot_output. The file name of the output pkl file has to be provided for --simple_plot_input. The data in this pkl file is visualized and saved in a pdf file, which can be specified with --simple_plot_output.

Misc options

The option --output_pkl_input_data can be used to generate a pkl file of all settings, parameters and weights from the previous BioEn run. This file can then be used afterwards with --input_pkl to restart the BioEn calculation.

Minimal example

The minimal amount of input parameters are:

  • number of ensemble members (--number_of_models)
  • list of models (--models_list)
  • type of experiments (--experiments)
  • input experimental and simulated data

In case you have data from NMR measurements (e.g. NOEs), a typical invocation would look like this:

bioen \
    --number_of_models 10 \
    --models_list <path-to-data>/models-generic.dat \
    --experiments generic \
    --theta 0.01 \
    --sim_path <path-to-data> \
    --exp_path <path-to-data> \
    --data_ids all

We provide example test scripts run_bioen*.sh in ./test/generic/, ./test/deer/, and ./test/scattering/ to run BioEn with the three mentioned types of data.

Default settings

The default setting for reweighting is log-weights for the procedure and bfgs2 for the optimization algorithm.

Output

Three BioEn output files are generated by default, for which you can choose the file names or leave it with the default naming.

  1. The most useful BioEn output file is in pickle (pkl) format. Choose the name of this file with the option --output_pkl. The default file name is bioen_result.pkl. This pkl file contains all relevant information from the weight optimization including experimental data, ensemble averaged data, (reference, initial, and optimized) weights, consistency of simulated data with experimental data (chi-squared), relative entropy, etc. For a complete analysis of your BioEn calculations, this file is essential.
  2. The second file contains a list of weights in text file format. The name can be choosen with --output_weights. The default name is bioen_result_weights.dat. But careful, it generates this file only for the smallest confidence value theta.
  3. The third files contains for each ensemble member the corresponding weight. This file is similar to the second file, however, it includes also the IDs of each ensemble member and is as such in a tabular form. The name of the file can be chosen by --output_models_weights with the default file name bioen_result_models_weights.dat. Also here, this file is generated from the smallest confidence value theta.

Misc information

We recommend to have a close look at the files in the folders ./test/generic/, ./test/deer/, and ./test/scatter/. These files can be used to understand and transfer the own scientific questions to BioEn. Lines including # are in general ignored.

For further options and more information, type:

bioen --help

FAQs

Q: All my optimization yield "fmin_final = 0.0". What is going on?

A: This could indicate that the path to fast libraries was not properly set before installing the package.