This is a machine-learning based protein-ligand scoring function.
Create environment
make Makefile create_environment
conda activate DXGB
make Makefile requirements
Install xgboost, rdkit and obabel, these packages can be easily installed using anaconda
conda install -c conda-forge xgboost=0.80.0
conda install -c rdkit rdkit=2019.03.1
conda install -c openbabel openbabel
Install source code
python setup.py install
To calculate Vina and SASA features, you should install mgltools, msms and a modified version of AutoDock Vina. To obatin deltaVinaRF predicted scores, you should also install R and its randomForest library.
Download MGLTools and MSMS from http://mgltools.scripps.edu/downloads, choose right version for your platform (Linux or Mac).
tar -xvzf mgltools_x86_64Linux2_1.5.6.tar.gz
cd mgltools_x86_64Linux2_1.5.6/
./install.sh
Remember use correct mgltoos name (Linux or Mac) when install mgltools.
mkdir msms
tar -xvzf msms_i86_64Linux2_2.6.1.tar.gz -C msms
cd msms
cp msms.x86_64Linux2.2.6.1 msms
In msms folder, there is a script pdb_to_xyzr. Change the line numfile = "./atmtypenumbers" to be numfile = "YourPATHofddeltaVinaXGB/DXGB/atmtypenumbers". atmtypenumbers file we used can be found in deltaVinaXGB/DXGB directory
Test pdb_to_xyzr
pdb_to_xyzr 1crn.pdb > 1crn.xyzr
If it doesn't work, try
./pdb_to_xyzr 1crn.pdb > 1crn.xyzr
If error (nawk: command not found) appears, change nawk to awk in pdb_to_xyzr (line 31)
Download modified Vina from https://github.com/chengwang88/vina4dv
Extract files from zip file.
If directory name is vina4dv_master, change it into vina4dv.
Download R from https://cran.r-project.org/ and install randomForest in R by
install.packages('randomForest')
If you have the dependencies installed already. Several environment variables need to be set in .bashrc (Linux) or .bash_profile (macOS) file in your home directory. An example is given below. You can modify the path based on your case. In this example, all softwares are installed under /home/jl7003 directory.
# path for MSMS
export PATH=$PATH:/home/jl7003/msms/
# set mgltool variable (if mac, should change mgltools_x86_64Linux2_1.5.6 into your downloaded mac version)
export PATH=$PATH:/home/jl7003/mgltools_x86_64Linux2_1.5.6/bin/
export MGL=/home/jl7003/mgltools_x86_64Linux2_1.5.6/
export MGLPY=$MGL/bin/python
export MGLUTIL=$MGL/MGLToolsPckgs/AutoDockTools/Utilities24/
# set vina dir (if mac, should use /mac/release/)
export VINADIR=/home/jl7003/vina4dv/build/linux/release/
# set DXGB dir (needed if run deltaVinaRF20 score)
export DXGB=/home/jl7003/deltaVinaXGB/DXGB
After set all environment variables, open a new window to enable all setups.
Dockerfile is built based on all the instructions above. It offers a ready-to-go environment including all the dependencies.
Build the docker image:
cd deltaVinaXGB
docker build -t delta-vina-xgb .
Run the docker image:
docker run -it delta-vina-xgb bash
Now you are in an isolated environment where you can use all the functionalities of this repo.
Before calculating features, three structure inputfiles are needed:
pdbid_ligand.mol2/sdf --> ligand structure file
pdbid_ligand.pdb --> (optional, needed to provid if you are not sure about your mol2/sdf file quality; this pdb file can be used to calculate all features excluding ligand stability features, and you still can get score) ligand structure file
pdbid_protein.pdb --> protein structure file
pdbid_protein_all.pdb --> protein with water molecules structure file
Examples for structure files can be found in Test_2al5 directory.
Note: these three files are needed to run predictions. All files should include hydrogens. If protein files with water molecules are not available, just copy the original protein.pdb to protein_all.pdb.
If features have been already calculated, only features files are needed:
Input.csv --> Input feature file
Example for Input.csv can be found in Test directory.
Activate DXGB environment
conda activate DXGB
All scripts are in DXGB directory.
cd DXGB
Check all options and defaults
python run_DXGB.py --help
The script can be run for one complex by
python run_DXGB.py --runfeatures --datadir ../Test_2al5 --pdbid 2al5 --average
--runfeatures is feature calculation, default is to calculate all features
--datadir is for structure files datadir
--pdbid is for structure pdbid, can be other type of index
--average is to calculate average scores from 10 models
Or it can also be run by providing a list of protein-ligand complex with input features as in Input.csv
python run_DXGB.py --datadir ../Test --average
Default is to predict scores for provided structures. If you want to get scores with explicit water molecules, and optimized ligands:
python run_DXGB.py --runfeatures --datadir ../Test_2al5 --pdbid 2al5 --water rbw --opt rbwo --average
--water is for consideration of water, rbw is to consider both receptor-bound water and bridging water.
(Other options: rw: only consider receptor-bound water; bw: only consider bridging water, using bw.)
--opt is for optimization, rbwo is to optimize ligand in no water, bridging water, and receptor-bound water environments.
(Other options: o: optimize ligand without water; rwo: optimize ligand with no water and with receptor-bound water; bwo: optimize ligand with no water and bridging water.)
Reminder optimization is based on protein structure, thus, you should optimization option that matches your water option.
The calculated features will be saved in Input.csv
file
The predicted scores for different structures of Vina, and deltaVinaXGB will be saved in outfile (default is score.csv
) in datadir.
If you want to get deltaVinaRF scores as well, add --runrf.
Note:
- Linux is the recommended system for our package. Our package also supports Mac OSX.
- Ligand structure should includes both atom and bond information, such as mol2 and sdf. Be careful when using mol2 file as input format, some atom types are not recognized in RDKit (O.co2 for O in C-PO32- group).
- Using different version of RDKit, the ligand stability features can be slightly different.
- If the mol2/sdf file can not be processed by RDKit, we will skip the ligand stability feature calculation and used default value (dE:-300, RMSD:300).
- Abbrevations: RW --> receptor water; BW --> bridging water
- Lu, J. N.; Hou, X. B.; Wang, C.; Zhang, Y. K., Incorporating Explicit Water Molecules and Ligand Conformation Stability in Machine-Learning Scoring Functions. J. Chem. Inf. Model. 2019, https://pubs.acs.org/doi/abs/10.1021/acs.jcim.9b00645
- Wang, C.; Zhang, Y. K., Improving Scoring-Docking-Screening Powers of Protein-Ligand Scoring Functions using Random Forest. J. Comput. Chem. 2017, 38, 169-177. https://doi.org/10.1002/jcc.24667