Harmonizing QSAR machine learning-based models and docking approaches for the identification of novel HDAC2 inhibitors

by Dao Quang Tung, Do Thi Mai Dung, Nguyen Thanh Cong, Dao Ngoc Nam Hai, Daniel Baecker, Phan Thi Phuong Dung, Nguyen Hai Nam, Nguyen Ngoc An*

*Correspondence: ngocan@vnu.edu.vn (N.N.A)

This repository is the prove of our paper, which has been submitted for publication in Link.

Model data

Please use the data in https://github.com/Lelvels/qsar_ml_find_hdac2_inhibitor/blob/main/data_for_modeling/train_test_data/HDAC2_train_test_data_final.xlsx if you need further test

Dependencies and implementation

You will need a working Python environment to run the code. The recommended way to set up your environment is through the Anaconda Python distribution which provides the conda package manager. Anaconda can be installed in your user directory and does not interfere with the system Python installation. The required dependencies are specified in the file *.yml in the env folder. We recommened to run the command below in Linux operating system terminal.

Run the following command in the repository folder (where env/*.yml files is located) to create a separate environment and install all required dependencies in it:

conda env create -f my-rdkit-env.yml -n your_env_name
conda env create -f tmap-env.yml -n your_other_env_name

Then verify that the new environment was installed correctly:

conda env list

Our screening dataset was stored using PostgreSQL database, installation is availible in PostgreSQL official website. The screening dataset is available in this link (8GB after decompress). After imported the database, create a duplicate of file env/env.example and rename it to .env, then fill the database URL in the file.

DATABASE_URL = postgresql://<host_url>/<database_name>

Project structures

The source code are available in the src folder.
The results of our work is in the results folder.

Dataset location

The train, test and validation data are available in the data_for_modeling/train_test_data folder
The screening dataset is available in the data_for_modeling/screening_dataset, if you want the raw data from database, they are avalible in this link (8GB after decompress).