/PharmaBench

Primary LanguageJupyter NotebookCreative Commons Zero v1.0 UniversalCC0-1.0

PharmaBench

Source code for the paper "PharmaBench: Enhancing ADMET benchmarks with large language models"

We present PharmaBench, comprising a total of 156,618 raw entries, compiled from 14,401 bioassays. Researchers and scientists who would like to contribute data should contact the database's maintainers.

Abstract

We propose a novel data mining approach that incorporates a multi-agent system based on Large Language Models (LLMs) to effectively identify experimental conditions within the descriptions of 14,401 bioassays. Additionally, we have proposed a data processing workflow to integrate data from various sources, totaling 156,618 raw entries. Through this workflow, we have constructed PharmaBench, a comprehensive benchmark set for ADMET properties, which includes eleven ADMET datasets and 52,482 entries. This benchmark set is designed to serve as an open-source dataset for the development of deep learning and machine learning models relevant to drug discovery projects. It is particularly aimed at research in transfer learning, pre-training, and explainable AI.

workflow

Data Records

Category Property Name Entries After Data Processing Final Entries for AI modelling Unit Mission Type
Physochemical LogD 14,141 13,068 - regression
Water Solubility 14,818 11,701 log10nM regression
Absorption BBB 12,486 8,301 - classification
Distribution PPB 1,310 1,262 % regression
Metabolism CYP 2C9 4,507 999 Log10uM regression
CYP 2D6 1,214 Log10uM regression
CYP 3A4 1,980 Log10uM regression
Clearance HLMC 5,252 2,286 Log10(mL.min-1.g-1) regression
RLMC 1,129 Log10(mL.min-1.g-1) regression
MLMC 1,403 Log10(mL.min-1.g-1) regression
Toxicity AMES 24,780 9,139 - classification
Total 77,294 52,482

Usage

There are eleven ADMET datasets within PharmaBench. Standardized SMILES representations of compounds were provided for modeling the compounds, and the experimental values are provided as the prediction targets. Users may use the labels within the scaffold_train_test_label and random_train_test_label as the train-test labels for fair comparison.

The final datasets for PharmaBench are stored within the data/final_datasets/ path and can be loaded with pandas:

import pandas as pd

# load regression dataset
data = pd.read_csv("data/final_datasets/ames_cls_final_data.csv")

scaffold_training = data[data['scaffold_train_test_label'] == 'train']
scaffold_test = data[data['scaffold_train_test_label'] == 'test']

random_training = data[data['random_train_test_label'] == 'train']
random_test = data[data['scaffold_train_test_label'] == 'test']

The column descriptions are:

Column Name Description Data Type
Smiles_unify Standardized SMILES representation of compounds based on standardization methods described in the Data Standardization. String
value Experimental values for different datasets including regression values and classification values Float
property Different ADMET property name for the experiment String
scaffold_train_test_label Training labels based on scaffold splitting, where 1 represents the training data and 0 represents the testing data. Float
random_train_test_label Training labels based on random splitting, where 1 represents the training data and 0 represents the testing data. Float

Working Environment Setting Up

We provide the requirements in requirements.txt. You may install the requirements file by:

conda create --name <env> --file requirements.txt

Citation

The materials and data under this repo are distributed under the CC0 Licence.