mims-harvard/TDC

BBB permeability - New dataset

devanshamin opened this issue · 7 comments

Describe the problem
Currently, TDC has BBB_martins dataset for Blood Brain Barrier (BBB) permeability consisting of only 2030 compounds. There is a much bigger dataset called Blood-Brain Barrier Database (B3DB) consisting of 7807 compounds.

Describe the solution you'd like
Inclusion of the dataset to the Single-instance Prediction Problem (ADME) and the ADMET Benchmark Group.

from tdc.single_pred import ADME
data = ADME(name="B3DB")

Additional context
B3DB - https://github.com/theochem/B3DB

Hi Devansh! Thanks for the pointer! This definitely sounds relevant! Would you like to contribute to TDC? Let us know, thanks!

I will work on this

@kexinhuang12345 Hi Kexin! I am interested in adding the BBB dataset to TDC. So far the steps I identified are:

  1. Add a bbb.py file to the tdc/single_pred folder. I realized that BBB belongs to ADME so no file changes in this folder.
  2. Add the appropriate reexport to tdc/single_pred/__init__.py. For same reason this step is not necessary.
  3. Download the data and give it to you for storing in Dataverse.
  4. Inserting in line 119 of tdc/metadata.py the names for the classification and the regression versions of the B3DB dataset.
adme_dataset_names = [
    # ...
    "clearance_microsome_az",
    "b3db_classification", # Added
    "b3db_regression", # Added
]
  1. Add to the object in line 627:
name2type = {
    # ...
    "bbb_adenot": "tab",
    "b3db_classification": "tab", # Added
    "b3db_regression": "tab", # Added
    "bbb_martins": "tab",
    # ...
}
  1. I am unsure of how to generate the id to put in name2id in line 740. Does one obtain that by adding the dataset to the data server?
  2. Same question, but for name2stats in line 907.

I am new to the package so any guidance or recommendations would be appreciated.

Looking forward to your response!

Hi @kexinhuang12345, we had a conversation back in February 2022 about adding this dataset to TDC so following up here. I'm working with @inakineitor and we would be happy to help get this dataset included (unless @marc-gav has made progress). We can also open a new issue if needed.

Iñaki – Kexin had previously pointed me to the contribution guide.

Sorry for the late reply - was traveling - this sounds awesome! I think the questions can be answered via the contribution guide pointed out by Ayush. Let me know if you still bump into any questions!

Hi Kexin, no worries! All steps are now completed except for name2stats, described as a "mapping from dataset names to statistics." How should the statistics IDs be generated?