mims-harvard/TDC

Oracle issue - Yuchen

Closed this issue · 16 comments

Describe the bug

Dear TDC Team,

I hope this message finds you well. I am writing to report some technical issues I encountered while utilizing the oracle provided by TDC. Below are the details of the problems:

Problem 1: I have encountered an error after downloading the Oracle with the name "JNK3". The error message is as follows:
ValueError: node array from the pickle has an incompatible dtype:

  • expected: {'names': ['left_child', 'right_child', 'feature', 'threshold', 'impurity', 'n_node_samples', 'weighted_n_node_samples', 'missing_go_to_left'], 'formats': ['<i8', '<i8', '<i8', '<f8', '<f8', '<i8', '<f8', 'u1'], 'offsets': [0, 8, 16, 24, 32, 40, 48, 56], 'itemsize': 64}
  • got: [('left_child', '<i8'), ('right_child', '<i8'), ('feature', '<i8'), ('threshold', '<f8'), ('impurity', '<f8'), ('n_node_samples', '<i8'), ('weighted_n_node_samples', '<f8')]

This problem also occurs with the Oracle named "GSK3". However, the error arises when I attempt to input a list of smiles. Inputting a single smile into GSK3 does not trigger the error.

Problem 2: As mentioned above, inputting a single smile into the Oracle "GSK3" does not result in an error. However, I have tried multiple active molecules targeting GSK3beta from ChEMBL, and the output value from the oracle is consistently 0. This suggests there might be an issue with the "GSK3" oracle that requires your attention.

I hope you can address these issues promptly. Please let me know if you need any further information or details from my end.

Best regards,

Yuchen

@abearab ^ you can have a look at this

I'm having the same issue with GSK3B. Moreover, there's a discrepancy on whether I evaluate a list of SMILES or just a single SMILE. If I evaluate a SMILE, I get 0.0; if I evaluate a list, I get the error @amva13 is getting for JNK3.

I wonder whether something changed in sklearn's random forests and their formatting. That being said, I'm using sklearn==1.3.0, which is the version inside this project's requirements.txt.

The culprit for the discrepancy between lists/individual SMILES is the try-except block in L656 of the implementation of oracles.

In other words, the loading of the oracle is failing silently, and thus the oracle returns the default value.

So we could try to solve two problems:

  1. Calling oracles on smile_str and [smile_str] should have the same behavior.
  2. Fixing the loading of the oracles for GSK3B and JNK3.

I'm happy to volunteer on any of those!

Hi @miguelgondu , thanks for the find! For clarity, changing the try-except block would only reveal the real error, not fix it. What version of the package are you using? Could you try 0.4.1 ?

Hi @amva13,

Yes! Changing the try-except block only reveals the error. Fixing it would involve checking what changed with the pkl files/their loading, I imagine.

I've tried with both 0.4.1 and 0.4.6. Both have the same issue.

Ok. This was to confirm error is not due to recent release changes. I will be personally inspecting this error starting now. One thing I'd try while I'm looking into it. There might be something to your claim about sklearn==1.3.0 causing a breaking change.

I would try building package 0.4.1 in a virtual environment (i.e. conda). 0.4.1 does not specify versions in requirements.txt and this might fix the behavior.

This error is indeed because of a mismatch in the formatting between the pickle object and the format expected by scikit learn. This is in part due to a version upgrade in scikit.

See reverse issue here
yzhao062/pyod#519

Evaluating some fixes and will push new version of package asap.

EDIT: Downgrading scikit-learn fixes the dtype issue but does not solve the underlying problem.

Hi @miguelgondu I believe I've solved it. Would you mind sharing some of the input SMILES strings which produced a 0.0 value for these oracles for you?

Hi @amva13, I used the one in the docs: 'CC(C)(C)[C@H]1CCc2c(sc(NC(=O)COc3ccc(Cl)cc3)c2C(N)=O)C1' should have a GSK3B score of 0.03 (at least according to the minimal example provided here)

Hi @miguelgondu I just pushed the fix and will be releasing the new package now. Will lyk when you can install

Thanks! Looking forward.

Just FYI: I'm getting a warning on Thiothixene_Rediscovery that is similar in spirit to this issue:

InconsistentVersionWarning: Trying to unpickle estimator DecisionTreeClassifier from version 0.23.0 when using version 1.3.0. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
  https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations

Got it. Thanks for pointing out. The best solution is to pickle these solutions with a more modern scikit (or invoke the models with a different method entirely to avoid the dependency issues altogether). For now the downgrade seems to work, though that particular classifier came from version 0.23.0.. so not great. I'll flag this is a longer term issue to look at.

@miguelgondu it's all fixed. you can install 0.4.7 for the working version

example:
https://colab.research.google.com/drive/17mGlLaVkfA2-0sqhbZlQ4cUI0JnFBpRq?usp=sharing

Hi @amva13 , thanks for the fix!

Checking with the other oracles in that specific version, something seems to break in deco hop. In the first example of the documentation (the same one I provided above) I went from getting 0.5338... to getting 0.0. Weird!

The rest of the oracles seem to work as expected, except for the ones in the issue I raised recently (#244).

Thanks again for the hard work.

ack'd issue opened