mims-harvard/TDC

No module named sklearn.svm.classes

jannisborn opened this issue ยท 9 comments

Describe the bug

TDC is incompatible with recent scikit-learn versions (>=0.24.0). This is not pinned in the requirements.

Functions like load_drd2_model raise:

>             clf_model = pickle.load(f)
E             ModuleNotFoundError: No module named 'sklearn.svm.classes'

To Reproduce
Try to use the drd2 oracle with a recent scikit-learn version

Expected behavior
The call should pass

Environment:

  • OS: MacOS
  • Python version: 3.7.12
  • TDC version: 0.3.6
  • Any other relevant information:

Additional context
I think the fix should involve a change in the backend logic that supports sklearn 1.X. It's not a good solution to just pin this repository to a very outdated version of sklearn (sklearn 0.X). It's not considered a stable version.

The error currently occurs quite silently since the pickled.load loads an object that needs this module. The minimal fix could be to have an if-else based on the version parsing of sklearn and then load a different object for recent versions where sklearn.svm.classes was moved to another location.

yup. i can't load the GSK3B oracle as well due to the same issue.

it's really bad to pin the scikit-learn to an old version. there are many new features introduced in the later versions that are useful or even essential and it's going to break a lot of packages in any fairly complex machine learning project. this will hinder the adoption of PyTDC (and consequently any newly published research that relies on PyTDC) especially in non-academic settings where it can actually have impact on drug discovery. I have to emphasise non-academic (ie industrial) settings because this is where people care more about reliability of software, ease of maintainence and so on, which is sadly ignored by academia most of the time. That would be such a shame given how useful PyTDC is.

Completely agree with you @linminhtoo!
We actually tried to emulate sklearn.svm.classes in newer sklearn versions with tricks like:

sys.modules["sklearn.svm.classes"] = sklearn.svm

After this hack, the pickle.load() command runs fine, however if one actually tries to perform inference the code fails again, because not only the package submdule was moved but also the module itself was changed.
Hence, I dont see an easy way to make this code working with newer sklearn versions.

By far the easiest would be if the developers (@kexinhuang12345) could provide a model that was trained and saved with a newer sklearn version. I'm happy to prepare a PR where PyTDC dynamically loads the old or the new model dependent on the sklearn version. But before we know whether such a model could be provided, I dont see a point in preparing the PR.
Hope this will resolve soon

Hi @jannisborn @linminhtoo thanks for the great point! Yes, pinning scikit-learn version would cause many issues. We will look into the feasibility of generating this new model and get back to you soon!

Hi all, I sketched a draft PR here: #168

Hey Jannis! Thanks for making the PR! @futianfan has actually copied the sklearn parameter to the new version and uploaded a new sklearn instance for drd2, jnk3, gsk3b to the server. #167 describes a quick fix (still under testing by @yuanqidu to see if both versions generate the same results)

Do you think this PR solves the issue? Based on a quick comparison between #167 and #168, I think your code is definitely more systematic, do you want to work on integrating both PRs? let us know! if not, we can also incorporate on our end! Thanks!

Ah sorry, that's on me, I should have synced with you before drafting the PR

Let me try to integrate both things, it looks straightforward

We really appreciate your contribution, anyway.

Fixed it at #168

thanks for the fix! @kexinhuang12345