operatorequals/httpimport

New version 0.9.2 breaks loading of joblib pickled model

AlexMikhalev opened this issue · 5 comments

Firstly, thank you for the project - I think it's amazing and shall be advertised wider. It's part of my secret sauce to deploy AI/ML models into distributed systems.
I used version 0.7.2 to load a pickled pre-trained Automata (Trie structure)

import httpimport
with httpimport.remote_repo(['utils'], "https://raw.githubusercontent.com/applied-knowledge-systems/the-pattern-automata/main/automata/"):
    import utils
from utils import loadAutomata, find_matches

with cryptic stack trace:

[!] 'nt' not found in HTTP repository. Moving to next Finder.
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 668, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 638, in _load_backward_compatible
  File "/home/alex/anaconda3/envs/thepattern_3.7/lib/python3.7/site-packages/httpimport.py", line 232, in load_module
    exec(module_src, mod.__dict__)
  File "<string>", line 36, in <module>
  File "<string>", line 5, in loadAutomata
  File "/home/alex/anaconda3/envs/thepattern_3.7/lib/python3.7/site-packages/joblib/__init__.py", line 113, in <module>
    from .memory import Memory, MemorizedResult, register_store_backend
  File "/home/alex/anaconda3/envs/thepattern_3.7/lib/python3.7/site-packages/joblib/memory.py", line 15, in <module>
    import pathlib
  File "/home/alex/anaconda3/envs/thepattern_3.7/lib/python3.7/pathlib.py", line 4, in <module>
    import ntpath
  File "/home/alex/anaconda3/envs/thepattern_3.7/lib/python3.7/ntpath.py", line 257, in <module>
    from nt import _getvolumepathname
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 668, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 640, in _load_backward_compatible
KeyError: 'nt'

I went on a path debugging joblib, but actual loading works:

import ahocorasick 
import joblib
Automata=joblib.load("automata_fresh_semantic.pkl.lzma")

full code for loading https://github.com/applied-knowledge-systems/the-pattern-automata/blob/main/automata/utils.py`
Code for creating automata is simple and hosted in the same repo: https://github.com/applied-knowledge-systems/the-pattern-automata

I am reverting to 0.7.2, but any suggestions for uplifting are welcome.

I am aware that some issue was introduced in 0.9.2. I am looking into it, as the unit tests run well on both, so there should be a coverage issue!

Thanks a lot for your verbose report!

Hello again @AlexMikhalev!

So, I have drilled down to the issue you are having and I prepared a brief explanation of why it is happening and a way to use v0.9.3:

Prior 0.9.0 version, it was mandatory to use an argument in most httpimport functions that had to be either a str or list indicating what packages/modules where expected to be loadable from the given URL. So using ['utils'] as your first argument indicated that (only) utils module can be loaded from your URL (and maybe you can see where this is going by now).
Every import statement in the clause was then checked against this list (or str) in the Finder function and if it existed (or str matched), the loader was called and THEN the actual HTTP/S call was happening, trying to bring the content of the module.
This was a deliberate design decision taken way back around 2017, as this module started as a Python stager for Python based malware (I was working as a Security Engineer / Red Teamer / Penetration Tester back then and I happened to need such a tool). This meant that httpimport had to do as little traffic as possible and double check if a request was needed to be made. The encrypted .zip loading feature was also a decision taken back when httpimport was effectively a "malware".

With the 0.9.0 release I figured out that, as now httpimport is used by Data Analysts and a very different type of beasts than malware devs, it does not need to be so restrictive on the traffic it does. So I removed the str or list argument that indicated what can be loaded and httpimport now tries ALL modules stated in an import statement inside its clause (by "clause" I mean the tabbed lines under a remote_repo call). Yet, the code trying the module is still placed in the Importer's Loader method, assuming that the module is certainly there (as the Finder phase has finished successfully). Failing if the module is not there. E.g nt module is not available to be loaded, and the importer fails.

For a proper fix of this bug, more or less a redesign needs to be done which can finally get a real v1.0.0 version released. But right now I am a bit busy having a life. Yet, I want to do that, and eventually I will!

TL; DR

If you want to use the v0.9.3 version, you can deliberately load your utils module as below:

>>> utils = httpimport.load("utils", "https://raw.githubusercontent.com/applied-knowledge-systems/the-pattern-automata/main/automata/")
>>> dir (utils)
['Automata', '__builtins__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', 'find_matches', 'loadAutomata']

Tada! Loaded!


If you have any questions/suggestions/remarks on the matter, please answer below! Thanks again for pointing out this old design issue that came as a regression bug!

@operatorequals thank you for the update. I had thought that your design will make security architects scream and I am glad you are not disappointed me there with your story. There is no need for redesign - I will use import from GitHub from private repos with authentication so your current design suite me better and I will be rewriting my side of pipelines.

@operatorequals thank you for the update. I had thought that your design will make security architects scream and I am glad you are not disappointed me there with your story. There is no need for redesign - I will use import from GitHub from private repos with authentication so your current design suite me better and I will be rewriting my side of pipelines.

Great! Feel free to open a PR should you create changes that you find generic and/or useful in the module!

Way easier than I thought! Already passes unit tests (and should work for your case as well).
https://github.com/operatorequals/httpimport/blob/rewrite/httpimport2.py#L23