RDST parallelism KeyError

Question

RDST parallelism KeyError

baraline opened this issue 2 years ago · 17 comments

During some, but not all, runs (e.g. FordA / FordB datasets) RDST Ensemble classifier fails with the following error dump :

joblib.externals.loky.process_executor._RemoteTraceback: 
Traceback (most recent call last):
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/numba/core/caching.py", line 469, in save
    data_name = overloads[key]
KeyError: ((array(float64, 3d, C), Tuple(array(float64, 2d, C), array(int64, 1d, C), array(int64, 1d, C), array(float64, 1d, C), array(bool, 1d, C)), type(CPUDispatcher(<function manhattan at 0x7fce883fe0d0>)), bool), ('x86_64-unknown-linux-gnu', 'cascadelake', '+64bit,+adx,+aes,-amx-bf16,-amx-int8,-amx-tile,+avx,+avx2,-avx512bf16,-avx512bitalg,+avx512bw,+avx512cd,+avx512dq,-avx512er,+avx512f,-avx512ifma,-avx512pf,-avx512vbmi,-avx512vbmi2,+avx512vl,+avx512vnni,-avx512vp2intersect,-avx512vpopcntdq,+bmi,+bmi2,-cldemote,+clflushopt,+clwb,-clzero,+cmov,+cx16,+cx8,-enqcmd,+f16c,+fma,-fma4,+fsgsbase,+fxsr,-gfni,+invpcid,-lwp,+lzcnt,+mmx,+movbe,-movdir64b,-movdiri,-mwaitx,+pclmul,-pconfig,+pku,+popcnt,-prefetchwt1,+prfchw,-ptwrite,-rdpid,+rdrnd,+rdseed,-rtm,+sahf,-serialize,-sgx,-sha,-shstk,+sse,+sse2,+sse3,+sse4.1,+sse4.2,-sse4a,+ssse3,-tbm,-tsxldtrk,-vaes,-vpclmulqdq,-waitpkg,-wbnoinvd,-xop,+xsave,+xsavec,+xsaveopt,+xsaves'), ('00e465fe82fb9c04ee9ece12d3d459a0d4fe0a0d451df090bccee8dc666d02b2', 'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 436, in _process_worker
    r = call_item()
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 288, in __call__
    return self.fn(*self.args, **self.kwargs)
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 595, in __call__
    return self.func(*args, **kwargs)
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/joblib/parallel.py", line 262, in __call__
    return [func(*args, **kwargs)
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/joblib/parallel.py", line 262, in <listcomp>
    return [func(*args, **kwargs)
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/sklearn/utils/fixes.py", line 117, in __call__
    return self.function(*args, **kwargs)
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/convst-0.2.1-py3.8.egg/convst/classifiers/rdst_ensemble.py", line 56, in _parallel_fit
    return model.fit(X, y)
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/sklearn/pipeline.py", line 378, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/sklearn/pipeline.py", line 336, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/joblib/memory.py", line 349, in __call__
    return self.func(*args, **kwargs)
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/sklearn/pipeline.py", line 870, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/sklearn/base.py", line 870, in fit_transform
    return self.fit(X, y, **fit_params).transform(X)
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/convst-0.2.1-py3.8.egg/convst/transformers/rdst.py", line 270, in transform
    X_new = self.transformer(
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/numba/core/dispatcher.py", line 487, in _compile_for_args
    raise e
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/numba/core/dispatcher.py", line 420, in _compile_for_args
    return_val = self.compile(tuple(argtypes))
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/numba/core/dispatcher.py", line 972, in compile
    self._cache.save_overload(sig, cres)
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/numba/core/caching.py", line 652, in save_overload
    self._save_overload(sig, data)
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/numba/core/caching.py", line 662, in _save_overload
    self._cache_file.save(key, data)
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/numba/core/caching.py", line 478, in save
    self._save_index(overloads)
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/numba/core/caching.py", line 522, in _save_index
    data = self._dump(data)
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/numba/core/caching.py", line 550, in _dump
    return dumps(obj)
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/numba/core/serialize.py", line 57, in dumps
    p.dump(obj)
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/numba/cloudpickle/cloudpickle_fast.py", line 568, in dump
    return Pickler.dump(self, obj)
  File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/numba/core/types/functions.py", line 486, in __getnewargs__
    raise ReferenceError("underlying object has vanished")
ReferenceError: underlying object has vanished

Answer 1 · 2022-11-09T09:16:55.000Z

This issue is caused by a problem with the dependencies and/or packages versions.
A new installation in a clean Conda environment for python 3.8.13 fix the issue. The issue will remain open until the dependencies causing the issue are identified.

Answer 2 · 2022-12-05T22:10:02.000Z

I am having this issue for most of my datasets with no threading, deleting the numba cache seems to fix it for a few runs, but it breaks again when they are written.

Another similar numba error (after removing the numba parallel option in an attempt to fix the first one, so this may be completely on me):

Traceback (most recent call last):
  File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/errors.py", line 823, in new_error_context
    yield
  File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/lowering.py", line 265, in lower_block
    self.lower_inst(inst)
  File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/lowering.py", line 439, in lower_inst
    val = self.lower_assign(ty, inst)
  File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/lowering.py", line 626, in lower_assign
    return self.lower_expr(ty, value)
  File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/lowering.py", line 1368, in lower_expr
    res = self.context.special_ops[expr.op](self, expr)
  File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/np/ufunc/array_exprs.py", line 360, in _lower_array_expr
    code_obj = compile(ast_module, expr_filename, 'exec')
TypeError: non-numeric type in Num

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/gpfs/home/pfm15hbu/esteval/tsml_estimator_evaluation/experiments/classification_experiments.py", line 105, in <module>
    run_experiment(sys.argv)
  File "/gpfs/home/pfm15hbu/esteval/tsml_estimator_evaluation/experiments/classification_experiments.py", line 75, in run_experiment
    overwrite=overwrite,
  File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/sktime/benchmarking/experiments.py", line 536, in load_and_run_classification_experiment
    test_file=build_test,
  File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/sktime/benchmarking/experiments.py", line 321, in run_classification_experiment
    classifier.fit(X_train, y_train)
  File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/sktime/classification/base.py", line 191, in fit
    self._fit(X, y)
  File "/gpfs/home/pfm15hbu/esteval/tsml_estimator_evaluation/sktime_estimators/classification/rdst.py", line 15, in _fit
    self.clf.fit(X, y)
  File "/gpfs/home/pfm15hbu/.conda/envs/est-eval/lib/python3.7/site-packages/convst/classifiers/rdst_ridge.py", line 160, in fit
    self.classifier = self.classifier.fit(self.transformer.transform(X), y)
  File "/gpfs/home/pfm15hbu/.conda/envs/est-eval/lib/python3.7/site-packages/convst/transformers/rdst.py", line 280, in transform
    self.phase_invariance
  File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/dispatcher.py", line 487, in _compile_for_args
    raise e
  File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/dispatcher.py", line 420, in _compile_for_args
    return_val = self.compile(tuple(argtypes))
  File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/dispatcher.py", line 965, in compile
    cres = self._compiler.compile(args, return_type)
  File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/dispatcher.py", line 125, in compile
    status, retval = self._compile_cached(args, return_type)
  File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/dispatcher.py", line 139, in _compile_cached
    retval = self._compile_core(args, return_type)
  File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/dispatcher.py", line 157, in _compile_core
    pipeline_class=self.pipeline_class)
  File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/compiler.py", line 716, in compile_extra
    return pipeline.compile_extra(func)
  File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/compiler.py", line 452, in compile_extra
    return self._compile_bytecode()
  File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/compiler.py", line 520, in _compile_bytecode
    return self._compile_core()
  File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/compiler.py", line 499, in _compile_core
    raise e
  File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/compiler.py", line 486, in _compile_core
    pm.run(self.state)
  File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/compiler_machinery.py", line 368, in run
    raise patched_exception
  File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/compiler_machinery.py", line 356, in run
    self._runPass(idx, pass_inst, state)
  File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/compiler_lock.py", line 35, in _acquire_compile_lock
    return func(*args, **kwargs)
  File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/compiler_machinery.py", line 311, in _runPass
    mutated |= check(pss.run_pass, internal_state)
  File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/compiler_machinery.py", line 273, in check
    mangled = func(compiler_state)
  File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/typed_passes.py", line 394, in run_pass
    lower.lower()
  File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/lowering.py", line 168, in lower
    self.lower_normal_function(self.fndesc)
  File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/lowering.py", line 222, in lower_normal_function
    entry_block_tail = self.lower_function_body()
  File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/lowering.py", line 251, in lower_function_body
    self.lower_block(block)
  File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/lowering.py", line 265, in lower_block
    self.lower_inst(inst)
  File "/gpfs/home/pfm15hbu/.conda/envs/est-eval/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/errors.py", line 837, in new_error_context
    raise newerr.with_traceback(tb)
numba.core.errors.LoweringError: Failed in nopython mode pipeline (step: native lowering)
�[1m�[1mnon-numeric type in Num
�[1m
File ".conda/envs/est-eval/lib/python3.7/site-packages/convst/transformers/_univariate_same_length.py", line 300:�[0m
�[1mdef U_SL_apply_all_shapelets(
    <source elided>
            
�[1m            _idx_no_norm = _idx_shp[where(normalize[_idx_shp] == False)[0]]
�[0m            �[1m^�[0m�[0m
�[0m
�[0m�[1mDuring: lowering "$360compare_op.41 = arrayexpr(expr=(<built-in function eq>, [Var($356binary_subscr.39, _univariate_same_length.py:300), const(bool, False)]), ty=array(bool, 1d, C))" at /gpfs/home/pfm15hbu/.conda/envs/est-eval/lib/python3.7/site-packages/convst/transformers/_univariate_same_length.py (300)�[0m

Answer 3 · 2022-12-06T08:23:02.000Z

You still get the KeyError even with a new environment ? Creating a new one seem to have fixed the issue for me.
I will look into the error without numba parallel, hopefully that is the source of the problem.

Answer 4 · 2022-12-06T08:50:16.000Z

@MatthewMiddlehurst I cannot reproduce the issue for both KeyError and LoweringError on my end, with and without parallel keyword and/or n_jobs > 1 with the following machines :

System:

python: 3.8.13 (default, Mar 28 2022, 11:38:47) [GCC 7.5.0]
OS: Ubuntu 20.04.5 LTS
Kernel : Linux-5.15.0-53-generic-x86_64-with-glibc2.17

Python dependencies:

pip: 21.2.4
setuptools: 61.2.0
sklearn: 1.1.1
sktime: 0.13.0
statsmodels: 0.13.2
numpy: 1.21.6
scipy: 1.8.1
joblib: 1.1.0
numba: 0.56.0

and this one (on which experiments are made)

System:

python: 3.8.13 (default, Mar 28 2022, 11:38:47) [GCC 7.5.0]
OS: Ubuntu 20.04.4 LTS
Kernel : Linux-5.14.0-1044-generic-x86_64-with-glibc2.31

Python dependencies:

pip: 22.2.2
setuptools: 61.2.0
sklearn: 1.1.3
sktime: 0.14.0
statsmodels: 0.13.5
numpy: 1.22.4
scipy: 1.8.1
joblib: 1.2.0
numba: 0.56.4

Could you please provide an example code with the version you are currently using ?

Answer 5 · 2022-12-06T13:22:56.000Z

There is indeed something off with numba see #34 . Does the problem happens on your end with the non Ensemble version ?

Answer 6 · 2022-12-06T14:31:55.000Z

I am running it on our computing cluster, so the setup may be a bit odd.

Python 3.7.13
OS: CentOS Linux 7 (Core)
Kernel: 3.10.0-1160.45.1.el7.x86_64

Python dependencies:

conda==22.11.0
convst==0.2.4
joblib==1.1.0
numba==0.56.4
numpy==1.21.6
pip==22.2.2
setuptools==63.4.1
scikit-learn==1.0.2
scipy==1.21.6
sktime==0.14.1
statsmodels==0.13.5

I am just running the ridge version using a simple wrapper for the sktime interface.
https://github.com/time-series-machine-learning/tsml-estimator-evaluation/blob/main/tsml_estimator_evaluation/sktime_estimators/classification/rdst.py

There are more dependencies installed than the ones listed above, but none of the other sktime numba seems to have an issue on the same environment.

Answer 7 · 2022-12-06T14:35:58.000Z

The workflow is running many individual jobs over many distributed cores, so it may not be typical. Again, the first few runs seemed to be fine but when the functions are cached, errors start to appear.

Answer 8 · 2022-12-06T14:52:28.000Z

I see, as no error are thrown when test on python 3.7 are run, I doubt it would change something but, could you by any chance try to run it on a python 3.8+ or use pickle5 ? (see https://numba.readthedocs.io/en/stable/developer/caching.html) It may impact how caching is handled, from what I understand.

Nevertheless, there is definitely something wrong with the Ensemble version even on Python 3.11, #34 shows high std for timings for RDST Ensemble, which could indicate that function are being cached again after other models have been run.

I suspect that it has something to do with the combination of multiple joblib processes using numba parallel, although i followed instructions from https://numba.readthedocs.io/en/stable/user/threading-layer.html#example-of-limiting-the-number-of-threads. The fact that it is spread over multiple machine may also be part of the issue, I never tested it in this context.

This will require a bit of time to fix, I'm afraid. If that helps, I can provide results generated on my end.

Answer 9 · 2022-12-06T15:27:44.000Z

Yeah, no issue, I can try giving it a run with more datasets on my own machine and with your suggestions on the cluster. Currently, I am just running it without numba on the cluster (I wont report any timing results from these as that would be unfair).

Answer 10 · 2022-12-06T15:32:47.000Z

Thanks, if you do find any strange difference for accuracy results, I would also appreciate feedback. I will update progress on this issue when I find the source of the problem.

Answer 11 · 2022-12-29T01:01:57.000Z

The issue may be caused by what is described as cache invalidation due to the loading of numba function from the _commony.py file by multiple independent processes, causing a recompilation, and further down the way, a cache explosion.
See https://numba.discourse.group/t/cache-behaviour/1520.

Answer 12 · 2022-12-29T10:18:41.000Z

@MatthewMiddlehurst The version 0.2.5 fixed the issues I was noticing on my side, at the cost of ~20% more run-time for RDST Ensemble for now until I learn how to properly manage a thread pool with numba and joblib threads.

Hopefully that also fix the issues on your side, would appreciate an update when you have some time.

Answer 13 · 2023-02-07T14:03:23.000Z

Just a quick update, I ran with my setup using Python 3.10 and the newest update and still had similar errors. It could possibly be a conflict with another dependency, as I am running it through a larger package. This probably is more HPC specific as I'm running >200 builds at once, but it's weird I haven't seen this with any of my numba stuff which also caches.

No issue running after I hacked it to remove caching from the transform parts. I am unsure how this has impacted performance, but it still seemed to finish everything rather quickly (much faster than no numba at least). As a temporary fix maybe allow the setting of a global variable to disable caching on these functions? Have not tried it so may not be possible, but i don't see why it wouldn't 🙂.

Answer 14 · 2023-02-07T16:22:30.000Z

Thanks for the update ! I indeed suspect it has something to do with the HPC cluster as I cannot reproduce anything on my end, but it is indeed a bit worrying that it only happens with this Numba code ...

I think the global variable approach is the best one, if it is feasible, i'll look into it and close this issue when it's done.

I would be curious as to why the problem happens, but I don't have an HPC cluster available as of now :/ Will update if i can manage to get to the bottom of it.

Answer 15 · 2023-03-11T13:08:47.000Z

Hi Matthew,

Sorry for the delay, I have ve been quite busy with the postdoc. So, I've tried some different solutions, and the one that worked with minimal complexity was to add variables defined in convst/__init__.py to modify the compilation args of numba function across the whole module. They should be set at the start of your script, before any compilation. These changes will be available on the new 0.2.6 version.

You can see an example of how to do this in this example : https://github.com/baraline/convst/blob/main/examples/Changing_numba_options.py

Alternatively, if this does not work for your setup, you can simply modify the value of the parameter in the init.py file to affect the entire module, which is still a hack, but does not require you to change the value across all the files. Hope this will fix the issue on your side !

Answer 16 · 2023-03-11T14:16:24.000Z

Thanks for looking into it @baraline! We managed to run everything using numba without the caching, and it all looks faithful to the reported results, very impressive work! The changes should make it easier for future runs.

Answer 17 · 2023-03-17T09:38:19.000Z

Thanks for your return @MatthewMiddlehurst ! Don't hesitate to re-open this issue if the fix does not work.

Additionally, if you have run RDST/RDST Ensemble on multivariate datasets, I fixed a bug in version 0.2.7 that caused some shapelets to not be correctly generated when n_jobs>1, which improved the performance for some multivariate dataset on my side. (see #44)