RDST parallelism KeyError
baraline opened this issue · 17 comments
During some, but not all, runs (e.g. FordA / FordB datasets) RDST Ensemble classifier fails with the following error dump :
joblib.externals.loky.process_executor._RemoteTraceback:
Traceback (most recent call last):
File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/numba/core/caching.py", line 469, in save
data_name = overloads[key]
KeyError: ((array(float64, 3d, C), Tuple(array(float64, 2d, C), array(int64, 1d, C), array(int64, 1d, C), array(float64, 1d, C), array(bool, 1d, C)), type(CPUDispatcher(<function manhattan at 0x7fce883fe0d0>)), bool), ('x86_64-unknown-linux-gnu', 'cascadelake', '+64bit,+adx,+aes,-amx-bf16,-amx-int8,-amx-tile,+avx,+avx2,-avx512bf16,-avx512bitalg,+avx512bw,+avx512cd,+avx512dq,-avx512er,+avx512f,-avx512ifma,-avx512pf,-avx512vbmi,-avx512vbmi2,+avx512vl,+avx512vnni,-avx512vp2intersect,-avx512vpopcntdq,+bmi,+bmi2,-cldemote,+clflushopt,+clwb,-clzero,+cmov,+cx16,+cx8,-enqcmd,+f16c,+fma,-fma4,+fsgsbase,+fxsr,-gfni,+invpcid,-lwp,+lzcnt,+mmx,+movbe,-movdir64b,-movdiri,-mwaitx,+pclmul,-pconfig,+pku,+popcnt,-prefetchwt1,+prfchw,-ptwrite,-rdpid,+rdrnd,+rdseed,-rtm,+sahf,-serialize,-sgx,-sha,-shstk,+sse,+sse2,+sse3,+sse4.1,+sse4.2,-sse4a,+ssse3,-tbm,-tsxldtrk,-vaes,-vpclmulqdq,-waitpkg,-wbnoinvd,-xop,+xsave,+xsavec,+xsaveopt,+xsaves'), ('00e465fe82fb9c04ee9ece12d3d459a0d4fe0a0d451df090bccee8dc666d02b2', 'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 436, in _process_worker
r = call_item()
File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 288, in __call__
return self.fn(*self.args, **self.kwargs)
File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 595, in __call__
return self.func(*args, **kwargs)
File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/joblib/parallel.py", line 262, in __call__
return [func(*args, **kwargs)
File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/joblib/parallel.py", line 262, in <listcomp>
return [func(*args, **kwargs)
File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/sklearn/utils/fixes.py", line 117, in __call__
return self.function(*args, **kwargs)
File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/convst-0.2.1-py3.8.egg/convst/classifiers/rdst_ensemble.py", line 56, in _parallel_fit
return model.fit(X, y)
File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/sklearn/pipeline.py", line 378, in fit
Xt = self._fit(X, y, **fit_params_steps)
File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/sklearn/pipeline.py", line 336, in _fit
X, fitted_transformer = fit_transform_one_cached(
File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/joblib/memory.py", line 349, in __call__
return self.func(*args, **kwargs)
File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/sklearn/pipeline.py", line 870, in _fit_transform_one
res = transformer.fit_transform(X, y, **fit_params)
File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/sklearn/base.py", line 870, in fit_transform
return self.fit(X, y, **fit_params).transform(X)
File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/convst-0.2.1-py3.8.egg/convst/transformers/rdst.py", line 270, in transform
X_new = self.transformer(
File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/numba/core/dispatcher.py", line 487, in _compile_for_args
raise e
File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/numba/core/dispatcher.py", line 420, in _compile_for_args
return_val = self.compile(tuple(argtypes))
File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/numba/core/dispatcher.py", line 972, in compile
self._cache.save_overload(sig, cres)
File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/numba/core/caching.py", line 652, in save_overload
self._save_overload(sig, data)
File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/numba/core/caching.py", line 662, in _save_overload
self._cache_file.save(key, data)
File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/numba/core/caching.py", line 478, in save
self._save_index(overloads)
File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/numba/core/caching.py", line 522, in _save_index
data = self._dump(data)
File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/numba/core/caching.py", line 550, in _dump
return dumps(obj)
File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/numba/core/serialize.py", line 57, in dumps
p.dump(obj)
File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/numba/cloudpickle/cloudpickle_fast.py", line 568, in dump
return Pickler.dump(self, obj)
File "/home/prof/guillaume/anaconda3/envs/convst/lib/python3.8/site-packages/numba/core/types/functions.py", line 486, in __getnewargs__
raise ReferenceError("underlying object has vanished")
ReferenceError: underlying object has vanished
This issue is caused by a problem with the dependencies and/or packages versions.
A new installation in a clean Conda environment for python 3.8.13 fix the issue. The issue will remain open until the dependencies causing the issue are identified.
I am having this issue for most of my datasets with no threading, deleting the numba cache seems to fix it for a few runs, but it breaks again when they are written.
Another similar numba error (after removing the numba parallel option in an attempt to fix the first one, so this may be completely on me):
Traceback (most recent call last):
File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/errors.py", line 823, in new_error_context
yield
File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/lowering.py", line 265, in lower_block
self.lower_inst(inst)
File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/lowering.py", line 439, in lower_inst
val = self.lower_assign(ty, inst)
File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/lowering.py", line 626, in lower_assign
return self.lower_expr(ty, value)
File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/lowering.py", line 1368, in lower_expr
res = self.context.special_ops[expr.op](self, expr)
File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/np/ufunc/array_exprs.py", line 360, in _lower_array_expr
code_obj = compile(ast_module, expr_filename, 'exec')
TypeError: non-numeric type in Num
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/gpfs/home/pfm15hbu/esteval/tsml_estimator_evaluation/experiments/classification_experiments.py", line 105, in <module>
run_experiment(sys.argv)
File "/gpfs/home/pfm15hbu/esteval/tsml_estimator_evaluation/experiments/classification_experiments.py", line 75, in run_experiment
overwrite=overwrite,
File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/sktime/benchmarking/experiments.py", line 536, in load_and_run_classification_experiment
test_file=build_test,
File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/sktime/benchmarking/experiments.py", line 321, in run_classification_experiment
classifier.fit(X_train, y_train)
File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/sktime/classification/base.py", line 191, in fit
self._fit(X, y)
File "/gpfs/home/pfm15hbu/esteval/tsml_estimator_evaluation/sktime_estimators/classification/rdst.py", line 15, in _fit
self.clf.fit(X, y)
File "/gpfs/home/pfm15hbu/.conda/envs/est-eval/lib/python3.7/site-packages/convst/classifiers/rdst_ridge.py", line 160, in fit
self.classifier = self.classifier.fit(self.transformer.transform(X), y)
File "/gpfs/home/pfm15hbu/.conda/envs/est-eval/lib/python3.7/site-packages/convst/transformers/rdst.py", line 280, in transform
self.phase_invariance
File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/dispatcher.py", line 487, in _compile_for_args
raise e
File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/dispatcher.py", line 420, in _compile_for_args
return_val = self.compile(tuple(argtypes))
File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/dispatcher.py", line 965, in compile
cres = self._compiler.compile(args, return_type)
File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/dispatcher.py", line 125, in compile
status, retval = self._compile_cached(args, return_type)
File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/dispatcher.py", line 139, in _compile_cached
retval = self._compile_core(args, return_type)
File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/dispatcher.py", line 157, in _compile_core
pipeline_class=self.pipeline_class)
File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/compiler.py", line 716, in compile_extra
return pipeline.compile_extra(func)
File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/compiler.py", line 452, in compile_extra
return self._compile_bytecode()
File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/compiler.py", line 520, in _compile_bytecode
return self._compile_core()
File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/compiler.py", line 499, in _compile_core
raise e
File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/compiler.py", line 486, in _compile_core
pm.run(self.state)
File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/compiler_machinery.py", line 368, in run
raise patched_exception
File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/compiler_machinery.py", line 356, in run
self._runPass(idx, pass_inst, state)
File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/compiler_lock.py", line 35, in _acquire_compile_lock
return func(*args, **kwargs)
File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/compiler_machinery.py", line 311, in _runPass
mutated |= check(pss.run_pass, internal_state)
File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/compiler_machinery.py", line 273, in check
mangled = func(compiler_state)
File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/typed_passes.py", line 394, in run_pass
lower.lower()
File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/lowering.py", line 168, in lower
self.lower_normal_function(self.fndesc)
File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/lowering.py", line 222, in lower_normal_function
entry_block_tail = self.lower_function_body()
File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/lowering.py", line 251, in lower_function_body
self.lower_block(block)
File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/lowering.py", line 265, in lower_block
self.lower_inst(inst)
File "/gpfs/home/pfm15hbu/.conda/envs/est-eval/lib/python3.7/contextlib.py", line 130, in __exit__
self.gen.throw(type, value, traceback)
File "/gpfs/home/pfm15hbu/.local/lib/python3.7/site-packages/numba/core/errors.py", line 837, in new_error_context
raise newerr.with_traceback(tb)
numba.core.errors.LoweringError: Failed in nopython mode pipeline (step: native lowering)
�[1m�[1mnon-numeric type in Num
�[1m
File ".conda/envs/est-eval/lib/python3.7/site-packages/convst/transformers/_univariate_same_length.py", line 300:�[0m
�[1mdef U_SL_apply_all_shapelets(
<source elided>
�[1m _idx_no_norm = _idx_shp[where(normalize[_idx_shp] == False)[0]]
�[0m �[1m^�[0m�[0m
�[0m
�[0m�[1mDuring: lowering "$360compare_op.41 = arrayexpr(expr=(<built-in function eq>, [Var($356binary_subscr.39, _univariate_same_length.py:300), const(bool, False)]), ty=array(bool, 1d, C))" at /gpfs/home/pfm15hbu/.conda/envs/est-eval/lib/python3.7/site-packages/convst/transformers/_univariate_same_length.py (300)�[0m
You still get the KeyError even with a new environment ? Creating a new one seem to have fixed the issue for me.
I will look into the error without numba parallel, hopefully that is the source of the problem.
@MatthewMiddlehurst I cannot reproduce the issue for both KeyError and LoweringError on my end, with and without parallel keyword and/or n_jobs > 1 with the following machines :
System:
- python: 3.8.13 (default, Mar 28 2022, 11:38:47) [GCC 7.5.0]
- OS: Ubuntu 20.04.5 LTS
- Kernel : Linux-5.15.0-53-generic-x86_64-with-glibc2.17
Python dependencies:
- pip: 21.2.4
- setuptools: 61.2.0
- sklearn: 1.1.1
- sktime: 0.13.0
- statsmodels: 0.13.2
- numpy: 1.21.6
- scipy: 1.8.1
- joblib: 1.1.0
- numba: 0.56.0
and this one (on which experiments are made)
System:
- python: 3.8.13 (default, Mar 28 2022, 11:38:47) [GCC 7.5.0]
- OS: Ubuntu 20.04.4 LTS
- Kernel : Linux-5.14.0-1044-generic-x86_64-with-glibc2.31
Python dependencies:
- pip: 22.2.2
- setuptools: 61.2.0
- sklearn: 1.1.3
- sktime: 0.14.0
- statsmodels: 0.13.5
- numpy: 1.22.4
- scipy: 1.8.1
- joblib: 1.2.0
- numba: 0.56.4
Could you please provide an example code with the version you are currently using ?
There is indeed something off with numba see #34 . Does the problem happens on your end with the non Ensemble version ?
I am running it on our computing cluster, so the setup may be a bit odd.
Python 3.7.13
OS: CentOS Linux 7 (Core)
Kernel: 3.10.0-1160.45.1.el7.x86_64
Python dependencies:
conda==22.11.0
convst==0.2.4
joblib==1.1.0
numba==0.56.4
numpy==1.21.6
pip==22.2.2
setuptools==63.4.1
scikit-learn==1.0.2
scipy==1.21.6
sktime==0.14.1
statsmodels==0.13.5
I am just running the ridge version using a simple wrapper for the sktime interface.
https://github.com/time-series-machine-learning/tsml-estimator-evaluation/blob/main/tsml_estimator_evaluation/sktime_estimators/classification/rdst.py
There are more dependencies installed than the ones listed above, but none of the other sktime numba seems to have an issue on the same environment.
The workflow is running many individual jobs over many distributed cores, so it may not be typical. Again, the first few runs seemed to be fine but when the functions are cached, errors start to appear.
I see, as no error are thrown when test on python 3.7 are run, I doubt it would change something but, could you by any chance try to run it on a python 3.8+ or use pickle5 ? (see https://numba.readthedocs.io/en/stable/developer/caching.html) It may impact how caching is handled, from what I understand.
Nevertheless, there is definitely something wrong with the Ensemble version even on Python 3.11, #34 shows high std for timings for RDST Ensemble, which could indicate that function are being cached again after other models have been run.
I suspect that it has something to do with the combination of multiple joblib processes using numba parallel, although i followed instructions from https://numba.readthedocs.io/en/stable/user/threading-layer.html#example-of-limiting-the-number-of-threads. The fact that it is spread over multiple machine may also be part of the issue, I never tested it in this context.
This will require a bit of time to fix, I'm afraid. If that helps, I can provide results generated on my end.
Yeah, no issue, I can try giving it a run with more datasets on my own machine and with your suggestions on the cluster. Currently, I am just running it without numba on the cluster (I wont report any timing results from these as that would be unfair).
Thanks, if you do find any strange difference for accuracy results, I would also appreciate feedback. I will update progress on this issue when I find the source of the problem.
The issue may be caused by what is described as cache invalidation due to the loading of numba function from the _commony.py
file by multiple independent processes, causing a recompilation, and further down the way, a cache explosion.
See https://numba.discourse.group/t/cache-behaviour/1520.
@MatthewMiddlehurst The version 0.2.5 fixed the issues I was noticing on my side, at the cost of ~20% more run-time for RDST Ensemble for now until I learn how to properly manage a thread pool with numba and joblib threads.
Hopefully that also fix the issues on your side, would appreciate an update when you have some time.
Just a quick update, I ran with my setup using Python 3.10 and the newest update and still had similar errors. It could possibly be a conflict with another dependency, as I am running it through a larger package. This probably is more HPC specific as I'm running >200 builds at once, but it's weird I haven't seen this with any of my numba stuff which also caches.
No issue running after I hacked it to remove caching from the transform parts. I am unsure how this has impacted performance, but it still seemed to finish everything rather quickly (much faster than no numba at least). As a temporary fix maybe allow the setting of a global variable to disable caching on these functions? Have not tried it so may not be possible, but i don't see why it wouldn't 🙂.
Thanks for the update ! I indeed suspect it has something to do with the HPC cluster as I cannot reproduce anything on my end, but it is indeed a bit worrying that it only happens with this Numba code ...
I think the global variable approach is the best one, if it is feasible, i'll look into it and close this issue when it's done.
I would be curious as to why the problem happens, but I don't have an HPC cluster available as of now :/ Will update if i can manage to get to the bottom of it.
Hi Matthew,
Sorry for the delay, I have ve been quite busy with the postdoc. So, I've tried some different solutions, and the one that worked with minimal complexity was to add variables defined in convst/__init__.py
to modify the compilation args of numba function across the whole module. They should be set at the start of your script, before any compilation. These changes will be available on the new 0.2.6 version.
You can see an example of how to do this in this example : https://github.com/baraline/convst/blob/main/examples/Changing_numba_options.py
Alternatively, if this does not work for your setup, you can simply modify the value of the parameter in the init.py file to affect the entire module, which is still a hack, but does not require you to change the value across all the files. Hope this will fix the issue on your side !
Thanks for looking into it @baraline! We managed to run everything using numba without the caching, and it all looks faithful to the reported results, very impressive work! The changes should make it easier for future runs.
Thanks for your return @MatthewMiddlehurst ! Don't hesitate to re-open this issue if the fix does not work.
Additionally, if you have run RDST/RDST Ensemble on multivariate datasets, I fixed a bug in version 0.2.7 that caused some shapelets to not be correctly generated when n_jobs>1, which improved the performance for some multivariate dataset on my side. (see #44)