KNN on max series seems slower than cuda-based implementation on comparable devices ?
fcharras opened this issue · 4 comments
Initial report contained an error, please follow through the first comment for a better explanation.
import numpy as np
from sklearn.neighbors import NearestNeighbors
import sklearn
device = "
# device = "gpu:0"
from sklearnex import patch_sklearn
patch_sklearn()
sklearn.set_config(target_offload=f"{device}")
seed = 123
rng = np.random.default_rng(seed)
n_samples = 10_000_000
dim = 100
n_queries = 10_000
k = 100
data = rng.random((n_samples, dim), dtype=np.float32)
query = rng.random((n_queries, dim), dtype=np.float32)
knn = NearestNeighbors(n_neighbors=k, algorithm="brute")
knn.fit(data)
%time knn.kneighbors(X=query)
show following results:
- if
device=cpu
:
CPU times: user 25min 40s, sys: 18 s, total: 25min 58s
Wall time: 14.1 s
- if
device=gpu
(Max Series on intel beta cloud):
CPU times: user 25min 42s, sys: 21.7 s, total: 26min 4s
Wall time: 14.1 s
but one could expect a significant speedup on GPU.
Comparing on A100 with cuml
implementation (in fact inherited from OSS implementation from FAISS):
import numpy as np
from cuml.neighbors import NearestNeighbors
import cupy
seed = 123
rng = np.random.default_rng(seed)
n_samples = 10_000_000
dim = 100
n_queries = 10_000
k = 100
data = rng.random((n_samples, dim), dtype=np.float32)
query = rng.random((n_queries, dim), dtype=np.float32)
data = cupy.asarray(data)
query = cupy.asarray(query)
knn = NearestNeighbors(n_neighbors=k, algorithm="brute")
knn.fit(data)
%time knn.kneighbors(X=query)
it's about 3sc:
CPU times: user 2.71 s, sys: 8.49 ms, total: 2.72 s
Wall time: 2.73 s
Environment:
sklearn-intelex + dpcpp_cpp_rt install with conda with max series gpu on intel beta cloud.
There is actually an error in my initial snippet, in that it imports NearestNeighbors
estimators before calling patch_sklearn
, it should read:
import numpy as np
import sklearn
device = "cpu"
# device = "gpu:0"
from sklearnex import patch_sklearn, config_context
patch_sklearn()
from sklearn.neighbors import NearestNeighbors
seed = 123
rng = np.random.default_rng(seed)
n_samples = 10_000_000
dim = 100
n_queries = 10_000
k = 100
data = rng.random((n_samples, dim), dtype=np.float32)
query = rng.random((n_queries, dim), dtype=np.float32)
with config_context(target_offload=f"{device}"):
knn = NearestNeighbors(n_neighbors=k, algorithm="brute")
knn.fit(data)
%time knn.kneighbors(X=query)
it significantly improves the walltime on cpu:
CPU times: user 6min 21s, sys: 4.6 s, total: 6min 26s
Wall time: 3.53 s
(NB: the CPU it runs on provides 254 cores, that's a lot of cores, users usually have easier access to middle-end gpus than workstation CPUs with 64cores+)
But still no luck running it on GPU, now I have the following error:
Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)
INFO:sklearnex: sklearn.utils.validation._assert_all_finite: running accelerated version on CPU
INFO:sklearnex: sklearn.neighbors.NearestNeighbors.fit: running accelerated version on CPU
INFO:sklearnex: sklearn.utils.validation._assert_all_finite: running accelerated version on CPU
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
Cell In[1], line 27
25 with config_context(target_offload=f"{device}"):
26 knn = NearestNeighbors(n_neighbors=k, algorithm="brute")
---> 27 knn.fit(data)
28 get_ipython().run_line_magic('time', 'knn.kneighbors(X=query)')
File ~/mambaforge/envs/sklex/lib/python3.10/site-packages/sklearnex/neighbors/knn_unsupervised.py:91, in NearestNeighbors.fit(self, X, y)
89 def fit(self, X, y=None):
90 self._fit_validation(X, y)
---> 91 dispatch(self, 'fit', {
92 'onedal': self.__class__._onedal_fit,
93 'sklearn': sklearn_NearestNeighbors.fit,
94 }, X, None)
95 return self
File ~/mambaforge/envs/sklex/lib/python3.10/site-packages/sklearnex/_device_offload.py:161, in dispatch(obj, method_name, branches, *args, **kwargs)
158 backend, q, cpu_fallback = _get_backend(obj, q, method_name, *hostargs)
160 if backend == 'onedal':
--> 161 return branches[backend](obj, *hostargs, **hostkwargs, queue=q)
162 if backend == 'sklearn':
163 return branches[backend](obj, *hostargs, **hostkwargs)
File ~/mambaforge/envs/sklex/lib/python3.10/site-packages/sklearnex/neighbors/knn_unsupervised.py:144, in NearestNeighbors._onedal_fit(self, X, y, queue)
142 self._onedal_estimator.effective_metric_ = self.effective_metric_
143 self._onedal_estimator.effective_metric_params_ = self.effective_metric_params_
--> 144 self._onedal_estimator.fit(X, y, queue=queue)
146 self._save_attributes()
File ~/mambaforge/envs/sklex/lib/python3.10/site-packages/onedal/neighbors/neighbors.py:722, in NearestNeighbors.fit(self, X, y, queue)
721 def fit(self, X, y, queue=None):
--> 722 return super()._fit(X, y, queue=queue)
File ~/mambaforge/envs/sklex/lib/python3.10/site-packages/onedal/neighbors/neighbors.py:248, in NeighborsBase._fit(self, X, y, queue)
246 if _is_classifier(self) or (_is_regressor(self) and gpu_device):
247 _fit_y = self._validate_targets(self._y, X.dtype).reshape((-1, 1))
--> 248 result = self._onedal_fit(X, _fit_y, queue)
250 if y is not None and _is_regressor(self):
251 self._y = y if self._shape is None else y.reshape(self._shape)
File ~/mambaforge/envs/sklex/lib/python3.10/site-packages/onedal/neighbors/neighbors.py:690, in NearestNeighbors._onedal_fit(self, X, y, queue)
686 train_alg = kdtree_knn_classification_training
688 return train_alg(**params).compute(X, y).model
--> 690 policy = self._get_policy(queue, X, y)
691 X, y = _convert_to_supported(policy, X, y)
692 params = self._get_onedal_params(X, y)
File ~/mambaforge/envs/sklex/lib/python3.10/site-packages/onedal/neighbors/neighbors.py:48, in NeighborsCommonBase._get_policy(self, queue, *data)
47 def _get_policy(self, queue, *data):
---> 48 return _get_policy(queue, *data)
File ~/mambaforge/envs/sklex/lib/python3.10/site-packages/onedal/common/_policy.py:33, in _get_policy(queue, *data)
31 return _DataParallelInteropPolicy(data_queue)
32 return _DataParallelInteropPolicy(queue)
---> 33 assert data_queue is None and queue is None
34 return _HostInteropPolicy()
AssertionError:
I thought about converting the data to on-device usm_ndarray
beforehand:
import numpy as np
import sklearn
import dpctl.tensor as dpt
# device = "cpu"
device = "gpu"
from sklearnex import patch_sklearn, config_context
patch_sklearn()
from sklearn.neighbors import NearestNeighbors
seed = 123
rng = np.random.default_rng(seed)
n_samples = 10_000_000
dim = 100
n_queries = 10_000
k = 100
data = rng.random((n_samples, dim), dtype=np.float32)
query = rng.random((n_queries, dim), dtype=np.float32)
data = dpt.asarray(data)
query = dpt.asarray(query)
with config_context(target_offload=f"{device}"):
knn = NearestNeighbors(n_neighbors=k, algorithm="brute")
knn.fit(data)
%time knn.kneighbors(X=query)
but then the compute will just hang and output nothing.
So I found out I had a version mismatch in the conda dependency tree if I don't install everything with the -c intel
channel. It does not change the performance I got on CPU:
Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)
CPU times: user 6min 19s, sys: 4.03 s, total: 6min 23s
Wall time: 3.5 s
and now here's on GPU Max Series:
Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)
CPU times: user 10.4 s, sys: 4.01 s, total: 14.4 s
Wall time: 14.5 s
this time it seems to work and to be properly dispatched to GPU. There's about a 5 times slowdown compared to the cuml backend on nvidia A100 (see report in the OP). The performance cap one can reach on intel Max Series is unknown but the gap still feel larger than it should be, judging by the respective GPU specs.
@fcharras thank you for the report. Let me reproduce and investigate the issue.
Hi @fcharras, thank you for providing these results. We have reproduced the experiments and will create an internal feature request to identify ways to speed up this computation for more comparable results.