kNN causes an error when the input has nominal data
e10e3 opened this issue · 3 comments
Versions
River version: 0.21.1
Python version: 3.12.4
Operating system: macOS 14.5
Describe the bug
When a kNN model does a prediction, it computes the distance of its input with previous data points.
If some of the feature are nominal (i.e. not numbers), kNN will cause an error because it tries to perform a subtraction on unsupported data types.
One would expect kNN to be resilient to nominal data, since there exist ways to derive a distance between non-numeric features. A simple one is to give a distance of 0 when they are equal and 1 if they are different.
Code to reproduce
from river import neighbors
m = neighbors.KNNClassifier()
m.learn_one({"a": "river"}, 0)
m.predict_one({"a": "river"})
Output
Traceback (most recent call last):
File "/path/to/river/knn-strings.py", line 6, in <module>
m.predict_one({"a": "river"})
File "/path/to/river/river/base/classifier.py", line 66, in predict_one
y_pred = self.predict_proba_one(x, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/to/river/river/neighbors/knn_classifier.py", line 149, in predict_proba_one
nearest = self._nn.search((x, None), n_neighbors=self.n_neighbors, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/to/river/river/neighbors/ann/swinn.py", line 431, in search
return self._linear_scan(item, n_neighbors)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/to/river/river/neighbors/ann/swinn.py", line 339, in _linear_scan
points = [(p.item, self.dist_func(item, p.item)) for p in self]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/to/river/river/neighbors/base.py", line 33, in __call__
return self.distance_function(a[0], b[0])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/to/river/river/utils/math.py", line 165, in minkowski_distance
return sum((abs(a.get(k, 0.0) - b.get(k, 0.0))) ** p for k in {*a.keys(), *b.keys()}) ** (1 / p)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/to/river/river/utils/math.py", line 165, in <genexpr>
return sum((abs(a.get(k, 0.0) - b.get(k, 0.0))) ** p for k in {*a.keys(), *b.keys()}) ** (1 / p)
~~~~~~~~~~~~~~^~~~~~~~~~~~~~~
TypeError: unsupported operand type(s) for -: 'str' and 'str'
Hi @e10e3, I see your point with the error. However, following the river convention, this type of situation should be dealt with on the application side. So, it is not a bug, but the intended behavior.
I am also against making kNN slow in cases where the data is numeric only.
Note that linear models, decisions trees and other models are also prone to fail if the user passes numeric and non-numeric data directly.
As possible solutions, the user can use a pipeline with One-Hot Encoding or supplying a custom distance metric as you propose in the related PR.
I agree with your point.
I guess I won't be the last (nor the first) to propose such change, is this convention explained in a document for future reference?