nan in result
alashkov83 opened this issue · 6 comments
On my program your dbcv code return nan in some cases (sklearm Calinski-Harabaz and Shilhuette index work well with this data (3 dimensional, about 200-1000 points)).
I understood. For data with No. of Labels is 1 (1 cluster without noise) DBCV return nan. sklearm Calinski-Harabaz and Shilhuette index raise ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive)
I thought that the problem was only among the labels. After I added this code:
if len (set (labels)) <2 or len (set (labels))> len (labels) - 1:
raise ValueError ("Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive)")
part of the nan-values was replaced by exceptions.
But not all! Some nan remained.
In terminal I got:
test_of_nan.py:53: RuntimeWarning: divide by zero encountered in double_scalars
core_dist = (numerator / (n_neighbors)) ** (-1 / n_features)
test_of_nan.py:198: RuntimeWarning: invalid value encountered in double_scalars
cluster_validity = numerator / denominator
The data for verification and the verification script can be found in the attachment.
I change code in function:
def _core_dist(point, neighbors):
"""
Computes the core distance of a point.
Core distance is the inverse density of an object.
Args:
point (np.array): array of dimensions (n_features,)
point to compute core distance of
neighbors (np.ndarray): array of dimensions (n_neighbors, n_features):
array of all other points in object class
Returns: core_dist (float)
inverse density of point
"""
n_features = np.shape(point)[0]
n_neighbors = np.shape(neighbors)[1]
distance_vector = cdist(point.reshape(1, -1), neighbors)
print(distance_vector)
distance_vector = distance_vector[distance_vector != 0] # in this point was problem, some cases #return []
if len(distance_vector) != 0:
numerator = ((1 / distance_vector) ** n_features).sum()
core_dist = ((numerator / n_neighbors) ** (-1 / n_features))
else:
core_dist = 0.0
return core_dist
But i'an not sure that this code correct! core_dist =0 -> density = inf
I have faced the same issue after scaling my features using the StandardScaler()
before computing the DBCV score. Problem was that the range of values in the distance_vector = distance_vector[distance_vector != 0]
was too large. Consequently, when computing the numerator numerator = ((1 / distance_vector) ** n_features).sum()
the value was too small and rounded to 0.0 by numpy. I managed to solve this issue by converting the distance_vector
variable to np.float128()
first.