christopherjenness/DBCV

nan in result

alashkov83 opened this issue · 6 comments

On my program your dbcv code return nan in some cases (sklearm Calinski-Harabaz and Shilhuette index work well with this data (3 dimensional, about 200-1000 points)).

I understood. For data with No. of Labels is 1 (1 cluster without noise) DBCV return nan. sklearm Calinski-Harabaz and Shilhuette index raise ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive)

I thought that the problem was only among the labels. After I added this code:

if len (set (labels)) <2 or len (set (labels))> len (labels) - 1:
        raise ValueError ("Number of labels is 1. Valid values ​​are 2 to n_samples - 1 (inclusive)")

part of the nan-values ​​was replaced by exceptions.
But not all! Some nan remained.
In terminal I got:

test_of_nan.py:53: RuntimeWarning: divide by zero encountered in double_scalars
  core_dist = (numerator / (n_neighbors)) ** (-1 / n_features)
test_of_nan.py:198: RuntimeWarning: invalid value encountered in double_scalars
  cluster_validity = numerator / denominator

The data for verification and the verification script can be found in the attachment.

nan_error.zip

I change code in function:

def _core_dist(point, neighbors):
    """
    Computes the core distance of a point.
    Core distance is the inverse density of an object.
    Args:
        point (np.array): array of dimensions (n_features,)
            point to compute core distance of
        neighbors (np.ndarray): array of dimensions (n_neighbors, n_features):
            array of all other points in object class
    Returns: core_dist (float)
        inverse density of point
    """
    n_features = np.shape(point)[0]
    n_neighbors = np.shape(neighbors)[1]

    distance_vector = cdist(point.reshape(1, -1), neighbors)
    print(distance_vector)
    distance_vector = distance_vector[distance_vector != 0] # in this point was problem, some cases #return [] 
    if len(distance_vector) != 0:
        numerator = ((1 / distance_vector) ** n_features).sum()
        core_dist = ((numerator / n_neighbors) ** (-1 / n_features))
    else:
        core_dist = 0.0
    return core_dist

But i'an not sure that this code correct! core_dist =0 -> density = inf

I have faced the same issue after scaling my features using the StandardScaler() before computing the DBCV score. Problem was that the range of values in the distance_vector = distance_vector[distance_vector != 0] was too large. Consequently, when computing the numerator numerator = ((1 / distance_vector) ** n_features).sum() the value was too small and rounded to 0.0 by numpy. I managed to solve this issue by converting the distance_vector variable to np.float128() first.