nan in result

Question

nan in result

alashkov83 opened this issue 6 years ago · 6 comments

On my program your dbcv code return nan in some cases (sklearm Calinski-Harabaz and Shilhuette index work well with this data (3 dimensional, about 200-1000 points)).

Answer 1 · 2018-09-14T09:05:52.000Z

Can you provide the data?

…

On Fri, Sep 14, 2018, 2:51 AM Aleksandr Lashkov ***@***.***> wrote: On my program your dbcv code return nan in some cases (sklearm Calinski-Harabaz and Shilhuette index work well with this data (3 dimensional, about 200-1000 points)). — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#8>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKfcvjtPxdhQM3qV6fNSQ7A60hBCxxNZks5ua1IJgaJpZM4Wotlp> .

Answer 2 · 2018-09-14T10:08:26.000Z

I understood. For data with No. of Labels is 1 (1 cluster without noise) DBCV return nan. sklearm Calinski-Harabaz and Shilhuette index raise ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive)

Answer 3 · 2018-09-14T12:26:10.000Z

It seems many people misunderstand this. I will update when I have time.

…

On Fri, Sep 14, 2018, 6:08 AM Aleksandr Lashkov ***@***.***> wrote: I understood. For data with No. of Labels is 1 (1 cluster without noise) DBCV return nan. sklearm Calinski-Harabaz and Shilhuette index raise ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#8 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKfcvlHV2TVg-tJYPSI7YA3FiQGmykEvks5ua4AagaJpZM4Wotlp> .

Answer 4 · 2018-09-18T08:53:36.000Z

I thought that the problem was only among the labels. After I added this code:

if len (set (labels)) <2 or len (set (labels))> len (labels) - 1:
        raise ValueError ("Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive)")

part of the nan-values was replaced by exceptions.
But not all! Some nan remained.
In terminal I got:

test_of_nan.py:53: RuntimeWarning: divide by zero encountered in double_scalars
  core_dist = (numerator / (n_neighbors)) ** (-1 / n_features)
test_of_nan.py:198: RuntimeWarning: invalid value encountered in double_scalars
  cluster_validity = numerator / denominator

The data for verification and the verification script can be found in the attachment.

nan_error.zip

Answer 5 · 2018-09-18T14:34:30.000Z

I change code in function:

def _core_dist(point, neighbors):
    """
    Computes the core distance of a point.
    Core distance is the inverse density of an object.
    Args:
        point (np.array): array of dimensions (n_features,)
            point to compute core distance of
        neighbors (np.ndarray): array of dimensions (n_neighbors, n_features):
            array of all other points in object class
    Returns: core_dist (float)
        inverse density of point
    """
    n_features = np.shape(point)[0]
    n_neighbors = np.shape(neighbors)[1]

    distance_vector = cdist(point.reshape(1, -1), neighbors)
    print(distance_vector)
    distance_vector = distance_vector[distance_vector != 0] # in this point was problem, some cases #return [] 
    if len(distance_vector) != 0:
        numerator = ((1 / distance_vector) ** n_features).sum()
        core_dist = ((numerator / n_neighbors) ** (-1 / n_features))
    else:
        core_dist = 0.0
    return core_dist

But i'an not sure that this code correct! core_dist =0 -> density = inf

Answer 6 · 2023-04-26T10:12:07.000Z

I have faced the same issue after scaling my features using the StandardScaler() before computing the DBCV score. Problem was that the range of values in the distance_vector = distance_vector[distance_vector != 0] was too large. Consequently, when computing the numerator numerator = ((1 / distance_vector) ** n_features).sum() the value was too small and rounded to 0.0 by numpy. I managed to solve this issue by converting the distance_vector variable to np.float128() first.