/Text-Clustering

Python Program for Text Clustering using Bisecting k-means

Primary LanguageJupyter Notebook

Text-Clustering

Normalized Mutual Information (NMI) Score: 0.6934

Approach:

  1. The input data containing 8580 text records in sparse format is first read into a matrix.
  2. This CSR matrix is then scaled by IDF and normalized by its L2-norm and then converted to a dense ndarray representation.
  3. This array is then separated into the desired number of clusters using bisecting k-means clustering approach.

Calinski Harabaz Score (Caliński, T., & Harabasz, J. (1974). “A dendrite method for cluster analysis”. Communications in Statistics-theory and Methods 3: 1-27.) has been calculated for the list of clusters for values of k starting from 3 to 21 in steps of 2 for the given dataset.

This metric has been plotted on the y-axis against the values for k on the x-axis

plot