sophryu99/TIL

Clustering: algorithm and related library uses (sci-kit learn)

Opened this issue · 3 comments

Clustering with scikit-learn

sklearn.cluster
Clustering algorithms in scikit-learn: documentation
Screen Shot 2021-07-28 at 2 36 53 PM

KMeans

The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia. This algorithm requires the number of clusters to be specified.

Basic idea: divides a set of samples into disjoint clusters (mean of the samples in the cluster). The means are called 'centroids'.

Steps

  1. Choose the initial centroids by selecting arbitrary data from the dataset.
  2. Assigns each data to its nearest centroid, and create new centroids by taking the mean value of all of the data assigned to each previous centroid.
  3. The difference between the old and the new centroids are computed, and this is done until the difference is less than a threshold.

The K-means algorithm aims to choose centroids that minimize the inertia, or within-cluster sum-of-squares criterion:

calculating inertia

Screen Shot 2021-07-28 at 2 49 05 PM

The elbow method

  • Choose a range of K
  • Run K-means for every K in your range
  • After each run, calculate sum of squared erros
  • Also calculate the change in slope between each consecutive sum
  • The 'elbow' aka 'turning point' is the run where the largest difference in slope is calculated

Clustering vs Classification

Clustering

  • Ideally, captures generating distributions

  • Practically, is an exploration of the structure of your dataset

  • Data points are fixed

  • Cluster centers are searched for

Classification

  • Labels are fixed

  • Transformations of the data points are searched for

Elbow method pseudocode

# Finding the ideal k
distortions = []
inertias = []
mapping1 = {}
mapping2 = {}
K = range(4, 10)
 
for k in K:
    # Building and fitting the model
    kmeanModel = cluster.KMeans(n_clusters=k).fit(X)
    kmeanModel.fit(X)
 
    distortions.append(sum(np.min(cdist(X, kmeanModel.cluster_centers_,
                                        'euclidean'), axis=1)) / X.shape[0])
    inertias.append(kmeanModel.inertia_)
 
    mapping1[k] = sum(np.min(cdist(X, kmeanModel.cluster_centers_,
                                   'euclidean'), axis=1)) / X.shape[0]
    mapping2[k] = kmeanModel.inertia_

plt.plot(K, distortions, 'bx-')
plt.xlabel('Values of K')
plt.ylabel('Distortion')
plt.title('The Elbow Method using Distortion')
plt.show()