Clustering: algorithm and related library uses (sci-kit learn)

Question

Clustering: algorithm and related library uses (sci-kit learn)

Opened this issue 3 years ago · 3 comments

Clustering with scikit-learn

sklearn.cluster
Clustering algorithms in scikit-learn: documentation

Answer 1 · 2021-07-28T05:50:14.000Z

KMeans

The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia. This algorithm requires the number of clusters to be specified.

Basic idea: divides a set of samples into disjoint clusters (mean of the samples in the cluster). The means are called 'centroids'.

Steps

Choose the initial centroids by selecting arbitrary data from the dataset.
Assigns each data to its nearest centroid, and create new centroids by taking the mean value of all of the data assigned to each previous centroid.
The difference between the old and the new centroids are computed, and this is done until the difference is less than a threshold.

The K-means algorithm aims to choose centroids that minimize the inertia, or within-cluster sum-of-squares criterion:

calculating inertia

The elbow method

Choose a range of K
Run K-means for every K in your range
After each run, calculate sum of squared erros
Also calculate the change in slope between each consecutive sum
The 'elbow' aka 'turning point' is the run where the largest difference in slope is calculated

Answer 2 · 2021-09-02T02:36:03.000Z

Clustering vs Classification

Clustering

Ideally, captures generating distributions
Practically, is an exploration of the structure of your dataset
Data points are fixed
Cluster centers are searched for

Classification

Labels are fixed
Transformations of the data points are searched for

Answer 3 · 2021-10-26T22:12:00.000Z

Elbow method pseudocode

# Finding the ideal k
distortions = []
inertias = []
mapping1 = {}
mapping2 = {}
K = range(4, 10)
 
for k in K:
    # Building and fitting the model
    kmeanModel = cluster.KMeans(n_clusters=k).fit(X)
    kmeanModel.fit(X)
 
    distortions.append(sum(np.min(cdist(X, kmeanModel.cluster_centers_,
                                        'euclidean'), axis=1)) / X.shape[0])
    inertias.append(kmeanModel.inertia_)
 
    mapping1[k] = sum(np.min(cdist(X, kmeanModel.cluster_centers_,
                                   'euclidean'), axis=1)) / X.shape[0]
    mapping2[k] = kmeanModel.inertia_

plt.plot(K, distortions, 'bx-')
plt.xlabel('Values of K')
plt.ylabel('Distortion')
plt.title('The Elbow Method using Distortion')
plt.show()