Clustering: algorithm and related library uses (sci-kit learn)
Opened this issue · 3 comments
Clustering with scikit-learn
sklearn.cluster
Clustering algorithms in scikit-learn: documentation
KMeans
The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia. This algorithm requires the number of clusters to be specified.
Basic idea: divides a set of samples into disjoint clusters (mean of the samples in the cluster). The means are called 'centroids'.
Steps
- Choose the initial centroids by selecting arbitrary data from the dataset.
- Assigns each data to its nearest centroid, and create new centroids by taking the mean value of all of the data assigned to each previous centroid.
- The difference between the old and the new centroids are computed, and this is done until the difference is less than a threshold.
The K-means algorithm aims to choose centroids that minimize the inertia, or within-cluster sum-of-squares criterion:
calculating inertia
The elbow method
- Choose a range of K
- Run K-means for every K in your range
- After each run, calculate sum of squared erros
- Also calculate the change in slope between each consecutive sum
- The 'elbow' aka 'turning point' is the run where the largest difference in slope is calculated
Clustering vs Classification
Clustering
-
Ideally, captures generating distributions
-
Practically, is an exploration of the structure of your dataset
-
Data points are fixed
-
Cluster centers are searched for
Classification
-
Labels are fixed
-
Transformations of the data points are searched for
Elbow method pseudocode
# Finding the ideal k
distortions = []
inertias = []
mapping1 = {}
mapping2 = {}
K = range(4, 10)
for k in K:
# Building and fitting the model
kmeanModel = cluster.KMeans(n_clusters=k).fit(X)
kmeanModel.fit(X)
distortions.append(sum(np.min(cdist(X, kmeanModel.cluster_centers_,
'euclidean'), axis=1)) / X.shape[0])
inertias.append(kmeanModel.inertia_)
mapping1[k] = sum(np.min(cdist(X, kmeanModel.cluster_centers_,
'euclidean'), axis=1)) / X.shape[0]
mapping2[k] = kmeanModel.inertia_
plt.plot(K, distortions, 'bx-')
plt.xlabel('Values of K')
plt.ylabel('Distortion')
plt.title('The Elbow Method using Distortion')
plt.show()