KMeans Clustering Using Python
here to download the code.
ClickBrief Intro to KMeans Clustering
-
K-Means clustering is an iterative algorithm widely used in all data analasis for finding similarity groups in a data called Clusters.
-
It is an unsupervised learning technique.
-
It attempts to group individuals in a population together by similarity, but not driven by a specific purpose.
-
As you don’t have prescribed labels in the data and no class values denoting a priori grouping of the data instances are given, In this Repo,let’s seeabout the famous centroid based clustering algorithm — K-means — in a simplest way.
Steps to implement KMeans Clustering
-
Here we implemented k-means clustering using sci-kit learn.
-
To run a k-means algorithm, you have to randomly initialize three points called the cluster centroids, because I want to group my data into three clusters.
-
K-means moves the centroids to the average of the points in a cluster. In other words, the algorithm calculates the average of all the points in a cluster and moves the centroid to that average location.
-
This process is repeated until there is no change in the clusters (or possibly until some other stopping condition is met). K is chosen randomly or by giving specific initial starting points by the user for finding good reslts.
-
-
K-means is used for exploratory data mining, you must examine the clustering results anyways to determine which clusters make sense. The value for k can be decreased if some clusters are too small, and increased if the clusters are too broad.
Example
- Importing all dependencies.
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets.samples_generator import make_blobs
from sklearn.cluster import KMeans
- Creating blobs dataset.
X, y_true = make_blobs(n_samples=300, centers=3,
cluster_std=.50, random_state=0)
- Scatter plot between two dimensions.
plt.scatter(X[:, 0], X[:, 1], s=20);
plt.show()
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
- Scatter plot after k-means clustering with centroids shown as black circle.
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=20, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5);
plt.show()
K_Means fails ?
-
Yes, K-Means algorithm fails for non-linear datasets
-
The learning algorithm requires apriori specification of the number of cluster centers.
-
The use of Exclusive Assignment - If there are two highly overlapping data then k-means will not be able to resolve that there are two clusters.
-
Applicable only when mean is defined i.e. fails for categorical data.
-
Unable to handle noisy data and outliers.
-
Let us consider the below dataset :
-
By seeing the figure, we can say that there are two clusters but this algorithm doe'n work fine.
-
The below plot is after k-means on the dataset:
License:
- This project is licensed under the MIT License - see the LICENSE.md file for details