K - Means Algorithm

This code demonstrates how to perform K-means clustering using the sklearn library in Python. K-means is an unsupervised learning algorithm used to identify clusters of data points in a dataset. Here are the steps involved:

  1. Pick K points as the initial centroids from the data set, either randomly or the first K.
  2. Find the Euclidean distance of each point in the data set with the identified K points — cluster centroids.
  3. Assign each data point to the closest centroid using the distance found in the previous step.
  4. Find the new centroid by taking the average of the points in each cluster group.
  5. Repeat 2 to 4 for a fixed number of iterations or until the centroids don't change.

The code imports the required libraries and reads in a dataset of income and age data. It then performs K-means clustering on the data, and plots the resulting clusters. It also shows an elbow plot to help determine the optimal number of clusters.

Libraries Used

  • sklearn.cluster for performing K-means clustering
  • pandas for reading in and manipulating data
  • sklearn.preprocessing for scaling the data using MinMaxScaler
  • matplotlib.pyplot for creating plots
  • warnings for ignoring FutureWarnings

Steps

  1. Import the required libraries
  2. Read in the dataset
  3. Plot the original data
  4. Perform K-means clustering on the data
  5. Plot the resulting clusters
  6. Scale the data using MinMaxScaler
  7. Perform K-means clustering on the scaled data
  8. Plot the resulting clusters
  9. Create an elbow plot to determine the optimal number of clusters.

Usage

To run this code, you need to have the required libraries installed. You can install them using pip or any other package manager. Once the libraries are installed, you can run the code in a Python environment of your choice.

The income.csv file contains the data used in this code, and you can replace it with your own data to perform K-means clustering on your dataset.

Example Output