Overview
This project explores the application of K-Means clustering to the MNIST handwritten digits dataset. We implement K-Means from scratch and evaluate the impact of different feature extraction techniques (raw pixels, PCA, HOG) on clustering performance.
Prerequisites
- Python 3.x
- Libraries:
- scikit-learn
- OpenCV (cv2)
- matplotlib
- NumPy
- seaborn (optional, for better visualizations)
Note: The notebook will install the required libraries in the first cell.
MNIST Dataset: it will be automatically downloaded(it is already there in the dataset folder of the repository) Notebook Structure
- K-Means Clustering Theory:
- Brief explanation of K-Means clustering concepts.
- Intuition behind the algorithm.
- Implementation from Scratch:
- Code for a custom K-Means implementation.
- Dataset Introduction (MNIST):
- Description of the MNIST dataset.
- Sample visualizations of the images.
- Feature Extraction:
- Rationale for using feature extraction before clustering.
- Implementation and comparison of:
- Raw pixel data
- Principal Component Analysis (PCA)
- Histogram of Oriented Gradients (HOG)
- Clustering and Evaluation:
- K-Means clustering applied with each feature representation.
- Visualization of clusters.
- Evaluation using metrics like silhouette score and confusion matrices.
- Results and Conclusions:
- Summary of findings (key points from the section you provided).
- Insights into the effectiveness of different feature extraction techniques.
How to Run
Execute the notebook cells in order. It will download the requirements and the dataset automatically when needed.
References
K-Means Clustering Intuition: A Simple Explanation of K-Means Clustering: https://www.analyticsvidhya.com/blog/2020/10/a-simple-explanation-of-k-means-clustering/ HOG: Medium Article - HOG Feature Descriptor: https://medium.com/@dnemutlu/hog-feature-descriptor-263313c3b40d