- Overview
- Prerequisites
- Getting Started
- Dataset Insight
- Data Exploration
- Exploratory Data Analysis
- Model Testing
Feel free to navigate through the sections and delve into the details of the Glass Classification Project!
Welcome to the Glass Classification Project! This project aims to classify different types of glass based on their distinct attributes. By leveraging a dataset available on Kaggle here, our goal is to thoroughly analyze the data, overcome challenges, build precise models, and ultimately identify the most effective model for accurately classifying glass samples.
Before running the code, make sure you have the right dependencies installed.
-
Clone the repository:
git clone https://github.com/matlaczj/Glass-Classification-Project cd Glass-Classification-Project
-
Follow the steps outlined in the project documentation to replicate the analysis and results.
src\code.ipynb
The dataset comprises 214 glass samples, each characterized by 9 attributes and categorized into 7 classes. Note that one class (Class 4) has no instances, effectively resulting in 6 classes. These classes represent diverse types of glass objects, spanning from building and vehicle windows to containers, tableware, and headlamps.
- 1: Building windows float processed
- 2: Building windows non-float processed
- 3: Vehicle windows float processed
- 4: Vehicle windows non-float processed (none in this database)
- 5: Containers
- 6: Tableware
- 7: Headlamps
Upon exploration, it was discovered that the original dataset contains no missing values, ensuring the reliability and completeness of our analysis. The dataset includes key attributes such as Refractive Index (RI), Sodium (Na), Magnesium (Mg), Aluminum (Al), Silicon (Si), Potassium (K), Calcium (Ca), Barium (Ba), and Iron (Fe).
-
Box Plots: A closer look at box plots revealed significant variations in the attribute 'Si' compared to others, providing valuable insights into the dataset's characteristics.
-
Correlation Matrix: The correlation matrix highlighted relationships between different attributes, guiding the selection of features for model training.
The project employs a range of supervised learning models, including the k-Nearest Neighbors (kNN) classifier. To address class imbalance, StratifiedKFold is utilized for cross-validation, and the Synthetic Minority Over-sampling Technique (SMOTE) is applied to balance the training set.
The following table provides an overview of the performance metrics for each tested model:
Model | Learn Error | Test Error | Test Precision | Test Recall | Test F1 Score |
---|---|---|---|---|---|
k-Nearest Neighbors | 0.000 | 0.276 | 0.731 | 0.751 | 0.721 |
Decision Tree | 0.017 | 0.257 | 0.738 | 0.734 | 0.720 |
Gaussian Naive Bayes | 0.323 | 0.625 | 0.531 | 0.578 | 0.485 |
Nearest Centroid | 0.343 | 0.566 | 0.488 | 0.562 | 0.476 |
The k-Nearest Neighbors (kNN) classifier with k=1 emerged as the most effective model. It showcased high F1 scores, low training error, and strong generalization capabilities.