
Multi-label wine classification ML project trained using Kaggle wine quality dataset :bar_chart:

Primary LanguageJupyter Notebook


This is a project where I practiced training various different multi-label wine quality classifiers with one vs. all method.

The workflow includes EDA (exploratory analysis, data visualization), data preprocessing (feature selection with chi-square test, oversampling minority classes with synthetic data, feature scaling), and trained data on different classification ML models (logistic regression, linear supported vector machine (SVM), kernel SVM, and K-NN)

Feel free to click into the .ipynb notebook for detailed analysis.


The dataset is extremely skewed with minority class (i.e. wine quality) like '3' and '8' share less than 1% of the total population. We can see this by plotting a histogram on 'quality' column. quality_count

A clearer visualization of the correlations between features by plotting out a heatmap: corr_heat

Further visualize the relations between features and wine quality. Notice features like "pH", "chlorides", "residual sugar" almost have no impact on classifying the quality of the wine. feature_bar


  • Feature selection using chi-square test
  • Drop irrelevant features
  • Split dataset
  • Apply SMOTE to oversample minority classes data by generating synthetic training data using K-NN. Note we do not oversample testing data.
  • Feature scaling


Because of the skewed nature of the dataset. Use F1-score as the performance metric. By applying synthetic minority oversampling technique, KNN model has a notable increase in its weighted F1-score avg from 0.52 to 0.67. The accuracy also went from 51% to 65%. The other models like logistic regression, linear SVM, and kernel SVM did not perform better as expected.

Logistic Regression


Linear SVM & Kernel SVM


K-NN (Rapid Prototype)


K-NN (Final)
