Predicting Wine Quality From Physical and Chemical Properties

Project Takeaway
Description of the Data
Methods Overview
Project Summary

Project Takeaway

The goal of this project is to better understand what chemical and physical characteristics make a good wine. Using measurements from 1600 red wines, and their median professional wine score, we'll use machine learning to predict wine scores and what wine characteristics lead to better wine scores.

Four models were tested: XGBoost, Random Forest, Logistic Regression, and Naive Bayes
XGBoost outperformed other models due to its superior handling of complex variable interactions and its flexibility in fine-tuning
XGBoost achieved a test accuracy of 0.85, F1 Score of 0.86, and an ROC AUC of 0.91
Higher alcohol content was the best indicator of increased wine quality
Sulphate content and acidity were the next best indicators of increased wine quality, they may contribute to a wine's aroma and flavor

Description of the Data

The data was downloaded from UC Irvine’s curated repository of datasets for machine learning. The data consisted of a single table with 1600 rows, each containing data on a specific portuguese red wine variant. Along with the wine’s quality score (the median score from three professional wine tasters) the table also had eleven other columns with measured physicochemical properties.

Methods Overview

Exploratory Data Analysis
- Checking Data Quality
- ANOVA for Variable Selection
Model Building
- Logistic Regression
- Naive Bayes
- Random Forest
- XGBoost
Model Parameter Tuning
- Grid Search for Random Forest
- Iterative Randomized Search for XGBoost
Model Evaluation
- Accuracy
- F1 Score
- ROC Curve/ Confusion Matrix
- Identifying the Most Important Variables for Wine Quality Prediction

Project Summary

Exploratory Data Analysis
Model Building
How Random Forests Work
Model Parameter Tuning
Model Evaluation
Determining Important Variables

Exploratory Data Analysis

First lets take a look at all of the data in our dataset.
We have:

11 predictor variables
1 outcome variable (quality)

In order to make this classification problem a little easier we will be binning wines into 'good' and 'bad' as follows:

Model Building

In order to train our models, the data was split into 80% training/ 20% testing
Using the default models from Tensorflow here were the results:

From this point on in the analysis, we will focus on the two top performing models: Random Forests, and XGBoost

How Random Forests Work

Given the inter-dependencies of the predictor variables, using tree-based methods might yield the best results. Random Forests and other more complex models like XGBoost can leverage the indirect correlations to the outcome variable, enhancing the models performance.

Once a Random Forest is created, there are a few model parameters to keep in mind that we can tweek to improve it's performance.

Model Parameter Tuning

RF and XGBoost performed the best, let's take both models, tune them, and see if we can improve them further.
A GridSearch was performed on the RF model, this approach is more time and resource intensive, but it is exhaustive, so we will know that we will be getting the best result.

XGBoost Results

A Randomized search was performed on the XGBoost mode, this approach is faster, not exhaustive, but it is quick. To help ensure that we were getting still good results I did the Randomized search 10 times and picked the best result.

Model Evaluation

Now with the best model selected, and parameters tuned, we can evaluate the model and see how it performed.

True Positives (TP): 144
True Positives (TP): 144
False Positives (FP): 25
False Negatives (FN): 27

The model has an AUC of 0.91, which suggests it has a high accuracy in distinguishing between the classes

Overall, the model appears to be performing well, with good predictive power and a balance of error types that lean towards correct predictions.

Determining Important Variables

Both Alcohol and Sulphates are positively correlated with wine score.

The scatter plots and feature importance graph together suggest that while many factors contribute to wine quality, alcohol content is potentially the most predictive, followed by sulphates.

The least important features suggest that within this dataset and model, aspects like pH and chlorides are not key drivers in predicting wine quality.
Features related to acidity, such as volatile acidity and fixed acidity, also play a significant role in the model's predictions.

BrianMillerS/wine_quality_classification