Modeling wine preferences by data mining from physicochemical properties

Linear Regression and Random Forest

In this work we will use data mining techniques on collected wine sets and evaluate wine quality by given raw data.

Wine can be classified by human experts or by physicochemical laboratory tests – pH or alcohol values or density determination.

But there is another method how to classify wine.

Input variables (based on physicochemical tests):

1 - fixed acidity

2 - volatile acidity

3 - citric acid

4 - residual sugar

5 - chlorides

6 - free sulfur dioxide

7 - total sulfur dioxide

8 - density

9 - pH

10 - sulphates

11 - alcohol

Output variable (based on sensory data):

12 - quality (score between 0 and 10)

from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor

import numpy as np
import pandas as pd
import statsmodels.stats.api as sm

%pylab inline
Populating the interactive namespace from numpy and matplotlib
# Read data
wine = pd.read_csv('wine_data.csv', sep='\t')
wine.head()
Type Fixed acidity (g/l) Volatile acidity (g/l) Citric acid (g/l) Residual sugar (g/l) Chlorides (g/l) Free sulfur dioxide (mg/l) Total sulfur dioxide (mg/l) Density (g/cm3) pH Sulphates (g/l) Alcohol (%) quality
0 red 7.4 0.70 0.00 1.9 0.076 11 34 0.9978 3.51 0.56 9.4 5
1 red 7.8 0.88 0.00 2.6 0.098 25 67 0.9968 3.20 0.68 9.8 5
2 red 7.8 0.76 0.04 2.3 0.092 15 54 0.9970 3.26 0.65 9.8 5
3 red 11.2 0.28 0.56 1.9 0.075 17 60 0.9980 3.16 0.58 9.8 6
4 red 7.4 0.70 0.00 1.9 0.076 11 34 0.9978 3.51 0.56 9.4 5

We see 11 signs describing the chemical composition of the wines. Here is the distribution of expert evaluation of wines in the sample:

plt.figure(figsize(8,6))
stat = wine.groupby('quality')['quality'].agg(lambda x : float(len(x))/wine.shape[0])
stat.plot(kind='bar', fontsize=14, width=0.9, color="green")
plt.xticks(rotation=0)
plt.ylabel('Proportion', fontsize=14)
plt.xlabel('Quality', fontsize=14)
<matplotlib.text.Text at 0x7f20f8773350>

png

Type and Quality of Wine

y = wine[['Type', 'quality']]
y.head()
Type quality
0 red 5
1 red 5
2 red 5
3 red 6
4 red 5
wine.groupby('Type')['Type'].count()
Type
red      1599
white    4898
Name: Type, dtype: int64

We separate 25% of the set for the quality prediction control.

from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(wine.ix[:, wine.columns != 'quality'], 
                                                    wine['quality'], 
                                                    test_size=0.25)
X_train['Type'] = X_train['Type'].apply(lambda x : -1 if x == 'red' else 1)
X_test['Type'] = X_test['Type'].apply(lambda x : -1 if x == 'red' else 1)

If we do not have any more information about the wines, our best guess on the assessment – - the average, available in the training set:

np.mean(y_train)
5.8169129720853858

If we predict this value assessment of all the winse oin the training set, we get the average square error.

sqrt(mean_squared_error([np.mean(y_train)]*len(y_train), y_train))
0.87670621485856171

and on the test

sqrt(mean_squared_error([np.mean(y_train)]*len(y_test), y_test))
0.8625590794770277

On the test sample the error is greater, since we evaluate the average by the teaching. This is a natural effect. To clarify our prediction, it is possible to predict the value of every wine by the average estimation of wines of the same type in the sample:

regressor = LinearRegression()
regressor.fit(X_train['Type'].reshape(-1,1), y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
y_train_predictions = regressor.predict(X_train['Type'].reshape(-1,1))
y_test_predictions = regressor.predict(X_test['Type'].reshape(-1,1))

Prediction errors have decreased slightly:

sqrt(mean_squared_error(y_train_predictions, y_train))
0.87064224152662018
sqrt(mean_squared_error(y_test_predictions, y_test))
0.85579069866207058

Here is the true evaluation of wines and their average prediction by the types on a test sample:

pyplot.figure(figsize(10,10))
pyplot.scatter(y_test, y_test_predictions, color="blue", alpha=0.1)
pyplot.xlim(2,10)
pyplot.ylim(2,10)
plot(range(11), color='black')
grid()
plt.xlabel('Quality')
plt.ylabel('Estimated quality')
<matplotlib.text.Text at 0x7f20dd856f10>

png

def fun(v):
    return v + np.random.uniform(low=-0.5, 
                                 high=0.5, 
                                 size=len(v))

pyplot.figure(figsize(15, 35))
for i in range (1, 12):
    pyplot.subplot(6, 2, i)
    pyplot.scatter(fun(wine['quality']), wine.ix[:, i], 
                   color=wine["Type"], 
                   edgecolors="green")
    pyplot.xlabel('Quality')
    pyplot.ylabel(str(wine.columns[i]))

png

Linear Regression

To begin with, let us construct the linear regression model.

lm = LinearRegression()
lm.fit(X_train, y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

The prediction errors were essentially reduced:

sqrt(mean_squared_error(lm.predict(X_train), y_train))
0.73840949206442807
sqrt(mean_squared_error(lm.predict(X_test), y_test))
0.71457788648123488

The true evaluation of wines and their predictions by the linear model:

plt.figure(figsize(16,8))
plt.subplot(121)
pyplot.scatter(y_train, lm.predict(X_train), color="red", alpha=0.1)
pyplot.xlim(2,10)
pyplot.ylim(2,10)
plot(range(11), color='black')
grid()
pyplot.title('Train set', fontsize=16)
pyplot.xlabel('Quality')
pyplot.ylabel('Estimated quality')

plt.subplot(122)
pyplot.scatter(y_test, lm.predict(X_test), color="red", alpha=0.1)
pyplot.xlim(2,10)
pyplot.ylim(2,10)
plot(range(11), color='black')
grid()
pyplot.title('Test set', fontsize=16)
pyplot.xlabel('Quality')
pyplot.ylabel('Estimated quality')
<matplotlib.text.Text at 0x7f2087f41190>

png

We calculate the coefficient of determination - the proportion by the explained model of the disperse response.

lm.score(X_test, y_test)
0.31365659929719758

Random Forest

We construct the random forest on the teaching set:

rf = RandomForestRegressor(n_estimators=100, min_samples_leaf=3)
rf.fit(X_train, y_train)
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=3,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=100, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

The quality has increased, though the model has been retrained.

sqrt(mean_squared_error(rf.predict(X_train), y_train))
0.35262460471153395
sqrt(mean_squared_error(rf.predict(X_test), y_test))
0.60694168473930055

The true evaluation of wines and their predictions by the random forest:

plt.figure(figsize(16,7))
plt.subplot(121)
pyplot.scatter(y_train, rf.predict(X_train), color="red", alpha=0.1)
pyplot.xlim(2,10)
pyplot.ylim(2,10)
plot(range(11), color='black')
grid()
pyplot.title('Train set', fontsize=16)
pyplot.xlabel('Quality')
pyplot.ylabel('Estimated quality')

plt.subplot(122)
pyplot.scatter(y_test, rf.predict(X_test), color="red", alpha=0.1)
pyplot.xlim(2,10)
pyplot.ylim(2,10)
plot(range(11), color='black')
grid()
pyplot.title('Test set', fontsize=16)
pyplot.xlabel('Quality')
pyplot.ylabel('Estimated quality')
<matplotlib.text.Text at 0x7f20878b3ad0>

png

The coefficient of determination for the random forest:

rf.score(X_test, y_test)
0.50485060884990918

We compare the errors of the linear regression and random forest on a test sample:

plt.figure(figsize(8,6))
plt.hist(abs(y_test - lm.predict(X_test)) - abs(y_test - rf.predict(X_test)), bins=16, normed=True)
plt.xlabel('Difference of absolute errors')
<matplotlib.text.Text at 0x7f208792e510>

png

The differences between the average absolute errors are significant:

tmeans = sm.CompareMeans(sm.DescrStatsW(abs(y_test - lm.predict(X_test))), 
                         sm.DescrStatsW(abs(y_test - rf.predict(X_test))))

tmeans.ttest_ind(alternative='two-sided', usevar='pooled', value=0)[1]
6.0766386407976045e-13

95% confidence interval for the average difference of absolute errors.

tmeans.tconfint_diff(alpha=0.05, alternative='two-sided', usevar='pooled')
(0.079701943166600675, 0.13903686377836694)

That is, using our random forest instead of linear regression, we predict the expert evaluation more accurately. Let’s see what signs have the greatest predictive power:

importances = pd.DataFrame(zip(X_train.columns, rf.feature_importances_))
importances.columns = ['feature name', 'importance']
importances.sort(ascending=False)

The alcohol content has the greatest influence on the expert evaluation of wine quality.

Conclusion

In this work we solved the regression problem. The algorithms we chose are called linear regression and random forest. We predict that random forest has a better value. It is more exact. The alcohol content has the greatest influence on the expert evaluation of wine quality. The prediction model in this work can bear quite good results.

Reference

Citation Request: This dataset is public available for research. The details are described in [Cortez et al., 2009]. Please include this citation if you plan to use this database:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib