The aim of this project is to identify the suitable model to make the prediction for the house price with given significant predictor variables and used a supervised learning technique.
The data-set is available on Kaggle: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data
- The train data-set has 1460 samples, 80 features and 1 target variable.
- The test data-set has 1459 samples and 80 features.
- The target variable is the sale price.
The heat-map showed the correlation of each variable with another. The darker colour means the relationship between any two variables are strongly correlated and lighter colour means they have almost no relationship.
By observing the above heat-map, the provided data is sufferring multicollinearity such as 'TotRmsAbvGrd' with 'GrLivArea', 'TotalBsmtSF' with '1stFlrSF', 'GarageCars' with 'GarageArea' and more, which independent variable has the strong relationship with another independent variable and the model performance will be affected.
By looking the 'SalePrice' on the y-axis and compare with 'GrLivArea', 'TotalBsmtSF' and '1stFlrSF' on the x-axis, they seem do have outliers. since the 'TotalBsmtSF' and '1stFlrSF' are strongly correlated, only the 'TotalBsmtSF' has been taken for further analysis and 'GrLivArea' as well.
From above two images, we could clearly see that several data points (red circle) are lies on abnormal distance from other values in a random samples, this could cause a problem, heteroscedasticity, which forming a cone-like shape pattern result in standard error bias. The cone-like shape pattern (green lines) shown below:
There are a lot of techniques to deal with outliers, but, above mentioned outliers have been removed to keep process simpler.
Those weak features with the correlation coefficient less than 0.2 that almost no relationship with the 'SalePrice' have been removed.
Removing all those features with missing data aren't a good practice, because some of those features might be important. However, there will be a tedious work to fill up the missing data that depending on domain knowledge and experience. In order to keep the process simple, all the features with missing data have been removed except 'Electrical', which only one missing data. The missing data will be filled in with the most common value in 'Electrical'.
The normality graph for 'SalePrice', 'GrLivArea' and 'TotalBsmtSF'. (Click on the picture to zoom in)
This process is to transform the data into a normal distribution shape used the log transformation. The probability plot applied, which if the data points lie on the diagonal line, it means the particular feature more likely to be a normal distribution. Also, the skewness and kurtosis are indicating whether the feature is a normal, left skew, right skew distribution, heavier tails or light tails.
According to the rule of thumb:
- Reference: https://www.spcforexcel.com/knowledge/basic-statistics/are-skewness-and-kurtosis-useful-statistics
- Skewness
- If the skewness is between -0.5 and 0.5, the data are fairly symmetrical
- If the skewness is between -1 and – 0.5 or between 0.5 and 1, the data are moderately skewed
- If the skewness is less than -1 or greater than 1, the data are highly skewed
- Kurtosis
- If the kurtosis is close to 0, then a normal distribution is often assumed. These are called mesokurtic distributions.
- If the kurtosis is less than zero, then the distribution is light tails and is called a platykurtic distribution.
- If the kurtosis is greater than zero, then the distribution has heavier tails and is called a leptokurtic distribution.
- The ordinal variables have been label encoding by converting a string into an ordered number.
- All year type variable's value converted into year interval. For instance: original = 1995 -> year interval = 23 = 2018 - 1995
- Lastly, all the data transformed into dummy values.
- The testing data-set has go through the similar process as training data-set does (e.g data transformation, remove weak features and etc.)
The XGBoost model will be used in this project. XGBoost stands for eXtreme Gradient Boosting. XGBoost is fast and dominates structured or tabular datasets on classification and regression predictive modeling problems.
- The good explanation is here: https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/
The XGBoost's hyper-parameters have been randomly selected by using 'RandomizedSearchCV' library in order to get the best hyper-parameters with lesser execution time. After the training data fit into the XGBoost model, the result generated shown below:
- Note: 'Test' showed in the result is actually a validation data and 'r2' is R-squared.
The result shows that the 'Test r2' is slightly lower than 'Train r2', which means the model is little bit over-fitting.
visual checking whether the standardized residual is a normal distribution. The standardized residual seems like not close to normal distribution.
- reference: http://docs.statwing.com/interpreting-residual-plots-to-improve-your-regression/#y-unbalanced-header
The data points are not so evenly distributed vertically, the model has room for improvement.
After the result comparison, I still need to put more effort to improve the model. The result reflects that some of the valuable data might not yet to be discovered from the dataset. Probably need to review all the missing data, outliers, also, spend more time on data analysis and multicollinearity issue.
Google Colab
- Python 3
- xgboost 0.7.post4
- sklearn 0.19.2