DSC_RegressionMachineLearning: A Jupyter Notebook repository from nsrozak

Purpose of the Project

Predict house sale prices using data about Washington homes and regression machine learning models
Learning goal: prepare data for machine learning, implement machine learning algorithms in Python, and analyze their results

Clean data
- Remove unnecessary columns
- Correct misentered data
- Correct column data types
- Remove predictor outliers
Prepare data for machine learning
- Binary encoding of categorical variables
- Split data into predictors and response
- Normalize predictor data so all features have equal priorities
- Split data into train, validation, and test sets

Create boxplots for sale price grouped by each categorical variable
- Differences in boxplots indicate differences in price correlated with different levels of a categorical variable
Create scatterplots for sale price against each quantitative variable
- Examine correlation between sale price and each quantitative variable
Create time series plots for price and time data
- Plot the first, second, and third quantiles for each year
- Examine whether there is a trend or change in variation

Implement K Nearest Neighbors with Sci-Kit Learn package
- Use random search and grid search to tune hyperparameters by maximizing the negative mean squared error
Results
- Mean Absolute Error: 131426.5878
- Mean Squared Error: 58914690993.4774

Create graphs showing relationship between predicted price, actual price, and absolute differences between predicted price and actual price

Implement Random Forest with Sci-Kit Learn package
- Use random search and grid search to tune hyperparameters by maximizing the negative mean squared error
Results
- Mean Absolute Error: 117519.3716
- Mean Squared Error: 44596532085.3743

Examine feature importance
- Important features were 'sqftAbove' and 'bathroom'
- Unimportant features were 'location'
Create graphs showing relationship between predicted price, actual price, and absolute differences between predicted price and actual price

Implement Support Vector Regression with Sci-Kit Learn package
- Use random search and grid search to tune hyperparameters by maximizing the negative mean squared error
Results
- Mean Absolute Error: 203642.6582
- Mean Squared Error: 142434180920.8759

Create graphs showing relationship between predicted price, actual price, and absolute differences between predicted price and actual price

Random Forest model is the best at predicting housing prices since it has the smallest mean absolute error and the smallest mean squared error
Support Vector Regression is the worst at predicting housing prices since it has the largest mean absolute error and the largest mean squared error
- Additionally, all predictions are less than $600000, so the model is not trained to recognize expensive homes
For all models, grouping absolute prediction difference by category shows
- Homes sold in East Urban have the greatest absolute prediction difference
- Homes sold with a waterfront have the greatest absolute prediction difference
- Homes sold with a view rating of 4 have the greatest absolute prediction difference
- Homes sold with 7 bathrooms have the greatest absolute prediction difference
  - Predicting prices of homes with these attributes may be less accurate, so further investigation is recommended before accepting the predicted value as fact

Model may improve if more outliers are removed during data cleaning
- The greatest differences in observed and actual prices occurred for homes selling at high prices
None of the models are overfitted since the models behave similarly on training, validation, and test sets
- All features can be used since the model is not overfitted