- Define business object
- Make sense of the data from a high level
- Create the traning and test sets using proper sampling methods, e.g., random vs. stratified
- Correlation analysis (pair-wise and attribute combinations)
- Data cleaning (missing data, outliers, data errors)
- Data transformation via pipelines (categorical text to number using one hot encoding, feature scaling via normalization/standardization, feature combinations)
- Train and cross validate different models and select the most promising one (Linear Regression, Decision Tree, and Random Forest were tried in this tutorial)
- Fine tune the model using trying different combinations of hyperparameters
- Evaluate the model with best estimators in the test set
- Launch, monitor, and refresh the model and system
ArmanJR/California-Housing-Prices
A clone of https://www.kaggle.com/armanjr/california-housing-prices
Jupyter Notebook