- Data preparation
- Exploratory Data Analysis
- Check packaging
- Look at top and bottom of dataset
- Check your Ns
- Look at center and spread
- Make comparisons (Correlation between columns?)
- Data Cleaning
Missing data?- Noisy data?
- Handling outliers?
- Data Transformation
- Normalizing continuous data
- Dummy coding categorical data
- Data reduction
- Removing irrelevant columns?
- Exploratory Data Analysis
- Model Training
- Use crossvalidation for multiple train/dev splits
- How do we divide into train/dev splits?
- 80/20?
- Take random samples?
- How do we divide into train/dev splits?
- Use different models
- Bias-variance tradeoff
- Use crossvalidation for multiple train/dev splits
- Model evaluation
- Test the accuracy of models
- Metrics (R2, MSE, RMSE, MAE, mAE)
- Choose the best performing model
- Test the accuracy of models
- Result
- Use the chosen model to predict result on the test set
I made a google colab (in R) that we can use: https://colab.research.google.com/drive/1fknUKXEmA2UA8Z62jIXtp1mXsptpRqFr?usp=sharing