
Possible steps to take

  • Data preparation
    • Exploratory Data Analysis
      • Check packaging
      • Look at top and bottom of dataset
      • Check your Ns
      • Look at center and spread
      • Make comparisons (Correlation between columns?)
    • Data Cleaning
      • Missing data?
      • Noisy data?
      • Handling outliers?
    • Data Transformation
      • Normalizing continuous data
      • Dummy coding categorical data
    • Data reduction
      • Removing irrelevant columns?
  • Model Training
    • Use crossvalidation for multiple train/dev splits
      • How do we divide into train/dev splits?
        • 80/20?
        • Take random samples?
    • Use different models
    • Bias-variance tradeoff
  • Model evaluation
    • Test the accuracy of models
      • Metrics (R2, MSE, RMSE, MAE, mAE)
    • Choose the best performing model
  • Result
    • Use the chosen model to predict result on the test set

Shared working space

I made a google colab (in R) that we can use: