higor-gomes93/project-V-kaggle-competition

W7-Kaggle_competition

Description

Find the best machine learning model and params for a given dataset.

Instructions

Find the Kaggle competition with your cohort name, i.e. diamonds-databcn0722, https://www.kaggle.com/competitions/diamonds-databcn0722/overview

train.csv

1. Processing/cleaning the dataset: this should be later modularized in functions.
1. Train a model (fit & predict) with the data in train.csv. This file DOES contain a y (price).
  - Do train, test, split on train.csv if necessary.
  - Choose the best model regarding the metrics. In this case, the lowest RMSE (error).
  2.1. Export the model: we don't want to invest time/RAM resources on training the model again in the future.

test.csv

1. Apply the same cleaning to test.csv. This files does NOT contain a y (no price column).
1. We'll apply the already trained model from step 2 to the test.csv file. With this we'll generate a new column with the predicted values.

my_submission.csv

1. Generate a submission.csv file with only two columns: the ID of the diamond & the predicted price (y).

In other words: use train.csv to generate and save a model. Use test.csv to predict new values. Then generate a DF with ID & predicted.

Deliverables

Jupyter notebooks where you show the process you followed to get to your submissions.
A slide (.ppt, ipynb, etc) with a summary of the metrics you obtained and the rationale behind it.
- Why do those params work better than others?

Tips

Check often for the df.shape & len of the things your working with
Do make sure you ONLY have two columns. When saving the file, you might save an "extra column" (index). So make sure you don't include it. There should only be two columns: id & price
Take advantage of the daily submissions. Try at least one today!