End-to-end ml pipeline for estimating the typical price for a given property based on the price of similar properties in NYC.
Link to Github: https://github.com/BM2304/nd0821-c2-build-model-workflow-starter Link to W&B: https://wandb.ai/bm23/nyc_airbnb
Make sure to have conda installed and ready, then create a new environment using the environment.yml
file provided in the root of the repository and activate it:
> conda env create -f environment.yml
> conda activate nyc_airbnb_dev
Let's make sure we are logged in to Weights & Biases. Get your API key from W&B by going to https://wandb.ai/authorize and click on the + icon (copy to clipboard), then paste your key into this command:
> wandb login [your API key]
You should see a message similar to:
wandb: Appending key for api.wandb.ai to your netrc file: /home/[your username]/.netrc
As usual, the parameters controlling the pipeline are defined in the config.yaml
file defined in
the root of the starter kit.
In order to run the pipeline when you are developing, you need to be in the root of the starter kit, then you can execute as usual:
> mlflow run .
This will run the entire pipeline.
When developing it is useful to be able to run one step at the time. Say you want to run only
the download
step. The main.py
is written so that the steps are defined at the top of the file, in the
_steps
list, and can be selected by using the steps
parameter on the command line:
> mlflow run . -P steps=download
If you want to run the download
and the basic_cleaning
steps, you can similarly do:
> mlflow run . -P steps=download,basic_cleaning
You can override any other parameter in the configuration file using the Hydra syntax, by
providing it as a hydra_options
parameter. For example, say that we want to set the parameter
modeling -> random_forest -> n_estimators to 10 and etl->min_price to 50:
> mlflow run . \
-P steps=download,basic_cleaning \
-P hydra_options="modeling.random_forest.n_estimators=10 etl.min_price=50"
download
: downloads the data (get_data). MLprojectbasic_cleaning
: remove outliers and null values MLprojectdata_check
: run data checks like distribution and expected columns MLprojectdata_split
: segrgate the data (splits the data) MLprojecttrain_random_forest
: train random forest and upload fitted model MLprojecttest_regression_model
: test a trained model on new test data MLproject
> mlflow run https://github.com/BM2304/nd0821-c2-build-model-workflow-starter.git \
-v [the version you want to use, like 1.0.1] \
-P hydra_options="etl.sample='sample2.csv'"