Predict the prices of houses in Kaggle competition
https://www.kaggle.com/c/house-prices-advanced-regression-techniques
Container images are based on google/cloud-sdk:226.0.0
.
This is written in Dockerfile, which copies relevant files to /airflow/dags.
To build, docker build -t houseprice:1.0.0 .
This assumes everything are on a production environment.
Therefore, you should do
docker run --it -p 8080:8080 -p 5000:5000 houseprice:1.0.0 /bin/bash
Read below.
- https://github.com/GoogleCloudPlatform/cloud-sdk-docker/
- https://hub.docker.com/r/google/cloud-sdk
- http://docs.docker.jp/engine/reference/builder.html#volume
Google Cloud authentication is necessary. We will do them like below.
- copy service-account-credential to
/airflow/dags/.gcp/your-credential.json
export GOOGLE_APPLICATION_CREDENTIALS=/airflow/dags/.gcp/your-credential.json
gcloud auth activate-service-account --key-file=${GOOGLE_APPLICATION_CREDENTIALS}
rm -r .gcp
export PROJECT_ID="your-project-id" REGION="your-region" BUCKET_NAME="your-bucket-name" MODEL_NAME="your-model-name"
gcloud config set core/project ${PROJECT_ID}
gcloud config set compute/region ${REGION}
Basic configuration is done.
By default
-
src/inspect_data.py
loadsdata/raw/new.csv
and check its dtype. -
src/alignment.py
loadsdata/raw/old.csv
, extract necessary features, alignment them, and savesdata/prepared/alignmented_data.csv
. -
src/split.py
loadsdata/prepared/alignmented_data.csv
and split it intodata/prepared/data_train.csv
anddata/prepared/data_valid.csv
. -
src/train.py
loadsdata/prepared/data_train.csv
, and train a linear regression model, which will be saved intodata/pickles/model.pkl
ordata/pickles/${DATE}/model.pkl
. This code also uses MLflow modules and track the coefficient of determinant R^2, Rooted Mean Squared Log Error, the linear model calledlm
inmlruns
directory.4.1
calculate_metrics
insrc/evaluate.py
is called bysrc/train.py
to save metrics of performance of the model intodata/profile/metrics.json
. -
src/predict.py
loadsdata/pickles/model.pkl
anddata/raw/test.csv
, and them make a prediction filedata/prepared/my_submission.csv
.
These steps are controled by Airflow jobs that is written in src/dags.py
.
It is supposed that applicatioins are evoked in a container.
Expose ports 5000 and 8080 for MLflow and Airflow respectively.
MLflow tutorial is https://mlflow.org/docs/latest/quickstart.html
Airflow tutorial is https://airflow.apache.org/tutorial.html
Commands below are written in misc/start_server
so that you can just kick ./misc/start_server
Run below then airflow works on localhost:8080
airflow initdb
airflow webserver -p 8080 2>&1 > /airflow/logs/webserver.log &
To enable scheduled process, you have to run
airflow scheduler 2>&1 > /airflow/logs/scheduler/scheduler.log &
This is necessary for SequentialExecuter.
Run below then mlflow works on localhost:5000
mlflow ui -h 0.0.0.0 2>&1 > /airflow/logs/mlflow.log &