Customer Churn - Engineering

Table of Contents

About The Project
- Built With
Getting Started
Roadmap

About The Project

The main purpose of this project, is to focus on the engineering part and not so on the modelling part. We will create efficient data pipelines and will adhere to coding best practices using different tools, languages, and technologies like Python, Scala, Spark, Docker, CI/CD tools, etc.

Note: This repo will be used for testing different technologies.

Built With

Getting Started

The dataset for this project is taken from Kaggle. It is a simple dataset regarding customer churn including both numeric and categorical features. It is a classification task with the target variable being binary (True/False), meaning that if a customer has left the company, the target variable is True/1, otherwise it is False/0.

It is necessary to save the csv file from Kaggle to the "data" (src/python/src/data/) directory renaming it as "telco_churn.csv" in order for the pipeline to work.

To run the project, you need to clone this repo and run the docker/docker-compose-shell.sh script.

This script runs the train, predict, or both phases. To run only the train phase, include the argument "train" to the script. For the predict phase, add "predict", and for both, either run it with no arguments or add "both".

Clone the repo

git clone https://github.com/SteliosGian/churn-engineering.git

Run the script

./docker/docker-compose-shell.sh

Make sure the script has the adequate permissions

chmod +x docker/docker-compose-shell.sh

or run

bash docker/docker-compose-shell.sh

MLflow Server

The project starts a local MLflow server running in the background, which you can access at http://127.0.0.1:5000/ .
With MLflow, you can track custom metrics and hyperparameters as well as log artifacts such as plots and models.

Prerequisites

Docker must be installed in order to run the project with Docker. Otherwise, it can be executed by running the python scripts (train.py/predict.py) individually.

Notes

Spark is not needed for this project because the amount of data is not that large. However, a small pipeline is created in the branch "spark" using Scala.

Roadmap

Docker ☑
Shell scripts ☑
TravisCI ☑
MLflow ☑
Spark ☑
API