The purpose of this example is to highlight the utility of Skafos, Metis Machine's data science operationalization and delivery platform. In this example, we will:
- Build and train a model predicting cell phone churn with data on a public S3 bucket
- Save this model to a private S3 bucket
- Score new customers using this model and save these scores.
- Access these scores via an API and S3.
The figure below provides a functional architecture for this process.
- Sign up for a Skafos account
- Install skafos on your machine
- Authenticate your account via the
skafos auth
command. - A working knowledge of how to use git.
The source data for this example is available in a public S3 bucket provided by Metis Machine. In the steps below, we will describe how to access it. No code modifications are required to access the input data.
This data has been slightly modified from its source, which is freely available and can be found here or here.
In the following step-by-step guide, we will walk you through how to use the code in this repository to run a job on Skafos. Following completion of this tutorial, you should be able to:
- Run the existing code and access its output on S3.
- Replace the provided data and model with your own data and model.
- Fork the churn-model-demo from github. This code is freely available as part of the Skafos organization. Note that the README is a copy of these instructions.
- Clone the forked repo to your machine, and add an upstream remote to connect to the original repo, if desired.
Each Skafos project will need its own project token and unique metis.config.yml
file. The example metis.config.yml.example
provided in this repo is the identical to what you will need, but the project token and job ids are tied to another Skafos account and organization.
Creating your own metis.config.yml
file is simple and described below.
Once in top level of the working directory of this project, type: skafos init
on the command line. This will generate a new metis.config.yml
file that is tied to your Skafos account and organization.
Open up this config file and edit the first job id to match the example .yml file included in the repo. Specifically, modify the following:
language: python
name: build-churn-model
entrypoint: build-churn-model.py
Note: Do not edit the project token or job_ids in the .yml file. Otherwise, Skafos will not recognize and run your job.
In the example metis.config.yml
file, you'll not that there are two jobs: one to build a model, and one to score new users. You will need to add a second job to your Skafos project via the following command on the command line:
skafos create job score-new-users --project <insert-your-project-token-here>
This will output a job_id on the command line. Copy this job id to your metis.config.yml
file, again using the example yaml file as a template, and including the following:
language: python
name: score-new-users
entrypoint: score-new-users.py`
dependencies: [<job-id for build-churn-model.py>]
This dependency will ensure that new users are not scored until the churn model has been built. If build-churn-model.py
does not complete, then score-new-users.py
will not run.
Now that your metis.config.yml
file has all the necessary components, add it to the repo, commit, and push.
In Steps 3 and 4 above, you initialized a Skafos project so you can run the cloned repo in Skafos. Now, you will need to add the Skafos app to your github repository.
To do this, navigate to the Settings page for your organization, click on Installed GitHub Apps to add the Skafos app to this repository. Alternatively, if this repo is not part of an organization, navigate to your Settings page, click on Applications, and install the Skafos app.
In common/data.py
, the AWS information to retrieve input data and store output models and data is provided. The input S3 bucket and file names do not need to be modified; however, the location of the output models and scores will need to be updated in the code, as well as the specified keyspace.
To make these changes, do the following:
- Create a private S3 bucket to save your output models and scores. This bucket will replace the existing value for
S3_PRIVATE_BUCKET
in the code. - You will need to provide Skafos with your
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
via the command line.skafos env AWS_ACCESS_KEY_ID --set <key>
andskafos env AWS_SECRET_ACCESS_KEY --set <key>
will do this. - Update the
KEYSPACE
to be theproject_token
that was generated with themetis.config.yml
file.
In step 7, you generated several changes to common/data.py.
These changes now need to be pushed to github. In doing so, the Skafos app will pick them up and run both the training and scoring jobs.
Navigate to dashboard.metismachine.io to monitor the status of the job you just pushed. Additional documentation about how to use the dashboard can be found here.
Once your job has completed, you can verify that the predictive model itself (in the form of a .pkl
file) and the scored users (in a .csv
file) are in the private S3 bucket you specified in Step 7.
In addition to data that has been output to S3, this code uses the Skafos SDK to store scored users in a Cassandra database. Specifically, the save_scores
function, will write scored users to a table.
The scored users in Cassandra can be easily accessed via an API call. Navigating to the root project directory on the command line, type skafos fetch --table model_scores
. This will return both a list of scores and a cURL command that can be incorporated into applications in the usual fashion to retrieve this data.
Now that you have successfully built a predictive model on Skafos and scored new data, you can adapt this code to build your own models.