/model-pipeline

A template repo for running/deploying machine learning models

Primary LanguagePythonMIT LicenseMIT

Assignment

I have included the answers for the questions within the readme file. While working on the workflow, I have tried to keep the things simple and easy. The overall design of the architecture will depend on the existing tech stack and complexity of the system. If there are multiple models and they are interdendent on each other and runs on different schedule, it would make sense to use an orchestration tool and model metadata storage service.

In the repo, I have also included a cicd pipeline setup. This pipeline gets triggered on every push to the main branch. And then builds a docker image, pushes it to image repository and then deploys the application service to the kubernetes cluster. The code included is for a very simple app, but it can be configured accordingly depending on scale and complexity. For example, you may have different environments like testing, staging and production. To deploy the service to these environments, we just need to add an extra jobs step in deploy.yaml file. It can even be configured in a manner that when code gets merged into release/* branch, it deploys to staging environment, when code gets merged to develop branch, it deploys to test environment etc etc.

  • Create a new repo. with all the bells & whistles to handle modern ML model development, deployment and serving. You can use any cookiecutter template or build your own.
    • I have created a cookiecutter template, available here
  • Decorator for logging and time profiling
    • The decorators are available at ./utils/decorators.py
    • There are two decorators for logging purposes:
      • log_info(): this logs the output of the function
      • log_time(): this logs the time taken by a function
  • Shell script to identify all the added notebook files, run them, export them as HTMLs, and then strip them of their output
    • The shell script is available at ./scripts/notebook_to_html.bash
  • Pre-commit hook that will run the above bash script
    • The instructions for setting it up as a pre-commit hook are mentioned within the file.
  • Implement workflow for both the model’s batch training. The workflow should be runnable both on local machine and cloud dev. and prod environments with a simple config change
    • I have created a script file ./scripts/run_workflow.bash
    • The script takes in two input parameters the environment and the model name and accordingly runs the pipeline.
    • This file is then triggered by github action at the mentioned schedule, daily and every two hours. The respective code is in ./.github/workflows/batch_prediction.yml
    • To keep things simple, I have used the github action, but in a larger system and with more variability, we should go with an orchestration tool like airflow.
  • Write test cases or data contracts for data output of every step in the workflow and include them in the workflow
    • The tests are written in ./tests/test_shapes.py. The tests are simple and straightforward, inspired from the cleaning files shared.
    • They basically check if the shape is correct or not.
    • These tests are run with the pipeline also, after feature engineering job.
    • I have also created a decorator to simplify the test cases, this decorator can be used in normal workflow, without writing assert statements in every function.
  • Draw architecture diagram of how you can serve these models via an API as described in the target workflow above. Include the stream processing pipeline part
    • The architecture diagram is placed in ./docs/prediction-architecture.drawio.png
  • How will you prepare features in real-time given just the user id?
    • In the architecture diagram shared, I have choosen the following approach:
      • The events generated by the application are streamed via kafka or kiensis and stored as raw values in data ware house.
      • The generated events trigger a lambda function for feature generation.
      • The generated features are then stored on the feature store. These can be retuilised again for batch predictions.
      • The generated features then trigger another lambda function, which generates the predictions.
      • The lambdas can be configured to trigger on each event or in a gap of minutes.
      • These predictions are then stored in a database. The database could be a key/value store like redis/dynamodb or mysql database.
      • Now when the application queries for the predictions, the prediction service will look for the predictions for the given clientid in the prediction database.
      • If in case, there are no predictions in the database, the prediction service will fall back to batch predictions, or even if they are also not available then, the prediction service will fall back to a baseline prediction.
      • This approach priortizes latency over cost. As storing all the predictions in key/value store can be expensive, but it ensures that the predictions are served quickly.
  • Mention tools you would use and the reasons for your choice
    • The tools used are:
      • Kafka/Kinesis/SqsQueue: For event streaming and processing
      • Lambda functions: For computing features and predictions in real-time
      • Feature store: For storing the features, these can be stored on aws athena or a database. But databases get expensive over time.
      • Database: For storing the predictions, depending on the priority, if we want to priortize latency, then we can store the predictions in cache services like redis, otherwise databases like mysql and postgresql also offer decent query time.
      • Prediction Service: For serving predictions I have choosen the following frameworks/tools:
        • Fastapi: for developing and running the api server. It is asynchronous and provide good developer experience with type safety and documentation.
        • Kubernetes Cluster: The api server will be deployed to a kubernetes cluster and the api server will run on the kubernetes pods, which can be scaled up and down, depending on load.
        • Sumologic/Cloudwatch: For monitoring the api servers, the logs will be streamed to a log analytics service like cloudwatch. And through these log analytics services, we can build a real time monitoring and alerting system to notify on any failures.