
In this project, I implemented, evaluated, operated, monitored, and evolved a recommendation service for a scenario of a movie streaming service.

Primary LanguagePython


Table of contents

Overall architecture

The following architecture shows our deployment movie recommendation system

⚙️ Software & Tools

Continuous Integration

1. A pipeline for movie recommendation


  • Data storage

    • Apache Kafka is a distributed event store and stream-processing platform
    • Collect Kafka log data
      • Data (movies watched by user) --> for (re)training model and for online evaluation
      • Rate (rating by user) --> for (re)training model and for online evaluation
      • Request --> for online evaluation
    • This pipeline, once run, continues to run until it is intentionally stopped.
    • After online evaluation, expired data is automatically deleted.
  • Data preprocessing

    • pre-processing the stored raw data
    • Generate a compresssed sparse row (CSR) matrix
    • Split it into train/validation sets
  • Model (re)training

    Matrix Factorization (MF)

    • SVD

    • SVD++

  • Offline evaluation

    • 'RMSE' as metric for offline evaluation

2. Code integrity checks with uni-test

  • The process is integrated on Jenkins pipeline, which runs automatically.
  • The result can be identified in a coverage report format on Jenkins

3. Automatic integration pipeline with Jenkins

  • Continuous integration
    • Jenkins
      • Unit test 1 to 5 --> model management & offline evaluation (model) --> online evaluation
    • Using Blue Ocean plugin
      • A more visualized dashboard than ever before
      • Commit occurs in master branch of github --> Autorun the entire pipeline
      • Save after pipeline build --> Jenkinsfile for pipeline is committed to master branch on github
    • Using freestyle project
      • Automatically run once in a specific period of time
      • Setting the "build periodically" option

Continuous Deployment

1. Containerization with Rancher

  • Rancher
    • A complete container management platform that includes everything necessary for container management during the production process
  • Deploymeny components
    • Our system manages two recommendation models as different deployments in one cluster
    • Each deployment consists of two pods, one replica of the ohter, which distributes and processes tasks

2. Automatic Continuous Deployment with Jenkins

  • Automatic Continuous Deployment with Jenkins
    • Extending our integration pipeline to model deployment
    • We leverage jenkins to transmit the deployment signal to the Rancher
    • Whenever committed to Github, the pipeline is executed:
      • Continuous Integration : Data fetching, Data preprocessing, Model retraining
      • Continuous Deployment : Build docker images, Push images to docker repo
      • Model deployment : Pull docker images for retrained models and redeploy it through Rancher

  • Zero downtime for model redeployment

    - The new redeployment also has 2 pods with replica
    - After one new pod is deployment, one existing pod is terminated
    - After a new pod is deployed again, the remaining existing pod is also terminated --> ZERO DWONTIME in the process of deploying the retrained models
  • All these process are stable controlled under the Rancher platform

3. Monitoring

  • Monitoring infrastructure
    • Prometheus, Grafana and Node Exporter to monitor our infrastructure
      • Memory usage
      • CPU usage
      • Latency time in flask
      • Model quality

- Sending alerts to our slack #alert channel

4. Versioning and tracking provenance

  • Provenance
    • DVC
      • An open-source version control system
      • DVC stores the information of dataset and the model in .dvc format
    • Process
      • Track modification --> Add changes to git --> push git tag


  • Collect data from Kafka Streaming and data preprocessing for movie recommendation model training
  • Deploy and measure a model inference service
  • Build and operate infrastructures
    • A continuous integration infrastructure for evaluate a model in production
    • A monitoring infrastructure for the system health and model quality
    • A continuous deployment infrasturcture for automatic periodic retraining and versioning
  • Design and implement a monitoring strategy to detect possible issues in ML systems