MLops Optimization

This repo serves as a test to see how we can improve upon ML serving latency. Massive thanks to comet ml, which this is based off of.

Version 1 - Basic Attempt

Version 1 is a first attempt at building a model; using BERT to make a precision on some sample text. This is what an initial attempt at creating a Dockerfile would be like.

Docker commands:

docker build -t v1 version_1/
docker image ls
docker system df -v
docker container ls -s

We can see here then that the image is 3.78GB, NOT including downloading the model. After starting the container, that goes up to 4.22GB (since you have to download BERT from transformers).

Version 2 - Better Software Practices

Version 2 involves switching to lightweight versions of torch (using torch cpu), as well as using a slimmer version of python. This brings the image size down to 1.12GB, and starting the container up to 1.56GB. This should be considered the bare minimum that needs to be done; here, versions are pinned as well.

The same docker commands from before are used, just switched out to use v2.

Version 3 - Distillation and Onnx

Version 3 introduces some new enhancements made by switching to a model using distillation, i.e. in this case DistilBERT, which is 40% smaller. In addition, we can use onnx to help reduce model size. The image itself becomes larger, at 1.72GB, however that is due to the fact that we now have both the original distilbert model + the onnx version.

Version 4 - Multistage Docker Builds

In this version, the Docker image shrinks by doing a multi-step build and only copying over the relevant items needed (as well as copying the tokenizer locally, so as to save time on container start). Now we're down to 1.25GB for the image size, and on container start, the size is still the same.

Version	Image Size	Container Size
V1*	3.783GB	4.22GB
V2*	1.051GB	1.61GB
V3	1.602GB	1.72GB
V4	1.125GB	1.125GB

*Note for V1 and V2, the model isn't downloaded yet (will be done at container start time), which is why you see the gap

ToDOS

Quantization
Switching to Go vs Python for serving