This repository contains complete ecosystem for training and deploying the model for scoring the readability of the text
It consists of five services (in order of execution): - train (where finetuning happens) - optimization (some optimizations applied and model is converted to onnx format) - inference (deployong the model to triton server) - api (exposing the endpoint for making post requests) - load_testing (sending ~13k requests to measure the performance of the inference)
To run everything you just need to do docker compose up
. All services will be started and killed in the correct order. At the very end only api
service will remain running. If you don't need some services - just comment the corresponding lines in docker-compose.yml
Before you run - you'll need to place a train.csv.zip
(from https://www.kaggle.com/c/commonlitreadabilityprize/overview) here - data/input/
Training (train)
BertForSequenceClassification
was chosen for finetuning. After 3 epochs of fine-tuning it was able to achieve
Training script is here
The local notebook is here, the interactive kaggle version is here
After three epochs validation loss looks like this:
+-------+---------------+-----------------+----------+
| Epoch | Training Loss | Validation Loss | Rmse |
+-------+---------------+-----------------+----------+
| 1 | No log | 0.673124 | 0.678524 |
| 2 | No log | 0.683692 | 0.690056 |
| 3 | No log | 0.649549 | 0.655264 |
+-------+---------------+-----------------+----------+
Optimization (optimization)
To optimize the model a bit and to deploy it to triton server we conver it to onnx format using huggingface optimum
package. The corresponding Dockerfile
Inference (inference)
To make a model serving a bit more reliable and to get the access to different metrics (Prometheus) we'll deploy our model to Triton Inference Server - config
API (api)
To expose our model to extternal world we'll create a FastAPI application. To get a readability score one needs to make a POST request to /predict
endpoint with json containing text
field.
Curl example
curl -X POST http://localhost:5000/predict -H 'Content-Type: application/json' -d '{"text":"Some text to score"}'
Response example:
["Some text to score",-0.06885720044374466]
Load testing (load_testing)
This service will take a training dataset and wil send requests with texts from it for 2 minutes using locust
library
Here are the results for performing the inference on RTX 3060. As we can see it is possible to achieve ~110 requests per second (~2 rps for i7 3770)
| Type Name # reqs # fails | Avg Min Max Med | req/s failures/s
| --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
| POST /predict 13120 0(0.00%) | 1658 72 3685 1700 | 109.88 0.00
| --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
| Aggregated 13120 0(0.00%) | 1658 72 3685 1700 | 109.88 0.00
|
|
| Response time percentiles (approximated)
| Type Name 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100% # reqs
| --------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
| POST /predict 1700 2000 2100 2100 2700 2800 2800 2800 3100 3700 3700 13120
| --------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
| Aggregated 1700 2000 2100 2100 2700 2800 2800 2800 3100 3700 3700 13120