This repository demonstrates a minimal example of how to deploy a fine-tuned BERT model using FastAPI.
Clone the repository via
git clone https://github.com/jaketae/fastapi-bert.git
Using conda, create a new virtual environment and install dependencies specified in spec-file.txt
via
conda create --name myenv --file spec-file.txt
You can activate the environment any time in the terminal via
conda activate myenv
A total of 3 BERT and BERT-variant models were fine-tuned on the GLUE CoLA dataset. The dataset contains sentences, each with labels indicating whether they are grammaticality acceptable or not. Hence, its formulation is a classic binary classification problem. Below is an example entry from the dataset, loaded through HuggingFace Datasets.
{
'idx': 0,
'label': 1,
'sentence': "Our friends won't buy this analysis, let alone the next one we propose."
}
For training and validation, we use the HuggingFace transformers Trainer API to expedite prototyping. Since one of the tertiary goals of this experiment is to determine how different models compare to each other in minimally fine-tuned conditions, we do not delve into hyperparameter search.
The experiment can be run by simply running all the cells of the Jupyter notebook. Note that without running the experiment, it will not be possible to spin up the FastAPI web application since it won't have any reference point to load model weights from.
Below is a summary of the results of the experiment, seeded at 42. The specific training arguments can be accessed in experiment.ipynb
, where Colab notebook in which the experiment was conducted. Matthew's correlation was used as a performance metric.
bert-base-uncased | distilbert-base-uncased | distilroberta-base | |
---|---|---|---|
Epoch 1 | 0.521 | 0.449 | 0.361 |
Epoch 2 | 0.535 | 0.453 | 0.466 |
Epoch 3 | 0.572 | 0.497 | 0.527 |
Epoch 4 | 0.555 | 0.510 | 0.551 |
Epoch 5 | 0.557 | 0.483 | 0.536 |
The best model, BERT-base-uncased at the third epoch, was saved to be loaded in the FastAPI app.
main.py
contains code relevant to spinning up the main FastAPI process; backend.py
demonstrates how a trained model can be loaded into memory and run for inference. To optimize serving, we perform basic quantizing by using torch.qint8
for nn.Linear
layers in the BERT model.
Activate the appropriate conda environment, cd
into the repository directory, then run the FastAPI app by typing
uvicorn main:app --reload
As the purpose of this project was to demonstrate a minimally functional serving, the app does not have a user-facing frontend. Instead, to run model inference, one can access the app via the curl
command, as follows:
curl -H "Content-type: application/json" -X POST -d '{"passage":"I are a boy"}' http://localhost:8000/generate
The following is a sample response.
{"prob": 0.004417025949805975}