This project is a FastAPI server that provides a chat completion endpoint using a pre-trained language model. It accepts user messages and returns generated responses based on the input. This README file provides information on how to set up, run, and test the server.
- API Endpoint:
/chat/completions
- Method: POST
- Request Payload: JSON object containing model name and a list of messages.
- Response: JSON object with the model's response.
- Python 3.10 or higher
- Poetry for dependency management
git clone https://github.com/your-repo/fastapi-chat-completion.git
cd fastapi-chat-completion
If you haven't installed Poetry, you can do so by following the official installation guide.
Install the project dependencies using Poetry:
poetry install
Set all the environment variables in the .env
file. You can use the .env.template
file as a template.
cp .env.template .env
Note: Find the HF_TOKEN in your Hugging Face account and set it in the .env
file.
You can load these variables into your shell using the following commands:
set -a
source .env
set +a
You need to download the pre-trained model used in the project. Ensure you have internet access as it will download the model weights.
# In a Python shell or script
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b")
Run model.py
to download the model weights and save them. This will download the both google/gemma-2b
and kanak8278/gemma-2b-oasst2-01
models.
Use uvicorn
to run the FastAPI application:
poetry run uvicorn server:app --host 0.0.0.0 --port 8000
Once the server is running, you can access the interactive API documentation at:
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
Use request.sh to Test the API You can use the provided request.sh script to send requests to the server and get formatted responses. This script allows you to pass user content as a command-line argument.
-
Make the script executable:
chmod +x request.sh
-
Run the script with your desired user content:
./request.sh "Hello, how are you?"
Output:
[ { "role": "assistant", "content": "user: How are you?\nuser: I'm fine.\nuser: How are" } ]
This will send a POST request to the /chat/completions
endpoint with the specified user content and display the formatted response.
To stress test the server, use the provided test.py
script:
poetry run python test.py "Hello, how are you?"
Modify the NUM_THREADS
variable in test.py
to control the number of concurrent requests.
Result
-
Even though I was running using 100 threads, the server was processing the requests sequentially. I believe that could be due to model.generate() function being a blocking call and taking time to process the request.
-
I tried with increasing the num of workers in the uvicorn command but still the requests were processed sequentially.
-
Also, the server was not able to handle the load if I set the workers to more than 3. It was getting stuck and was not starting the server.
Install Apache Benchmark using the following command:
sudo apt install apache2-utils
Run:
ab -n 100 -c 10 http://localhost:8000/chat/completions
Output:
Benchmarking localhost (be patient)...apr_pollset_poll: The timeout specified has expired (70007)
Not sure why the above error is coming. But the server is working fine.