This project sets up a language model server using FastAPI, Docker, and Langserve. The server uses the Llama model for natural language processing tasks.
- Docker
- Python 3.8 or later
- Llama model (download and add to the project folder)
The main script that sets up and runs the FastAPI server. Replace "your model name"
with the path to your Llama model:
# Define the Llama model
llm = LlamaCpp(
model_path="your model name", # Specify your model path here
temperature=0.75,
max_tokens=2000,
top_p=1
)
Optional: You can also add the model name in local_inference.py
and simple-flask.py
to test it without using the Docker container.
-
Build the Docker image:
docker build -t llama-server .
-
Run the Docker container:
docker run -p 3000:3000 llama-server
The FastAPI server will be accessible at http://localhost:3000
.
To interact with the Llama model, make a POST request to http://127.0.0.1:3000/llama/invoke
with the following JSON body:
{
"input": {
"system_message": "You are a helpful assistant",
"user_message": "Generate a list of 5 funny dog names",
"max_tokens": 1000
}
}
This project uses llama.cpp
for CPU-only execution. Ensure that the Llama model is downloaded and added to the project folder.