Creating inference wrappers for Large Language Models in multiple frameworks: HuggingFace TGI, NVIDIA Triton Inference
- Docker, docker-compose
- NVIDIA Container Toolkit
Change MODEL_NAME
to your HF model code or destination to your checkpoints
make build_server
make build_client
make run_client
- Open http://localhost:3001 on your browser.
- Navigate to
/chat_completion
endpoint - Try the API by sending requests using the following request body:
{
"dialog": [
{
"role": "system",
"content": "You are my coding assistant"
},
{
"role": "user",
"content": "Help me!"
},
{
"role": "assistant",
"content": "Okay bro!"
},
{
"role": "user",
"content": "What is Mojo programming?"
}
],
"model_params": {
"temperature": 0,
"top_p": 1,
"max_gen_len": 1025,
"stream": false
},
"model_name": "codellama/CodeLlama-13b-Instruct-hf"
}
- Send Post request to http://localhost:3001/chat_completion using the previous request body.
- If
stream
is set True, the response will be a stream of server-sent event.