This repo contains scripts for building docker images of theroyallab/tabbyAPI suitable for RunPod or local use.
12.4.1-runtime-ubuntu22.04-runpod12.3.2-runtime-ubuntu22.04-runpod12.2.2-runtime-ubuntu22.04-runpod12.1.0-runtime-ubuntu22.04-runpod
A 2x 48 GB GPU system is required to follow this example.
$ cd /app/models; \
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli \
download \
turboderp/Llama-3-70B-Instruct-exl2 \
--local-dir turboderp_Llama-3-70B-Instruct-exl2_6.0bpw \
--revision 6.0bpw \
--cache-dir /app/models/.cacheURL: Huggingface repository turboderp/Llama-3-70B-Instruct-exl2
Adapt /app/models/config.yml to use this model.
network:
host: 0.0.0.0
port: 7000
disable_auth: False
logging:
prompt: False
generation_params: False
sampling:
developer:
model:
max_seq_len: 32768
model_dir: models
model_name: turboderp_Llama-3-70B-Instruct-exl2_6.0bpw
gpu_split_auto: False
gpu_split: [25, 47]
cache_mode: Q4
fasttensors: true$ restart.sh$ tail -f /app/tabbyAPI.log$ grep API /app/tabbyAPI.logInsert the API key into the authorization header.
$ curl -s http://localhost:7000/v1/chat/completions \
-X POST \
-H "Content-Type: application/json" \
-H "Authorization: Bearer d160430598b33ef9acd98235d12dc3ae" \
-d '{
"model": "turboderp_Llama-3-70B-Instruct-exl2_6.0bpw",
"messages": [
{
"role": "user",
"content": "Compose a poem that explains the concept of recursion in programming."
}
]
}'- tabbyAPI RunPod template for CUDA 12.1, derived from nvidia/cuda:12.1.0-runtime-ubuntu22.04
- tabbyAPI RunPod template for CUDA 12.2, derived from nvidia/cuda:12.2.2-runtime-ubuntu22.04
- tabbyAPI RunPod template for CUDA 12.3, derived from nvidia/cuda:12.3.2-runtime-ubuntu22.04
- tabbyAPI RunPod template for CUDA 12.4, derived from nvidia/cuda:12.4.1-runtime-ubuntu22.04
Inspirations and scripts from the following projects were used.
Last changed: 2024-05-25