This repository contains a basic demonstration of text generation with LLMs such as Llama. Please keep in mind that this is not a "reference implementation", and should never be used in production.
- Docker
- NVIDIA Container Toolkit
- At least 140GB VRAM (on one or more GPUs)
- NVIDIA Drivers and CUDA
- An LLM converted to
gguf
format, such as Llama 2 converted using theconvert.py
script in this repository: https://github.com/ggerganov/llama.cpp
- Clone this repository:
git clone https://github.com/brianlechthaler/llama-docker-demo.git
- Change directory to cloned repository:
cd llama-docker-demo
- Build Docker Image:
docker build -t llama-docker-demo .
- Run docker image:
docker run --gpus=all --cap-add SYS_RESOURCE -e USE_MLOCK=0 -e MODEL=/var/model/ggml-model-f16.gguf -v /home/$USER/llama/llama-2-70b-chat:/var/model -ti llama-docker-demo "what is a hello world?"
- Make sure to replace
/home/$USER/llama/llama-2-70b-chat
with the path to folder containing yourgguf
model if it’s located somewhere else. - You can replace
"what is a hello world?"
with whatever prompt you want.
- Make sure to replace