An AI chatbot assistant named Rocky powered by LLaMA2
- Windows
- (optional) GPU
-
Python for Windows: https://www.python.org/downloads/
-
ctransformers: https://github.com/marella/ctransformers
-
Llama-2-7B-Chat-GGML model: https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/tree/main
-
(optional) CUDA, GPU driver
- Install Python for Windows
- Clone this repository
git clone https://github.com/by-park/llmrocky.git
- Install pre-requisites
pip install ctransformers
If GPU is supported, please use this command
pip install ctransformers[cuda]
- Download LLaMA2 (llama-2-7b-chat.ggmlv3.q2_K.bin) and place the model file under the 'model' folder.
- If GPU is not supported, please remove the parameter named 'gpu_layers' in main.py
from (with GPU)
llm = AutoModelForCausalLM.from_pretrained("model\\llama-2-7b-chat.ggmlv3.q2_K.bin", model_type="llama", gpu_layers=32)
to (without GPU)
llm = AutoModelForCausalLM.from_pretrained("model\\llama-2-7b-chat.ggmlv3.q2_K.bin", model_type="llama")
- run the 'main.py' (F5 key for Python default IDLE)
python main.py
- TensorRT Support
-
TensorRT-LLM supports GeForce 40 series GPUs: https://github.com/NVIDIA/TensorRT-LLM/tree/main/windows
-
LLamA example: https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llama/README.md
# With fp16 inference python3 ../run.py --max_output_len=50 \ --tokenizer_dir ./tmp/llama/7B/ \ --engine_dir=./tmp/llama/7B/trt_engines/fp16/1-gpu/
-
TensorRT needs onnx conversion and tensor rt conversion
-
- Rocky Animation: https://peacelight14.blogspot.com/2011/02/office-assistant.html