This sample shows how to implement a llama-based model with OpenVINO runtime.
-
Please follow the Licence on HuggingFace and get the approval from Meta before downloading llama checkpoints, for more information
-
Please notice this repository is only for a functional test and personal study.
- Linux, Windows
- Python >= 3.9.0
- CPU or GPU compatible with OpenVINO.
- RAM: >=16GB
- vRAM: >=8GB
$ python3 -m venv openvino_env
$ source openvino_env/bin/activate
$ python3 -m pip install --upgrade pip
$ pip install wheel setuptools
$ pip install -r requirements.txt
setup access Tokens
$ huggingface-cli login --token hf_xxxxxxxxx
from Transformers:
$ python3 export.py --model_id 'meta-llama/Llama-2-7b-hf' --output {your_path}/Llama-2-7b-hf
or from Optimum-Intel:
$ python3 export_op.py --model_id 'meta-llama/Llama-2-7b-hf' --output {your_path}/Llama-2-7b-hf
or for #GPTQ model:
$ python3 export_op.py --model_id 'TheBloke/Llama-2-7B-Chat-GPTQ' --output {your_path}/Llama-2-7B-Chat-GPTQ
Parameters that can be selected
--model_id
- path (absolute path) to be used from Huggngface_hub (https://huggingface.co/models) or the directory where the model is located.--output
- the address where the converted model is saved- If you have difficulty accessing
huggingface
, you can try to usemirror-hf
to download
$ python3 quantize.py --model_id {your_path}/Llama-2-7b-hf --precision int4 --output {your_path}/Llama-2-7b-hf-int4
Parameters that can be selected
--model_id
- The path to the directory where the OpenVINO IR model is located.--precision
- Quantization precision: int8 or int4.--output
- Path to save the model.
For more information on quantization configuration, please refer to weight compression
Optimum-Intel OpenVINO pipeline:
$ python3 pipeline/generate_op.py --model_id {your_path}/Llama-2-7b-hf-int4 --prompt "what is openvino ?" --device "CPU"
Parameters that can be selected
--model_id
- HuggingFace model id or path to the directory where the OpenVINO IR model is located.--prompt
- Maximum size of output tokens.--max_sequence_length
- Maximum size of output tokens.--device
- The device to run inference on. e.g "CPU","GPU".
or Restructured pipeline:
$ python3 pipeline/generate.py --model_path {your_path}/Llama-2-7b-hf-int4 --prompt "what is openvino ?" --device "CPU"
Parameters that can be selected
--model_path
- The path to the directory where the OpenVINO IR model is located.--max_sequence_length
- Maximum size of output tokens.--device
- The device to run inference on. e.g "CPU","GPU".
$ python3 demo/qa_gradio.py --model_id {your_path}/Llama-2-7b-hf-int4
$ python3 quantize.py --model_id 'meta-llama/Llama-2-7b-chat-hf' --output {your_path}/Llama-2-7b-chat-hf-int4
$ streamlit run demo/chat_streamlit.py -- --model_id {your_path}/Llama-2-7b-chat-hf-int4