This soucre code is the inference pipeline of LLama3 which can run in Linux locallay
- Install and activate a virtual environment:
conda create -n llama3 python=3.11
conda activate llama3
- Install some requirements. Note that the transofrmer and accelerate lib are only used for exporting original model to torchScript model. For inference, we don't need those
pip install -r requirements.txt
- To export the original model weight file to torch script(Optinoal because I already exported):
- Convert original checkpoitn (can be downloaded at here ) to safetensors format to be loaded by Autotokenizer (required at least 16GB RAM)
python3 utils/convert_llama_weights_to_hf.py \
- Export to torch script:
python3 utils/export_torchScript
- Convert original checkpoitn (can be downloaded at here ) to safetensors format to be loaded by Autotokenizer (required at least 16GB RAM)
- Run inference:
python3 main.py --tokenizer tokenizer.model --max_gen_len 10 --temperature 0.3
You can test with another tokenizer file (cl100k_base.tiktoken (for GPT4))