/llama-mps

Experimental fork of Facebooks LLaMa model which runs it with GPU acceleration on Apple Silicon M1/M2

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

LLaMa MPS fork

This is a fork of https://github.com/markasoftware/llama-cpu which is a fork of https://github.com/facebookresearch/llama. The goal of this fork is to use GPU acceleration on Apple M1/M2 devices.

LLaMa-adapter support has been added in a separate branch!
Multi-modal LLaMa-adapter support has been added in a separate branch!
Llama v2 support has been added in a separate branch

Please check the original repos for installation instructions. After you're done, run this torchrun example.py --ckpt_dir ../7B --tokenizer_path ../tokenizer.model --max_batch_size=1 with correct paths to the models. You might need to set up env. variable PYTORCH_ENABLE_MPS_FALLBACK=1

This fork is experimental, currently at the stage which allows to run a full non-quantized model with MPS.

After the model is loaded, inference for max_gen_len=20 takes about 3 seconds on a 24-core M1 Max vs 12+ minutes on a CPU (running on a single core). For 7B model, it always goes above 32gb of RAM, writing 2-4gb to ssd (swap) on every launch, but consumes less memory after it is loaded.

If you notice, that the output of the model has empty/repetitive text, try using a fresh version of python/pytorch. For me it was giving bad outputs with Python 3.8.15 and pytorch 1.12.1. After trying it with python3.10 and torch 2.1.0.dev20230309 the model worked as expected and produced high-quality outputs.