/llama4micro

A "large" language model running on a microcontroller

Primary LanguageC++MIT LicenseMIT

llama4micro 🦙🔬

A "large" language model running on a microcontroller.

Example run

Background

I was wondering if it's possible to fit a non-trivial language model on a microcontroller. Turns out the answer is some version of yes!

This project is using the Coral Dev Board Micro with its FreeRTOS toolchain. The board has a number of neat hardware features not currently being used here (notably a TPU, sensors, and a second CPU core). It does, however, also have 64MB of RAM. That's tiny for LLMs, which are typically measured in the GBs, but comparatively huge for a microcontroller.

The LLM implementation itself is an adaptation of llama2.c and the tinyllamas checkpoints trained on the TinyStories dataset. The quality of the smaller model versions isn't ideal, but good enough to generate somewhat coherent (and occasionally weird) stories.

Setup

Clone this repo with its submodules karpathy/llama2.c (specifically the 8-bit quantized version) and maxbbraun/coralmicro (a fork to maximize heap size):

git clone --recurse-submodules https://github.com/maxbbraun/llama4micro.git

cd llama4micro

Some of the tools use Python. Install their dependencies:

python3 -m venv venv
. venv/bin/activate

pip install -r llama2.c/requirements.txt
pip install -r coralmicro/scripts/requirements.txt

Download the model and quantize it:

MODEL_NAME=stories15M
wget -P data https://huggingface.co/karpathy/tinyllamas/resolve/main/${MODEL_NAME}.pt

python llama2.c/export.py data/${MODEL_NAME}_q80.bin --version 2 --checkpoint data/${MODEL_NAME}.pt

cp llama2.c/tokenizer.bin data/tokenizer.bin

Build and flash the image:

mkdir build
cd build

cmake ..
make -j

python ../coralmicro/scripts/flashtool.py --build_dir . --elf_path llama4micro

Usage

  1. The model loads automatically when the board powers up.
    • This takes ~6 seconds.
    • The green light will turn on when it's ready.
  2. Press the button next to the green light.
    • The green light will turn off.
  3. The model now generates tokens.
    • The results are streamed to the serial port.
    • This happens at a rate of ~2.5 tokens per second.
  4. Generation stops after the end token or maximum steps.
    • The green light will turn on again.
    • Goto 2.