Running a LLM on the ESP32

Summary

I wanted to see if it was possible to run a Large Language Model (LLM) on the ESP32. Surprisingly it is possible, though probably not very useful.

The "Large" Language Model used is actually quite small. It is a 260K parameter tinyllamas checkpoint trained on the tiny stories dataset.

The LLM implementation is done using llama.2c with minor optimizations to make it run faster on the ESP32.

Hardware

LLMs require a great deal of memory. Even this small one still requires 1MB of RAM. I used the LILYGO T-Camera S3 ESP32-S3 because it has 8MB of embedded PSRAM and a screen.

Optimizing Llama2.c for the ESP32

With the following changes to llama2.c, I am able to achieve 19.13 tok/s:

Utilizing both cores of the ESP32 during math heavy operations.
Utilizing some special dot product functions from the ESP-DSP library that are designed for the ESP32-S3. These functions utilize some of the few SIMD instructions the ESP32-S3 has.
Maxing out CPU speed to 240 MHz and PSRAM speed to 80MHZ and increasing the instruction cache size.

Setup

This requires the ESP-IDF toolchain to be installed

idf.py build
idf.py -p /dev/{DEVICE_PORT} flash

AIWintermuteAI/esp32-llm

Running a LLM on the ESP32

Summary

Hardware

Optimizing Llama2.c for the ESP32

Setup