/llama-dfdx

LLaMa 7b with CUDA acceleration implemented in rust. Minimal GPU memory needed!

Primary LanguageRustMIT LicenseMIT

LLaMa 7b in rust

This repo contains the popular LLaMa 7b language model, fully implemented in the rust programming language!

Uses dfdx tensors and CUDA acceleration.

This runs LLaMa directly in f16, meaning there is no hardware acceleration on CPU. Using CUDA is heavily recommended.

Here is the 7b model running on an A10 GPU:

How To Run

(Once) Setting up model weights

Download model weights

  1. Install git lfs. On ubuntu you can run sudo apt install git-lfs
  2. Activate git lfs with git lfs install.
  3. Run the following commands to download the model weights in pytorch format (~25 GB):
    1. LLaMa 7b (~25 GB): git clone https://huggingface.co/decapoda-research/llama-7b-hf
    2. LLaMa 13b (~75 GB): git clone https://huggingface.co/decapoda-research/llama-13b-hf
    3. LLaMa 65b (~244 GB): git clone https://huggingface.co/decapoda-research/llama-65b-hf

Convert the model

  1. (Optional) Run python3.x -m venv <my_env_name> to create a python virtual environment, where x is your prefered python version
  2. (Optional, requires 1.) Run source <my_env_name>\bin\activate (or <my_env_name>\Scripts\activate if on Windows) to activate the environment
  3. Run pip install numpy torch
  4. Run python convert.py to convert the model weights to rust understandable format: a. LLaMa 7b: python convert.py b. LLaMa 13b: python convert.py llama-13b-hf c. LLaMa 65b: python convert.py llama-65b-hf

(Once) Compile

You can compile with normal rust commands:

With cuda:

cargo build --release -F cuda

Without cuda:

cargo build --release

Run the executable

With default args:

./target/release/llama-dfdx --model <model-dir> generate "<prompt>"
./target/release/llama-dfdx --model <model-dir> chat
./target/release/llama-dfdx --model <model-dir> file <path to prompt file>

To see what commands/custom args you can use:

./target/release/llama-dfdx --help