Clone the repo
git clone https://github.com/sunnytqin/lit-llama.git
cd lit-llama
install dependencies listed in requirements.txt
You are all set! 🎉
To generate text predictions, you don't need to download the model weights. (I think you have reading rights to my folder to access checkpoint weights for lit-llama). If that is causing you a problem, please let me know!
Run inference:
python generate.py --model_size 7B
This will run the 7B model. The large model is the 30B model by default.
python awesomegui.py --data [path_to_LLM_output]
For a sample output, use output/sample_output
You only need a basic python environment (python 3 + numpy
) to run the GUI - no need to install the entire environment!
- It is the deterministic top k = 1 prediction. i.e., the token with the highest probability
- For now, we generate 50 new tokens given the prompt auto regressively. I will soon run some teacher-forcing samples
- We display the small model output by default and you need to click the token to see details
Make sure you request 3 GPUs and enough CPU memory (to load the 30B weights). GPU 0 and 1 for large model with pipeline parallelism and GPU 2 for the small model.
salloc -p kempner -t 0-02:00 --mem 240000 --gres=gpu:3
It takes a couple minutes to load the model but the inference is fast.
On GPUs with bfloat16
support, the generate.py
script will automatically convert the weights and consume about ~14 GB.
See python generate.py --help
for more options.