tloen/llama-int8

Quantized inference code for LLaMA models

PythonGPL-3.0

Issues

Does this support llama2 as well?
#21 opened 9 months ago by YaoJiayi
0
Producing nan Tensors
#20 opened a year ago by Bryan-Lavender
0
CUDA out of memory
#19 opened 2 years ago by fengyh3
0
Getting error on generation in Windows
#12 opened 2 years ago by elephantpanda
4
65B on multiple GPUs : CUDA out of memory with 4 x GPU RTX A5000 (24GB) / 96GB in total
#18 opened 2 years ago by scampion
3
LLaMA 13B works on a single RTX 4080 16GB
#17 opened 2 years ago by kcchu
1
Further detail needed - installing bitsandbytes from source
#16 opened 2 years ago by chrisbward
1
Issue for bitsandbytes /// NameError: name 'cuda_setup' is not defined. Did you mean: 'CUDASetup'?
#15 opened 2 years ago by kskim-phd
1
Tracking issue for Mac support
#4 opened 2 years ago by pannous
3
Can 65B run on 4*32G GPU?
#11 opened 2 years ago by zhongtao93
0
Is it possible to save the smaller weights so it doesn't have to convert them each time?
#10 opened 2 years ago by spullara
0
When a single A100 80G ,memory is about 96G,Error loading 65B
#8 opened 2 years ago by dpyneo
3
Systematic comparison of original models to int8 inferencing
#9 opened 2 years ago by innokean
1
Does 8GB able to run smallest llama model?
#5 opened 2 years ago by lucasjinreal
4
RTX4090 CUDA out of memory.
#7 opened 2 years ago by WuNein
3
Any chance to share quantized int8 7B and 13B models?
#6 opened 2 years ago by progressionnetwork
0
13B - load is successful on T4, but forward pass fails
#2 opened 2 years ago by deep-diver
0