Issues
- 0
add support for minicpm
#289 opened - 0
GPTQ vs bitsandbytes
#288 opened - 0
Error when load GPTQ model
#287 opened - 1
- 1
Syntax changed in triton.testing.do_bench() causing error when running llama_inference.py
#285 opened - 0
Support Mistral.
#284 opened - 0
- 0
neox.py needs to add "import math"
#282 opened - 0
LoRa and diff with bitsandbytes
#281 opened - 1
- 0
Can i quantize HF version of llama model
#279 opened - 1
Would GPTQ be able to support LLaMa2?
#278 opened - 2
- 1
- 1
- 1
Issue with GPTQ
#274 opened - 0
can it support openllama model?
#273 opened - 0
- 2
- 0
llama_inference 4bits error
#270 opened - 3
Proposed changes to reduce VRAM usage. Potentially quantize larger models on consumer hardware.
#269 opened - 2
AttributeError: 'QuantLinear' object has no attribute 'weight' (t5 branch) (Google/flan-ul2)
#268 opened - 1
CUDA out of memory on flan-ul2
#265 opened - 0
SqueezeLLM support?
#264 opened - 0
What is the right perplexity number?
#263 opened - 2
The detected CUDA version (12.1) mismatches the version that was used to compile PyTorch (11.7)
#262 opened - 4
[Question] What is the expected discrepancy between simulated and actually computed values?
#261 opened - 1
Dependency conflicts for `safetensors`
#260 opened - 0
Finetuning Quantized LLaMA
#259 opened - 0
compare with llama.cpp int4 quantize?
#257 opened - 0
How to quantize bloom after lora/ptuning?
#255 opened - 1
I use python llama.py to generate a quantized model, but I can't find the .safetensors model
#254 opened - 0
- 2
- 0
- 1
- 0
Does not support 3bit quantization?
#248 opened - 2
- 2
- 0
- 2
Sample code does not work
#243 opened - 0
Unable to run 'python setup_cuda.py install'
#242 opened - 0
- 2
Porting GPTQ to CPU?
#240 opened - 2
AttributeError: module 'torch.nn.functional' has no attribute 'scaled_dot_product_attention'
#239 opened - 1
6-bit quantization
#236 opened - 0
Giepeto
#234 opened - 3
fastest-inference-4bit fails to build
#233 opened - 1
- 0
Benchmark broken on H100
#231 opened