Issues
- 1
Syntax changed in triton.testing.do_bench() causing error when running llama_inference.py
#285 opened by prasanna - 0
add support for minicpm
#289 opened by LDLINGLINGLING - 0
GPTQ vs bitsandbytes
#288 opened by iaoxuesheng - 0
Error when load GPTQ model
#287 opened by KyrieCui - 1
Dependency conflicts for `safetensors`
#260 opened by Yiximail - 1
- 1
_pickle.UnpicklingError: invalid load key, 'v'.
#249 opened by ahnHeejune - 2
inference with the saved model error: AttributeError: module 'torch.backends.cuda' has no attribute 'sdp_kernel'
#271 opened by LuciaIsFine - 2
Porting GPTQ to CPU?
#240 opened by yiliu30 - 2
the inference speed of GPTQ 4bit quantized model
#252 opened by pineking - 0
Support Mistral.
#284 opened by nbollman - 0
- 0
neox.py needs to add "import math"
#282 opened by StudyingShao - 0
LoRa and diff with bitsandbytes
#281 opened by RonanKMcGovern - 1
Transformers broke again (AttributeError: 'GPTQ' object has no attribute 'inp1')
#280 opened by EyeDeck - 1
Would GPTQ be able to support LLaMa2?
#278 opened by moonlightian - 0
Can i quantize HF version of llama model
#279 opened by akanyaani - 2
- 1
Help: Quantized llama-7b model with custom prompt format produces only gibberish
#276 opened by Glavin001 - 3
Proposed changes to reduce VRAM usage. Potentially quantize larger models on consumer hardware.
#269 opened by sigmareaver - 1
Issue with GPTQ
#274 opened by d0lphin - 1
High PPL when groupsize != -1 for OPT model after replace linear layer with quantlinear.
#275 opened by hyx1999 - 2
- 0
can it support openllama model?
#273 opened by Ted8000 - 0
- 0
llama_inference 4bits error
#270 opened by gjm441 - 2
AttributeError: 'QuantLinear' object has no attribute 'weight' (t5 branch) (Google/flan-ul2)
#268 opened by sigmareaver - 1
CUDA out of memory on flan-ul2
#265 opened by sigmareaver - 4
[Question] What is the expected discrepancy between simulated and actually computed values?
#261 opened by set-soft - 2
The detected CUDA version (12.1) mismatches the version that was used to compile PyTorch (11.7)
#262 opened by siddhsql - 2
Sample code does not work
#243 opened by foamliu - 0
SqueezeLLM support?
#264 opened by nikshepsvn - 0
What is the right perplexity number?
#263 opened by JianbangZ - 0
Finetuning Quantized LLaMA
#259 opened by Qifeng-Wu99 - 0
compare with llama.cpp int4 quantize?
#257 opened by luohao123 - 0
How to quantize bloom after lora/ptuning?
#255 opened by moonlightian - 2
AttributeError: module 'torch.nn.functional' has no attribute 'scaled_dot_product_attention'
#239 opened by leszekhanusz - 1
I use python llama.py to generate a quantized model, but I can't find the .safetensors model
#254 opened by jimi202008 - 0
- 2
- 0
- 0
Does not support 3bit quantization?
#248 opened by foamliu - 0
- 0
Unable to run 'python setup_cuda.py install'
#242 opened by alannoote96 - 0
- 1
6-bit quantization
#236 opened by philipturner - 1
- 3
fastest-inference-4bit fails to build
#233 opened by lee-b - 0
Giepeto
#234 opened by IsaacGanon - 0
Benchmark broken on H100
#231 opened by FrederikAbitz