qwopqwop200/GPTQ-for-LLaMa

4 bits quantization of LLaMA using GPTQ

PythonApache-2.0

Issues

add support for minicpm
#289 opened 6 months ago
0
GPTQ vs bitsandbytes
#288 opened 9 months ago
0
Error when load GPTQ model
#287 opened a year ago
0
datasets.utils.info_utils.ExpectedMoreSplits: {'validation'}
#286 opened a year ago
1
Syntax changed in triton.testing.do_bench() causing error when running llama_inference.py
#285 opened a year ago
1
Support Mistral.
#284 opened a year ago
0
error: block with no terminator, has llvm.cond_br %5624, ^bb2, ^bb3
#283 opened a year ago
0
neox.py needs to add "import math"
#282 opened a year ago
0
LoRa and diff with bitsandbytes
#281 opened a year ago
0
Transformers broke again (AttributeError: 'GPTQ' object has no attribute 'inp1')
#280 opened a year ago
1
Can i quantize HF version of llama model
#279 opened a year ago
0
Would GPTQ be able to support LLaMa2?
#278 opened a year ago
1
Why does the model quantization prompt KILLED at the end?
#277 opened a year ago
2
Help: Quantized llama-7b model with custom prompt format produces only gibberish
#276 opened a year ago
1
High PPL when groupsize != -1 for OPT model after replace linear layer with quantlinear.
#275 opened a year ago
1
Issue with GPTQ
#274 opened a year ago
1
can it support openllama model?
#273 opened a year ago
0
Could not obtain official perplexity using bloom_eval()
#272 opened 2 years ago
0
inference with the saved model error: AttributeError: module 'torch.backends.cuda' has no attribute 'sdp_kernel'
#271 opened 2 years ago
2
llama_inference 4bits error
#270 opened 2 years ago
0
Proposed changes to reduce VRAM usage. Potentially quantize larger models on consumer hardware.
#269 opened 2 years ago
3
AttributeError: 'QuantLinear' object has no attribute 'weight' (t5 branch) (Google/flan-ul2)
#268 opened 2 years ago
2
CUDA out of memory on flan-ul2
#265 opened 2 years ago
1
SqueezeLLM support?
#264 opened 2 years ago
0
What is the right perplexity number?
#263 opened 2 years ago
0
The detected CUDA version (12.1) mismatches the version that was used to compile PyTorch (11.7)
#262 opened 2 years ago
2
[Question] What is the expected discrepancy between simulated and actually computed values?
#261 opened 2 years ago
4
Dependency conflicts for `safetensors`
#260 opened a year ago
1
Finetuning Quantized LLaMA
#259 opened 2 years ago
0
compare with llama.cpp int4 quantize?
#257 opened 2 years ago
0
How to quantize bloom after lora/ptuning?
#255 opened 2 years ago
0
I use python llama.py to generate a quantized model, but I can't find the .safetensors model
#254 opened 2 years ago
1
Wondering whether some of the triton or cuda kernel also speedup fp16 or not?
#253 opened 2 years ago
0
the inference speed of GPTQ 4bit quantized model
#252 opened 2 years ago
2
Does this work for gptj specifically the cuda branch? Thanks!
#250 opened 2 years ago
0
_pickle.UnpicklingError: invalid load key, 'v'.
#249 opened 2 years ago
1
Does not support 3bit quantization?
#248 opened 2 years ago
0
An error is reported when running python setup_cuda.py install
#247 opened 2 years ago
2
Errors encountered when running benchmark FP16 baseline on multiple GPUs
#246 opened 2 years ago
2
No CUDA_ENV / conda-froce cudatoolkit-dev freezes
#245 opened 2 years ago
0
Sample code does not work
#243 opened 2 years ago
2
Unable to run 'python setup_cuda.py install'
#242 opened 2 years ago
0
Build issue with newer torch pybind11 cast.h - workaround inside
#241 opened 2 years ago
0
Porting GPTQ to CPU?
#240 opened 2 years ago
2
AttributeError: module 'torch.nn.functional' has no attribute 'scaled_dot_product_attention'
#239 opened 2 years ago
2
6-bit quantization
#236 opened 2 years ago
1
Giepeto
#234 opened 2 years ago
0
fastest-inference-4bit fails to build
#233 opened 2 years ago
3
no module named quant_cuda (fastest-inference-4bit branch)
#232 opened 2 years ago
1
Benchmark broken on H100
#231 opened 2 years ago
0