qwopqwop200/GPTQ-for-LLaMa

4 bits quantization of LLaMA using GPTQ

PythonApache-2.0

Issues

question about the zero_point
#230 opened 2 years ago
0
running on old gpu with fp32 only
#229 opened 2 years ago
3
How to inference llama-65b-4bit on mulgpu
#228 opened 2 years ago
6
Result with the branch `fastest-inference-4bit`
#227 opened 2 years ago
11
where to get /path/to/downloaded/llama/weights
#226 opened 2 years ago
0
About the fine-grained of weight quantization
#225 opened 2 years ago
0
OpenCL support
#224 opened 2 years ago
1
Errors to compile with CUDA 12.1
#220 opened 2 years ago
2
Error on A100，device kernel image is invalid
#219 opened 2 years ago
0
Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed.
#217 opened 2 years ago
2
CUDA kernel sync problem
#216 opened 2 years ago
1
wbit=16 Conversion Gives Error
#215 opened 2 years ago
2
CUDA Benchmark on 2bit, 3bit, 4bit models - Why 3bit slower than 4bit, but faster than 2biit?
#214 opened 2 years ago
1
4bits on 65B
#213 opened 2 years ago
1
How can I get the gradient when using 4bits model?
#211 opened 2 years ago
0
IndexError: tensors used as indices must be long, byte or bool tensors
#210 opened 2 years ago
2
CUDA error: unknown error (Error when quantize llama Model)
#209 opened 2 years ago
1
neox.py generates randrange() error
#207 opened 2 years ago
13
Security Issue: This Auto-downloads 800 trojan viruses
#206 opened 2 years ago
2
CUDA: 8bit quantized models are stupid.
#205 opened 2 years ago
4
File "<string>", line 21, in matmul_248_kernel
#204 opened 2 years ago
0
NameError: name 'transformers' is not defined
#198 opened 2 years ago
2
llama 30b generates strange answers after quantizing to 4bit
#193 opened 2 years ago
1
why disable tf32 ?
#192 opened 2 years ago
4
slower inference speed
#191 opened 2 years ago
4
Inference with Beam > 1 broken in Triton
#188 opened 2 years ago
3
I implement an easy-to-use package based on cuda branch
#186 opened 2 years ago
3
module 'quant_cuda' has no attribute 'vecquant4matmul'
#185 opened 2 years ago
0
Latest "change attention algorithm" commit breaks inference
#183 opened 2 years ago
5
Quantize 7b with 8GB VRAM OOM
#182 opened 2 years ago
2
triton branch is a lot slower than hipified cuda branch on AMD GPUs
#181 opened 2 years ago
1
Fused mlp causes assertion error
#179 opened 2 years ago
5
TypeError: expected string or bytes-like object
#178 opened 2 years ago
2
Compiled w/o GPU support. Am I missing something?
#176 opened 2 years ago
3
ERROR: Could not find a version that satisfies the requirement triton==2.0.0 (from versions: none)
#175 opened 2 years ago
1
Fixing Triton -"Unexpected MMA layout version found" for prevolta GPUs raises new problems
#174 opened 2 years ago
5
make into a package (like sterlind did)
#173 opened 2 years ago
5
llama.cpp ERROR
#172 opened 2 years ago
1
CUDA branch, multi GPU. "RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!"
#171 opened 2 years ago
4
Issue on Multi-GPU on the cuda branch (
#170 opened 2 years ago
0
Found another repo claim they implemented GPTQ
#168 opened 2 years ago
1
What is the command to install Triton?
#167 opened 2 years ago
1
A 4-bit quantitative model will generate self questioning and self answering content
#166 opened 2 years ago
2
8-bit quantization has ridiculous PPL and output nonsense
#165 opened 2 years ago
3
my error
#162 opened 2 years ago
2
ModuleNotFoundError: No module named 'llama_inference_offload'
#161 opened 2 years ago
14
Killed
#160 opened 2 years ago
4
Is there a way to seperate from the prompt and the generated answer.
#159 opened 2 years ago
2
Installation issue | WSL 2
#158 opened 2 years ago
4
T5 Benchmark
#157 opened 2 years ago
25