Issues
- 0
question about the zero_point
#230 opened - 3
running on old gpu with fp32 only
#229 opened - 6
How to inference llama-65b-4bit on mulgpu
#228 opened - 11
- 0
- 0
- 1
OpenCL support
#224 opened - 2
Errors to compile with CUDA 12.1
#220 opened - 0
Error on A100,device kernel image is invalid
#219 opened - 2
Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed.
#217 opened - 1
CUDA kernel sync problem
#216 opened - 2
wbit=16 Conversion Gives Error
#215 opened - 1
CUDA Benchmark on 2bit, 3bit, 4bit models - Why 3bit slower than 4bit, but faster than 2biit?
#214 opened - 1
4bits on 65B
#213 opened - 0
- 2
- 1
- 13
neox.py generates randrange() error
#207 opened - 2
- 4
CUDA: 8bit quantized models are stupid.
#205 opened - 0
- 2
- 1
- 4
why disable tf32 ?
#192 opened - 4
slower inference speed
#191 opened - 3
Inference with Beam > 1 broken in Triton
#188 opened - 3
- 0
- 5
- 2
Quantize 7b with 8GB VRAM OOM
#182 opened - 1
- 5
Fused mlp causes assertion error
#179 opened - 2
- 3
- 1
ERROR: Could not find a version that satisfies the requirement triton==2.0.0 (from versions: none)
#175 opened - 5
Fixing Triton -"Unexpected MMA layout version found" for prevolta GPUs raises new problems
#174 opened - 5
make into a package (like sterlind did)
#173 opened - 1
llama.cpp ERROR
#172 opened - 4
- 0
Issue on Multi-GPU on the cuda branch (
#170 opened - 1
- 1
What is the command to install Triton?
#167 opened - 2
- 3
- 2
my error
#162 opened - 14
- 4
Killed
#160 opened - 2
- 4
Installation issue | WSL 2
#158 opened - 25
T5 Benchmark
#157 opened