Issues
- 0
Question about dequantization
#638 opened by HaoWeiWang - 1
Triton Support
#574 opened by rjmehta1993 - 7
[BUG] Failed to quantize Qwen2.5-Math-72B-Instruct: Measurement/inference error (3): hidden_states
#627 opened by Orion-zhen - 4
[BUG] chat-instruct Llama 3.1 end word "assistant "
#632 opened by Katehuuh - 4
- 0
[BUG] exllamav2-0.2.2+cu118.torch2.4.0-cp310-cp310-win_amd64.whl Version seems missing under releases.
#633 opened by Nrgte - 0
[BUG] Random slowdowns in tensor parallel.
#630 opened by Ph0rk0z - 0
[REQUEST] Support Yarn for Qwen 2.5 >32K
#629 opened by Downtown-Case - 9
[BUG] Quantization of Qwen return garbage
#621 opened by fahadh4ilyas - 4
[BUG] Qwen 2.5 34B returns garbage at certain quantization levels, but not others
#628 opened by Downtown-Case - 6
- 1
- 3
- 10
Curious about Exllama+TP
#571 opened by grimulkan - 5
How to implement paged attention in HF format?
#616 opened by fahadh4ilyas - 6
- 2
Error in quant
#587 opened by Orion-zhen - 6
[BUG] 0.2.1 doesn't compile on Opensuse
#620 opened by avidwriter - 4
- 7
Batch generation with Exllamav2_HF is weird
#606 opened by fahadh4ilyas - 16
Command R+ is broken?
#612 opened by Ph0rk0z - 4
- 3
A doubt regarding filters/tools.
#581 opened by royallavanya140 - 0
how can i solve this problem
#611 opened by Sultan0ML - 2
Pipeline mode support
#605 opened by laoda513 - 22
- 1
lollms exllamav2 binding module not found
#602 opened by Fuckingnameless - 1
Remove tokens and system prompt from generation
#608 opened by kaykyr - 1
- 5
Tensor parallelism issues
#598 opened by dirkson - 1
Does NVLink improve tensor parallelism?
#603 opened by bryanhpchiang - 3
Async Stream Genenerator?
#604 opened by KingBipo - 2
MemoryError despite sufficient system resources
#596 opened by KingBipo - 0
Do you know of any code framework that supports fast attention score calculation similar to flash attention?
#599 opened by qiyuxinlin - 0
Request for multi model support
#595 opened by royallavanya140 - 1
ModuleNotFoundError: No module named 'blessed'
#584 opened by puppetm4st3r - 4
[qesstion] Wrapper Linear API and 2bits
#589 opened by wenhuach21 - 15
problem with cache.
#591 opened by royallavanya140 - 2
Llama 3 speed
#585 opened by freQuensy23-coder - 5
Will it support CPU offloading?
#578 opened by fzyzcjy - 1
- 8
Q8 or unquantized cache with what context length for llama 3.1-8b 5.0 bpw exl2?
#575 opened by lovebeatz - 1
name 'flash_attn_func' is not defined
#576 opened by ZhenyaPav - 0
Add more docs and type annotations
#579 opened by Dan-wanna-M - 3
orig_func Quantization error
#573 opened by Masterjp123 - 39
Quantizing Llama 3.1 405B
#565 opened by grimulkan - 1
Enhancement: Docker Image Github Actions
#570 opened by 0x4139 - 1
No prebuilt pip package for version 0.1.8
#567 opened by debasish-mihup - 3
Got error in new model LLama 3.1 : Value for eos_token_id is not of expected type <class 'int'>
#566 opened by GrennKren - 1