[Bug] Bad outputs with fp8 quantization at high RPS
Closed this issue · 13 comments
Checklist
- 1. I have searched related issues but cannot get the expected help.
- 2. The bug has not been fixed in the latest version.
- 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- 5. Please use English, otherwise it will be closed.
Describe the bug
I ran a RPS benchmark script with prompts of an average input length of 1600 tokens and got bad outputs as the RPS increased. For example:
*给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给追给给给给给给给给迫你。
It seems to be related to quantization and concurrent requests. I've listed some commands below with various models, quants, and max num reqs, and if they had good or bad outputs at a high RPS and max running-req.
Unfortunately I can't share the exact prompts used, but I'll update as I find other reproducible prompts.
Here's a summary:
- Unquantized models with the
--quantization fp8
flag have bad outputs at high RPS. - Unquantized models with the
--quantization fp8
and--max-num-reqs 10
flags have good outputs at high RPS. - A pre-quantized fp8 with no
--quantization
or--max-num-reqs
flag had good outputs at high RPS.
Reproduction
BAD OUTPUTS @ 5.5rps
#running-req: 137
CUDA_VISIBLE_DEVICES=2,3 python -m sglang.launch_server --model-path NousResearch/Meta-Llama-3.1-70B-Instruct --port 30003 --tp 2 --mem-fraction-static 0.90 --host 0.0.0.0 --context-length 2048 --quantization fp8
-----------------------------------
GOOD OUTPUTS @ 5.5 RPS
#running-req: 8
CUDA_VISIBLE_DEVICES=2,3 python -m sglang.launch_server --model-path NousResearch/Meta-Llama-3.1-70B-Instruct --port 30003 --tp 2 --mem-fraction-static 0.90 --host 0.0.0.0 --context-length 2048 --quantization fp8 --max-num-reqs 10
-----------------------------------
BAD OUTPUTS @ 5.5rps
#running-req: 135
CUDA_VISIBLE_DEVICES=2,3 python -m sglang.launch_server --model-path NousResearch/Hermes-3-Llama-3.1-70B --port 30003 --tp 2 --mem-fraction-static 0.90 --host 0.0.0.0 --context-length 2048 --quantization fp8
-----------------------------------
GOOD OUTPUTS @ 5.5 RPS
#running-req: 8
CUDA_VISIBLE_DEVICES=2,3 python -m sglang.launch_server --model-path NousResearch/Hermes-3-Llama-3.1-70B --port 30003 --tp 2 --mem-fraction-static 0.90 --host 0.0.0.0 --context-length 2048 --quantization fp8 --max-num-reqs 10
-----------------------------------
GOOD OUTPUTS @ 5.5 RPS
#running-req: 136
CUDA_VISIBLE_DEVICES=2,3 python -m sglang.launch_server --model-path NousResearch/Hermes-3-Llama-3.1-70B-FP8 --port 30003 --tp 2 --mem-fraction-static 0.90 --host 0.0.0.0 --context-length 2048
Environment
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H100 80GB HBM3
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr
NVCC: Cuda compilation tools, release 12.2, V12.2.140
CUDA Driver Version: 535.129.03
PyTorch: 2.4.0+cu121
sglang: 0.2.13
flashinfer: 0.1.5+cu121torch2.4
triton: 3.0.0
transformers: 4.44.2
requests: 2.32.3
tqdm: 4.66.5
numpy: 1.26.4
aiohttp: 3.10.5
fastapi: 0.112.1
hf_transfer: 0.1.8
huggingface_hub: 0.24.6
interegular: 0.3.3
packaging: 24.1
PIL: 10.4.0
psutil: 6.0.0
pydantic: 2.8.2
uvicorn: 0.30.6
uvloop: 0.20.0
zmq: 26.2.0
vllm: 0.5.4
multipart: 0.0.9
openai: 1.42.0
anthropic: 0.34.1
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 SYS 0-103 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 SYS 0-103 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 SYS 0-103 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 SYS 0-103 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS 104-207 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS 104-207 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS 104-207 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS 104-207 1 N/A
NIC0 SYS SYS SYS SYS SYS SYS SYS SYS X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
ulimit soft: 4096```
It seems to be a critical bug. Although we have not seen this before, it would be very helpful if you could share some reproducible examples.
In the meantime, you can help us to find the source of this bug by trying the following options (one at each time)
--disable-cuda-graph
--disable-flashinfer
--disable-flashinfer-sampling
--chunked-prefill-size -1
Hi, I run into the same problem when the I made too many requests at the same time. The output just became 梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦... May I ask if the bug will be fixed any time soon?
Thanks for your excellent library!
v0.2.15 fix some fp8 weights loading bugs. May you have a try?
Actually I'm already using v0.2.15. I loaded an AWQ quatization of Mistral Large 2407 on 4 A40s. I had normal outputs at 50 requests at the same time using the asynchronous client from openai library, but got invalid tokens at 100. I haven't tested in more details, though.
Are you using regex or other constrained generation in your pipeline? If so, you were probably affected by the bug that was fixed here: 47f20da
Are you using regex or other constrained generation in your pipeline? If so, you were probably affected by the bug that was fixed here: 47f20da
No, I'm not using any constrained generation. Just simple text input and output. Currently I work around this problem by submitting only 50 request at a time, since sglang is still much faster than other libraries I tried.
Actually I'm already using v0.2.15. I loaded an AWQ quatization of Mistral Large 2407 on 4 A40s. I had normal outputs at 50 requests at the same time using the asynchronous client from openai library, but got invalid tokens at 100. I haven't tested in more details, though.
I have observed this as well on A100s at TP4 with AWQ, GPTQ and w8a8
This is perhaps related: vllm-project/vllm#7228
It seems to be a critical bug. Although we have not seen this before, it would be very helpful if you could share some reproducible examples.
In the meantime, you can help us to find the source of this bug by trying the following options (one at each time)
--disable-cuda-graph
--disable-flashinfer
--disable-flashinfer-sampling
--chunked-prefill-size -1
It happens even those options disabled (tested on 2xH100 with Llama2 70b fp8)
Has this issue been resolved? I sometimes encounter it too(deepseek-v2.5-fp8). I didn't encounter this issue in (commit 2abe4f1), but it appeared in the latest commit(8f527e2). @zhyncs @merrymercy This looks like a rather serious bug.
I'm still waiting on this as well. Please let me know if I can be of any help in the meantime, I can test any models / configurations, let me know.
@qeternity This issue seems unrelated to the model; I encountered the same problem using another model (deepseek-v2) as well.