[Bug] Bad outputs with fp8 quantization at high RPS

Question

[Bug] Bad outputs with fp8 quantization at high RPS

Closed this issue a month ago · 13 comments

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
5. Please use English, otherwise it will be closed.

Describe the bug

I ran a RPS benchmark script with prompts of an average input length of 1600 tokens and got bad outputs as the RPS increased. For example:

*给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给追给给给给给给给给迫你。

It seems to be related to quantization and concurrent requests. I've listed some commands below with various models, quants, and max num reqs, and if they had good or bad outputs at a high RPS and max running-req.

Unfortunately I can't share the exact prompts used, but I'll update as I find other reproducible prompts.

Here's a summary:

Unquantized models with the --quantization fp8 flag have bad outputs at high RPS.
Unquantized models with the --quantization fp8 and --max-num-reqs 10 flags have good outputs at high RPS.
A pre-quantized fp8 with no --quantization or --max-num-reqs flag had good outputs at high RPS.

Reproduction

BAD OUTPUTS @ 5.5rps
#running-req: 137


CUDA_VISIBLE_DEVICES=2,3  python -m sglang.launch_server --model-path NousResearch/Meta-Llama-3.1-70B-Instruct --port 30003 --tp 2 --mem-fraction-static 0.90 --host 0.0.0.0 --context-length 2048 --quantization fp8


-----------------------------------

GOOD OUTPUTS @ 5.5 RPS
#running-req: 8

CUDA_VISIBLE_DEVICES=2,3  python -m sglang.launch_server --model-path NousResearch/Meta-Llama-3.1-70B-Instruct --port 30003 --tp 2 --mem-fraction-static 0.90 --host 0.0.0.0 --context-length 2048 --quantization fp8 --max-num-reqs 10


-----------------------------------


BAD OUTPUTS @ 5.5rps
#running-req: 135

CUDA_VISIBLE_DEVICES=2,3  python -m sglang.launch_server --model-path NousResearch/Hermes-3-Llama-3.1-70B --port 30003 --tp 2 --mem-fraction-static 0.90 --host 0.0.0.0 --context-length 2048 --quantization fp8


-----------------------------------

GOOD OUTPUTS @ 5.5 RPS
#running-req: 8

CUDA_VISIBLE_DEVICES=2,3  python -m sglang.launch_server --model-path NousResearch/Hermes-3-Llama-3.1-70B --port 30003 --tp 2 --mem-fraction-static 0.90 --host 0.0.0.0 --context-length 2048 --quantization fp8 --max-num-reqs 10


-----------------------------------

GOOD OUTPUTS @ 5.5 RPS
#running-req: 136

CUDA_VISIBLE_DEVICES=2,3  python -m sglang.launch_server --model-path NousResearch/Hermes-3-Llama-3.1-70B-FP8 --port 30003 --tp 2 --mem-fraction-static 0.90 --host 0.0.0.0 --context-length 2048

Environment

CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H100 80GB HBM3
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr
NVCC: Cuda compilation tools, release 12.2, V12.2.140
CUDA Driver Version: 535.129.03
PyTorch: 2.4.0+cu121
sglang: 0.2.13
flashinfer: 0.1.5+cu121torch2.4
triton: 3.0.0
transformers: 4.44.2
requests: 2.32.3
tqdm: 4.66.5
numpy: 1.26.4
aiohttp: 3.10.5
fastapi: 0.112.1
hf_transfer: 0.1.8
huggingface_hub: 0.24.6
interegular: 0.3.3
packaging: 24.1
PIL: 10.4.0
psutil: 6.0.0
pydantic: 2.8.2
uvicorn: 0.30.6
uvloop: 0.20.0
zmq: 26.2.0
vllm: 0.5.4
multipart: 0.0.9
openai: 1.42.0
anthropic: 0.34.1
NVIDIA Topology: 
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    SYS     0-103   0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    SYS     0-103   0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    SYS     0-103   0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    SYS     0-103   0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     104-207 1               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     104-207 1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     104-207 1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     104-207 1               N/A
NIC0    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0


ulimit soft: 4096```

Answer 1 · 2024-08-27T17:43:31.000Z

It seems to be a critical bug. Although we have not seen this before, it would be very helpful if you could share some reproducible examples.

In the meantime, you can help us to find the source of this bug by trying the following options (one at each time)

--disable-cuda-graph
--disable-flashinfer
--disable-flashinfer-sampling
--chunked-prefill-size -1

Answer 2 · 2024-09-03T14:45:58.000Z

Hi, I run into the same problem when the I made too many requests at the same time. The output just became 梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦... May I ask if the bug will be fixed any time soon?

Thanks for your excellent library!

Answer 3 · 2024-09-03T14:47:45.000Z

v0.2.15 fix some fp8 weights loading bugs. May you have a try?

Answer 4 · 2024-09-03T14:51:45.000Z

Actually I'm already using v0.2.15. I loaded an AWQ quatization of Mistral Large 2407 on 4 A40s. I had normal outputs at 50 requests at the same time using the asynchronous client from openai library, but got invalid tokens at 100. I haven't tested in more details, though.

Answer 5 · 2024-09-05T10:09:13.000Z

Are you using regex or other constrained generation in your pipeline? If so, you were probably affected by the bug that was fixed here: 47f20da

Answer 6 · 2024-09-05T10:15:29.000Z

Are you using regex or other constrained generation in your pipeline? If so, you were probably affected by the bug that was fixed here: 47f20da

No, I'm not using any constrained generation. Just simple text input and output. Currently I work around this problem by submitting only 50 request at a time, since sglang is still much faster than other libraries I tried.

Answer 7 · 2024-09-14T16:16:40.000Z

Actually I'm already using v0.2.15. I loaded an AWQ quatization of Mistral Large 2407 on 4 A40s. I had normal outputs at 50 requests at the same time using the asynchronous client from openai library, but got invalid tokens at 100. I haven't tested in more details, though.

I have observed this as well on A100s at TP4 with AWQ, GPTQ and w8a8

Answer 8 · 2024-09-14T17:30:32.000Z

This is perhaps related: vllm-project/vllm#7228

Answer 9 · 2024-09-15T19:49:17.000Z

It seems to be a critical bug. Although we have not seen this before, it would be very helpful if you could share some reproducible examples.

In the meantime, you can help us to find the source of this bug by trying the following options (one at each time)

--disable-cuda-graph

--disable-flashinfer

--disable-flashinfer-sampling

--chunked-prefill-size -1

It happens even those options disabled (tested on 2xH100 with Llama2 70b fp8)

Answer 10 · 2024-09-18T16:28:07.000Z

Has this issue been resolved? I sometimes encounter it too（deepseek-v2.5-fp8）. I didn't encounter this issue in (commit 2abe4f1), but it appeared in the latest commit(8f527e2). @zhyncs @merrymercy This looks like a rather serious bug.

Answer 11 · 2024-09-20T07:23:22.000Z

I'm still waiting on this as well. Please let me know if I can be of any help in the meantime, I can test any models / configurations, let me know.

Answer 12 · 2024-09-20T15:38:34.000Z

vllm-project/vllm#7228

@qeternity This issue seems unrelated to the model; I encountered the same problem using another model (deepseek-v2) as well.

Answer 13 · 2024-09-21T03:08:08.000Z

After updating to this PR, I no longer have this issue. #1482 @user-0a