[Bug]: qwen2.5-72b-insruct math 自测分数和榜单分数差异较大

Question

[Bug]: qwen2.5-72b-insruct math 自测分数和榜单分数差异较大

Closed this issue 2 months ago · 7 comments

tianshiyisi commented 2 months ago

Model Series

Qwen2.5

What are the models used?

qwen2.5-72b-instruct

What is the scenario where the problem happened?

qwen2.5-72b-instruct math评测与官方榜单差异较大

Is this a known issue?

I have followed the GitHub README.
I have checked the Qwen documentation and cannot find an answer there.
I have checked the documentation of the related framework and cannot find useful information.
I have searched the issues and there is not a similar one.

Information about environment

OS: Ubuntu 22.04
Python: Python 3.10.6
GPUs: 8 x NVIDIA A100
NVIDIA driver: 470.141.10 (from nvidia-smi)
CUDA compiler: 12.1 (from nvcc -V)
PyTorch: 2.4.0+cu121 (from python -c "import torch; print(torch.version)")

Log output

压测过程日志：
10/15 16:01:01 - OpenCompass - INFO - Task [qwen2/math]
10/15 16:01:03 - OpenCompass - WARNING - Max Completion tokens for qwen2 is :16384
10/15 16:01:03 - OpenCompass - INFO - Try to load the data from /root/.cache/opencompass/./data/math/math.json
10/15 16:01:03 - OpenCompass - INFO - Start inferencing [qwen2/math]
[2024-10-15 16:01:05,267] [opencompass.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Starting build dataloader
[2024-10-15 16:01:05,267] [opencompass.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Starting inference process...
100%|██████████| 40/40 [58:08<00:00, 87.22s/it] 
10/15 16:59:14 - OpenCompass - INFO - time elapsed: 3492.95s

压测结果日志：
10/15 16:59:20 - OpenCompass - INFO - Try to load the data from /root/.cache/opencompass/./data/math/math.json
10/15 16:59:20 - OpenCompass - INFO - Task [qwen2/math]: {'accuracy': 49.18}
10/15 16:59:20 - OpenCompass - INFO - time elapsed: 2.93s

Description

启动命令： python -m vllm.entrypoints.openai.api_server --port 8000 --tensor-parallel-size 4 --gpu-memory-utilization 0.95 --model /oss/Qwen2.5-72B-Instruct --served-model-name qwen2
压测命令：
cat <<'EOF' >eval_openai_api.json
{
"eval_backend": "OpenCompass",
"eval_config": {
"datasets": [
"math"
],
"models": [
{
"path": "qwen2",
"openai_api_base": "http://127.0.0.1:8000/v1/chat/completions",
"temperature": 0.0
}
]
}
}
EOF

cat <<'EOF' >eval_openai_api.py
from evalscope.run import run_task
from evalscope.summarizer import Summarizer

def run_eval():

task_cfg = 'eval_openai_api.json'

run_task(task_cfg=task_cfg)

print('>> Start to get the report with summarizer ...')
report_list = Summarizer.get_report_from_cfg(task_cfg)
print(f'\n>> The report list: {report_list}')

run_eval()
EOF

python eval_openai_api.py

压测结果（math 72b也测了一下）：

Answer 1 · 2024-10-16T06:18:47.000Z

@tianshiyisi You can find the scripts for reproducing our results here: https://github.com/QwenLM/Qwen2.5-Math/tree/main?tab=readme-ov-file#evaluation

Answer 2 · 2024-10-16T06:26:18.000Z

thanks @hzhwcmhf ,i will reproduce this and the score from
https://qwenlm.github.io/zh/blog/qwen2.5-llm/

Answer 3 · 2024-10-17T00:56:35.000Z

The issue has been resolved, thanks. Using the Qwen2.5-math benchmarking tool to test on the math dataset, the Qwen2.5-72B-Instruct model scored 82.8, while the Qwen2.5-math-72B-Instruct model scored 85.3. The reason is that the prompts constructed by the two are different, specifically the position of the CoT (Chain of Thought) hint. The former prompt is more likely to lead to the correct answer.

Qwen2.5-math example:
curl -X POST
-H "Content-Type: application/json"
-d '{
"model": "qwen2",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "A robot moving forward at a constant speed takes 2.5 hours to travel 1 kilometer. Moving forward at this same constant speed, it takes the robot 90 seconds to travel the length of a particular hallway. How many meters long is the hallway? Please reason step by step, and put your final answer within \boxed{}."
}
],
"temperature": 0,
"top_p": 1.0,
"top_k": -1
}' http://127.0.0.1:8000/v1/chat/completions

OpenCompass prompt example:
curl -X POST
-H "Content-Type: application/json"
-d '{
"model": "qwen2",
"messages": [
{
"role": "system",
"content": "Please reason step by step, and put your final answer within \boxed{}."
},
{
"role": "user",
"content": "A robot moving forward at a constant speed takes 2.5 hours to travel 1 kilometer. Moving forward at this same constant speed, it takes the robot 90 seconds to travel the length of a particular hallway. How many meters long is the hallway?"
}
],
"temperature": 0,
"top_p": 1.0,
"top_k": -1
}' http://127.0.0.1:8000/v1/chat/completions

Answer 4 · 2024-10-17T03:09:09.000Z

I want to confirm if just the location of the CoT prompt contributes to a 30% accuracy boost. If so I may adjust the way I use system message in the future.

Answer 5 · 2024-10-18T03:03:36.000Z

Yes, I verified that evalscope is using the math_gen_265cce.py file under opencompass. The content is as follows:

I modified it to use the math_0shot_gen_393424.py file with the following content:

After stress testing, the score improved from 49.18 to 75.18.

Answer 6 · 2024-11-07T08:56:38.000Z

Yes, I verified that evalscope is using the math_gen_265cce.py file under opencompass. The content is as follows: I modified it to use the math_0shot_gen_393424.py file with the following content: After stress testing, the score improved from 49.18 to 75.18.

hello，why I use the math_0shot_gen_393424.py file to test by OpenCompass just get 58.16 score (Qwen2.5-Math-72B-Instruct). And the model answer more other contents, do you meet this problem just like this:

Answer 7 · 2024-11-07T10:59:22.000Z

@13416157913 please raise that at https://github.com/QwenLM/Qwen2.5-Math