[Bug]: qwen2.5-72b-insruct math 自测分数和榜单分数差异较大
Closed this issue · 7 comments
Model Series
Qwen2.5
What are the models used?
qwen2.5-72b-instruct
What is the scenario where the problem happened?
qwen2.5-72b-instruct math评测与官方榜单差异较大
Is this a known issue?
- I have followed the GitHub README.
- I have checked the Qwen documentation and cannot find an answer there.
- I have checked the documentation of the related framework and cannot find useful information.
- I have searched the issues and there is not a similar one.
Information about environment
OS: Ubuntu 22.04
Python: Python 3.10.6
GPUs: 8 x NVIDIA A100
NVIDIA driver: 470.141.10 (from nvidia-smi)
CUDA compiler: 12.1 (from nvcc -V)
PyTorch: 2.4.0+cu121 (from python -c "import torch; print(torch.version)")
Log output
压测过程日志:
10/15 16:01:01 - OpenCompass - INFO - Task [qwen2/math]
10/15 16:01:03 - OpenCompass - WARNING - Max Completion tokens for qwen2 is :16384
10/15 16:01:03 - OpenCompass - INFO - Try to load the data from /root/.cache/opencompass/./data/math/math.json
10/15 16:01:03 - OpenCompass - INFO - Start inferencing [qwen2/math]
[2024-10-15 16:01:05,267] [opencompass.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Starting build dataloader
[2024-10-15 16:01:05,267] [opencompass.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Starting inference process...
100%|██████████| 40/40 [58:08<00:00, 87.22s/it]
10/15 16:59:14 - OpenCompass - INFO - time elapsed: 3492.95s
压测结果日志:
10/15 16:59:20 - OpenCompass - INFO - Try to load the data from /root/.cache/opencompass/./data/math/math.json
10/15 16:59:20 - OpenCompass - INFO - Task [qwen2/math]: {'accuracy': 49.18}
10/15 16:59:20 - OpenCompass - INFO - time elapsed: 2.93s
Description
启动命令: python -m vllm.entrypoints.openai.api_server --port 8000 --tensor-parallel-size 4 --gpu-memory-utilization 0.95 --model /oss/Qwen2.5-72B-Instruct --served-model-name qwen2
压测命令:
cat <<'EOF' >eval_openai_api.json
{
"eval_backend": "OpenCompass",
"eval_config": {
"datasets": [
"math"
],
"models": [
{
"path": "qwen2",
"openai_api_base": "http://127.0.0.1:8000/v1/chat/completions",
"temperature": 0.0
}
]
}
}
EOF
cat <<'EOF' >eval_openai_api.py
from evalscope.run import run_task
from evalscope.summarizer import Summarizer
def run_eval():
task_cfg = 'eval_openai_api.json'
run_task(task_cfg=task_cfg)
print('>> Start to get the report with summarizer ...')
report_list = Summarizer.get_report_from_cfg(task_cfg)
print(f'\n>> The report list: {report_list}')
run_eval()
EOF
python eval_openai_api.py
@tianshiyisi You can find the scripts for reproducing our results here: https://github.com/QwenLM/Qwen2.5-Math/tree/main?tab=readme-ov-file#evaluation
thanks @hzhwcmhf ,i will reproduce this and the score from
https://qwenlm.github.io/zh/blog/qwen2.5-llm/
The issue has been resolved, thanks. Using the Qwen2.5-math benchmarking tool to test on the math dataset, the Qwen2.5-72B-Instruct model scored 82.8, while the Qwen2.5-math-72B-Instruct model scored 85.3. The reason is that the prompts constructed by the two are different, specifically the position of the CoT (Chain of Thought) hint. The former prompt is more likely to lead to the correct answer.
Qwen2.5-math example:
curl -X POST
-H "Content-Type: application/json"
-d '{
"model": "qwen2",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "A robot moving forward at a constant speed takes 2.5 hours to travel 1 kilometer. Moving forward at this same constant speed, it takes the robot 90 seconds to travel the length of a particular hallway. How many meters long is the hallway? Please reason step by step, and put your final answer within \boxed{}."
}
],
"temperature": 0,
"top_p": 1.0,
"top_k": -1
}' http://127.0.0.1:8000/v1/chat/completions
OpenCompass prompt example:
curl -X POST
-H "Content-Type: application/json"
-d '{
"model": "qwen2",
"messages": [
{
"role": "system",
"content": "Please reason step by step, and put your final answer within \boxed{}."
},
{
"role": "user",
"content": "A robot moving forward at a constant speed takes 2.5 hours to travel 1 kilometer. Moving forward at this same constant speed, it takes the robot 90 seconds to travel the length of a particular hallway. How many meters long is the hallway?"
}
],
"temperature": 0,
"top_p": 1.0,
"top_k": -1
}' http://127.0.0.1:8000/v1/chat/completions
I want to confirm if just the location of the CoT prompt contributes to a 30% accuracy boost. If so I may adjust the way I use system message in the future.
Yes, I verified that evalscope is using the math_gen_265cce.py file under opencompass. The content is as follows: I modified it to use the math_0shot_gen_393424.py file with the following content: After stress testing, the score improved from 49.18 to 75.18.
hello,why I use the math_0shot_gen_393424.py file to test by OpenCompass just get 58.16 score (Qwen2.5-Math-72B-Instruct). And the model answer more other contents, do you meet this problem just like this:
@13416157913 please raise that at https://github.com/QwenLM/Qwen2.5-Math