InternLM/InternLM-Math

InternLM2-Math-Plus-7B evaluation on math only get 5.06% accuracy?

LesterGong opened this issue · 11 comments

I used opencompass to evaluate InternLM2-Math-Plus-7B on the math dataset and only got 5.06% accuracy.

log:
You are using a model of type internlm to instantiate a model of type internlm2. This is not supported for all configurations of models and can yield errors.
You are using a model of type internlm to instantiate a model of type internlm2. This is not supported for all configurations of models and can yield errors.

I looked at 'config.json', where "model_type": "internlm", I wonder if this is the reason.

Please provide a detailed config of using opencompass. Can you put some sample input and output of your evaluation? We will put how to reproduce MATH and GSM8K on internlm2-math-plus as soon as possible.

Please provide a detailed config of using opencompass. Can you put some sample input and output of your evaluation? We will put how to reproduce MATH and GSM8K on internlm2-math-plus as soon as possible.

Thank you for your reply.

my eval_internlm_math_chat.py:

from mmengine.config import read_base
from opencompass.models.huggingface import HuggingFaceCausalLM

with read_base():
    from .datasets.math.math_gen_736506 import math_datasets

    from .models.hf_internlm.my_internlm2_chat_math_7b import models as internlm_math_chat_7b_models

datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])
datasets = [*math_datasets]
models = [*internlm_math_chat_7b_models]

and my model config internlm2_chat_math_7b.py:

from opencompass.models import HuggingFacewithChatTemplate

models = [
    dict(
        type=HuggingFacewithChatTemplate,
        abbr='internlm2-chat-math-7b-hf',
        path='my internlm2-math-plus-7b local path',
        max_out_len=1024,
        batch_size=8,
        run_cfg=dict(num_gpus=2),
        stop_words=['</s>', '<|im_end|>'],
    )
]

some samples like:

"0": {
        "origin_prompt": [
            {
                "role": "HUMAN",
                "prompt": "Question: If the pattern shown continues, what is the letter in the $2010$th position? MATHLETEMATHLETEMATHLETE...\nLet's think step by step\nAnswer:\n"
            }
        ],
        "prediction": " The pattern shown is \"MATHLETE\" repeating every 7 letters. To find the letter in the 2010th position, we need to find the remainder when 2010 is divided by 7.\n$2010 \\div 7 = 287$ with a remainder of 5.\nThis means that the 2010th letter is the same as the 5th letter in the pattern \"MATHLETE\".\nThe 5th letter in \"MATHLETE\" is \"T\".\nSo, the letter in the 2010th position is \"T\". The answer is: $\\boxed{T}$",
        "gold": "A"
    },
"5": {
        "origin_prompt": [
            {
                "role": "HUMAN",
                "prompt": "Question: Let $S = 2010 + 2011 + \\cdots + 4018$. Compute the residue of $S$, modulo 2009.\nLet's think step by step\nAnswer:\n"
            }
        ],
        "prediction": " First, we notice that the sum $S$ is an arithmetic series with first term $a = 2010$, last term $l = 4018$, and number of terms $n = 2009$.\nThe sum of an arithmetic series is given by the formula $S = \\frac{n}{2}(a + l)$.\nSo, we have $S = \\frac{2009}{2}(2010 + 4018) = \\frac{2009}{2}(6028) = 2009 \\cdot 3014$.\nNow, we want to find the residue of $S$ modulo 2009. Since $2009 \\cdot 3014$ is clearly divisible by 2009, the residue of $S$ modulo 2009 is 0. The answer is: $\\boxed{0}$",
        "gold": "0"
    },

I want to know where I went wrong, thank you again!

Our evaluation script is written by native vllm. And we are reproducing your problems using opencompass. We will back to you when we find out the differences.

We suspect that the problem comes few-shot template, using a zero-shot template could fix the problem. We will provide code later.

There is something wrong with the few-shot setting for InternLM2-Math or OpenCompass, we are now trying to fix it. If you don't mind, please use the following OpenCompass Zero-Shot MATH config:
datasets.math.math_0shot_gen_393424
The results for this config are:
image

Hi, I'm sorry I still have a problem. You said there might be a problem with few-shot. But I used math_gen_736506 before:

# math_gen_736506.py

from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import MATHInternDataset, MATHInternEvaluator, math_intern_postprocess

math_reader_cfg = dict(input_columns=['problem'], output_column='solution')

math_infer_cfg = dict(
    prompt_template=dict(
        type=PromptTemplate,
        template=dict(round=[
            dict(role='HUMAN', prompt="Question: {problem}\nLet's think step by step\nAnswer:")
        ])),
    retriever=dict(type=ZeroRetriever),
    inferencer=dict(type=GenInferencer, max_out_len=512))

math_eval_cfg = dict(
    evaluator=dict(type=MATHInternEvaluator), pred_postprocessor=dict(type=math_intern_postprocess))

math_datasets = [
    dict(
        type=MATHInternDataset,
        abbr='math',
        path='./data/math/math.json',
        reader_cfg=math_reader_cfg,
        infer_cfg=math_infer_cfg,
        eval_cfg=math_eval_cfg)
]

Isn't this a zero-shot setting?

I followed you to use datasets.math.math_0shot_gen_393424, but I didn't get the same results as you.
I want to know whether the system prompt is required, or is role=HUMAN enough?
Or maybe there's something wrong with me somewhere else. I'm sorry to bother you again.

This is weird, I use the original GitHub main branch code for OpenCompass and the datasets.math.math_0shot_gen_393424 with model config as same as yours, like:

    dict(
        type=HuggingFacewithChatTemplate,
        abbr='internlm2-math-plus-7b',
        path='internlm/internlm2-math-plus-7b',
        max_out_len=1024,
        batch_size=8,
        run_cfg=dict(num_gpus=1),
        stop_words=['</s>', '<|im_end|>'],
    ),

But I can't reimplement your result, The only difference is I use the LMDeploy to infer.
Please make sure your model is up to date and re-download the model if necessary.

Hello, what is the official prompt for MATH dataset?

Hello, what is the official prompt for MATH dataset?

<|im_start|>user\nProblem:\n{Problem here}\nLet's think step by step\nSolution:\n<|im_end|>\n<|im_start|>assistant\n

If you don't need chatml template:
Problem:\n{Problem here}\nLet's think step by step\nSolution:\n

Hi, I'm sorry I still have a problem. You said there might be a problem with few-shot. But I used math_gen_736506 before:

# math_gen_736506.py

from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import MATHInternDataset, MATHInternEvaluator, math_intern_postprocess

math_reader_cfg = dict(input_columns=['problem'], output_column='solution')

math_infer_cfg = dict(
    prompt_template=dict(
        type=PromptTemplate,
        template=dict(round=[
            dict(role='HUMAN', prompt="Question: {problem}\nLet's think step by step\nAnswer:")
        ])),
    retriever=dict(type=ZeroRetriever),
    inferencer=dict(type=GenInferencer, max_out_len=512))

math_eval_cfg = dict(
    evaluator=dict(type=MATHInternEvaluator), pred_postprocessor=dict(type=math_intern_postprocess))

math_datasets = [
    dict(
        type=MATHInternDataset,
        abbr='math',
        path='./data/math/math.json',
        reader_cfg=math_reader_cfg,
        infer_cfg=math_infer_cfg,
        eval_cfg=math_eval_cfg)
]

Isn't this a zero-shot setting?

I followed you to use datasets.math.math_0shot_gen_393424, but I didn't get the same results as you. I want to know whether the system prompt is required, or is role=HUMAN enough? Or maybe there's something wrong with me somewhere else. I'm sorry to bother you again.

We are working on reproduce and fix this problem, will reply to you as soon as possible.

We redownload Internlm2-math-plus-7b from huggingface and use opencompass with math@393424 and obtain 53.48, gsm8k@1d7fe4 and obtain 85.29. Please redownload and try to evaluate it with batch-size=1.