MMLU result can not be reproduced
Closed this issue · 2 comments
I follow the description from prosparse-7B and test the Acc on MMLU with ultraeval. MMLU average Acc is 41.69 but paper reports 45.21.
Here is one sample eval configuration:
{
"task_name": "mmlu_high-school-microeconomics_gen",
"path": "datasets/mmlu/data/high-school-microeconomics.jsonl",
"description": "The following are multiple choice questions (with answers) about high_school_microeconomics.\n\n",
"transform": "datasets/mmlu/transform_gen_v1.py",
"fewshot": 5,
"generate": {
"method": "generate",
"params": ""
},
"postprocess": "",
"metric": {
"accuracy": {
"evaluation": {
"type": "prefix_match"
}
}
}
}
generation_config:
{
"bos_token_id": 1,
"do_sample": true,
"eos_token_id": 2,
"pad_token_id": 0,
"temperature": 0.6,
"max_new_tokens": 10,
"top_p": 0.9,
"transformers_version": "4.31.0.dev0"
}
prosparse-7B configuration:
{
"_name_or_path": "SparseLLM/prosparse-llama-2-7b",
"architectures": [
"SparseLlamaForCausalLM"
],
"auto_map": {
"AutoConfig": "configuration_sparsellama.SparseLlamaConfig",
"AutoModel": "modeling_sparsellama.SparseLlamaForCausalLM",
"AutoModelForCausalLM": "modeling_sparsellama.SparseLlamaForCausalLM"
},
"bos_token_id": 1,
"eos_token_id": 2,
"hidden_act": "relu",
"hidden_act_param": 0.01,
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 11008,
"max_position_embeddings": 4096,
"model_type": "sparsellama",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 32,
"pad_token_id": 0,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.31.0.dev0",
"use_cache": true,
"vocab_size": 32000,
"max_length": 4096
}
It is somewhat hard for me to tell what the problem is, as evaluation is sensitive to a variety of factors (e.g, the vLLM version, the CUDA version, the generation config). I can provide the UltraEval version I used in the attachment. (ultraeval-07f99f7e.zip) The evaluation command is:
pip install .; python data_process.py; bash scripts/run_paper.sh --model_size 7b --port <port_number> --ckpt_path <path_to_model> --output_base_path evaluation --hidden_act relu --print_only True --is_hf True
I also find that you set the max_new_tokens
to 10 in your generation config, which might cause problems.
Thanks for the code! The result has been reproduced.