dvlab-research/MR-GSM8K

Challenge LLMs to Reason About Reasoning: A Benchmark to Unveil Cognitive Depth in LLMs

Python

Issues

How to calculate the scores?
#4 opened a month ago
2
BUG: error_reason_correctness
#3 opened 2 months ago
1
the new version
#2 opened 4 months ago
9
question abount the inconsistency between "model_output_solution_correctness" and "model_output_solution_first_error_step"
#1 opened 5 months ago
3