Paper: What Makes Math Word Problems Challenging for LLMs? [NAACL 2024]
We formulate and investigate two research questions:
- Which characteristics of an input math word question (sampled from GSM8K) make it complex for an LLM?
- Based on these characteristics, can we predict whether a particular LLM will be able to solve specific input MWPs correctly?
This repository contains the raw data, the feature data extracted, and code to reproduce and extend the subsequent analysis.
git clone https://github.com/kvadityasrivatsa/analyzing-llms-for-mwps.git
cd analyzing-llms-for-mwps
pip install -r requirements.txt
- Build query prompts from GSM8K questions by running the notebook:
./code/llm_querying/build_query_prompts.ipynb
- Run
./code/llm_querying/vllm_query.py
for querying all prompts for one LLM at a time:python3 std_query_vllm.py \ --model-name "meta-llama/Llama-2-13b-chat-hf" \ --batch-size 1 \ --query-limit -1 \ --n-seq 1 \ --seed $RANDOM \ --temperature 0.8 \ --repetition-penalty 1.0 \ --max-len 2000 \ --query-paths ../../data/query_datasets/gsm8k_queries.json
[OR]
The LLM response data used for our work is available here.
Download the zip, extract, and replace the contents into the ./data
folder for further processing.
- (Optional) Specify HuggingFace Access Token at here for accessing restricted models like LLama2.
- Run the notebook
./code/classifier_based_analysis/predicting_success_rate.ipynb
to:- Extract linguistic, math, and world knowledge features from LLM responses on GSM8K.
- Generate relevant feature distribution statistics.
- Train statistical classifiers on extracted features to predict which questions are always or never solved correctly by LLMs.
LLM Model | HuggingFace Model Name | Pass@1 | Success Rate |
---|---|---|---|
Llama2-13B | meta-llama/Llama-2-13b-chat-hf |
28.70 | 37.24 |
Llama2-70B | meta-llama/Llama-2-70b-chat-hf |
56.80 | 56.09 |
Mistral-7B | mistralai/Mistral-7B-Instruct-v0.2 |
40.03 | 36.27 |
MetaMath-13B | meta-math/MetaMath-13B-V1.0 |
72.30 | 63.73 |
- Logistic Regression
- Decision Tree
- Random Forest
Our work proposes a total of 23 features spanning three types: Linguistic (L), Math (M), and World Knowledge (W).
A detailed description of each feature and corresponding Python functions for extraction is covered in ./code/classifier_based_analysis/feature_extraction.py
.
@misc{srivatsa2024makes,
title={What Makes Math Word Problems Challenging for LLMs?},
author={KV Aditya Srivatsa and Ekaterina Kochmar},
year={2024},
eprint={2403.11369},
archivePrefix={arXiv},
primaryClass={cs.CL}
}