Analyzing LLMs for Math Word Problems

Paper: What Makes Math Word Problems Challenging for LLMs? [NAACL 2024]

Overview

We formulate and investigate two research questions:

Which characteristics of an input math word question (sampled from GSM8K) make it complex for an LLM?
Based on these characteristics, can we predict whether a particular LLM will be able to solve specific input MWPs correctly?

This repository contains the raw data, the feature data extracted, and code to reproduce and extend the subsequent analysis.

Usage

Installation

git clone https://github.com/kvadityasrivatsa/analyzing-llms-for-mwps.git
cd analyzing-llms-for-mwps
pip install -r requirements.txt

Collect LLM Responses

Build query prompts from GSM8K questions by running the notebook: ./code/llm_querying/build_query_prompts.ipynb

Run ./code/llm_querying/vllm_query.py for querying all prompts for one LLM at a time:

 python3 std_query_vllm.py \
 --model-name "meta-llama/Llama-2-13b-chat-hf" \
 --batch-size 1 \
 --query-limit -1 \
 --n-seq 1 \
 --seed $RANDOM \
 --temperature 0.8 \
 --repetition-penalty 1.0 \
 --max-len 2000 \
 --query-paths ../../data/query_datasets/gsm8k_queries.json

[OR]

Import Response Data

The LLM response data used for our work is available here. Download the zip, extract, and replace the contents into the ./data folder for further processing.

Extracting Features, Training & Evaluating Classifiers

(Optional) Specify HuggingFace Access Token at here for accessing restricted models like LLama2.
Run the notebook ./code/classifier_based_analysis/predicting_success_rate.ipynb to:
1. Extract linguistic, math, and world knowledge features from LLM responses on GSM8K.
2. Generate relevant feature distribution statistics.
3. Train statistical classifiers on extracted features to predict which questions are always or never solved correctly by LLMs.

Supported LLMs

LLM Model	HuggingFace Model Name	Pass@1	Success Rate
Llama2-13B	`meta-llama/Llama-2-13b-chat-hf`	28.70	37.24
Llama2-70B	`meta-llama/Llama-2-70b-chat-hf`	56.80	56.09
Mistral-7B	`mistralai/Mistral-7B-Instruct-v0.2`	40.03	36.27
MetaMath-13B	`meta-math/MetaMath-13B-V1.0`	72.30	63.73

Classifier Models

Logistic Regression
Decision Tree
Random Forest

Feature Set

Our work proposes a total of 23 features spanning three types: Linguistic (L), Math (M), and World Knowledge (W). A detailed description of each feature and corresponding Python functions for extraction is covered in ./code/classifier_based_analysis/feature_extraction.py.

Citation

@misc{srivatsa2024makes,
      title={What Makes Math Word Problems Challenging for LLMs?}, 
      author={KV Aditya Srivatsa and Ekaterina Kochmar},
      year={2024},
      eprint={2403.11369},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}