This repository contains the data and code for the AdvScore paper, accepted as a NAACL 2025 Main Paper🚀.
AdvScore is a human-centered metric designed to evaluate the true adversarialness of benchmarks by capturing the varying abilities of both models and humans. Additionally, it helps identify poor-quality examples in adversarial datasets.
If you use this work, please cite using the following BibTeX:
@inproceedings{sung2025advscore,
author = {Yoo Yeon Sung, Maharshi Gor, Eve Fleisig, Ishani Mondal, and Jordan Boyd-Graber},
title = {Is your benchmark truly adversarial? AdvScore: Evaluating Human-Grounded Adversarialness},
booktitle = {Proceedings of the NAACL Conference},
year = {2025},
publisher = {Association for Computational Linguistics}
}
Before running any scripts, install the required dependencies:
pip install -r requirements.txt
AdvScore relies on Item Response Theory (IRT) models. You have two options:
- Use this repository to train an IRT or MIRT model.
- To replicate our experiments, download the pretrained models for each dataset from this Google Drive link.
Once downloaded, update CKPT_DIR
in funcs_mirt.py
with the path to your pretrained models.
To compute AdvScore, you need to collect LLM and human subject responses for the adversarial dataset.
- The experimental datasets are located in the
data/
directory. - Files ending in
_text.csv
contain model and human binary correctness for each question.- Models are labeled with their real names.
- Human teams are labeled with numbers or uppercase/lowercase alphabets.
You can compute AdvScore by calculating each parameter that contributes to the score using the following command:
python comp_advscore.py
Incentivized by AdvScore, we recruit experts to write questions that are adversarial (difficult for humans but not for models).
This AdvQA dataset is available in the data/Advqa_text.csv
.
It is also uploaded to Hugging Face and can be accessed here:
AdvQA Dataset.
We conduct a feature analysis by fitting a logistic regression model on annotation types of AdvQA.
data/AdvQA_advtype_annots.csv
data/features_df.csv
python logreg.py
We use a neural approach to train our 2PL IRT model, leveraging the flexibility and scalability of neural networks while maintaining the interpretability of the IRT framework.
The model parameters are learned through backpropagation, with the network architecture designed to mimic the 2PL IRT structure.
The neural 2PL IRT model consists of three main components:
- Item embedding layer representing item difficulties (βᵢ) and discriminations (γᵢ)
- Person embedding layer representing person abilities (θⱼ)
- Sigmoid output layer computing the probability of a correct response
Total parameters: 2N + M, where:
- N = number of items
- M = number of subjects
- Includes N difficulty parameters, N discrimination parameters, and M ability parameters
To enhance regularization and interpretability, we incorporate prior distributions on the model parameters:
- Item difficulties (βᵢ) and person abilities (θⱼ):
- Gaussian prior with mean 0 and variance 1
- Item discriminations (γᵢ):
- Gamma prior with shape k and scale θ
Why a Gamma prior for discriminations?
Ensures positivity Allows for fine-tuning the model's sensitivity to item discrimination 🎯
The IRT model training follows these steps:
Initialize network weights randomly, sampling from the respective prior distributions
Training Epochs:
- Forward Pass: Compute predicted probabilities for each person-item interaction
- Compute Negative Log-Likelihood Loss
- Add Regularization Terms based on prior distributions
- Backpropagate the gradients and update model parameters
Monitor Validation Performance
- Use early stopping to prevent overfitting
Optimizer:
We use the Adam optimizer due to its efficiency in treating sparse gradients and its ability to adapt the learning rate for each parameter.