decrypting-crosswords

Abstract

Cryptic crosswords are puzzles that rely on general knowledge and the solver's ability to manipulate language on different levels, dealing with various types of wordplay. Previous research suggests that solving such puzzles is challenging even for modern NLP models, including Large Language Models (LLMs). However, there is little to no research on the reasons for their poor performance on this task. In this paper, we establish the benchmark results for three popular LLMs: {\tt Gemma2}, {\tt Llama3} and {\tt ChatGPT}, showing that their performance on this task is still significantly below that of humans. We also investigate why these models struggle to achieve superior performance.

Repo Structure

data:

Contains our introduced datasets. Including our introduced Small explanatory dataset and Times for the Times dataset. We don't include the original data from Rozner et al., but we upload it to HF

Results

Contains all results for different experiments

Outputs

Contains the raw models' outputs

dataset_manipulation

Contains script for all data processing.

Prompts

All used prompts are included in prompts.py file

Zero-shot Evaluation

customize the different training arguments at eval.sh, then run:

bash eval.sh

Definition extraction and Wordplay detection

bash eval_def_wordplay.bash