Cryptic crosswords are puzzles that rely on general knowledge and the solver's ability to manipulate language on different levels, dealing with various types of wordplay. Previous research suggests that solving such puzzles is challenging even for modern NLP models, including Large Language Models (LLMs). However, there is little to no research on the reasons for their poor performance on this task. In this paper, we establish the benchmark results for three popular LLMs: {\tt Gemma2}, {\tt Llama3} and {\tt ChatGPT}, showing that their performance on this task is still significantly below that of humans. We also investigate why these models struggle to achieve superior performance.
Contains our introduced datasets. Including our introduced Small explanatory dataset and Times for the Times dataset. We don't include the original data from Rozner et al., but we upload it to HF
Contains all results for different experiments
Contains the raw models' outputs
Contains script for all data processing.
All used prompts are included in prompts.py
file
customize the different training arguments at eval.sh
, then run:
bash eval.sh
bash eval_def_wordplay.bash