This repository is developed to evaluate Large Language Models (LLMs) via Game Playing. It includes the following components:
-
An extensible game simulation software to test LLMs via grid-based games such as Tic-Tac-Toe, Connect Four, and Gomoku. For more information and try out the game simulations, please see the game simulation. You can run the game simulations directly from the GitHub.io page or you can download the code repository and then run the game simulation on your computer.
-
A leaderboard to view and compare the results of previous games among LLMs. We welcome submissions to the leaderboard. To review the current status of the leaderboard, please see the leaderboard folder. You can also view it on the GitHub.io page. The results matrix is also available here. We present the results of games among leading LLMs, including Claude 3.5 Sonnet and Claude 3 Sonnet by Anthropic, Gemini 1.5 Pro and Gemini 1.5 Flash by Google, GPT-4 Turbo and GPT-4o by OpenAI, and Llama3-70B by Meta.
-
The detailed output files of game runs to analyze the details of the games that are presented on the leaderboard. Please see the outputs folder.
This repository welcomes contributions and suggestions. The LLM Game Benchmark repository is shared under the MIT License.
Tic-Tac-Toe | Connect Four | Gomoku |
---|---|---|
Game Simulation Webpage:
To run simulations of Tic-Tac-Toe, Connect Four, and Gomoku games, please visit the game simulation page. You can use your OpenAI API Key or Google Gemini API Key to run the simulations yourself. Below is a screenshot of the game simulation page.
Interactions with the LLMs:
We have implemented the interaction between each game and the LLMs, as shown in the figure below. We utilized the web services provided by Open AI and Google for their models. You can simply use your own API Key to run the game simulations for OpenAI and Google models. To interact with the LLMs hosted on AWS Bedrock such as models developed by Anthropic and Meta, you can use the sample AWS Bedrock code provided in the webservice directory.
Publication:
We have published the details of this study at ArXiv.org and submitted it to a leading IEEE journal in the field. If you utilize the repository, please cite the publication:
- Topsakal, O., Edell, C. J., & Harper, J. B. (2024). Evaluating Large Language Models with Grid-Based Game Competitions: An Extensible LLM Benchmark and Leaderboard. arXiv [Cs.AI]. Available at http://arxiv.org/abs/2407.07796
In a previous study, we evaluated the strategic thinking capabilities of various LLMs, including Claude 2.1, Gemini-Pro 1.0, GPT-3.5-Turbo, GPT-4, Llama2-70B, and Mistral Large, by having them play Tic-Tac-Toe through a mobile app. This study builds upon that research with additional games, more in-depth analysis, and a user-friendly web-based game simulation software to evaluate more recent LLMs.
- Topsakal, O., Harper, J.B. Benchmarking Large Language Model (LLM) Performance for Game Playing via Tic-Tac-Toe. Electronics 2024, 13, 1532. https://doi.org/10.3390/electronics13081532
If you have any questions, please contact research.explorations at gmail.