Wordle-Data-Gen: A Wordle Synthetic Data Generator

Project Overview

This project aims to generate synthetic data suitable for training large language models (LLMs). The data is generated by simulating the game of Wordle, which involves guessing a five-letter word within six attempts. The synthetic data represents well-played games, capturing strategies that reveal the most information about the word in the fewest guesses.

Goals

Generate a high-quality dataset of Wordle game simulations that can be used for training LLMs.
Develop a model for the game that can make statistically informed guesses.
Implement scoring rubrics to filter out the best played games, ensuring the quality of data based on strategic gameplay rather than mere luck.

Approach

Data Extraction: Filter a comprehensive list of words to include only five-letter words suitable for Wordle.
Memory and Indexing: Load the words into memory and build indices on a hash set to facilitate quick filtering of potential solutions.
Game Simulation: Model the gameplay by choosing words randomly and making guesses based on statistical likelihoods. Simulate up to six turns per game.
Data Selection: Develop a scoring system to select top-performing games, focusing on the efficiency of guesses and information revealed.
Data Output: Write the selected game progressions to a file, formatted to be useful for LLM training.

Installation

Ensure you have Python installed on your system. Then, clone this repository and navigate to the project directory.

git clone https://github.com/yourusername/wordle-llm-data.git
cd wordle-llm-data

Install the required Python packages using:

pip install -r requirements.txt

Usage

Run the main script with optional parameters to start generating data. You can specify parameters such as the number of simulations to run and the percentage of top games to select.

python main.py --num_games 10000 --top_percent 10

Contributing

Contributions are welcome! If you have suggestions or improvements, feel free to fork this repository and submit a pull request.