A simple script to validate datasets for OpenAI fine tuning.
The validator function was blatantly copied from the OpenAI Cookbook.
- Data preparation and analysis for chat model fine-tuning | OpenAI Cookbook
- openai-cookbook/examples/Chat_finetuning_data_prep.ipynb at b6aeae9bbabe624cd5d766cc96c9a187235dbbda · openai/openai-cookbook
gpt-3.5-turbo-0125
gpt-3.5-turbo-1106
gpt-3.5-turbo-0613
Caution
babbage-002
and davinci-002
use a different format and we cannot validate datasets for them with this script.
- Python
>=3.12
- Poetry
>=1.8.1
Checkout the repository.
git clone https://github.com/gh640/openai-fine-tuning-validate
Install dependencies with Poetry.
poetry install
Run openai-fine-tuning-validate
command in a venv Poetry manages.
poetry run openai-fine-tuning-validate [dataset-file]
Valid samples:
poetry run openai-fine-tuning-validate tests/data/dataset-1-simple.jsonl
# => Dataset is valid
poetry run openai-fine-tuning-validate tests/data/dataset-2-multi-turn.jsonl
# => Dataset is valid
Invalid samples:
echo '{}' >> invalid.jsonl
poetry run openai-fine-tuning-validate invalid.jsonl
# => {'missing_messages_list': 1}
echo '{"messages": [{"role": "unknown"}]}' > invalid.jsonl
poetry run openai-fine-tuning-validate invalid.jsonl
# =>
# {'example_missing_assistant_message': 1,
# 'message_missing_key': 1,
# 'missing_content': 1,
# 'unrecognized_role': 1}