Pydantic models to describe, with maximum strictness, the schema of a ChatGPT data
export's conversations.json
.
Contributions are highly encouraged, from anyone!
This project exists for research purposes, and is not intended for production use. The Pydantic models for ChatGPT conversations are designed to be as strict and precise as possible, and should break as soon as the schema changes.
Consequently, you'll find some unconventional practices in this project, such as:
# model/assistant.py
class Author(Model):
role: Literal['assistant']
name: None
metadata: Literal[{}] # type:ignore
Annotations like name: None
and metadata: Literal[{}]
appear useless, but they
are important to us. They tell us that, for everyone who has used this project so far,
name
is always null, and metadata
is always an empty object, for every assistant
message encountered in the data. This indicates that OpenAI probably isn't using these
fields yet. If they do, we'll know immediately, because Pydantic's validation will break.
(Note: yes, Pydantic will enforce Literal[{}]
even though this is considered invalid
by static type checkers)
Clone this project, and run pip install -r requirements.txt
to install dependencies.
Run a ChatGPT export. Place your conversations.json
in the root of this project.
Take a look at the code (particularly config
and parse_and_validate
) to understand
what's going on.
Run parse_and_validate.py
to process your conversations, see the output, and
identify any validation errors.
Based on Pydantic's descriptive error messages (if any), update the models as needed. When you're done, we would greatly appreciate a pull request!
Since everyone's data is different, contributions are necessary for this project to be effective. If you encounter a validation error, this means either the schema has changed (and you're the first to notice), or your data is different from what we've seen so far.
When updating the models, the general rule of thumb is to be as strict as possible. For
example, if a new field named text
appears somewhere in the data, you might be tempted
to annotate it as str
. But if all the values are null, then the annotation should be
None
until proven otherwise.
For simplicity and consistency across the project, your model names do not need to be
globally unique. For example, each role's module has its own Author
model, rather than
AssistantAuthor
and UserAuthor
, etc. The intention is to import and use these
modules as namespaces, like assistant.Author
, user.Author
, etc.
Let's be honest: this is a simple project. We don't care what you use. We'd rather you push messy code than not push at all.
But for transparency:
- Formatter:
black
, with--skip-string-normalization
because single quotes are better :) - Static Type Checking:
pyright