/discord_data

Library to parse messages/activity from the discord data export

Primary LanguagePythonMIT LicenseMIT

discord_data

Library to parse information from the discord data export, see more info here.

The request to process the data has to be done manually, and it can take a while for them to deliver it to you.

This supports both the old CSV and new JSON formats for messages.

Install:

Requires python3.8+. To install with pip, run:

pip install discord_data

Single Export

This takes the messages and activity directories as arguments, like:

>>> from discord_data import parse_messages, parse_activity
>>> next(parse_messages("./discord/october_2020/messages"))
>>> next(parse_activity("./discord/october_2020/activity"))

Message(mid='747951969171275807', dt=datetime.datetime(2020, 8, 25, 22, 54, 5, 726000, tzinfo=datetime.timezone.utc), channel=Channel(cid='464051583559139340', name='general', server_name='Dream World'), content='<:NotLikeThis:237729324885606403>', attachments='')

Activity(event_id='AQICfXBljgG+pYXCTRrwzy6MqgAAAAA=', event_type='start_listening', region_info=RegionInfo(city='cityNameHere', country_code='US', region_code='CA', time_zone='America/Los_Angeles'), fingerprint=Fingerprint(os='Mac OS X', os_version='16.1.0', browser='Discord Client', ip='216.58.195.78', isp=None, device=None, distro=None), timestamp=datetime.datetime(2016, 11, 26, 7, 8, 47))

Each of these returns a Generator, so they only read from the (giant) JSON files as needed. If you want to process all the data, you can call list on it to consume the whole generator:

from discord_data import parse_messages, parse_activity
msg = list(parse_messages("./discord/october_2020/messages"))
acts = list(parse_activity("./discord/october_2020/activity"))

The raw activity data includes lots of additional fields, this only includes items I thought would be useful. If you want to parse the JSON blobs yourself, you do so by using from discord_data import parse_raw_activity

If you just want to quickly load the parsed data into a REPL:

python3 -m discord_data ./discord/october_2020

That drops you into a python shell with access to activity and messages variables which include the parsed data

Or, to dump it to JSON:

python3 -m discord_data ./discord/october_2020 -o json > discord_data.json

Merge Exports

Exports seem to be complete, but when a server or channel is deleted, all messages in that channel are deleted permanently, so I'd recommend periodically doing an export to make sure you don't lose anything.

I recommend you organize your exports like this:

discord
├── march_2021
│   ├── account
│   ├── activity
│   ├── messages
│   ├── programs
│   ├── README.txt
│   └── servers
└── october_2020
    ├── account
    ├── activity
    ├── messages
    ├── programs
    ├── README.txt
    └── servers

The discord folder at the top would be the export_dir keyword argument to the merge_activity and merge_messages functions, which call the underlying parse functions:

You can choose to supply the arguments with export_dir or paths:

# locates the corresponding `messages` directories in the folder structure
list(merge_messages(export_dir="./discord"))`
# supply a list of the message directories yourself
list(merge_messages(paths=["./discord/march_2021/messages", "./discord/october_2020/messages"]))

If the format for the discord export changes, the parse/merge functions will still work, they just might yield errors as part of their output. To ignore those, you can do:

for msg in merge_messages(export_dir="./discord"):
    if isinstance(msg, Exception):
        logger.warning(msg)
        continue
    # do something with msg
    print(msg.content)

Created to be used as part of HPI