Important
Any use of this code and the data obtained with its help must adhere to Bluesky's Terms of Service and Community Guidelines.
In particular, you are not allowed to distribute any of the data without an explicit permission of the user that piece of data belongs to.
We also do not condone any use of the data obtained with this code for the purposes of:
- training ML/AI models w/o explicit consent of the users who own the data
- aiding any kind of harassment campaigns against anyone
This is a bunch of code that can download all of Bluesky into a giant table in PostgreSQL.
The structure of that table is roughly (repo, collection, rkey) -> JSON
, and
it is a good idea to partition it by collection.
NOTE: all of this is valid as of December 2024, when Bluesky has ~24M accounts, ~4.7B records total, and average daily peak of ~1000 commits/s.
- Local PLC mirror. Without it you'll get throttled hard all the time. So go to https://github.com/bsky-watch/plc-mirror and set it up now. It'll need a few hours to replicate everything and become useable.
- 32GB of RAM, but the more the better, obviously.
- One decent SATA SSD is plenty fast to keep up. Preferably a dedicated one (definitely not the same that your system is installed on). There will be a lot of writes happening, so the total durability of the disk will be used up at non-negligible rate.
- XFS is mandatory to get a decent performance out of ScyllaDB.
With a SATA SSD dedicated to ScyllaDB it can handle about 6000 commits/s from firehose. The actual number you'll get might be lower, if your CPU is not fast enough.
Once a day get a list of all repos from all known PDSs and adds any that are missing to the database.
Connects to firehose of each PDS and stores all received records in the database.
If CONSUMER_RELAYS
is specified, it will also add any new PDSs to the database
that have records sent through a relay.
Goes over all repos that might have missing data, gets a full checkout from the PDS and adds all missing records to the database.
- Set up a PLC mirror. It'll need
a few hours to fetch all the data. You can use any other implementation too,
the only requirement is that
/${did}
request returns a DID document. - Decide where do you want to store the data. It needs to be on XFS, otherwise ScyllaDB's performance will be very poor.
- Copy
example.env
to.env
and edit it to your liking.POSTGRES_PASSWORD
can be anything, it will be used on the first start ofpostgres
container to initialize the database.
- Optional: copy
docker-compose.override.yml.example
todocker-compose.override.yml
to change some parts ofdocker-compose.yml
without actually editing it (and introducing possibility of merge conflicts later on). make init-db
- This will add the initial set of PDS hosts into the database.
- You can skip this if you're specifying
CONSUMER_RELAYS
indocker-compose.override.yml
make up
make status
- will show container status and resource usagemake psql
- starts up SQL shell inside thepostgres
containermake logs
- streams container logs into your terminalmake sqltop
- will show you currently running queriesmake sqldu
- will show disk space usage for each table and index
Record indexer exposes a simple HTTP handler that allows to do this:
curl -s 'http://localhost:11003/pool/resize?size=10'
With partitioning by collection you can have separate indexes for each record
type. Also, doing any kind of heavy processing on a particular record type will
be also faster, because all of these records will be in a separate table and
PostgreSQL will just read them sequentially, instead of checking collection
column for each row.
You can do the partitioning at any point, but the more data you already have in the database, the longer will it take.
Before doing this you need to run lister
at least once in order to create the
tables (make init-db
does this for you as well).
- Stop all containers except for
postgres
. - Run the SQL script in
psql
. - Check
migrations
dir for any additional migrations you might be interested in. - Once all is done, start the other containers again.