rust-lang/crates.io

Experimental database dumps changelog

pietroalbini opened this issue · 12 comments

This is a low-traffic issue tracking all the changes happening to the experimental database dumps. We recommend subscribing to this issue to get notified whenever we make some changes to the contents of the dumps.

The next crates.io deploy (happening in the next few days) will include the following changes to the database dumps:

  • PR #3612: The textsearchable_index_col column will be removed from crates.csv, as that column is an implementation detail of crates.io's search. Users importing the database dumps into a PostgreSQL database will not be affected by this change, as a trigger will populate that column at import time.
  • PR #3611: The version_downloads.csv file will only include the last 90 days of data instead of full day-to-day historical data. Cumulative download counts are still available in crates.csv and versions.csv.
  • PR #3549: The version_authors.csv file will be removed, as that data was deleted from the crates.io database too.

We also plan to make the following changes in the future:

  • Issue #3479: all the data from version_downloads.csv will be moved out of the database dump into separate files, one for each day. This will allow clients interested in this data to download it separately.

Two relevant changes were just deployed:

  • #8155 will delete the badges table
  • #8232 added a new crate_downloads table, which is supposed to replace the crates.downloads column soon. this was done for performance reasons to reduce the amount of bloat in the crates table from the regular downloads column updates. at the moment the data should be in sync, but if everything works out we will stop writing to the crates.downloads column in the near future and eventually remove it.
  • as mentioned in the last update, #8295 is going to disable writes to the crates.downloads column. we will keep the column around for now to avoid unnecessary schema churn, but once the system has shown the expected performance benefits we will most likely remove the column completely.
  • once #8233 is merged and deployed it will remove the crates.downloads column. please us the crate_downloads table instead.
  • #8484 will introduce a new experimental default_versions table with a mapping from crates to their "default" version, that will be shown by the frontend and used in e.g. reverse dependency queries.
  • #8748 added an experimental ZIP file artifact at https://static.crates.io/db-dump.zip. this file has the advantage of not having to decompress the entire file if you only need access to a certain database table CSV file. compared to the tarball the ZIP file does not have a top-level datetime path prefix, otherwise the files should contain the exact same data.
  • #9756 added a num_no_build column to the versions table
  • #9786 changed it to be "private" since it is reasonably easy to reconstruct it locally if needed (thanks @dtolnay)

in other words: no changes (except for the few days in between the PRs) and sorry for forgetting to mention the change here 🙈

  • #9932 added a new edition column to the versions table
  • #9998 adds a couple more columns to the versions table. the data will be backfilled from the saved crate files in the next couple of days. this will likely increase size of the database dump, but with deduplicating compression it hopefully should be manageable.
  • #10107 reverts the num_no_build column to be "public" again, to fix the nullability issue during imports
  • #10078 adds categories and keywords columns to the versions table