derat/yambs

Generate enums from database dump

Closed this issue · 3 comments

derat commented

The types in seed/enums.go are woefully incomplete. I'm currently generating them from the following files:

I think that the only source of truth is the MusicBrainz database itself, but I believe that the data dump file containing the relevant (tiny!) tables is mbdump.tar.bz2, which is 4.4 GB in the 20230121-001933 dump. Even discounting the time and bandwidth needed to download it, extracting the tables takes forever due to tar archives just being streams without tables of contents.

However, I just learned from MBS-8736 that monthly "sample" dumps are available at http://ftp.musicbrainz.org/pub/musicbrainz/data/sample/. Looking at http://ftp.musicbrainz.org/pub/musicbrainz/data/sample/20230101-000001/mbdump-sample.tar.xz (266 MB), I think that the full contents of tables like release_status and work_type are included. Even better, these are easy-to-parse TSV files instead of the ugly SQL INSERT INTO statements and PO files that I was parsing before.

So, I should update seed/gen/gen_enums.go to get its data from mbdump-sample.tar.xz.

derat commented

Well, this simplifies the code and definitely finds some new values...

Before (2317 lines, 137 KB):

    544 Language
     50 LinkAttributeType
    430 LinkType
     60 MediumFormat
     15 ReleaseGroupType
      6 ReleasePackaging
      4 ReleaseStatus
     73 WorkAttributeType
     29 WorkType

After (13090 lines, 896 KB):

   7844 Language
   1108 LinkAttributeType
    578 LinkType
     98 MediumFormat
     16 ReleaseGroupType
     22 ReleasePackaging
      6 ReleaseStatus
     73 WorkAttributeType
     29 WorkType

The increase in executable size is negligible (I suspect that most of these enums are being dropped since they aren't referenced in the code), so it's probably just a question of whether these long lists will be hard to use and whether pkg.go.dev will have trouble generating docs from such a big file.

derat commented

Some thoughts about how to prune the results:

  • 20 of the rows in the language table have frequency 2, 441 have frequency 1, and 7383 have frequency 0.
  • 1014 of the rows in the link_attribute_type table look like they're instruments (by virtue of having a root or parent column containing 14). I'm not sure if there's any way to just include the most-common instruments; I can join with the instrument table by UUID but I don't see any frequency data there. Maybe I'll just drop instruments for now.
  • There's probably no harm in dropping link_type rows that don't have an entity_type0 or entity_type1 of recording, release, release_group, or work, since I'm assuming they can't be used when seeding.
derat commented

Moving in the right direction:

    461 Language
   1108 LinkAttributeType
    329 LinkType
     98 MediumFormat
     16 ReleaseGroupType
     22 ReleasePackaging
      6 ReleaseStatus
     73 WorkAttributeType
     29 WorkType

I'm still not sure what to do with all the instruments in link_attribute_type. https://musicbrainz.org/recording/create looks like it deals with this by just providing a text field and showing matching instruments after the user types something.