Generate enums from database dump
Closed this issue · 3 comments
The types in seed/enums.go
are woefully incomplete. I'm currently generating them from the following files:
- https://raw.githubusercontent.com/metabrainz/musicbrainz-server/master/po/attributes.pot
- https://raw.githubusercontent.com/metabrainz/musicbrainz-server/master/po/languages.pot
- https://raw.githubusercontent.com/metabrainz/musicbrainz-server/master/t/sql/initial.sql
I think that the only source of truth is the MusicBrainz database itself, but I believe that the data dump file containing the relevant (tiny!) tables is mbdump.tar.bz2
, which is 4.4 GB in the 20230121-001933 dump. Even discounting the time and bandwidth needed to download it, extracting the tables takes forever due to tar archives just being streams without tables of contents.
However, I just learned from MBS-8736 that monthly "sample" dumps are available at http://ftp.musicbrainz.org/pub/musicbrainz/data/sample/. Looking at http://ftp.musicbrainz.org/pub/musicbrainz/data/sample/20230101-000001/mbdump-sample.tar.xz (266 MB), I think that the full contents of tables like release_status
and work_type
are included. Even better, these are easy-to-parse TSV files instead of the ugly SQL INSERT INTO
statements and PO files that I was parsing before.
So, I should update seed/gen/gen_enums.go
to get its data from mbdump-sample.tar.xz
.
Well, this simplifies the code and definitely finds some new values...
Before (2317 lines, 137 KB):
544 Language
50 LinkAttributeType
430 LinkType
60 MediumFormat
15 ReleaseGroupType
6 ReleasePackaging
4 ReleaseStatus
73 WorkAttributeType
29 WorkType
After (13090 lines, 896 KB):
7844 Language
1108 LinkAttributeType
578 LinkType
98 MediumFormat
16 ReleaseGroupType
22 ReleasePackaging
6 ReleaseStatus
73 WorkAttributeType
29 WorkType
The increase in executable size is negligible (I suspect that most of these enums are being dropped since they aren't referenced in the code), so it's probably just a question of whether these long lists will be hard to use and whether pkg.go.dev will have trouble generating docs from such a big file.
Some thoughts about how to prune the results:
- 20 of the rows in the
language
table have frequency2
, 441 have frequency1
, and 7383 have frequency0
. - 1014 of the rows in the
link_attribute_type
table look like they're instruments (by virtue of having aroot
orparent
column containing14
). I'm not sure if there's any way to just include the most-common instruments; I can join with theinstrument
table by UUID but I don't see any frequency data there. Maybe I'll just drop instruments for now. - There's probably no harm in dropping
link_type
rows that don't have anentity_type0
orentity_type1
ofrecording
,release
,release_group
, orwork
, since I'm assuming they can't be used when seeding.
Moving in the right direction:
461 Language
1108 LinkAttributeType
329 LinkType
98 MediumFormat
16 ReleaseGroupType
22 ReleasePackaging
6 ReleaseStatus
73 WorkAttributeType
29 WorkType
I'm still not sure what to do with all the instruments in link_attribute_type
. https://musicbrainz.org/recording/create looks like it deals with this by just providing a text field and showing matching instruments after the user types something.