Analyzing the size of Homebrew formulae bottles
The Homebrew formula JSON API does not provide package size information for bottles1.
I aim to retrieve package sizes regularly in order to build a database of (package, version, bottle_arch) -> size
pairs for future analysis.
This analysis could capture:
- Package growth over time
- Total estimated archive size
- Spikes in package size indicating significant changes warranting further inspection
- Expired or broken package URL for rarely-updated formulae with rarely-downloaded bottles
- Packages to target for size optimization, from individual relief to humanity-scale savings.
This is currently mostly an experiment in using simple CLI tools like Make and curl to do some data engineering and science that has the above useful implications.
- KISS, to the level of probably dumb.
- Use CLI tools as much as possible; keep code to a minimum.
- Anything that can be installed via Homebrew is fair to use.
- Prioritize concurrency using simple tools such as Make
-j
, xargs, parallel, fd, ripgrep, etc. - Rebuilding the database from scratch means losing data, so avoid that.
make formula.json # get the data file
make urls # split it out
make sizes # get the sizes
stateDiagram-v2
Formulajson : Homebrew API \n formula.json
Urls : One URL file per URL
Database : Database (Unspecified Format)
[*] --> Formulajson : retrieve latest database
Formulajson --> Urls : extract triplets, write URL files
state fork_state <<fork>>
state join_state <<join>>
Urls --> fork_state : list URL files
fork_state --> HTTPRequest1 : retrieve package size
fork_state --> HTTPRequest2 : through HEAD requests
fork_state --> HTTPRequestN : to all URLs in files
HTTPRequest1 --> Sizes1 : write size file
HTTPRequest2 --> Sizes2 : write size file
HTTPRequestN --> SizesN : write size file
Sizes1 --> join_state
Sizes2 --> join_state
SizesN --> join_state
join_state --> Database
Database --> [*]
note left of fork_state
One size file per retrieved URL
end note
It takes around 80 minutes to run for me two requests at a time in order not to trigger some kind of speed limit at my ISP level 2.
You can check the counts of urls and size files by running something like this:
fd .url data | wc -l
fd .size data | wc -l
If the numbers are the same, you've got the data for the current formula.json
.
Footnotes
-
A bottle is a pre-packaged archive of a formula available in Homebrew. See https://docs.brew.sh/Bottles for more information. ↩
-
It's not ghcr.io rate-limiting me. My gateway is working fine but my ISP drops the upstream connection. It's probably some kind of DDOS protection at the DNS level. See notes.txt for ways I might get around this since curl does a DNS lookup every time it launches. ↩