Simplify popularity counting
tazjin opened this issue · 2 comments
popcount
currently uses a silly approach to create the popularity count for store paths, namely by evaluating every expression in nixpkgs
in sequence and writing out its path info via structured attributes to $out
.
This could be greatly simplified by running nix path-info --json
on each path instead, which can further be simplified by reading straight from the store-paths.xz
file of each channel.
In an ideal world I would be able to run this continously in an environment where Nix itself is not actually available and create popularity data for each commit in nixpkgs.
Alright, given store-paths.xz
(I'm using a downloaded copy here), all references can be counted in one pipeline like so:
cat store-paths.xz \
| xz -d \
| xargs nix path-info --json \
| jq -r '.[].references[]' \
| sed -r 's|/nix/store/[a-z0-9]+-||g' \
| sort \
| uniq -c \
| sort -n -r \
| awk '{ print "{\"" $2 "\":" $1 "}"}' \
| jq -c -s '. | add | with_entries(select(.value > 1))' \
> popcount-out
This is assuming that all store paths have been locally realised (I'm doing this on a machine that has pretty much the entire binary cache for current channels on disk). This can be done using:
cat store-paths.xz | xz -d | xargs nix build
Reference data is also contained in just the narinfo
files in the cache ... hmmm.
Getting somewhere with this ... I have a branch (not yet pushed) with an alternative popcount implementation that just fetches narinfo
files. This does a run over an entire channel in minutes (or in the case of warm edge caches, seconds) rather than hours.
With this setup it's probably going to be a lot more feasible to generate popularity data for every single channel update, instead of just every now and then when I run the script.