Simplify popularity counting

Question

Simplify popularity counting

tazjin opened this issue 5 years ago · 2 comments

popcount currently uses a silly approach to create the popularity count for store paths, namely by evaluating every expression in nixpkgs in sequence and writing out its path info via structured attributes to $out.

This could be greatly simplified by running nix path-info --json on each path instead, which can further be simplified by reading straight from the store-paths.xz file of each channel.

In an ideal world I would be able to run this continously in an environment where Nix itself is not actually available and create popularity data for each commit in nixpkgs.

Answer 1 · 2019-10-10T00:22:39.000Z

Alright, given store-paths.xz (I'm using a downloaded copy here), all references can be counted in one pipeline like so:

cat store-paths.xz \
  | xz -d \
  | xargs nix path-info --json \
  | jq -r '.[].references[]' \
  | sed -r 's|/nix/store/[a-z0-9]+-||g' \
  | sort \
  | uniq -c \
  | sort -n -r \
  | awk '{ print "{\"" $2 "\":" $1 "}"}' \
  | jq -c -s '. | add | with_entries(select(.value > 1))' \
  > popcount-out

This is assuming that all store paths have been locally realised (I'm doing this on a machine that has pretty much the entire binary cache for current channels on disk). This can be done using:

cat store-paths.xz | xz -d | xargs nix build

Reference data is also contained in just the narinfo files in the cache ... hmmm.

Answer 2 · 2019-10-30T23:25:40.000Z

Getting somewhere with this ... I have a branch (not yet pushed) with an alternative popcount implementation that just fetches narinfo files. This does a run over an entire channel in minutes (or in the case of warm edge caches, seconds) rather than hours.

With this setup it's probably going to be a lot more feasible to generate popularity data for every single channel update, instead of just every now and then when I run the script.