generate-programs-index crashes sometimes
vcunat opened this issue · 2 comments
Fortunately it doesn't seem to happen too often, so no channel has been blocked for more than a day by this so far.
Example:
PID: 1117916 (generate-progra)
UID: 497 (hydra-mirror)
GID: 65534 (nobody)
Signal: 11 (SEGV)
Timestamp: Mon 2021-11-29 13:42:37 CET (2h 10min ago)
Command Line: generate-programs-index /scratch/hydra-mirror/nixos-files.sqlite /scratch/hydra-mirror/release-nixos-21.11/nixos-21.11beta333771.8e6b3914626/unpack/nixos-21.11beta333771.8e6b3914626/programs.sqlite http://nix-cache.s3.amazonaws.com/ /scratch/hydra-mirror/release-nixos-21.11/nixos-21.11beta333771.8e6b3914626/store-paths /scratch/hydra-mirror/release-nixos-21.11/nixos-21.11beta333771.8e6b3914626/unpack/nixos-21.11beta333771.8e6b3914626/nixpkgs
Executable: /nix/store/n0mg1bi82a8rndk6nq1r1dqgblkavxbr-nixos-channel-native-programs/bin/generate-programs-index
Control Group: /system.slice/update-nixos-21.11.service
Unit: update-nixos-21.11.service
Slice: system.slice
Boot ID: 81eee5b9ce8b408c810c24c5fa1cb38b
Machine ID: d10f095896ae4cb0be9598b874b8ba07
Hostname: bastion
Storage: /var/lib/systemd/coredump/core.generate-progra.497.81eee5b9ce8b408c810c24c5fa1cb38b.1117916.1638189757000000.lz4 (inaccessible, truncated)
Message: Process 1117916 (generate-progra) of user 497 dumped core.
Unfortunately the coredumps don't seem usable, most likely due to truncation to 2G (only ?? lines in backtrace).
TL;DR: I tried to recreate the crash locally, but I failed. @grahamc: I'd suggest that the service is configured to raise the coredump limit, so we can really use it for debugging.
First I tried the exact binaries downloaded from bastion. But when using with the original binary cache argument http://nix-cache.s3.amazonaws.com or with https://cache.nixos.org, the execution was either extremely slow for me or completely stalled, consuming basically no resources. (I'm not sure why.)
So instead I tried without remote cache. I got the 0.5TB paths locally and slightly modified the code to allow working with the local daemon store. That seemed to work OK, with 3-4 minutes runtime (if I put the output on tmpfs)... though 18G RAM consumption was also slightly challenging. Still, I did about 70 runs without a single failure. Therefore it seems likely that the issue came from parts of code working with the binary cache; more confidence about this might be gained by inspecting logs, but we probably want a usable coredump anyway.
And yes, it would be nice to push the currently used flake.lock, as it's clearly using veeery different versions. When I had to recompile the binary, I somehow managed to test locally with exactly same nix commit and similar nixpkgs commit (almost identical package versions), but it took me quite some time... and this use case is where flakes are supposed to shine.
This doesn't seem to cause trouble anymore, so I suppose it's OK-ish. Apparently it helped to ease the RAM pressure by using zram.