NixOS/nixpkgs

builds started failing on Hydra's new hash-named x86 machines

Closed this issue · 11 comments

A couple days ago, i686 nixos tests started failing consistently, e.g. https://hydra.nixos.org/build/95616203 I can't reproduce the problem locally, and apparently there's something different about those hash-named build machines (which might have been added or at least changed around that time, too).

i686 tests aren't too important nowadays, I suppose, but we could at least do something simple, e.g. remove the i686 platform tag from these machines.

Eh, another problem, likely related and much worse – those machines quite often end an x86_64-linux build with:

checking for references to /build/ in /nix/store/gwawakcjhr48xgf04dhc16fkhw4xdnng-automake-1.15...
invalid ownership on file '/nix/store/gwawakcjhr48xgf04dhc16fkhw4xdnng-automake-1.15/bin/aclocal-1.15'

Again, I could never reproduce these.

/cc @NixOS/rfc-steering-committee I don't really have an idea about whom to ping, but some of them certainly should know about those new Hydra machines.

Apparently this currently blocks larger rebuilds, even with multiple restart attempts, e.g. see this build. /cc @FRidh who deals with staging-next a lot (to know I've posted a thread for this).

FRidh commented

@grahamc have you seen this?

No, I haven't seen this.

These hash-named x86 machines have Intel Scalable Gold 5120 cpus, and are transient -- so it is a bit lucky that this exact build's machine still exists.

However, unlucky because I can't log in to it:

$ ssh 1ff2e9d9-8922-4d04-9a35-8771a45f6fa5@sos.ams1.packet.net
[SOS Session Ready. Use ~? for help.]
[Note: You may need to press RETURN or Ctrl+L to get a prompt.]

nixos login: grahamc
^C
[grahamc@Petunia:~]$ ssh grahamc@147.75.85.145
^C

Evidently something very strange happened to it. I've since destroyed that server.

I picked up another one of the machines (b5b77143) which is alive, and boy did something stick out to me!

Look at this selection from top:

top - 10:38:34 up 8 days, 23:38,  1 user,  load average: 4.55, 4.39, 3.99
Tasks: 721 total,   2 running, 717 sleeping,   0 stopped,   2 zombie
%Cpu(s):  2.4 us,  3.0 sy,  0.0 ni, 94.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem : 385660.9 total,  64843.6 free, 112842.9 used, 207974.4 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used. 266543.8 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  809 root      20   0 8775624   2.7g    728 S 134.4   0.7   4421:02 unionfs
35199 nixbld10  20   0   15456  10492   2488 S  25.8   0.0   0:01.23 perl
35734 nixbld14  20   0   74632  49628  11960 R   8.3   0.0   0:00.25 cc1plus

How much would you bet unionfs is the problem? :)

These machines are spot and transient, so they never fully "install". In the x86 case I accidentally left the / mount as a unionfs of the netboot image and a mount point on disk. I've killed these problematic x86 machines until I can fix this by moving the netboot'd / on to the actual disk to avoid unionfs problems.

Also, I'm sorry for breaking it, and not noticing sooner. Thank you @vcunat for tracking it down, and thank you @FRidh for the ping!

@grahamc: thanks for the quick reaction. I should've tried to mention you directly; now I'll know who knows best about these builders, too.

The aarch64-linux ones also suffer from this, apparently: this build (step 6).

Yes, indeed. I have terminated those now as well. Same problem with unionfs. I'm traveling this week, which makes it a bit trickier to fix and re-launch these instances, but I'll give it a go!

Thank you for the heads up.

I have updated the filesystem layout, and / is now a zfs filesystem of its own, with no layering. I'll be re-launching the aarch64 and x86 builders with this new mechanism.

FRidh commented

I think the issue has been resolved so closing.