builds started failing on Hydra's new hash-named x86 machines

Question

builds started failing on Hydra's new hash-named x86 machines

Closed this issue 5 years ago · 11 comments

A couple days ago, i686 nixos tests started failing consistently, e.g. https://hydra.nixos.org/build/95616203 I can't reproduce the problem locally, and apparently there's something different about those hash-named build machines (which might have been added or at least changed around that time, too).

i686 tests aren't too important nowadays, I suppose, but we could at least do something simple, e.g. remove the i686 platform tag from these machines.

Answer 1 · 2019-07-07T06:20:09.000Z

Eh, another problem, likely related and much worse – those machines quite often end an x86_64-linux build with:

checking for references to /build/ in /nix/store/gwawakcjhr48xgf04dhc16fkhw4xdnng-automake-1.15...
invalid ownership on file '/nix/store/gwawakcjhr48xgf04dhc16fkhw4xdnng-automake-1.15/bin/aclocal-1.15'

Again, I could never reproduce these.

/cc @NixOS/rfc-steering-committee I don't really have an idea about whom to ping, but some of them certainly should know about those new Hydra machines.

Answer 2 · 2019-07-07T08:05:27.000Z

Apparently this currently blocks larger rebuilds, even with multiple restart attempts, e.g. see this build. /cc @FRidh who deals with staging-next a lot (to know I've posted a thread for this).

Answer 3 · 2019-07-07T08:35:20.000Z

@grahamc have you seen this?

Answer 4 · 2019-07-07T10:24:51.000Z

No, I haven't seen this.

Answer 5 · 2019-07-07T10:39:03.000Z

These hash-named x86 machines have Intel Scalable Gold 5120 cpus, and are transient -- so it is a bit lucky that this exact build's machine still exists.

However, unlucky because I can't log in to it:

$ ssh 1ff2e9d9-8922-4d04-9a35-8771a45f6fa5@sos.ams1.packet.net
[SOS Session Ready. Use ~? for help.]
[Note: You may need to press RETURN or Ctrl+L to get a prompt.]

nixos login: grahamc
^C

[grahamc@Petunia:~]$ ssh grahamc@147.75.85.145
^C

Evidently something very strange happened to it. I've since destroyed that server.

I picked up another one of the machines (b5b77143) which is alive, and boy did something stick out to me!

Look at this selection from top:

top - 10:38:34 up 8 days, 23:38,  1 user,  load average: 4.55, 4.39, 3.99
Tasks: 721 total,   2 running, 717 sleeping,   0 stopped,   2 zombie
%Cpu(s):  2.4 us,  3.0 sy,  0.0 ni, 94.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem : 385660.9 total,  64843.6 free, 112842.9 used, 207974.4 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used. 266543.8 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  809 root      20   0 8775624   2.7g    728 S 134.4   0.7   4421:02 unionfs
35199 nixbld10  20   0   15456  10492   2488 S  25.8   0.0   0:01.23 perl
35734 nixbld14  20   0   74632  49628  11960 R   8.3   0.0   0:00.25 cc1plus

Answer 6 · 2019-07-07T10:41:16.000Z

How much would you bet unionfs is the problem? :)

These machines are spot and transient, so they never fully "install". In the x86 case I accidentally left the / mount as a unionfs of the netboot image and a mount point on disk. I've killed these problematic x86 machines until I can fix this by moving the netboot'd / on to the actual disk to avoid unionfs problems.

Answer 7 · 2019-07-07T10:46:19.000Z

Also, I'm sorry for breaking it, and not noticing sooner. Thank you @vcunat for tracking it down, and thank you @FRidh for the ping!

Answer 8 · 2019-07-08T11:18:58.000Z

@grahamc: thanks for the quick reaction. I should've tried to mention you directly; now I'll know who knows best about these builders, too.

The aarch64-linux ones also suffer from this, apparently: this build (step 6).

Answer 9 · 2019-07-08T12:52:38.000Z

Yes, indeed. I have terminated those now as well. Same problem with unionfs. I'm traveling this week, which makes it a bit trickier to fix and re-launch these instances, but I'll give it a go!

Thank you for the heads up.

Answer 10 · 2019-07-08T19:06:09.000Z

I have updated the filesystem layout, and / is now a zfs filesystem of its own, with no layering. I'll be re-launching the aarch64 and x86 builders with this new mechanism.

Answer 11 · 2019-07-10T11:06:44.000Z

I think the issue has been resolved so closing.