NixOS/nix

Builder is sometimes unexpectedly killed

Opened this issue · 23 comments

nlewo commented

Nix sometimes fails to build the expression

with import <nix/config.nix>;

{ simple = builtins.derivation {
    name = "simple";
    system = "x86_64-linux";
    PATH = coreutils;
    builder = shell;
    args = ["-c" "touch $out"];
  };
}

Generally, the build works but for ~1% of attempts, it fails with

builder for '/nix/store/94yfdqk4r6vmja7qhqkyy0hl25bvvfjj-simple.drv' failed due to signal 9 (Killed)

I can reproduce on my laptop (nix 2.0) and in a vm (nix 2.0.2) with:

$ nix-build -E 'with import <nix/config.nix>; { simple = builtins.derivation {name="simple";system="x86_64-linux";PATH=coreutils;builder=shell;args=["-c" "touch $out"];};}'

$ i=0; while nix-build -E 'with import <nix/config.nix>; { simple = builtins.derivation {name="simple";system="x86_64-linux";PATH=coreutils;builder=shell;args=["-c" "touch $out"];};}' --check; do echo $i; i=$(($i+1)); done

Note I'm not able to reproduce if a delay is introduced. The following expression has been successfully built 6000 times:

with import <nix/config.nix>;

{ simple = builtins.derivation {
    name = "simple";
    system = "x86_64-linux";
    PATH = coreutils;
    builder = shell;
    args = ["-c" "sleep 0.5; touch $out"];
  };
}

Are you also able to reproduce?

I am not able to reproduce, probably due to my current environment being a bit unusual.

Are you able to capture backtraces or the like? Any details would be useful, thanks! :)

I could not replicate this (I went to 3000).

You should include the hardware you used to replicate this as well as details of the configuration like kernel version and the file system you are using.

@nlewo Interesting find, though.

nlewo commented

@coretemp what version of Nix are you using? I'm not able to reproduce with Nix 1.11.16.
@coretemp Kernel 4.9.86 and ext4 FS.

@dtzWill I'm not able to get any backtraces and logs are not really useful.

The nix-store -vvv --realize logs when it fails to build:

evaluating file '/nix/store/j4di8j9awar03dfz2c91hd0yrdw427v1-nix-2.0.2/share/nix/corepkgs/derivation.nix'
evaluating attribute 'simple'
processing attribute 'PATH'
resolved search path element '/nix/var/nix/profiles/per-user/root/channels' to '/nix/var/nix/profiles/per-user/root/channels'
resolved search path element '/nix/store/j4di8j9awar03dfz2c91hd0yrdw427v1-nix-2.0.2/share/nix/corepkgs' to '/nix/store/j4di8j9awar03dfz2c91hd0yrdw427v1-nix-2.0.2/share/nix/corepkgs'
evaluating file '/nix/store/j4di8j9awar03dfz2c91hd0yrdw427v1-nix-2.0.2/share/nix/corepkgs/config.nix'
processing attribute 'args'
processing attribute 'builder'
processing attribute 'name'
processing attribute 'system'
locking this thread to CPU 3
acquiring global GC lock '/nix/var/nix/gc.lock'
acquiring read lock on '/nix/var/nix/temproots/20175'
acquiring write lock on '/nix/var/nix/temproots/20175'
downgrading to read lock on '/nix/var/nix/temproots/20175'
instantiated 'simple' -> '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv'
querying info about missing paths...
starting pool of 3 threads
querying info about missing paths...
starting pool of 3 threads
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': created
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': woken up
entered goal loop
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': init
substitution of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': created
substitution of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': woken up
substitution of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': init
acquiring write lock on '/nix/var/nix/temproots/20175'
downgrading to read lock on '/nix/var/nix/temproots/20175'
substitution of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': done
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': waitee 'substitution of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv'' done; 0 left
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': woken up
substitution of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': goal destroyed
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': loading derivation
acquiring write lock on '/nix/var/nix/temproots/20175'
downgrading to read lock on '/nix/var/nix/temproots/20175'
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': have derivation
acquiring write lock on '/nix/var/nix/temproots/20175'
downgrading to read lock on '/nix/var/nix/temproots/20175'
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': all outputs substituted (maybe)
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': all inputs realised
building path '/nix/store/ifgmbbxn61i0wc4qp610r7jg0yf2qiwr-simple'
added input paths
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': woken up
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': trying to build
locking path '/nix/store/ifgmbbxn61i0wc4qp610r7jg0yf2qiwr-simple'
lock acquired on '/nix/store/ifgmbbxn61i0wc4qp610r7jg0yf2qiwr-simple.lock'
found build user 'nixbld1'
found build user 'nixbld10'
found build user 'nixbld11'
found build user 'nixbld12'
found build user 'nixbld13'
found build user 'nixbld14'
found build user 'nixbld15'
found build user 'nixbld16'
found build user 'nixbld17'
found build user 'nixbld18'
found build user 'nixbld19'
found build user 'nixbld2'
found build user 'nixbld20'
found build user 'nixbld21'
found build user 'nixbld22'
found build user 'nixbld23'
found build user 'nixbld24'
found build user 'nixbld25'
found build user 'nixbld26'
found build user 'nixbld27'
found build user 'nixbld28'
found build user 'nixbld29'
found build user 'nixbld3'
found build user 'nixbld30'
found build user 'nixbld31'
found build user 'nixbld32'
found build user 'nixbld4'
found build user 'nixbld5'
found build user 'nixbld6'
found build user 'nixbld7'
found build user 'nixbld8'
found build user 'nixbld9'
trying user 'nixbld1'
killing all processes running under uid '30001'
executing builder '/nix/store/zqh3l3lyw32q1ayb15bnvg9f24j5v2p0-bash-4.4-p12/bin/bash'
closing leaked FD 3
closing leaked FD 4
closing leaked FD 5
closing leaked FD 6
closing leaked FD 7
closing leaked FD 8
closing leaked FD 9
closing leaked FD 10
closing leaked FD 11
closing leaked FD 12
closing leaked FD 13
closing leaked FD 14
checking outputs of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv'...
waiting for children
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': got EOF
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': woken up
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': build done
killing process 20178
builder process for '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv' finished
killing all processes running under uid '30001'
builder for '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv' failed due to signal 9 (Killed)
lock released on '/nix/store/ifgmbbxn61i0wc4qp610r7jg0yf2qiwr-simple.lock'
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': done
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': goal destroyed
error: build of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv' failed

The logs when it succeeds:

evaluating file '/nix/store/j4di8j9awar03dfz2c91hd0yrdw427v1-nix-2.0.2/share/nix/corepkgs/derivation.nix'
evaluating attribute 'simple'
processing attribute 'PATH'
resolved search path element '/nix/var/nix/profiles/per-user/root/channels' to '/nix/var/nix/profiles/per-user/root/channels'
resolved search path element '/nix/store/j4di8j9awar03dfz2c91hd0yrdw427v1-nix-2.0.2/share/nix/corepkgs' to '/nix/store/j4di8j9awar03dfz2c91hd0yrdw427v1-nix-2.0.2/share/nix/corepkgs'
evaluating file '/nix/store/j4di8j9awar03dfz2c91hd0yrdw427v1-nix-2.0.2/share/nix/corepkgs/config.nix'
processing attribute 'args'
processing attribute 'builder'
processing attribute 'name'
processing attribute 'system'
locking this thread to CPU 3
acquiring global GC lock '/nix/var/nix/gc.lock'
acquiring read lock on '/nix/var/nix/temproots/20165'
acquiring write lock on '/nix/var/nix/temproots/20165'
downgrading to read lock on '/nix/var/nix/temproots/20165'
instantiated 'simple' -> '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv'
querying info about missing paths...
starting pool of 3 threads
querying info about missing paths...
starting pool of 3 threads
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': created
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': woken up
entered goal loop
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': init
substitution of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': created
substitution of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': woken up
substitution of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': init
acquiring write lock on '/nix/var/nix/temproots/20165'
downgrading to read lock on '/nix/var/nix/temproots/20165'
substitution of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': done
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': waitee 'substitution of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv'' done; 0 left
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': woken up
substitution of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': goal destroyed
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': loading derivation
acquiring write lock on '/nix/var/nix/temproots/20165'
downgrading to read lock on '/nix/var/nix/temproots/20165'
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': have derivation
acquiring write lock on '/nix/var/nix/temproots/20165'
downgrading to read lock on '/nix/var/nix/temproots/20165'
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': all outputs substituted (maybe)
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': all inputs realised
building path '/nix/store/ifgmbbxn61i0wc4qp610r7jg0yf2qiwr-simple'
added input paths
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': woken up
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': trying to build
locking path '/nix/store/ifgmbbxn61i0wc4qp610r7jg0yf2qiwr-simple'
lock acquired on '/nix/store/ifgmbbxn61i0wc4qp610r7jg0yf2qiwr-simple.lock'
found build user 'nixbld1'
found build user 'nixbld10'
found build user 'nixbld11'
found build user 'nixbld12'
found build user 'nixbld13'
found build user 'nixbld14'
found build user 'nixbld15'
found build user 'nixbld16'
found build user 'nixbld17'
found build user 'nixbld18'
found build user 'nixbld19'
found build user 'nixbld2'
found build user 'nixbld20'
found build user 'nixbld21'
found build user 'nixbld22'
found build user 'nixbld23'
found build user 'nixbld24'
found build user 'nixbld25'
found build user 'nixbld26'
found build user 'nixbld27'
found build user 'nixbld28'
found build user 'nixbld29'
found build user 'nixbld3'
found build user 'nixbld30'
found build user 'nixbld31'
found build user 'nixbld32'
found build user 'nixbld4'
found build user 'nixbld5'
found build user 'nixbld6'
found build user 'nixbld7'
found build user 'nixbld8'
found build user 'nixbld9'
trying user 'nixbld1'
killing all processes running under uid '30001'
executing builder '/nix/store/zqh3l3lyw32q1ayb15bnvg9f24j5v2p0-bash-4.4-p12/bin/bash'
closing leaked FD 3
closing leaked FD 4
closing leaked FD 5
closing leaked FD 6
closing leaked FD 7
closing leaked FD 8
closing leaked FD 9
closing leaked FD 10
closing leaked FD 11
closing leaked FD 12
closing leaked FD 13
closing leaked FD 14
checking outputs of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv'...
waiting for children
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': got EOF
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': woken up
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': build done
killing process 20168
builder process for '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv' finished
killing all processes running under uid '30001'
warning: rewriting hashes in '/nix/store/ifgmbbxn61i0wc4qp610r7jg0yf2qiwr-simple'; cross fingers
scanning for references inside '/nix/store/ifgmbbxn61i0wc4qp610r7jg0yf2qiwr-simple'
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': done
lock released on '/nix/store/ifgmbbxn61i0wc4qp610r7jg0yf2qiwr-simple.lock'
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': goal destroyed
acquiring global GC lock '/nix/var/nix/gc.lock'
/nix/store/ifgmbbxn61i0wc4qp610r7jg0yf2qiwr-simple

Note also the Nix expression can not be built if sandbox is enabled.

I also use ext4.

I reproduced the issue 3 times (56th iteration, 96th iteration, 994th iteration).

On Archlinux, nixpkgs-18.09pre134746.9430efbe49d

evaluating file '/nix/store/lw4c9f9x4nnhlpby2q8sjkvrcgphnym7-nix-2.0/share/nix/corepkgs/derivation.nix'
evaluating attribute 'simple'
resolved search path element '/nix/store/lw4c9f9x4nnhlpby2q8sjkvrcgphnym7-nix-2.0/share/nix/corepkgs' to '/nix/store/lw4c9f9x4nnhlpby2q8sjkvrcgphnym7-nix-2.0/share/nix/corepkgs'
evaluating file '/nix/store/lw4c9f9x4nnhlpby2q8sjkvrcgphnym7-nix-2.0/share/nix/corepkgs/config.nix'
acquiring global GC lock '/nix/var/nix/gc.lock'
acquiring read lock on '/nix/var/nix/temproots/2776'
acquiring write lock on '/nix/var/nix/temproots/2776'
downgrading to read lock on '/nix/var/nix/temproots/2776'
instantiated 'simple' -> '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv'
querying info about missing paths...
starting pool of 3 threads
querying info about missing paths...
starting pool of 3 threads
building of '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv': created
building of '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv': woken up
entered goal loop
building of '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv': init
substitution of '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv': created
substitution of '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv': woken up
substitution of '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv': init
acquiring write lock on '/nix/var/nix/temproots/2776'
downgrading to read lock on '/nix/var/nix/temproots/2776'
substitution of '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv': done
building of '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv': waitee 'substitution of '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv'' done; 0 left
building of '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv': woken up
substitution of '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv': goal destroyed
building of '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv': loading derivation
acquiring write lock on '/nix/var/nix/temproots/2776'
downgrading to read lock on '/nix/var/nix/temproots/2776'
building of '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv': have derivation
acquiring write lock on '/nix/var/nix/temproots/2776'
downgrading to read lock on '/nix/var/nix/temproots/2776'
building of '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv': all outputs substituted (maybe)
building of '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv': all inputs realised
building path '/nix/store/v7gvm5w920zr66385aw8y2l3afp3d3kx-simple'
added input paths 
building of '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv': woken up
building of '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv': trying to build
locking path '/nix/store/v7gvm5w920zr66385aw8y2l3afp3d3kx-simple'
lock acquired on '/nix/store/v7gvm5w920zr66385aw8y2l3afp3d3kx-simple.lock'
executing builder '/nix/store/q1g0rl8zfmz7r371fp5p42p4acmv297d-bash-4.4-p19/bin/bash'
closing leaked FD 3
closing leaked FD 4
closing leaked FD 5
closing leaked FD 6
closing leaked FD 7
closing leaked FD 8
closing leaked FD 9
closing leaked FD 10
closing leaked FD 11
closing leaked FD 12
checking outputs of '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv'...
building of '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv': got EOF
building of '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv': woken up
building of '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv': build done
killing process 2781
builder process for '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv' finished
builder for '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv' failed due to signal 9 (Killed)
lock released on '/nix/store/v7gvm5w920zr66385aw8y2l3afp3d3kx-simple.lock'
building of '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv': done
building of '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv': goal destroyed
error: build of '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv' failed

Note also the Nix expression can not be built if sandbox is enabled.

Just curious-- why not? Needed paths aren't available?

nlewo commented

It seems the worker gets the EOF on the child logger and sends a kill signal before the child process reach the terminated state. I don't know yet how to fix this...

I could also reproduce by manually building the master of Nix. I run it without any deamon and it takes more iterations (about 2000) to hit the bug.

@dtzWill With sandboxing, it fails with

...
...
checking outputs of '/nix/store/2nh58nwgxp376phzw8vlvmfdhnl8ky30-simple.drv'...
waiting for children
building of '/nix/store/2nh58nwgxp376phzw8vlvmfdhnl8ky30-simple.drv': read 146 bytes
�while setting up the build environment: executing '/nix/store/zqh3l3lyw32q1ayb15bnvg9f24j5v2p0-bash-4.4-p12/bin/bash': No such file or directory
waiting for children
building of '/nix/store/2nh58nwgxp376phzw8vlvmfdhnl8ky30-simple.drv': got EOF
...
...
nlewo commented

@edolstra I think this has been introduced in 21948de. If I remove the code that kill the process, I never hit this issue.
I don't really know what is happening... maybe the kill signal could be emitted after FDs are removed (in do_exit) and before the process reaches the terminated state: even if the builder succeeds, the builder process is then known as killed.

This appears to be due to coreutils, since it closes stdout/stderr prior to quitting (see https://github.com/coreutils/gnulib/commits/master/lib/closeout.c).

Probably if you change touch $out to touch $out; exit 0, this problem won't occur (since then bash won't do a tail-call of touch).

For example:

$ strace bash -c 'touch x'
...
close(0)                                = 0
close(1)                                = 0
close(2)                                = 0
exit_group(0)                           = ?

whereas

$ strace bash -c 'touch x; exit 0'
...
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f43e2441250) = 28751
...
wait4(-1, 0x7fffc3ac6bd0, WNOHANG, NULL) = -1 ECHILD (No child processes)
rt_sigreturn({mask=[]})                 = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
exit_group(0)                           = ?

Good find, but what are you saying. Is it that coreutils would need to be fixed or Nix? I have the impression that what coreutils does is correct, which would suggest that Nix still needs to be fixed.

Nobody is incorrect here. Nix just requires that builders don't close stdout/stderr.

nlewo commented

@edolstra yeah, good catch. Thanks!
I confirme touch $out; exit 0 works well. In fact, I've already observed this (and used it as workaround) but I didn't have understood why it worked...

The point is that it has been really hard to troubleshoot (for me at least:/) and it would be nice to avoid this kind of issue.

This issue comes from way the worker knows the builder process is finished. The git history also confirms this has been a long story, with some other issues (builder which never termintaes).
Why are we waiting for closed file descriptor to know the child is terminated? Just for sake of simplicity?

From the manual:

So we have to build a package. Building something from other stuff is called a derivation in Nix (as 
opposed to sources, which are built by humans instead of computers). We perform a derivation by calling stdenv.mkDerivation. mkDerivation is a function provided by stdenv that builds a package from a set of attributes. A set is just a list of key/value pairs where each key is a string and each value is an arbitrary Nix expression. They take the general form { name1 = expr1; ... nameN = exprN; }.

The manual states arbitrary. Additionally, the manual does not specify any conditions in which a builder should not work when a build process would work outside of Nix (which is the situation we have here).

Please explain how given the above argument you still think Nix does not have a bug here.

I suddenly experience random sigkills after an upgrade to nix 2.3. Not sure if that's the same issue. I also consistenly cannot build the expression mentioned in the beginning of this thread.

Seeing it all of a sudden since yesterday. nix-shell with cmake and make (doing what the derivation says) works fine.

building of '/nix/store/ajkgsssfq8gfqvv707lyxv35lg528hik-openenclave-sdk.drv': read 34 bytes
[ 0%] Built target oeedger8r-cpp
waiting for children
building of '/nix/store/ajkgsssfq8gfqvv707lyxv35lg528hik-openenclave-sdk.drv': read 89 bytes
make[1]: Leaving directory '/nix/store/i1cdkbcab2wag039a9wb4xfb5qib0mim-openenclave-sdk'
waiting for children
building of '/nix/store/ajkgsssfq8gfqvv707lyxv35lg528hik-openenclave-sdk.drv': read 38 bytes
make: *** [Makefile:182: all] Error 2
waiting for children
building of '/nix/store/ajkgsssfq8gfqvv707lyxv35lg528hik-openenclave-sdk.drv': got EOF
building of '/nix/store/ajkgsssfq8gfqvv707lyxv35lg528hik-openenclave-sdk.drv': woken up
building of '/nix/store/ajkgsssfq8gfqvv707lyxv35lg528hik-openenclave-sdk.drv': build done
killing process 29772
builder process for '/nix/store/ajkgsssfq8gfqvv707lyxv35lg528hik-openenclave-sdk.drv' finished
builder for '/nix/store/ajkgsssfq8gfqvv707lyxv35lg528hik-openenclave-sdk.drv' failed with exit code 2
lock released on '/nix/store/i1cdkbcab2wag039a9wb4xfb5qib0mim-openenclave-sdk.lock'
building of '/nix/store/ajkgsssfq8gfqvv707lyxv35lg528hik-openenclave-sdk.drv': done
building of '/nix/store/ajkgsssfq8gfqvv707lyxv35lg528hik-openenclave-sdk.drv': goal destroyed
error: build of '/nix/store/ajkgsssfq8gfqvv707lyxv35lg528hik-openenclave-sdk.drv' failed

stale commented

I marked this as stale due to inactivity. → More info

I'm seeing this on nix 2.11.

I'm seeing this on remote builds from my CI (from nix flake check). Getting "unexpected Nix daemon error: error: writing to file: Broken pipe" on the nix daemon logs. opensshd logs don't show anything interesting (just the builder user authentication when logging in). It's generally a different build that gets killed with SIGKILL, not a specific derivation.

Are there any flags to that I can use to produce more verbose logs which might be able to help figure out what is sending the SIGKILL?

Edit: I built it with -vvv and noticed that it is the same issue as others mentioned earlier in the thread, where the worker is receiving an EOF from the builder

srid commented

FWIW, I tried the latest master of Nix (thinking that #8049 may have fixed it), but the problem still remains.

I believe we may be seeing this very often in GHC CI, particularly when building Haskell packages.

mupdt commented

We are also experiencing this once in a while in our CI. A very rough estimate is that it happens a bit more than 1 in 100 times.

We just hit it today. This is the derivation we're building:

pkgs.runCommand "foo" {} ''
  mkdir -p $out/bin
  echo "#!${pkgs.bash}/bin/bash" > $out/bin/foo
  echo echo Hello World [we put a UUID here to make sure builds happen all the time] > $out/bin/foo
  chmod u+x $out/bin/foo
  # We're doing exit 0 here to avoid issue https://github.com/NixOS/nix/issues/2176
  exit 0
''

It looks like the exit 0 didn't help.

Nobody is incorrect here. Nix just requires that builders don't close stdout/stderr.

As a consequence of this decision we also have

We have two potential events to indicate termination

  • outputs have closed
  • process has exited

We should wait for both and have a timer to catch both edge cases where one happens but the other takes too long, as also mentioned in the other issue.