Recently, I've noted head.hackage CI jobs failing due to some variation of the following error:
$ nix run -f ./ci -c curl -L "$GHC_TARBALL" > ghc.tar.xztrace: Warning: `stdenv.lib` is deprecated and will be removed in the next release. Please use `lib` instead. For more information see https://github.com/NixOS/nixpkgs/issues/108938builder for '/nix/store/fv8zgvlrq596axmxwwlv63hxn28185fg-head-hackage-ci-0.1.0.0.drv' failed due to signal 9 (Killed); last 10 log lines: setupCompilerEnvironmentPhase Build with /nix/store/qabwi7rkqwppxbp575vq7khy6v2vqv81-ghc-8.10.4. unpacking sources unpacking source archive /nix/store/6cggp66y182f88lcz0kf9mzpba0mx1h9-ci source root is ci patching sources compileBuildDriverPhase setupCompileFlags: -package-db=/tmp/nix-build-head-hackage-ci-0.1.0.0.drv-0/setup-package.conf.d -j16 +RTS -A64M -RTS -threaded -rtsopts [1 of 1] Compiling Main ( Setup.hs, /tmp/nix-build-head-hackage-ci-0.1.0.0.drv-0/Main.o ) Linking Setup ...cannot build derivation '/nix/store/jj1kfsb60c3zqq2jm7bmablawdjqf7dg-repo.drv': 1 dependencies couldn't be builterror: build of '/nix/store/jj1kfsb60c3zqq2jm7bmablawdjqf7dg-repo.drv' failed
Note that it's not always head-hackage-ci that fails to build. For instance, this one fails when building optparse-applicative, but otherwise the error appears quite similar.
This appears to be somewhat nondeterministic, as it happens on some CI jobs on not on others.
Hmm, it sounds like we may be running out of memory although this is hard to believe given that we are just building Setup. Are these failures always in the 9.0 job?
$ nix run -f ./ci -c \ # collapsed multi-line commandbuilder for '/nix/store/j8k0b4qdk0znxqicdca1j385zh4fzgb8-head-hackage-ci-0.1.0.0.drv' failed due to signal 9 (Killed); last 10 log lines: Using strip version 2.35 found on system at: /nix/store/276p5pgfjf9ia9jljgkbmb7zky73p5ag-gcc-wrapper-10.3.0/bin/strip Using tar found on system at: /nix/store/1v0ryqy409fiwjgj0186v74clp44l04r-gnutar-1.34/bin/tar No uhc found building Preprocessing executable 'head-hackage-ci' for head-hackage-ci-0.1.0.0.. Building executable 'head-hackage-ci' for head-hackage-ci-0.1.0.0.. [1 of 6] Compiling LatestVersion ( LatestVersion.hs, dist/build/head-hackage-ci/head-hackage-ci-tmp/LatestVersion.o, dist/build/head-hackage-ci/head-hackage-ci-tmp/LatestVersion.dyn_o ) [2 of 6] Compiling Types ( Types.hs, dist/build/head-hackage-ci/head-hackage-ci-tmp/Types.o, dist/build/head-hackage-ci/head-hackage-ci-tmp/Types.dyn_o )cannot build derivation '/nix/store/1c0dcvg53bp950dxwbcx1jd5zzgss547-repo.drv': 1 dependencies couldn't be builterror: build of '/nix/store/1c0dcvg53bp950dxwbcx1jd5zzgss547-repo.drv' failed
Interestingly, the logs stop in yet another location this time.
Yet another one here. I've also reported this to the intermittent CI failures spreadsheet in ghc#21591, as it is extremely tiresome having to babysit every job and restart it due to what feels like 50/50 odds of it randomly dying due to this issue.
I spotted another occurrence of this in ghc-debug's CI here:
[ 1 of 18] Compiling Network.HTTP.Base64 ( Network/HTTP/Base64.hs, dist/build/Network/HTTP/Base64.p_o )error: builder for '/nix/store/80kxc1395rnq4wa04ss61c0dx8jvq3xv-HTTP-4000.3.15.drv' failed due to signal 9 (Killed)error: 1 dependencies of derivation '/nix/store/74svkafnlp4kql3yk4w0alhavyfnn5jf-cabal-install-3.4.0.0.drv' failed to build
@angerman determined that Docker simply sends kill -9 signals to containers when the system is under moderate load. We are unlikely to fix this, but the workaround is in place: my system for tracking spurious failures recognizes jobs that fail with a -9 and automatically restarts them. I'd suggest we accept that as the solution to this ticket (i.e., wontfix + workaround exists).
How exactly does the automatic restart workaround function? I've recently discovered a job that failed due to signal 9 (Killed)here, but it was not restarted automatically.
The system was added into the existing ghc-perf-import bot, which gets triggered for each job event. The bot is now misconfigured so not doing any of its jobs. I will probably spin my failure handling stuff into its own service soon.
@angerman this is happening on the zw3rk runners. (Sorry I haven't updated the graph above to use runner name instead of runner id—it's still on my list.) It's only happening on head.hackage jobs, so maybe there's something special about how they run.
Elusive as to why they happen ;-) Increasing load does increase them. If you can figure out why that would be great! What is triggering the kill. It's not me personally .
@chreekat you have access to the machines. I’m at a loss what is going and why they are being killed. I’m happy to make any changes to the configuration; I simply don’t understand what’s going on with docker here.
I've managed to reproduce this locally. I'm running on a 32 core AMD computer, which is somewhat busy/noisy.
❯ docker --versionDocker version 20.10.21, build v20.10.21
Here's what I ran:
docker run -it --rm nixos/nix:2.13.1git clone https://gitlab.haskell.org/ghc/head.hackage.gitcd head.hackagenix --extra-experimental-features nix-command\ flakes print-dev-env -j200 -L# wait for some Haskell packages to start building# then cancel the buildnix-collect-garbage -d # not sure if this is necessarynix --extra-experimental-features nix-command\ flakes print-dev-env -j200 -L# run the build again and it should fail this time. Or if not rinse and repeat
I'm pretty sure that this is being triggered by the extra preConfigurePhases added by the haskell generic builder, which in turn call "date +%s" which I can see in strace. Why this in particular breaks though and why only under docker is a mystery
Also if it makes it any easier to decipher what's going on the last build is of the SHA package. Stdout looks like this:
SHA> Preprocessing library for SHA-1.6.4.4..SHA> Building library for SHA-1.6.4.4..SHA> [1 of 1] Compiling Data.Digest.Pure.SHA ( src/Data/Digest/Pure/SHA.hs, dist/build/Data/Digest/Pure/SHA.o, dist/build/Data/Digest/Pure/SHA.dyn_o )error: builder for '/nix/store/24sxc21gbhbfxdrpgfgj47rdrgnqsqaa-SHA-1.6.4.4.drv' failed due to signal 9 (Killed)error: 1 dependencies of derivation '/nix/store/yzcjcdv8b6w0gjjjs904k7m2yvcgn136-authenticate-oauth-1.7.drv' failed to builderror: 1 dependencies of derivation '/nix/store/bh7lz8cn1rcp52a3d1x0hwd0swawnj79-wreq-0.5.4.2.drv' failed to builderror: 1 dependencies of derivation '/nix/store/2812d96ihn3nqsjdxf6v7qdgji1d4j4b-head-hackage-ci-0.1.0.0.drv' failed to builderror: 1 dependencies of derivation '/nix/store/l2d62qcfkmxnkvsl47w0vqfmcjgdc25f-repo.drv' failed to builderror: 1 dependencies of derivation '/nix/store/cnnqqv6vyfyshb0v0yw4ir9v5m0dizxk-head-hackage-build-env-env.drv' failed to build
One possible explanation is that the host is running a nix daemon, which too tries to clean up sandboxes by killing all processes under a uid. But that also ends up killing processes in the docker container with the same uid. I've checked that running pkill -u nixbld1 outside the container reproduces the behaviour we see.
running a nix build in the container is not insulated from the host.
running a container with the same range of user ids as the host is dangerous.
running nix inside a container and outside a container (both choosing the same nixbld user), can result in the outer one terminating the inner one because the ended up having the same numerical id?
So in essence then: we should not be running docker containers of nixOS/or OSs with nix on nix-enabled hosts?
I mean if we are using nix anyway, the extra layer of docker probably makes things worse in almost any direction; but this seems still to be a pretty significant limitation.
It also means if any uids between docker containers and the host collide things are on a “good luck” basis?
I think all the options offered so far sound good!
Another possibility is just shifting the uid/guids over a bit in the container say instead of starting at 3000 start at 6000. All we need to ensure is that these two sets don't overlap so I think that would be pretty safe
At work we do (1), which is quite nice because you get to share the same store. I think it's worth considering the security implications though (I guess we are trading off one type of sandboxing for another). I can help set this up. Here's an example config extracted from the docs/my work config: $5735
Actually, yeah. The implication of running nix as root on the ephemeral Docker image is pretty minor, I guess.
I think doing (3) would be nice eventually, so as to reduce complexity in the CI jobs. But right now it wouldn't hurt to just add --option build-users-group "" to the few jobs that use the nixos/nix images. (I mean, in GHC).
This is now resolved for head.hackage and GHC thanks to changes to those repos' CI scripts. Ideally we would fix it in our CI config instead. So I opened ci-config#7 to track it.