Recently, I've noted head.hackage CI jobs failing due to some variation of the following error:
$ nix run -f ./ci -c curl -L "$GHC_TARBALL" > ghc.tar.xztrace: Warning: `stdenv.lib` is deprecated and will be removed in the next release. Please use `lib` instead. For more information see https://github.com/NixOS/nixpkgs/issues/108938builder for '/nix/store/fv8zgvlrq596axmxwwlv63hxn28185fg-head-hackage-ci-0.1.0.0.drv' failed due to signal 9 (Killed); last 10 log lines: setupCompilerEnvironmentPhase Build with /nix/store/qabwi7rkqwppxbp575vq7khy6v2vqv81-ghc-8.10.4. unpacking sources unpacking source archive /nix/store/6cggp66y182f88lcz0kf9mzpba0mx1h9-ci source root is ci patching sources compileBuildDriverPhase setupCompileFlags: -package-db=/tmp/nix-build-head-hackage-ci-0.1.0.0.drv-0/setup-package.conf.d -j16 +RTS -A64M -RTS -threaded -rtsopts [1 of 1] Compiling Main ( Setup.hs, /tmp/nix-build-head-hackage-ci-0.1.0.0.drv-0/Main.o ) Linking Setup ...cannot build derivation '/nix/store/jj1kfsb60c3zqq2jm7bmablawdjqf7dg-repo.drv': 1 dependencies couldn't be builterror: build of '/nix/store/jj1kfsb60c3zqq2jm7bmablawdjqf7dg-repo.drv' failed
Note that it's not always head-hackage-ci that fails to build. For instance, this one fails when building optparse-applicative, but otherwise the error appears quite similar.
This appears to be somewhat nondeterministic, as it happens on some CI jobs on not on others.
Hmm, it sounds like we may be running out of memory although this is hard to believe given that we are just building Setup. Are these failures always in the 9.0 job?
$ nix run -f ./ci -c \ # collapsed multi-line commandbuilder for '/nix/store/j8k0b4qdk0znxqicdca1j385zh4fzgb8-head-hackage-ci-0.1.0.0.drv' failed due to signal 9 (Killed); last 10 log lines: Using strip version 2.35 found on system at: /nix/store/276p5pgfjf9ia9jljgkbmb7zky73p5ag-gcc-wrapper-10.3.0/bin/strip Using tar found on system at: /nix/store/1v0ryqy409fiwjgj0186v74clp44l04r-gnutar-1.34/bin/tar No uhc found building Preprocessing executable 'head-hackage-ci' for head-hackage-ci-0.1.0.0.. Building executable 'head-hackage-ci' for head-hackage-ci-0.1.0.0.. [1 of 6] Compiling LatestVersion ( LatestVersion.hs, dist/build/head-hackage-ci/head-hackage-ci-tmp/LatestVersion.o, dist/build/head-hackage-ci/head-hackage-ci-tmp/LatestVersion.dyn_o ) [2 of 6] Compiling Types ( Types.hs, dist/build/head-hackage-ci/head-hackage-ci-tmp/Types.o, dist/build/head-hackage-ci/head-hackage-ci-tmp/Types.dyn_o )cannot build derivation '/nix/store/1c0dcvg53bp950dxwbcx1jd5zzgss547-repo.drv': 1 dependencies couldn't be builterror: build of '/nix/store/1c0dcvg53bp950dxwbcx1jd5zzgss547-repo.drv' failed
Interestingly, the logs stop in yet another location this time.
Yet another one here. I've also reported this to the intermittent CI failures spreadsheet in ghc#21591, as it is extremely tiresome having to babysit every job and restart it due to what feels like 50/50 odds of it randomly dying due to this issue.
I spotted another occurrence of this in ghc-debug's CI here:
[ 1 of 18] Compiling Network.HTTP.Base64 ( Network/HTTP/Base64.hs, dist/build/Network/HTTP/Base64.p_o )error: builder for '/nix/store/80kxc1395rnq4wa04ss61c0dx8jvq3xv-HTTP-4000.3.15.drv' failed due to signal 9 (Killed)error: 1 dependencies of derivation '/nix/store/74svkafnlp4kql3yk4w0alhavyfnn5jf-cabal-install-3.4.0.0.drv' failed to build
@angerman determined that Docker simply sends kill -9 signals to containers when the system is under moderate load. We are unlikely to fix this, but the workaround is in place: my system for tracking spurious failures recognizes jobs that fail with a -9 and automatically restarts them. I'd suggest we accept that as the solution to this ticket (i.e., wontfix + workaround exists).
How exactly does the automatic restart workaround function? I've recently discovered a job that failed due to signal 9 (Killed)here, but it was not restarted automatically.
The system was added into the existing ghc-perf-import bot, which gets triggered for each job event. The bot is now misconfigured so not doing any of its jobs. I will probably spin my failure handling stuff into its own service soon.
@angerman this is happening on the zw3rk runners. (Sorry I haven't updated the graph above to use runner name instead of runner id—it's still on my list.) It's only happening on head.hackage jobs, so maybe there's something special about how they run.
Elusive as to why they happen ;-) Increasing load does increase them. If you can figure out why that would be great! What is triggering the kill. It's not me personally .
@chreekat you have access to the machines. I’m at a loss what is going and why they are being killed. I’m happy to make any changes to the configuration; I simply don’t understand what’s going on with docker here.