setNumCapabilities 257 (or more) crashes a program
Summary
RTS has MAX_N_CAPABILITIES
= 256, but it looks like setNumCapabilities
does not check if the desired number of capabilities is below this limit. If you try to set more capabilities, sometimes you get a segmentation fault.
Steps to reproduce
import GHC.Conc
main = setNumCapabilities 257
(sometimes you need more depending on the environment and GHC version).
On the host with 256 cores, the next setup seems to crash every time:
printf "import GHC.Conc\nmain = setNumCapabilities 257" | docker run -i --rm haskell:9.10.1-bullseye runghc; echo $?
On other host, I needed to bump setNumCapabilities
to 265 to reproduce this issue.
I first discovered this issue in Nix builds, and the next setup also crashes every time for me (it uses GHC 9.6.6 under the hood):
$ nix run nixpkgs/b681065d0919f7eb5309a93cea2cfa84dec9aa88#ghc -- -threaded conc.hs; ./conc
[1] 311106 segmentation fault (core dumped) ./conc
For this crash I have a stack trace that is showing that crash happens somewhere in RTS:
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000000000049d34f in assignNurseriesToCapabilities ()
[Current thread is 1 (LWP 311106)]
(gdb) bt
#0 0x000000000049d34f in assignNurseriesToCapabilities ()
#1 0x000000000049d96c in storageAddCapabilities ()
#2 0x000000000047eaea in setNumCapabilities ()
#3 0x000000000041042d in base_GHCziConcziSync_setNumCapabilities1_info ()
#4 0x0000000000000000 in ?? ()
I have tried reproducing it with GHCup and have found that it requires more than 257 capabilities to cause a crash. If you bump the setNumCapabilities
argument to 259 (on a host with 256 cores), it causes the program to crash under both ghc
and runghc
of versions 9.6.6, 9.8.4 and 9.10.1. I have also encountered cases (with 258 capabilities IIRC) when it crashes not on every program launch, and also noticed that ghci
and runghc
looks more susceptible to the problem.
Expected behavior
I would expect such program not to crash. Maybe it should print a warning or silently ignore such a big argument, or maybe it should dynamically resize the capabilities array to not crash.
Context
I first discovered this issue while upgrading our platform to NixOS 24.11, builds were failing on hedgehog tests. As I have discovered, it was seeing that our build machine has 256 cores, planning to run 256 workers, so it set the number of capabilities to number of cores + 2 (so 258 in our case) to make some room for IO threads, and was crashing as I have shown above.
Environment
- GHC versions used: 9.6.6, 9.8.4, 9.10.1
- Operating System: Ubuntu
- System Architecture: x86_64 (reproduced on ARM64 too)