Using `-Dn` can lead to segfaults
Summary
When using the debug RTS, we can use the -Dn
flag to print non-moving debug info.
One of the things this does is print non-moving segment occupancy statistics.
This code is also called when using the eventlog when the -ln
flag is enabled.
In that codepath, we make sure not to collect information about live words when the mutator is running, as this can lead to issues.
But it looks like this doesn't happen when going via the debug codepath. In that case collecting live word data seems to be enabled unconditionally.
I'm seeing some segmentation faults when running -Dn
like the following:
Thread 3 "nonmoving-mark" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 170171]
0x00000000004f006a in closure_sizeW_ (p=0x42203a7fd8, info=0xfffffffffffffff0) at rts/ClosureSize.c:14
14 switch (info->type) {
(gdb) bt
#0 0x00000000004f006a in closure_sizeW_ (p=0x42203a7fd8, info=0xfffffffffffffff0) at rts/ClosureSize.c:14
#1 0x000000000049768a in closure_sizeW (p=0x42203a7fd8) at rts/include/rts/storage/ClosureMacros.h:418
#2 0x00000000004c745a in nonmovingAllocatorCensus_ (alloc_idx=2, collect_live_words=true) at rts/sm/NonMovingCensus.c:40
#3 0x00000000004c787d in nonmovingPrintAllocatorCensus (collect_live_words=true) at rts/sm/NonMovingCensus.c:138
#4 0x00000000004c6d8e in nonmovingMark_ (mark_queue=0x6afcd0, dead_weaks=0x7ffff6b5de60, resurrected_threads=0x7ffff6b5de68, concurrent=true) at rts/sm/NonMoving.c:1224
#5 0x00000000004c64a3 in nonmovingConcurrentMarkWorker (data=0x0) at rts/sm/NonMoving.c:933
#6 0x00000000004dce09 in start_thread (param=0x6a18e0) at rts/posix/OSThreads.c:218
#7 0x00007ffff7cdfe24 in start_thread () from /nix/store/46m4xx889wlhsdj72j38fnlyyvvvvbyb-glibc-2.37-8/lib/libc.so.6
#8 0x00007ffff7d619b0 in clone3 () from /nix/store/46m4xx889wlhsdj72j38fnlyyvvvvbyb-glibc-2.37-8/lib/libc.so.6
(gdb) p *Bdescr(nonmovingGetSegment(0x42203a7fd8))
$7 = {start = 0x42203a0000, {free = 0x4200000002, nonmoving_segment = {allocator_idx = 2, next_free_snap = 0}}, link = 0x0, u = {back = 0x42203a7ff8, bitmap = 0x42203a7ff8,
scan = 0x42203a7ff8}, gen = 0x6a15c0, gen_no = 1, dest_no = 1, node = 0, flags = 1024, blocks = 8, _padding = {0, 0, 0}}
(gdb) p printClosure(0x42203a7fd8)
0x42203a7fd8: ghc-prim:GHC.Types.:(0x42203622e1, 0x42203683fa)
Solution
Maybe all we need to do is only enable live word collection if we are not doing concurrent collection? If so then this should be a one line change and I'd be happy to make an MR