Intermittent segfaults observed in multi-threaded program (likely via BoundedChan) with GHC 8.10.2
[ TL;DR. Missing dirty_MVAR call in putMvar resulted in MVar incorrectly remaining clean even when holding the last reference to a TSO queue head in a younger generation, the queue head was subsequently moved by the GC, but the MVar pointer was not updated. As a result the MVar's TSO queue was corrupted with a dangling pointer to an unexpected object or just random content in memory. ]
Steps to reproduce
Crash many tens of minutes into the run, which was processing tens of GB of data,
so the crash is by no means immediate, or easy to reproduce. It has happened
a few times now. Rather difficult to reproduce, very sensitive to scheduler
timing and workload. Things that made it more likely were:
Limiting the depth of the BoundedChan to 1, thus increasing inter-thread contention
Limiting the heap size with -A128k, making GC more frequent.
Running on a bare-metal 16-core/32-thread machine, to get more effective concurrency.
A multi-layer pipeline of BoundedChan's between source and sink:
HTTPS or stdin
gunzip
group into chunks of 1k lines
parallel JSON parser/filter
output
Large dataset from internet-wide IP survey, compressed to multiple GB.
internal error: ASSERTION FAILED: file rts/Stats.c, line 502
The stack trace is more useful now:
(gdb) bt#0 0x00007f6f6dc1e625 in raise () from /lib64/libc.so.6#1 0x00007f6f6dc078d9 in abort () from /lib64/libc.so.6#2 0x00000000019f1c5f in rtsFatalInternalErrorFn (s=0x1b0da30 "ASSERTION FAILED: file %s, line %u\n", ap=0x7f6f17ffe8a8) at rts/RtsMessages.c:186#3 0x00000000019f1891 in barf (s=0x1b0da30 "ASSERTION FAILED: file %s, line %u\n") at rts/RtsMessages.c:48#4 0x00000000019f18f4 in _assertFail (filename=0x1b11cd9 "rts/Stats.c", linenum=502) at rts/RtsMessages.c:63#5 0x00000000019fd3f6 in stat_endGC (cap=0x273b760, initiating_gct=0x2834990, live=1064162, copied=1009, slop=194846, gen=0, par_n_threads=16, gc_threads=0x2833650, par_max_copied=153, par_balanced_copied=204, gc_spin_spin=33332578664, gc_spin_yield=33022849, mut_spin_spin=42068217183, mut_spin_yield=41736854, any_work=885, no_work=885, scav_find_work=28) at rts/Stats.c:502#6 0x0000000001a1357d in GarbageCollect (collect_gen=0, do_heap_census=false, deadlock_detect=false, gc_type=2, cap=0x273b760, idle_cap=0x7f6f04000bb0) at rts/sm/GC.c:941#7 0x00000000019f8e87 in scheduleDoGC (pcap=0x7f6f17ffee50, task=0x7f6f58000bb0, force_major=false, deadlock_detect=false) at rts/Schedule.c:1836#8 0x00000000019f71de in schedule (initialCapability=0x273b760, task=0x7f6f58000bb0) at rts/Schedule.c:562#9 0x00000000019fa10f in scheduleWorker (cap=0x273b760, task=0x7f6f58000bb0) at rts/Schedule.c:2600#10 0x0000000001a00cf1 in workerStart (task=0x7f6f58000bb0) at rts/Task.c:445#11 0x00007f6f6fd0b4e2 in start_thread () from /lib64/libpthread.so.0#12 0x00007f6f6dce36c3 in clone () from /lib64/libc.so.6
The failed assertion is:
500 for (unsigned int i=0; i < par_n_threads; i++) {501 gc_thread *gct = gc_threads[i];502 ASSERT(gct->gc_end_cpu >= gct->gc_start_cpu);503 stats.gc.cpu_ns += gct->gc_end_cpu - gct->gc_start_cpu;504 }
(gdb) p gct->gc_end_cpu - gct->gc_start_cpu$7 = 2616009
So it is rather mysterious as to how the assertion managed to fail...
[ Unless some other thread was modifying these values at the same time, and the core file contains values that are not exactly what the pointers held the time of the comparison. ]
Disabling statistics, I finally get a more meaningful core:
internal error: ASSERTION FAILED: file rts/PrimOps.cmm, line 1824
With backtrace:
#0 0x00007f4379081625 in raise () from /lib64/libc.so.6#1 0x00007f437906a8d9 in abort () from /lib64/libc.so.6#2 0x00000000019f1c5f in rtsFatalInternalErrorFn (s=0x1b0da30 "ASSERTION FAILED: file %s, line %u\n", ap=0x7f43497f5d08) at rts/RtsMessages.c:186#3 0x00000000019f1891 in barf (s=0x1b0da30 "ASSERTION FAILED: file %s, line %u\n") at rts/RtsMessages.c:48#4 0x00000000019f18f4 in _assertFail (filename=0x1ace29c "rts/PrimOps.cmm", linenum=1824) at rts/RtsMessages.c:63#5 0x0000000001a34afc in stg_putMVarzh ()
And the assertion in question is:
ASSERT(StgTSO_block_info(tso) == mvar);
Sadly, I have no symbols for stg_putMVarzh, is there a way to build the RTS so that this function also has debug symbols and I can look at its local variables, ...?
Compiling with -debug and running with -DS I get a new segfault:
#0 0x0000000001a1ff1a in LOOKS_LIKE_INFO_PTR_NOT_NULL (p=12297829382473034410) at includes/rts/storage/ClosureMacros.h:246#1 0x0000000001a1ff69 in LOOKS_LIKE_INFO_PTR (p=12297829382473034410) at includes/rts/storage/ClosureMacros.h:251#2 0x0000000001a1ffa1 in LOOKS_LIKE_CLOSURE_PTR (p=0x4201cd9268) at includes/rts/storage/ClosureMacros.h:256#3 0x0000000001a2096a in checkClosure (p=0x4200012000) at rts/sm/Sanity.c:346#4 0x0000000001a20dc4 in checkHeapChain (bd=0x4200000480) at rts/sm/Sanity.c:473#5 0x0000000001a21a97 in checkGeneration (gen=0x3066c30, after_major_gc=true) at rts/sm/Sanity.c:845#6 0x0000000001a21b6d in checkFullHeap (after_major_gc=true) at rts/sm/Sanity.c:864#7 0x0000000001a21be8 in checkSanity (after_gc=true, major_gc=true) at rts/sm/Sanity.c:873#8 0x0000000001a1315b in GarbageCollect (collect_gen=1, do_heap_census=false, deadlock_detect=false, gc_type=2, cap=0x2fb10a0, idle_cap=0x7f6d3c000bb0) at rts/sm/GC.c:850#9 0x00000000019f8e87 in scheduleDoGC (pcap=0x7f6d75ffae50, task=0x7f6d90000bb0, force_major=false, deadlock_detect=false) at rts/Schedule.c:1836#10 0x00000000019f71de in schedule (initialCapability=0x2fb10a0, task=0x7f6d90000bb0) at rts/Schedule.c:562#11 0x00000000019fa10f in scheduleWorker (cap=0x2fb10a0, task=0x7f6d90000bb0) at rts/Schedule.c:2600#12 0x0000000001a00cf1 in workerStart (task=0x7f6d90000bb0) at rts/Task.c:445#13 0x00007f6da988e4e2 in start_thread () from /lib64/libpthread.so.0#14 0x00007f6da78666c3 in clone () from /lib64/libc.so.6
It looks like increasing -A makes the crash less likely (or at least take longer to manifest). I've simplified the code to eliminate the most CPU-intensive parts, and now the crash happens sooner, and fairly reliably with -DS, with default -A. With +RTS -A32m -n2m -DS -N12 still running and hasn't crashed yet, but there's many more minutes of crunching through the full dataset (~10m to decompress and stream 4.5 GB of data (compressed size)).
@trac-vdukhovni and I chatted about this issue this evening. Happily I now have the code and can reproduce the issue locally under rr. Also, @trac-vdukhovni has confirmed that he has seen the crash under ghc 8.8 as well.
Currently I am looking at an rr trace which fails with the sanity-checker error:
>>> bt#0 0x0000000001844cb3 in LOOKS_LIKE_INFO_PTR_NOT_NULL (p=12297829382473034410) at includes/rts/storage/ClosureMacros.h:246#1 0x0000000001844d02 in LOOKS_LIKE_INFO_PTR (p=12297829382473034410) at includes/rts/storage/ClosureMacros.h:251#2 0x0000000001844d3a in LOOKS_LIKE_CLOSURE_PTR (p=0x4205e98000) at includes/rts/storage/ClosureMacros.h:256#3 0x0000000001845703 in checkClosure (p=0x420434c270) at rts/sm/Sanity.c:346#4 0x0000000001845b5d in checkHeapChain (bd=0x4204301300) at rts/sm/Sanity.c:473#5 0x0000000001846874 in checkGeneration (gen=0x39dcdb0, after_major_gc=true) at rts/sm/Sanity.c:849#6 0x0000000001846906 in checkFullHeap (after_major_gc=true) at rts/sm/Sanity.c:864#7 0x0000000001846981 in checkSanity (after_gc=true, major_gc=true) at rts/sm/Sanity.c:873#8 0x0000000001837bd2 in GarbageCollect (collect_gen=1, do_heap_census=false, deadlock_detect=false, gc_type=2, cap=0x39cb490, idle_cap=0x748230000bb0) at rts/sm/GC.c:850#9 0x000000000181cf4f in scheduleDoGC (pcap=0x2b38137cae68, task=0x5d1d78000bb0, force_major=false, deadlock_detect=false) at rts/Schedule.c:1836#10 0x000000000181b10f in schedule (initialCapability=0x39cb490, task=0x5d1d78000bb0) at rts/Schedule.c:562#11 0x000000000181e27c in scheduleWorker (cap=0x39cb490, task=0x5d1d78000bb0) at rts/Schedule.c:2600#12 0x000000000182508b in workerStart (task=0x5d1d78000bb0) at rts/Task.c:445#13 0x00007fbe9a8efedd in start_thread () from /nix/store/xg6ilb9g9zhi2zg1dpi4zcp288rhnvns-glibc-2.30/lib/libpthread.so.0#14 0x00000000700f9aaf in clone () from /nix/store/xg6ilb9g9zhi2zg1dpi4zcp288rhnvns-glibc-2.30/lib/libc.so.6
This occurred at the end of a major GC.
The p here from frame 3 is a weak pointer with this reachability graph at the beginning of the collection:
However, it appears that this weak pointer is the result of evacuation during the current GC since if I move back to the beginning of the collection the weak is all 0xas.
Furthermore, if I set a watchpoint on $weak->header.info I find that it was written by copy_tag, which is copying from a closure:
>>> bt#0 0x0000000001875cae in copy_tag (p=0x420434c268, info=0x185b570 <stg_WEAK_info>, src=0x4205e98348, size=6, gen_no=1, tag=0) at rts/sm/Evac.c:147#1 0x0000000001875f4d in copy (p=0x420434c268, info=0x185b570 <stg_WEAK_info>, src=0x4205e98348, size=6, gen_no=1) at rts/sm/Evac.c:286#2 0x0000000001876ca2 in evacuate (p=0x420434c268) at rts/sm/Evac.c:858#3 0x000000000183c623 in markWeakPtrList () at rts/sm/MarkWeak.c:431#4 0x0000000001836b0b in GarbageCollect (collect_gen=1, do_heap_census=false, deadlock_detect=false, gc_type=2, cap=0x39cb490, idle_cap=0x748230000bb0) at rts/sm/GC.c:430#5 0x000000000181cf4f in scheduleDoGC (pcap=0x2b38137cae68, task=0x5d1d78000bb0, force_major=false, deadlock_detect=false) at rts/Schedule.c:1836#6 0x000000000181b10f in schedule (initialCapability=0x39cb490, task=0x5d1d78000bb0) at rts/Schedule.c:562#7 0x000000000181e27c in scheduleWorker (cap=0x39cb490, task=0x5d1d78000bb0) at rts/Schedule.c:2600#8 0x000000000182508b in workerStart (task=0x5d1d78000bb0) at rts/Task.c:445#9 0x00007fbe9a8efedd in start_thread () from /nix/store/xg6ilb9g9zhi2zg1dpi4zcp288rhnvns-glibc-2.30/lib/libpthread.so.0#10 0x00000000700f9aaf in clone () from /nix/store/xg6ilb9g9zhi2zg1dpi4zcp288rhnvns-glibc-2.30/lib/libc.so.6>>> print $src_weak[0]$12 = { header = { info = 0x185b570 <stg_WEAK_info> }, cfinalizers = 0x1cada60, key = 0x4205e98000, value = 0x42044d4589, finalizer = 0x1cada60, link = 0x0}>>> print $src_weak[0].key[0]$14 = { header = { info = 0x185b510 <stg_TSO_info> }, payload = 0x4205e98008}>>> x/4a 0x42044d45880x42044d4588: 0x15247c8 <base_GHCziConcziSync_ThreadId_con_info> 0x4205e980000x42044d4598: 0x1694198 <base_GHCziSTRef_STRef_con_info> 0x4200fae628
Note that the payload of the ThreadId constructor is the same TSO that the weak is keyed on.
So somehow we are evacuating a weak pointer yet not scavenging it.
Confirmed that we do not touch $weak->key again after evacuation until the sanity check.
The block that we allocate into was allocated while evacuating yet another weak pointer:
Thread 20 hit Hardware access (read/write) watchpoint 5: -location $bd->flagsOld value = 1New value = 0alloc_todo_block (ws=0x39e6730, size=6) at rts/sm/GCUtils.c:334334 bd->flags = BF_EVACUATED;>>> bt#0 alloc_todo_block (ws=0x39e6730, size=6) at rts/sm/GCUtils.c:334#1 0x000000000183af41 in todo_block_full (size=6, ws=0x39e6730) at rts/sm/GCUtils.c:298#2 0x0000000001875c25 in alloc_for_copy (size=6, gen_no=1) at rts/sm/Evac.c:125#3 0x0000000001875c9a in copy_tag (p=0x39dce20, info=0x185b570 <stg_WEAK_info>, src=0x4200fac1e0, size=6, gen_no=1, tag=0) at rts/sm/Evac.c:144#4 0x0000000001875f4d in copy (p=0x39dce20, info=0x185b570 <stg_WEAK_info>, src=0x4200fac1e0, size=6, gen_no=1) at rts/sm/Evac.c:286#5 0x0000000001876ca2 in evacuate (p=0x39dce20) at rts/sm/Evac.c:858#6 0x000000000183c623 in markWeakPtrList () at rts/sm/MarkWeak.c:431#7 0x0000000001836b0b in GarbageCollect (collect_gen=1, do_heap_census=false, deadlock_detect=false, gc_type=2, cap=0x39cb490, idle_cap=0x748230000bb0) at rts/sm/GC.c:430#8 0x000000000181cf4f in scheduleDoGC (pcap=0x2b38137cae68, task=0x5d1d78000bb0, force_major=false, deadlock_detect=false) at rts/Schedule.c:1836#9 0x000000000181b10f in schedule (initialCapability=0x39cb490, task=0x5d1d78000bb0) at rts/Schedule.c:562#10 0x000000000181e27c in scheduleWorker (cap=0x39cb490, task=0x5d1d78000bb0) at rts/Schedule.c:2600#11 0x000000000182508b in workerStart (task=0x5d1d78000bb0) at rts/Task.c:445#12 0x00007fbe9a8efedd in start_thread () from /nix/store/xg6ilb9g9zhi2zg1dpi4zcp288rhnvns-glibc-2.30/lib/libpthread.so.0#13 0x00000000700f9aaf in clone () from /nix/store/xg6ilb9g9zhi2zg1dpi4zcp288rhnvns-glibc-2.30/lib/libc.so.6
Note how the closure at $weak[0]->key is now a forwarding pointer.
However, the closure layout of stg_WEAK_info declares key to be a non-pointer (since it's a weak reference), so we do not evacuate this field.
We eventually conclude scavenging
We enter checkSanity and eventually stumble into the bad weak pointer
My recollection here is that traverseWeakPtrList (called after each round of scavenging) should have evacuated $weak->key. It's unclear why it didn't. Perhaps $weak is somehow no longer on weak_ptr_list?
Between steps 3 and 4 there is another event: another thread (thread 17) steps in and evacuates $src_weak again (to 0x42076a7388). This causes a race to evacuate $src_weak, resulting in two copies. At the end of collection one of these copies (the one on weak_ptr_list) will be valid (since its key will be evacuated by traverseWeakList) and the other will be invalid but unreachable. Nevertheless, checkSanity stumbles over the latter and crashes. This is sadly a spurious crash. I fix the cause in !4408 (closed).
@trac-vdukhovni and I again spoke on the phone. Unfortunately it looks like with the final version of !4408 (closed) I am unable to reproduce the crash with or without sanity-checking enabled; I've tried only a few times but it seems likely that the issue is hidden. @trac-vdukhovni, perhaps you could look into expanding the testcase again?