GHC 7.8—8.10, putMVar segfault

Hmm, this is disturbing. A core may be helpful; how much parallelism does you program use?

added Phigh Tbug concurrency runtime crash labels and removed needs triage label

I recompiled with "-g", and got:

internal error: ASSERTION FAILED: file rts/Stats.c, line 502

The stack trace is more useful now:

(gdb) bt
#0  0x00007f6f6dc1e625 in raise () from /lib64/libc.so.6
#1  0x00007f6f6dc078d9 in abort () from /lib64/libc.so.6
#2  0x00000000019f1c5f in rtsFatalInternalErrorFn (s=0x1b0da30 "ASSERTION FAILED: file %s, line %u\n", ap=0x7f6f17ffe8a8) at rts/RtsMessages.c:186
#3  0x00000000019f1891 in barf (s=0x1b0da30 "ASSERTION FAILED: file %s, line %u\n") at rts/RtsMessages.c:48
#4  0x00000000019f18f4 in _assertFail (filename=0x1b11cd9 "rts/Stats.c", linenum=502) at rts/RtsMessages.c:63
#5  0x00000000019fd3f6 in stat_endGC (cap=0x273b760, initiating_gct=0x2834990, live=1064162, copied=1009, slop=194846, gen=0, par_n_threads=16, gc_threads=0x2833650, 
    par_max_copied=153, par_balanced_copied=204, gc_spin_spin=33332578664, gc_spin_yield=33022849, mut_spin_spin=42068217183, mut_spin_yield=41736854, any_work=885, 
    no_work=885, scav_find_work=28) at rts/Stats.c:502
#6  0x0000000001a1357d in GarbageCollect (collect_gen=0, do_heap_census=false, deadlock_detect=false, gc_type=2, cap=0x273b760, idle_cap=0x7f6f04000bb0) at rts/sm/GC.c:941
#7  0x00000000019f8e87 in scheduleDoGC (pcap=0x7f6f17ffee50, task=0x7f6f58000bb0, force_major=false, deadlock_detect=false) at rts/Schedule.c:1836
#8  0x00000000019f71de in schedule (initialCapability=0x273b760, task=0x7f6f58000bb0) at rts/Schedule.c:562
#9  0x00000000019fa10f in scheduleWorker (cap=0x273b760, task=0x7f6f58000bb0) at rts/Schedule.c:2600
#10 0x0000000001a00cf1 in workerStart (task=0x7f6f58000bb0) at rts/Task.c:445
#11 0x00007f6f6fd0b4e2 in start_thread () from /lib64/libpthread.so.0
#12 0x00007f6f6dce36c3 in clone () from /lib64/libc.so.6

The failed assertion is:

500             for (unsigned int i=0; i < par_n_threads; i++) {
501                 gc_thread *gct = gc_threads[i];
502                 ASSERT(gct->gc_end_cpu >= gct->gc_start_cpu);
503                 stats.gc.cpu_ns += gct->gc_end_cpu - gct->gc_start_cpu;
504             }

But gdb repors:

(gdb) p gct->gc_end_cpu - gct->gc_start_cpu
$7 = 2616009

So it is rather mysterious as to how the assertion managed to fail...

[ Unless some other thread was modifying these values at the same time, and the core file contains values that are not exactly what the pointers held the time of the comparison. ]

This is indeed a known bug which we recently fixed in the tree. Disabling the assertion in question is fine

Disabling statistics, I finally get a more meaningful core:

internal error: ASSERTION FAILED: file rts/PrimOps.cmm, line 1824

With backtrace:

#0  0x00007f4379081625 in raise () from /lib64/libc.so.6
#1  0x00007f437906a8d9 in abort () from /lib64/libc.so.6
#2  0x00000000019f1c5f in rtsFatalInternalErrorFn (s=0x1b0da30 "ASSERTION FAILED: file %s, line %u\n", ap=0x7f43497f5d08) at rts/RtsMessages.c:186
#3  0x00000000019f1891 in barf (s=0x1b0da30 "ASSERTION FAILED: file %s, line %u\n") at rts/RtsMessages.c:48
#4  0x00000000019f18f4 in _assertFail (filename=0x1ace29c "rts/PrimOps.cmm", linenum=1824) at rts/RtsMessages.c:63
#5  0x0000000001a34afc in stg_putMVarzh ()

And the assertion in question is:

    ASSERT(StgTSO_block_info(tso) == mvar);

Sadly, I have no symbols for stg_putMVarzh, is there a way to build the RTS so that this function also has debug symbols and I can look at its local variables, ...?

Compiling with -debug and running with -DS I get a new segfault:

#0  0x0000000001a1ff1a in LOOKS_LIKE_INFO_PTR_NOT_NULL (p=12297829382473034410) at includes/rts/storage/ClosureMacros.h:246
#1  0x0000000001a1ff69 in LOOKS_LIKE_INFO_PTR (p=12297829382473034410) at includes/rts/storage/ClosureMacros.h:251
#2  0x0000000001a1ffa1 in LOOKS_LIKE_CLOSURE_PTR (p=0x4201cd9268) at includes/rts/storage/ClosureMacros.h:256
#3  0x0000000001a2096a in checkClosure (p=0x4200012000) at rts/sm/Sanity.c:346
#4  0x0000000001a20dc4 in checkHeapChain (bd=0x4200000480) at rts/sm/Sanity.c:473
#5  0x0000000001a21a97 in checkGeneration (gen=0x3066c30, after_major_gc=true) at rts/sm/Sanity.c:845
#6  0x0000000001a21b6d in checkFullHeap (after_major_gc=true) at rts/sm/Sanity.c:864
#7  0x0000000001a21be8 in checkSanity (after_gc=true, major_gc=true) at rts/sm/Sanity.c:873
#8  0x0000000001a1315b in GarbageCollect (collect_gen=1, do_heap_census=false, deadlock_detect=false, gc_type=2, cap=0x2fb10a0, idle_cap=0x7f6d3c000bb0) at rts/sm/GC.c:850
#9  0x00000000019f8e87 in scheduleDoGC (pcap=0x7f6d75ffae50, task=0x7f6d90000bb0, force_major=false, deadlock_detect=false) at rts/Schedule.c:1836
#10 0x00000000019f71de in schedule (initialCapability=0x2fb10a0, task=0x7f6d90000bb0) at rts/Schedule.c:562
#11 0x00000000019fa10f in scheduleWorker (cap=0x2fb10a0, task=0x7f6d90000bb0) at rts/Schedule.c:2600
#12 0x0000000001a00cf1 in workerStart (task=0x7f6d90000bb0) at rts/Task.c:445
#13 0x00007f6da988e4e2 in start_thread () from /lib64/libpthread.so.0
#14 0x00007f6da78666c3 in clone () from /lib64/libc.so.6

The problem weak object whose w->key failed LOOKS_LIKE_CLOSURE_PTR at line 346 (frame #3 (closed)) has:

(gdb) p *w->key
$7 = {header = {info = 0xaaaaaaaaaaaaaaaa}, payload = 0x42021048b8}

Great sleuthing! It looks like a weak pointer was incorrectly reclaimed.

Does the crash become any more or less likely after changing -A?

It looks like increasing -A makes the crash less likely (or at least take longer to manifest). I've simplified the code to eliminate the most CPU-intensive parts, and now the crash happens sooner, and fairly reliably with -DS, with default -A. With +RTS -A32m -n2m -DS -N12 still running and hasn't crashed yet, but there's many more minutes of crunching through the full dataset (~10m to decompress and stream 4.5 GB of data (compressed size)).

The -A32m -n2m run completed without a crash.

@trac-vdukhovni, this patch should help shed at least a bit more light on the situation:

diff --git a/rts/PrimOps.cmm b/rts/PrimOps.cmm
index b5930363a1..807c098daf 100644
--- a/rts/PrimOps.cmm
+++ b/rts/PrimOps.cmm
@@ -1821,7 +1821,9 @@ loop:
         StgMVar_tail(mvar) = stg_END_TSO_QUEUE_closure;
     }
 
-    ASSERT(StgTSO_block_info(tso) == mvar);
+    if(StgTSO_block_info(tso) != mvar) {
+      ccall putMVarError(mvar "ptr", tso "ptr");
+    }
     // save why_blocked here, because waking up the thread destroys
     // this information
     W_ why_blocked;
diff --git a/rts/RtsMessages.c b/rts/RtsMessages.c
index c816322a63..cd3a278454 100644
--- a/rts/RtsMessages.c
+++ b/rts/RtsMessages.c
@@ -40,6 +40,16 @@ RtsMsgFunction *debugMsgFn           = rtsDebugMsgFn;
 RtsMsgFunction *errorMsgFn           = rtsErrorMsgFn;
 RtsMsgFunction *sysErrorMsgFn        = rtsSysErrorMsgFn;
 
+void
+putMVarError(StgMVar *mvar, StgTSO *tso)
+{
+  errorBelch("putMVarzh: uh oh, blocked mismatch\n");
+  errorBelch("  tso->why_blocked=%d\n", tso->why_blocked);
+  errorBelch("  tso->block_info=%p\n", tso->block_info);
+  errorBelch("  mvar=%p\n", mvar);
+  barf("he's dead, jim.\n");
+}
+
 void
 barf(const char*s, ...)
 {

@trac-vdukhovni and I chatted about this issue this evening. Happily I now have the code and can reproduce the issue locally under rr. Also, @trac-vdukhovni has confirmed that he has seen the crash under ghc 8.8 as well.

Currently I am looking at an rr trace which fails with the sanity-checker error:

>>> bt
#0  0x0000000001844cb3 in LOOKS_LIKE_INFO_PTR_NOT_NULL (p=12297829382473034410) at includes/rts/storage/ClosureMacros.h:246
#1  0x0000000001844d02 in LOOKS_LIKE_INFO_PTR (p=12297829382473034410) at includes/rts/storage/ClosureMacros.h:251
#2  0x0000000001844d3a in LOOKS_LIKE_CLOSURE_PTR (p=0x4205e98000) at includes/rts/storage/ClosureMacros.h:256
#3  0x0000000001845703 in checkClosure (p=0x420434c270) at rts/sm/Sanity.c:346
#4  0x0000000001845b5d in checkHeapChain (bd=0x4204301300) at rts/sm/Sanity.c:473
#5  0x0000000001846874 in checkGeneration (gen=0x39dcdb0, after_major_gc=true) at rts/sm/Sanity.c:849
#6  0x0000000001846906 in checkFullHeap (after_major_gc=true) at rts/sm/Sanity.c:864
#7  0x0000000001846981 in checkSanity (after_gc=true, major_gc=true) at rts/sm/Sanity.c:873
#8  0x0000000001837bd2 in GarbageCollect (collect_gen=1, do_heap_census=false, deadlock_detect=false, gc_type=2, cap=0x39cb490, idle_cap=0x748230000bb0) at rts/sm/GC.c:850
#9  0x000000000181cf4f in scheduleDoGC (pcap=0x2b38137cae68, task=0x5d1d78000bb0, force_major=false, deadlock_detect=false) at rts/Schedule.c:1836
#10 0x000000000181b10f in schedule (initialCapability=0x39cb490, task=0x5d1d78000bb0) at rts/Schedule.c:562
#11 0x000000000181e27c in scheduleWorker (cap=0x39cb490, task=0x5d1d78000bb0) at rts/Schedule.c:2600
#12 0x000000000182508b in workerStart (task=0x5d1d78000bb0) at rts/Task.c:445
#13 0x00007fbe9a8efedd in start_thread () from /nix/store/xg6ilb9g9zhi2zg1dpi4zcp288rhnvns-glibc-2.30/lib/libpthread.so.0
#14 0x00000000700f9aaf in clone () from /nix/store/xg6ilb9g9zhi2zg1dpi4zcp288rhnvns-glibc-2.30/lib/libc.so.6

This occurred at the end of a major GC.

The p here from frame 3 is a weak pointer with this reachability graph at the beginning of the collection:

At the time of the crash the weak pointer is:

>>> print $weak[0]
$3 = {
  header = {
    info = 0x185b570 <stg_WEAK_info>
  }, 
  cfinalizers = 0x1cada60, 
  key = 0x4205e98000, 
  value = 0x42044d4589, 
  finalizer = 0x1cada60, 
  link = 0x0
}
>>> print $weak->value[0]
$6 = {
  header = {
    info = 0xaaaaaaaaaaaaaaaa
  }, 
  payload = 0x42044d4591
}
>>> print $weak->key[0]
$9 = {
  header = {
    info = 0xaaaaaaaaaaaaaaaa
  }, 
  payload = 0x4205e98008
}

However, it appears that this weak pointer is the result of evacuation during the current GC since if I move back to the beginning of the collection the weak is all 0xas.

Furthermore, if I set a watchpoint on $weak->header.info I find that it was written by copy_tag, which is copying from a closure:

>>> bt
#0  0x0000000001875cae in copy_tag (p=0x420434c268, info=0x185b570 <stg_WEAK_info>, src=0x4205e98348, size=6, gen_no=1, tag=0) at rts/sm/Evac.c:147
#1  0x0000000001875f4d in copy (p=0x420434c268, info=0x185b570 <stg_WEAK_info>, src=0x4205e98348, size=6, gen_no=1) at rts/sm/Evac.c:286
#2  0x0000000001876ca2 in evacuate (p=0x420434c268) at rts/sm/Evac.c:858
#3  0x000000000183c623 in markWeakPtrList () at rts/sm/MarkWeak.c:431
#4  0x0000000001836b0b in GarbageCollect (collect_gen=1, do_heap_census=false, deadlock_detect=false, gc_type=2, cap=0x39cb490, idle_cap=0x748230000bb0) at rts/sm/GC.c:430
#5  0x000000000181cf4f in scheduleDoGC (pcap=0x2b38137cae68, task=0x5d1d78000bb0, force_major=false, deadlock_detect=false) at rts/Schedule.c:1836
#6  0x000000000181b10f in schedule (initialCapability=0x39cb490, task=0x5d1d78000bb0) at rts/Schedule.c:562
#7  0x000000000181e27c in scheduleWorker (cap=0x39cb490, task=0x5d1d78000bb0) at rts/Schedule.c:2600
#8  0x000000000182508b in workerStart (task=0x5d1d78000bb0) at rts/Task.c:445
#9  0x00007fbe9a8efedd in start_thread () from /nix/store/xg6ilb9g9zhi2zg1dpi4zcp288rhnvns-glibc-2.30/lib/libpthread.so.0
#10 0x00000000700f9aaf in clone () from /nix/store/xg6ilb9g9zhi2zg1dpi4zcp288rhnvns-glibc-2.30/lib/libc.so.6

>>> print $src_weak[0]
$12 = {
  header = {
    info = 0x185b570 <stg_WEAK_info>
  }, 
  cfinalizers = 0x1cada60, 
  key = 0x4205e98000, 
  value = 0x42044d4589, 
  finalizer = 0x1cada60, 
  link = 0x0
}
>>> print $src_weak[0].key[0]
$14 = {
  header = {
    info = 0x185b510 <stg_TSO_info>
  }, 
  payload = 0x4205e98008
}
>>> x/4a 0x42044d4588
0x42044d4588:   0x15247c8 <base_GHCziConcziSync_ThreadId_con_info>      0x4205e98000
0x42044d4598:   0x1694198 <base_GHCziSTRef_STRef_con_info>      0x4200fae628

Note that the payload of the ThreadId constructor is the same TSO that the weak is keyed on.

So somehow we are evacuating a weak pointer yet not scavenging it.

Confirmed that we do not touch $weak->key again after evacuation until the sanity check.

The block that we allocate into was allocated while evacuating yet another weak pointer:

Thread 20 hit Hardware access (read/write) watchpoint 5: -location $bd->flags

Old value = 1
New value = 0
alloc_todo_block (ws=0x39e6730, size=6) at rts/sm/GCUtils.c:334
334             bd->flags = BF_EVACUATED;
>>> bt
#0  alloc_todo_block (ws=0x39e6730, size=6) at rts/sm/GCUtils.c:334
#1  0x000000000183af41 in todo_block_full (size=6, ws=0x39e6730) at rts/sm/GCUtils.c:298
#2  0x0000000001875c25 in alloc_for_copy (size=6, gen_no=1) at rts/sm/Evac.c:125
#3  0x0000000001875c9a in copy_tag (p=0x39dce20, info=0x185b570 <stg_WEAK_info>, src=0x4200fac1e0, size=6, gen_no=1, tag=0) at rts/sm/Evac.c:144
#4  0x0000000001875f4d in copy (p=0x39dce20, info=0x185b570 <stg_WEAK_info>, src=0x4200fac1e0, size=6, gen_no=1) at rts/sm/Evac.c:286
#5  0x0000000001876ca2 in evacuate (p=0x39dce20) at rts/sm/Evac.c:858
#6  0x000000000183c623 in markWeakPtrList () at rts/sm/MarkWeak.c:431
#7  0x0000000001836b0b in GarbageCollect (collect_gen=1, do_heap_census=false, deadlock_detect=false, gc_type=2, cap=0x39cb490, idle_cap=0x748230000bb0) at rts/sm/GC.c:430
#8  0x000000000181cf4f in scheduleDoGC (pcap=0x2b38137cae68, task=0x5d1d78000bb0, force_major=false, deadlock_detect=false) at rts/Schedule.c:1836
#9  0x000000000181b10f in schedule (initialCapability=0x39cb490, task=0x5d1d78000bb0) at rts/Schedule.c:562
#10 0x000000000181e27c in scheduleWorker (cap=0x39cb490, task=0x5d1d78000bb0) at rts/Schedule.c:2600
#11 0x000000000182508b in workerStart (task=0x5d1d78000bb0) at rts/Task.c:445
#12 0x00007fbe9a8efedd in start_thread () from /nix/store/xg6ilb9g9zhi2zg1dpi4zcp288rhnvns-glibc-2.30/lib/libpthread.so.0
#13 0x00000000700f9aaf in clone () from /nix/store/xg6ilb9g9zhi2zg1dpi4zcp288rhnvns-glibc-2.30/lib/libc.so.6

set $src_weak = (StgWeak *) 0x4205e98348
set $weak =  (StgWeak *) 0x420434c270
set $bd = Bdescr($weak)
set $weak2 = (StgWeak *) 0x42076a7388

So, the overall story is:

We start a major collection
Thread 20 allocates todo block $bd while evacuating weak pointers in markWeakPtrList
Thread 20 evacuates $src_weak into $bd at $weak
Thread 17 scavenges a closure in block 0x4207602940, resulting in the second evacuation of $src_weak to $weak2
Thread 9 enters scavenge_block($bd). At this point $bd is:

>>> print $bd[0]
$24 = {
  start = 0x420434c000, 
  {
    free = 0x420434c000, 
    nonmoving_segment = {
      log_block_size = 0 '\000', 
      next_free_snap = 1076
    }
  }, 
  link = 0x0, 
  u = {
    back = 0x420434c000, 
    bitmap = 0x420434c000, 
    scan = 0x420434c000
  }, 
  gen = 0x39dcdb0, 
  gen_no = 1, 
  dest_no = 1, 
  node = 0, 
  flags = 1, 
  blocks = 1, 
  _padding = {[0] = 0, [1] = 0, [2] = 0}
}
>>> print ws[0]
$29 = {
  gen = 0x39dcdb0, 
  my_gct = 0x39e65b0, 
  todo_bd = 0x4204301300, 
  todo_free = 0x420434c2f8, 
  todo_lim = 0x420434c400, 
  todo_seg = 0x1, 
  todo_q = 0x39e7010, 
  todo_overflow = 0x0, 
  n_todo_overflow = 0, 
  todo_large_objects = 0x0, 
  scavd_list = 0x0, 
  n_scavd_blocks = 0, 
  n_scavd_words = 0, 
  part_list = 0x0, 
  n_part_blocks = 0, 
  n_part_words = 0
}
>>> print $bd
$28 = (bdescr *) 0x4204301300
>>> print $weak
$25 = (StgWeak *) 0x420434c270

Note how:

bd->free == bd->start < $weak
ws->todo_bd == $bd
ws->todo_free > $weak

scavenge_block then scavenges $weak. At this point it is:

>>> print $weak[0]
$33 = {
  header = {
    info = 0x185b570 <stg_WEAK_info>
  }, 
  cfinalizers = 0x1cada60, 
  key = 0x4205e98000, 
  value = 0x42044d4589, 
  finalizer = 0x1cada60, 
  link = 0x0
}
>>> print $weak[0]->key[0]
$35 = {
  header = {
    info = 0x4202906e49
  }, 
  payload = 0x4205e98008
}
>>> x/4a 0x42044d4588
0x42044d4588:   0x15247c8 <base_GHCziConcziSync_ThreadId_con_info>      0x4205e98000
0x42044d4598:   0x1694198 <base_GHCziSTRef_STRef_con_info>      0x4200fae628

Note how the closure at $weak[0]->key is now a forwarding pointer.

However, the closure layout of stg_WEAK_info declares key to be a non-pointer (since it's a weak reference), so we do not evacuate this field.

We eventually conclude scavenging
We enter checkSanity and eventually stumble into the bad weak pointer

My recollection here is that traverseWeakPtrList (called after each round of scavenging) should have evacuated $weak->key. It's unclear why it didn't. Perhaps $weak is somehow no longer on weak_ptr_list?

Alright, so I have confirmed that at upon entering markWeakPtrList the weak pointer list is

>>> print printWeakLists ()
======= WEAK LISTS =======
Capability 0:
Capability 1:
Capability 2:
Capability 3:
Capability 4:
Capability 5:
Capability 6:
Capability 7:
Capability 8:
Generation 0 current weaks:
Generation 0 old weaks:
Generation 1 current weaks:
0x4200fac1e0: WEAK(key=0x4200fac530 value=0x1c245a1 finalizer=0x42044d47a0)
0x4200fac1b0: WEAK(key=0x4200facca8 value=0x42044d4771 finalizer=0x42044d4781)
0x4200fac180: WEAK(key=0x4200fa6538 value=0x1c245a1 finalizer=0x1cada60)
0x4200fac150: WEAK(key=0x42088d88b0 value=0x1c245a1 finalizer=0x42044d4748)
0x4200fac120: WEAK(key=0x4200fad678 value=0x42044d4719 finalizer=0x42044d4729)
0x4200fac0f0: WEAK(key=0x4200fad820 value=0x42044d46e9 finalizer=0x42044d46f9)
0x4200fac0c0: WEAK(key=0x4200fada40 value=0x42044d46b9 finalizer=0x42044d46c9)
0x4200fac090: WEAK(key=0x4200fadbe8 value=0x42044d4689 finalizer=0x42044d4699)
0x4200fac060: WEAK(key=0x4200fade08 value=0x42044d4659 finalizer=0x42044d4669)
0x4200fac030: WEAK(key=0x4200fadfb0 value=0x42044d4629 finalizer=0x42044d4639)
0x4200fac000: WEAK(key=0x4200fae1e8 value=0x42044d45f9 finalizer=0x42044d4609)
0x4205e983a8: WEAK(key=0x4200fae408 value=0x42044d45c9 finalizer=0x42044d45d9)
0x4205e98378: WEAK(key=0x4200fae628 value=0x42044d4599 finalizer=0x42044d45a9)
0x4205e98348: WEAK(key=0x4205e98000 value=0x42044d4589 finalizer=0x1cada60)
Generation 1 old weaks:
=========================

Note how $src_weak (0x4205e98348) is present.

However, on finishing markWeakPtrList we have:

>>> print printWeakLists ()
======= WEAK LISTS =======
Capability 0:
Capability 1:
Capability 2:
Capability 3:
Capability 4:
Capability 5:
Capability 6:
Capability 7:
Capability 8:
Generation 0 current weaks:
Generation 0 old weaks:
Generation 1 current weaks:
0x420434c000: WEAK(key=0x4200fac530 value=0x1c245a1 finalizer=0x42044d47a0)
0x420434c030: WEAK(key=0x4200facca8 value=0x42044d4771 finalizer=0x42044d4781)
0x420434c060: WEAK(key=0x4200fa6538 value=0x1c245a1 finalizer=0x1cada60)
0x420434c090: WEAK(key=0x42088d88b0 value=0x1c245a1 finalizer=0x42044d4748)
0x420434c0c0: WEAK(key=0x4200fad678 value=0x42044d4719 finalizer=0x42044d4729)
0x420434c0f0: WEAK(key=0x4200fad820 value=0x42044d46e9 finalizer=0x42044d46f9)
0x420434c120: WEAK(key=0x4200fada40 value=0x42044d46b9 finalizer=0x42044d46c9)
0x420434c150: WEAK(key=0x4200fadbe8 value=0x42044d4689 finalizer=0x42044d4699)
0x420434c180: WEAK(key=0x4200fade08 value=0x42044d4659 finalizer=0x42044d4669)
0x420434c1b0: WEAK(key=0x4200fadfb0 value=0x42044d4629 finalizer=0x42044d4639)
0x420434c1e0: WEAK(key=0x4200fae1e8 value=0x42044d45f9 finalizer=0x42044d4609)
0x420434c210: WEAK(key=0x4200fae408 value=0x42044d45c9 finalizer=0x42044d45d9)
0x420434c240: WEAK(key=0x4200fae628 value=0x42044d4599 finalizer=0x42044d45a9)
0x42076a7388: WEAK(key=0x4205e98000 value=0x42044d4589 finalizer=0x1cada60)
Generation 1 old weaks:
=========================

Note how $weak (0x420434c270) is not present. Somehow we dropped a weak pointer from the list.

mentioned in merge request !4408 (closed)

Ahh, I see something interesting:

Between steps 3 and 4 there is another event: another thread (thread 17) steps in and evacuates $src_weak again (to 0x42076a7388). This causes a race to evacuate $src_weak, resulting in two copies. At the end of collection one of these copies (the one on weak_ptr_list) will be valid (since its key will be evacuated by traverseWeakList) and the other will be invalid but unreachable. Nevertheless, checkSanity stumbles over the latter and crashes. This is sadly a spurious crash. I fix the cause in !4408 (closed).

@trac-vdukhovni and I again spoke on the phone. Unfortunately it looks like with the final version of !4408 (closed) I am unable to reproduce the crash with or without sanity-checking enabled; I've tried only a few times but it seems likely that the issue is hidden. @trac-vdukhovni, perhaps you could look into expanding the testcase again?

@trac-vdukhovni and I have been communicating via email, where there have been some further interesting revelations. He observed this crash (on a patched tree with additional debug information):

#0  0x00007fe3270fa625 in raise () from /lib64/libc.so.6
#1  0x00007fe3270e38d9 in abort () from /lib64/libc.so.6
#2  0x00000000019f506b in rtsFatalInternalErrorFn (s=0x1aa7a65 "he's dead, jim.\n", ap=0x7fe2d3ffad08) at rts/RtsMessages.c:196
#3  0x00000000019f518d in barf (s=s@entry=0x1aa7a65 "he's dead, jim.\n") at rts/RtsMessages.c:58
#4  0x00000000019f52d3 in putMVarError (mvar=0x4201a10880, tso=0x4200e2d010) at rts/RtsMessages.c:50
#5  0x0000000001a1896b in stg_putMVarzh ()
#6  0x0000000000000000 in ?? ()
(gdb) fr 4
#4  0x00000000019f52d3 in putMVarError (mvar=0x4201a10880, tso=0x4200e2d010) at rts/RtsMessages.c:50
50        barf("he's dead, jim.\n");
(gdb) l
45      {
46        errorBelch("putMVarzh: uh oh, blocked mismatch\n");
47        errorBelch("  tso->why_blocked=%d\n", tso->why_blocked);
48        errorBelch("  tso->block_info=%p\n", tso->block_info);
49        errorBelch("  mvar=%p\n", mvar);
50        barf("he's dead, jim.\n");
51      }
52
53      void
54      barf(const char*s, ...)
(gdb) p *tso
$1 = {header = {info = 0x3a22706d61747365}, _link = 0x3235393230363122, global_link = 0x616e222c22363833, stackobj = 0x797370223a22656d, what_next = 26723, 
  why_blocked = 29807, flags = 1634887016, block_info = {closure = 0x736b726f772d7970, prev = 0x736b726f772d7970, bh = 0x736b726f772d7970, throwto = 0x736b726f772d7970, 
    wakeup = 0x736b726f772d7970, fd = 8316866959936158064}, id = 1952804398, saved_errno = 1948396578, dirty = 577073273, bound = 0x6c6176222c226161, 
  cap = 0x303632223a226575, trec = 0x333a303037343a36, blocked_exceptions = 0x3138363a3a373330, bq = 0x7d22656263353a66, alloc_limit = 8315172586696899338, 
  tot_stack_size = 1886216564}
(gdb) p *mvar
$2 = {header = {info = 0x1a19958 <stg_WHITEHOLE_info>}, head = 0x420188d6a3, tail = 0x4200c36678, value = 0x1e9e850}

That TSO doesn't look right at all. Reinterpreting tso as ASCII characters reveals that someone scribbled the JSON data his program is working with over the TSO.

Furthermore, looking at the backtraces of the other threads, we find a few interesting cases:

  (gdb) thread apply all backtrace

  ...

  Thread 37 (Thread 0x7fe2b97fa700 (LWP 2932832)):
  #0  reallyLockClosure (p=<optimized out>) at rts/SMPClosureOps.h:63
  #1  0x0000000001a1853c in stg_takeMVarzh ()
  #2  0x0000000000000000 in ?? ()

  ...

  Thread 25 (Thread 0x7fe2bb7fe700 (LWP 2932829)):
  #0  0x00000000009144c7 in cryptonite_aesni_encrypt_block256 ()
  #1  0x00000000008ddd6b in cryptonite_aes_generic_gcm_decrypt ()
  #2  0x0000000000884e39 in ?? ()
  #3  0x0000000000000000 in ?? ()

  ...

  Thread 1 (Thread 0x7fe2d3fff700 (LWP 2932820)):
  #0  0x00007fe3270fa625 in raise () from /lib64/libc.so.6
  #1  0x00007fe3270e38d9 in abort () from /lib64/libc.so.6
  #2  0x00000000019f506b in rtsFatalInternalErrorFn (s=0x1aa7a65 "he's dead, jim.\n", ap=0x7fe2d3ffad08) at rts/RtsMessages.c:196
  #3  0x00000000019f518d in barf (s=s@entry=0x1aa7a65 "he's dead, jim.\n") at rts/RtsMessages.c:58
  #4  0x00000000019f52d3 in putMVarError (mvar=0x4201a10880, tso=0x4200e2d010) at rts/RtsMessages.c:50
  #5  0x0000000001a1896b in stg_putMVarzh ()
  #6  0x0000000000000000 in ?? ()

So, we certainly have foreign calls at-play here. Moreover, cryptonite makes quite heavy use of foreign buffers, suggesting that this may either be a library bug or another manifestation of #17760 (closed).

Moreover, cryptonite makes quite heavy use of foreign buffers, suggesting that this may either be a library bug or another manifestation of #17760 (closed).

Marking withForeignPtr as NOINLINE could help dismissing this hypothesis.

Yes, I do use HTTPS sometimes, presumably in this trace, as there's no other library that comes to mind that would be decrypting data. However, the data is also compressed, and what we're seeing scribbled over the TSO is text that I'd have expected to see compressed (some bits recur in many lines and some is new in each line), so that suggests that cryptonite did not directly scribble that text into memory it did not own.

The other foreign calls are primarily in streaming-bytestring and bytestring, mostly reading bytestring payloads, but also constructing bytestrings via builders...

Yes, I do use HTTPS sometimes, presumably in this trace, as there's no other library that comes to mind that would be decrypting data. However, the data is also compressed, and what we're seeing scribbled over the TSO is text that I'd have expected to see compressed (some bits recur in many lines and some is new in each line), so that suggests that cryptonite did not directly scribble that text into memory it did not own.

Yes, this is a fair point. This does indeed suggest that cryptonite isn't at fault.

Unfortunately I have not been able to reproduce the crash on the full program you sent. @trac-vdukhovni, could describe how you are reproducing this?

So either the TSO is still live and its data was stepped on by some foreign call, or alternatively, the TSO is no longer live and so its value is pointing at live data. We'll have to figure out which... I'll keep running tests to see if I can narrow this down... The data in question came in via HTTPS compressed, then was decompressed, then chunks of 1k lines are streamed into bounded channels parsed from JSON in multiple threads, filtered, and the filter output concatenated via bytestring builders, the output of which is then streamed to an output bounded channel and consumed for printing. If there is unsafe memory into which uncompressed raw JSON data is bleeding it would have to be in or after the streaming gunzip, and before the JSON data is parsed to extract just the desired "key" (we're seeing raw JSON in *tso).

@trac-vdukhovni and I looked at a crash of his via Hangouts and we found a few things:

The crash is in putMVarzh, as noted above
the mvar is empty, has been whitehole'd (almost certainly due to the LOCK_CLOSURE earlier in putMVarzh), and lives in generation 1
mvar->head points to a ARR_WORDS living in generation 0 with flags=7. The array contains a URL of length 0x80.

[ The ARR_WORDS holds a DNS domain name not URL, but that's not terribly important. ] The domain name in question was almost certainly created via a ByteString builder, executed to yield a strict ByteString, without trimming:

import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as L
import qualified Data.ByteString.Builder.Extra as D

buildShort :: Builder -> B.ByteString
buildShort = build 128 256

build :: Int -- ^ Initial buffer size
      -> Int -- ^ Subsequent buffer size
      -> Builder -> B.ByteString
build !i !j = L.toStrict . D.toLazyByteStringWith buildStrat L.empty
  where
    buildStrat = D.untrimmedStrategy i j

The pinned byte array length was 0x80, and it contained a single domain name, which in this pipeline is generally most of the time a slice of a larger buffer, but late in the process is converted to a single bytestring via the above functions.

So if there's a safety issue it would be in toLazyByteStringWith, but otherwise it is in the GC code. Once the domain name is built, it is tested for relevance (read-only ops) and then added back to an output stream (converted back to a builder that concatenates many names into with a more generous buffer allocation).

So I have finally been able to reproduce this in gdb:

fname: putMVarzh: uh oh, blocked mismatch                         
fname:   tso->why_blocked=419                                                                                                                                                                                                                                                                                               
fname:   tso->block_info=0xa8517401a3af60f9         
fname:   mvar=0x42001b2860                                               
fname: internal error: he's dead, jim.

>>> print $mvar[0]
$6 = {
  header = {
    info = 0x1a3af90 <stg_WHITEHOLE_info>
  }, 
  head = 0x61, 
  tail = 0x4200874668, 
  value = 0x1edc640
}
>>> print $mvar->tail[0]
$3 = {
  header = {
    info = 0x1a3ba68 <stg_MVAR_TSO_QUEUE_info>
  }, 
  link = 0x1edc640, 
  tso = 0x42008750f0
}
>>> print $mvar->value[0]
$4 = {
  header = {
    info = 0x1a3b798 <stg_END_TSO_QUEUE_info>
  }, 
  payload = 0x1edc648
}

The value of head is truly bizarre.

This has been quite a challenge to get traction on; unfortunately I have been quite unable to reproduce the issue under rr. From my runs under gdb it appears that the issue is either:

the MVar's head MVAR_TSO_QUEUE being reclaimed and overwritten despite being live, or
another thread writing a bogus pointer to the MVar's head.

Unfortunately without rr it is quite tricky to distinguish these two cases.

Hmm, although now I see another failure mode: no obvious corruption but a case where we end up in putMVarzh where the mvar->head->tso->reason_blocked != mvar; both LHS and RHS look valid however.

While I don't have a concrete theory for how this program would trigger #17760 (closed), I felt it would be negligent to continue on the long road of gdb debugging without trying to the obvious NOINLINE workaround on withForeignPtr. To my surprise, a run of the program on the 23 GByte dataset finished successfully with this workaround. Of course, this doesn't necessarily mean much: it is possible that this passed due to changes in timing. Nevertheless, I'm going to leave this running in a loop overnight to see whether I can tease out a crash.

Well, the program ran 12 times without a crash over the 10.75 hours that I was away-from-keyboard. However, now I'm having trouble reproducing the issue even without the withForeignPtr change. Sigh.

Whew, well, it took two dozens runs but I finally managed to reproduce the crash again with the small data set.

Alright, well it looks like this indeed might be #17760 (closed). I'll try reproducing with my keepAlive# branch.

So I managed to get things building with !4347 (closed) and sadly it appears to still crash with the same failure mode. I suspect that adding a NOINLINE to withForeignPtr just changed simplification/timing enough to hide the bug.

Unfortunately I think I'm going to need to put this aside for now. While I hate release %9.0.1 with a known soundness issue, this appears not to be a recent regression and it has been extremely resistant to reproduction under rr, making debugging quite difficult. @trac-vdukhovni says he has a smaller reproducer which I'm hoping he will be able to post but I'm afraid it doesn't make sense to continue trying to diagnose the current reproducer until after %9.0.1 is out.

Reducing the nursery size further results in a crash without "+RTS -s/some/file":

$ $b --lines 128 "$u" +RTS -A64k -N14
bchan: putMVarzh: uh oh, blocked mismatch

bchan:   tso->why_blocked=65475

bchan:   tso->block_info=0xb8ad8

bchan:   mvar=0x42005138b8

bchan: internal error: he's dead, jim.

    (GHC version 8.10.2 for x86_64_unknown_linux)
    Please report this as a GHC bug:  https://www.haskell.org/ghc/reportabug
Aborted (core dumped)

This time the MVar head looks like:

(gdb) p *mvar
$1 = {header = {info = 0x1a05238 <stg_WHITEHOLE_info>}, head = 0x4200904b30, tail = 0x420090abd0, value = 0x1e865b0}
(gdb) p *mvar->head
$2 = {header = {info = 0x18a1ac0 <base_GHCziWord_W8zh_con_info>}, link = 0x61, tso = 0x1a05ae0 <stg_ARR_WORDS_info>}
(gdb) p *mvar->tail
$13 = {header = {info = 0x1a05d10 <stg_MVAR_TSO_QUEUE_info>}, link = 0x1e865b0, tso = 0x42005148f8}
(gdb) p *mvar->value
$12 = {header = {info = 0x1a05a40 <stg_END_TSO_QUEUE_info>}, payload = 0x1e865b8}

The MVar block info reads:

(gdb) set $MBLOCK_SIZE = 0x100000
(gdb) set $MBLOCK_MASK = 0x0fffff
(gdb) set $BLOCK_MASK = 0xFFF
(gdb) set $BLOCK_SHIFT = 12
(gdb) set $BDESCR_SHIFT = 6
(gdb) set $p = (int64_t)mvar
(gdb) p *(bdescr *)((((($p) &  $MBLOCK_MASK & ~$BLOCK_MASK) >> ($BLOCK_SHIFT-$BDESCR_SHIFT)) | (($p) & ~$MBLOCK_MASK)))
$6 = {start = 0x4200513000, {free = 0x4200513f78, nonmoving_segment = {log_block_size = 120 'x', next_free_snap = 81}}, link = 0x4200800340, u = {back = 0x4200513f78, 
    bitmap = 0x4200513f78, scan = 0x4200513f78}, gen = 0x2b317b0, gen_no = 1, dest_no = 1, node = 0, flags = 1, blocks = 1, _padding = {0, 0, 0}}

While for mvar->head we again have a gen = 0 block

(gdb) p *(bdescr *)((((($p) &  $MBLOCK_MASK & ~$BLOCK_MASK) >> ($BLOCK_SHIFT-$BDESCR_SHIFT)) | (($p) & ~$MBLOCK_MASK)))
$7 = {start = 0x4200904000, {free = 0x4200904fa0, nonmoving_segment = {log_block_size = 160 '\240', next_free_snap = 144}}, link = 0x0, u = {back = 0x0, bitmap = 0x0, 
    scan = 0x0}, gen = 0x2b31630, gen_no = 0, dest_no = 0, node = 0, flags = 7, blocks = 1, _padding = {0, 0, 0}}

The tail pointer (more reasonable looking) is again from gen = 1:

(gdb) p *(bdescr *)((((($p) &  $MBLOCK_MASK & ~$BLOCK_MASK) >> ($BLOCK_SHIFT-$BDESCR_SHIFT)) | (($p) & ~$MBLOCK_MASK)))
$8 = {start = 0x420090a000, {free = 0x420090ac00, nonmoving_segment = {log_block_size = 0 '\000', next_free_snap = 144}}, link = 0x0, u = {back = 0x420090ac00, 
    bitmap = 0x420090ac00, scan = 0x420090ac00}, gen = 0x2b317b0, gen_no = 1, dest_no = 1, node = 0, flags = 1, blocks = 1, _padding = {0, 0, 0}}

(gdb) set $p = (int64_t)mvar->tail->tso
(gdb) p *(bdescr *)((((($p) &  $MBLOCK_MASK & ~$BLOCK_MASK) >> ($BLOCK_SHIFT-$BDESCR_SHIFT)) | (($p) & ~$MBLOCK_MASK)))
$14 = {start = 0x4200514000, {free = 0x4200514ff0, nonmoving_segment = {log_block_size = 240 '\360', next_free_snap = 81}}, link = 0x42005004c0, u = {back = 0x4200514ff0, 
    bitmap = 0x4200514ff0, scan = 0x4200514ff0}, gen = 0x2b317b0, gen_no = 1, dest_no = 1, node = 0, flags = 1, blocks = 1, _padding = {0, 0, 0}}

I just got off the phone with @trac-vdukhovni; I believe he has identified the issue: stg_putMVarzh and stg_tryPutMVarzh both fail to dirty the MVar in the code-path where we wake-up a blocked thread directly. Specifically, the issue is this write (https://gitlab.haskell.org/ghc/ghc/-/blob/24a86f09da3426cf1006004bc45d312725280dd5/rts/PrimOps.cmm#L1815):

    // There are readMVar/takeMVar(s) waiting: wake up the first one

    tso = StgMVarTSOQueue_tso(q);
    StgMVar_head(mvar) = StgMVarTSOQueue_link(q);
    if (StgMVar_head(mvar) == stg_END_TSO_QUEUE_closure) {
        StgMVar_tail(mvar) = stg_END_TSO_QUEUE_closure;
    }

Note how we mutate head yet we never actually mark the MVar as dirty. @trac-vdukhovni says that dirtying the MVar resolves the crash in his case.

While it's plausible that this write would require marking the MVar as dirty, I can also think of an argument why it may not be necessary: the write makes no closures reachable from the MVar that weren't already reachable before. Indeed this is why I concluded that the code was correct in my audit last week.

That being said, I believe the above argument is very subtly incorrect: Dirtiness only reflects whether any objects in a younger generation are reachable from the MVar. This means that the MVar itself may be (correctly) marked as clean before the mutation (since mvar->head lives in the old generation), yet then needs to be dirtied after popping the head (since mvar->head->link lives in a younger generation). That is:

 Generation 0                         Generation 1

                                      mvar
                                      ┌────────────┐
                                      │ MVAR       │
                                      │ head       │ ──╮
                                      └────────────┘   │
                                                       │   
  qa                                  qb               │
 ┌────────────┐                       ┌────────────┐   │
 │ TSO_QUEUE  │ ←─────╮               │ TSO_QUEUE  │ ←─╯
 │ head       │       │               │ head       │                  
 │ link       │       ╰────────────── │ link       │
 └────────────┘                       └────────────┘

It's a bit hard to see how this could happen given that this implies that mvar->head, which was presumably allocated after mvar->head->link, lives in an older generation than mvar->head->link. That being said I currently can't rule it out. I suspect the turn out events would be something like:

a thread waits on mvar, giving rise to qa (in nursery)
another thread waits on mvar, giving rise to qb (in nursery)
a GC is initiated
a nursery object (perhaps a TSO?) containing a reference to qb is scavenged, evacuating qa to gen0 (this is the critical step)
mvar is scavenged, evacuating qb to gen1; mvar is consequently marked as clean
qb is scavenged but qa has already been evacuated; consequently qb is added to the remembered set
GC finishes
Someone calls putMVar# mvar, popping qb off the head of mvar; now mvar contains a direct reference to qa in gen0 yet is marked as clean.

There are still a number of open questions in the above timeline (e.g. who would have a reference to qb other than the mvar?). Nevertheless, I am guessing that the failure mode looks something like that.

Here's the sequence of events leading up to the corrupted state in my recording. (I don't have to distinguish between nursery and gen0, what feature of the block descriptor lets me tell these apart)? Each pointer in the MVAR TSO queue is shown with a generation suffix, with 0 possibly also being nursery.

The operation is either "put", "take" or "gc", the latter meaning that the head/tail pointers don't match the output of the previous state, because either got moved. After a GC I only know the head/tail pointers, but the "?" typically become apparent when exposed by "put" unless another GC happens too soon. The letters [cd] in square brackes indicate the before/after state of the MVAR (clean or dirty).

...
gc[dd]: 0x4200183c20:1 -> ? -> 0x420014c000:0
take[dd]: 0x4200183c20:1 -> ? -> 0x420014c000:0 -> 0x42005c9000:0
put[dd]: 0x42007a6000:0 -> 0x420014c000:0 -> 0x42005c9000:0
take[dd]: 0x42007a6000:0 -> 0x420014c000:0 -> 0x42005c9000:0 -> 0x42002672d0:0
take[dd]: 0x42007a6000:0 -> 0x420014c000:0 -> 0x42005c9000:0 -> 0x42002672d0:0 -> 0x4200560560:0
put[dd]: 0x420014c000:0 -> 0x42005c9000:0 -> 0x42002672d0:0 -> 0x4200560560:0
gc[dd]: 0x4200183c68:1 -> ? -> ? -> 0x4200b12000:0
take[dd]: 0x4200183c68:1 -> ? -> ? -> 0x4200b12000:0 -> 0x42001bb270:0
take[dd]: 0x4200183c68:1 -> ? -> ? -> 0x4200b12000:0 -> 0x42001bb270:0 -> 0x4200592c88:0
put[dd]: 0x4200563000:0 -> ? -> 0x4200b12000:0 -> 0x42001bb270:0 -> 0x4200592c88:0
gc[dd]: 0x42004caaf0:1 -> ? -> ? -> ? -> 0x4200025030:0
take[dd]: 0x42004caaf0:1 -> ? -> ? -> ? -> 0x4200025030:0 -> 0x420081b000:0
put[dd]: 0x42004cab08:1 -> ? -> ? -> 0x4200025030:0 -> 0x420081b000:0
take[dd]: 0x42004cab08:1 -> ? -> ? -> 0x4200025030:0 -> 0x420081b000:0 -> 0x4200318a30:0
put[dd]: 0x42001f2e38:1 -> ? -> 0x4200025030:0 -> 0x420081b000:0 -> 0x4200318a30:0
take[dd]: 0x42001f2e38:1 -> ? -> 0x4200025030:0 -> 0x420081b000:0 -> 0x4200318a30:0 -> 0x4200a055e0:0
gc[dd]: 0x42001f2e38:1 -> ? -> ? -> ? -> ? -> 0x42006e7018:0
put[dd]: 0x42001f2e68:1 -> ? -> ? -> ? -> 0x42006e7018:0
take[dd]: 0x42001f2e68:1 -> ? -> ? -> ? -> 0x42006e7018:0 -> 0x4200113a30:0
take[dd]: 0x42001f2e68:1 -> ? -> ? -> ? -> 0x42006e7018:0 -> 0x4200113a30:0 -> 0x42008e7270:0
put[dd]: 0x4200a326b8:1 -> ? -> ? -> 0x42006e7018:0 -> 0x4200113a30:0 -> 0x42008e7270:0
gc[dc]: 0x4200a326b8:1 -> ? -> ? -> ? -> ? -> 0x4200167dc0:1
put[cc]: 0x4200184220:1 -> ? -> ? -> ? -> 0x4200167dc0:1
put[cc]: 0x4200167d78:1 -> ? -> ? -> 0x4200167dc0:1
put[cc]: 0x4200167d90:1 -> ? -> 0x4200167dc0:1
put[cc]: 0x420028f000:0 -> 0x4200167dc0:1

OOPS: the info table pointer of the head(0x420028f000:0) is no longer one of the expected:

&stg_END_TSO_QUEUE_info
&stg_MVAR_TSO_QUEUE_info
&stg_IND_info
&stg_MSG_NULL_info

and the associated data is bad. Perhaps a useful precondition to check in the debug build before using the "head" value.

@bgamari, does the above match your analysis?

It looks like the key ingredient was that both the head and the tail of the queue got promoved to gen 1 by:

gc[dc]: 0x4200a326b8:1 -> ? -> ? -> ? -> ? -> 0x4200167dc0:1

and the MVAR was marked clean as a result, but then putMVarzh moved the head a few times to eventually land on a gen 0 queue member, with the MVAR marked clean.

I'll run a couple more tests, and post a patch later today...

mentioned in merge request !4457 (closed)

mentioned in merge request !4458 (closed)

mentioned in merge request !4459 (merged)

mentioned in merge request !4460 (closed)

changed the description

mentioned in issue #16361

changed title from GHC 8.10.2, putMVar segfault to GHC 7.8—8.10, putMVar segfault

changed the description

mentioned in issue #7923 (closed)

With the comments fixed, all 4 MRs are I think now ready.

Thank you, @trac-vdukhovni!

For the record, this bug was introduced by the optimisation described in #7923 (closed).

mentioned in commit 180f58cf

mentioned in commit 1e9486bd

GHC 7.8—8.10, putMVar segfault

Summary

Steps to reproduce

Expected behavior

Environment

Child items ...

Activity