Intermittent segfaults observed in multi-threaded program (likely via BoundedChan) with GHC 8.10.2
[ TL;DR. Missing dirty_MVAR call in putMvar resulted in MVar incorrectly remaining clean even when holding the last reference to a TSO queue head in a younger generation, the queue head was subsequently moved by the GC, but the MVar pointer was not updated. As a result the MVar's TSO queue was corrupted with a dangling pointer to an unexpected object or just random content in memory. ]
Steps to reproduce
Crash many tens of minutes into the run, which was processing tens of GB of data,
so the crash is by no means immediate, or easy to reproduce. It has happened
a few times now. Rather difficult to reproduce, very sensitive to scheduler
timing and workload. Things that made it more likely were:
Limiting the depth of the BoundedChan to 1, thus increasing inter-thread contention
Limiting the heap size with -A128k, making GC more frequent.
Running on a bare-metal 16-core/32-thread machine, to get more effective concurrency.
A multi-layer pipeline of BoundedChan's between source and sink:
HTTPS or stdin
gunzip
group into chunks of 1k lines
parallel JSON parser/filter
output
Large dataset from internet-wide IP survey, compressed to multiple GB.
internal error: ASSERTION FAILED: file rts/Stats.c, line 502
The stack trace is more useful now:
(gdb) bt#0 0x00007f6f6dc1e625 in raise () from /lib64/libc.so.6#1 0x00007f6f6dc078d9 in abort () from /lib64/libc.so.6#2 0x00000000019f1c5f in rtsFatalInternalErrorFn (s=0x1b0da30 "ASSERTION FAILED: file %s, line %u\n", ap=0x7f6f17ffe8a8) at rts/RtsMessages.c:186#3 0x00000000019f1891 in barf (s=0x1b0da30 "ASSERTION FAILED: file %s, line %u\n") at rts/RtsMessages.c:48#4 0x00000000019f18f4 in _assertFail (filename=0x1b11cd9 "rts/Stats.c", linenum=502) at rts/RtsMessages.c:63#5 0x00000000019fd3f6 in stat_endGC (cap=0x273b760, initiating_gct=0x2834990, live=1064162, copied=1009, slop=194846, gen=0, par_n_threads=16, gc_threads=0x2833650, par_max_copied=153, par_balanced_copied=204, gc_spin_spin=33332578664, gc_spin_yield=33022849, mut_spin_spin=42068217183, mut_spin_yield=41736854, any_work=885, no_work=885, scav_find_work=28) at rts/Stats.c:502#6 0x0000000001a1357d in GarbageCollect (collect_gen=0, do_heap_census=false, deadlock_detect=false, gc_type=2, cap=0x273b760, idle_cap=0x7f6f04000bb0) at rts/sm/GC.c:941#7 0x00000000019f8e87 in scheduleDoGC (pcap=0x7f6f17ffee50, task=0x7f6f58000bb0, force_major=false, deadlock_detect=false) at rts/Schedule.c:1836#8 0x00000000019f71de in schedule (initialCapability=0x273b760, task=0x7f6f58000bb0) at rts/Schedule.c:562#9 0x00000000019fa10f in scheduleWorker (cap=0x273b760, task=0x7f6f58000bb0) at rts/Schedule.c:2600#10 0x0000000001a00cf1 in workerStart (task=0x7f6f58000bb0) at rts/Task.c:445#11 0x00007f6f6fd0b4e2 in start_thread () from /lib64/libpthread.so.0#12 0x00007f6f6dce36c3 in clone () from /lib64/libc.so.6
The failed assertion is:
500 for (unsigned int i=0; i < par_n_threads; i++) {501 gc_thread *gct = gc_threads[i];502 ASSERT(gct->gc_end_cpu >= gct->gc_start_cpu);503 stats.gc.cpu_ns += gct->gc_end_cpu - gct->gc_start_cpu;504 }
(gdb) p gct->gc_end_cpu - gct->gc_start_cpu$7 = 2616009
So it is rather mysterious as to how the assertion managed to fail...
[ Unless some other thread was modifying these values at the same time, and the core file contains values that are not exactly what the pointers held the time of the comparison. ]
Disabling statistics, I finally get a more meaningful core:
internal error: ASSERTION FAILED: file rts/PrimOps.cmm, line 1824
With backtrace:
#0 0x00007f4379081625 in raise () from /lib64/libc.so.6#1 0x00007f437906a8d9 in abort () from /lib64/libc.so.6#2 0x00000000019f1c5f in rtsFatalInternalErrorFn (s=0x1b0da30 "ASSERTION FAILED: file %s, line %u\n", ap=0x7f43497f5d08) at rts/RtsMessages.c:186#3 0x00000000019f1891 in barf (s=0x1b0da30 "ASSERTION FAILED: file %s, line %u\n") at rts/RtsMessages.c:48#4 0x00000000019f18f4 in _assertFail (filename=0x1ace29c "rts/PrimOps.cmm", linenum=1824) at rts/RtsMessages.c:63#5 0x0000000001a34afc in stg_putMVarzh ()
And the assertion in question is:
ASSERT(StgTSO_block_info(tso) == mvar);
Sadly, I have no symbols for stg_putMVarzh, is there a way to build the RTS so that this function also has debug symbols and I can look at its local variables, ...?
Compiling with -debug and running with -DS I get a new segfault:
#0 0x0000000001a1ff1a in LOOKS_LIKE_INFO_PTR_NOT_NULL (p=12297829382473034410) at includes/rts/storage/ClosureMacros.h:246#1 0x0000000001a1ff69 in LOOKS_LIKE_INFO_PTR (p=12297829382473034410) at includes/rts/storage/ClosureMacros.h:251#2 0x0000000001a1ffa1 in LOOKS_LIKE_CLOSURE_PTR (p=0x4201cd9268) at includes/rts/storage/ClosureMacros.h:256#3 0x0000000001a2096a in checkClosure (p=0x4200012000) at rts/sm/Sanity.c:346#4 0x0000000001a20dc4 in checkHeapChain (bd=0x4200000480) at rts/sm/Sanity.c:473#5 0x0000000001a21a97 in checkGeneration (gen=0x3066c30, after_major_gc=true) at rts/sm/Sanity.c:845#6 0x0000000001a21b6d in checkFullHeap (after_major_gc=true) at rts/sm/Sanity.c:864#7 0x0000000001a21be8 in checkSanity (after_gc=true, major_gc=true) at rts/sm/Sanity.c:873#8 0x0000000001a1315b in GarbageCollect (collect_gen=1, do_heap_census=false, deadlock_detect=false, gc_type=2, cap=0x2fb10a0, idle_cap=0x7f6d3c000bb0) at rts/sm/GC.c:850#9 0x00000000019f8e87 in scheduleDoGC (pcap=0x7f6d75ffae50, task=0x7f6d90000bb0, force_major=false, deadlock_detect=false) at rts/Schedule.c:1836#10 0x00000000019f71de in schedule (initialCapability=0x2fb10a0, task=0x7f6d90000bb0) at rts/Schedule.c:562#11 0x00000000019fa10f in scheduleWorker (cap=0x2fb10a0, task=0x7f6d90000bb0) at rts/Schedule.c:2600#12 0x0000000001a00cf1 in workerStart (task=0x7f6d90000bb0) at rts/Task.c:445#13 0x00007f6da988e4e2 in start_thread () from /lib64/libpthread.so.0#14 0x00007f6da78666c3 in clone () from /lib64/libc.so.6
It looks like increasing -A makes the crash less likely (or at least take longer to manifest). I've simplified the code to eliminate the most CPU-intensive parts, and now the crash happens sooner, and fairly reliably with -DS, with default -A. With +RTS -A32m -n2m -DS -N12 still running and hasn't crashed yet, but there's many more minutes of crunching through the full dataset (~10m to decompress and stream 4.5 GB of data (compressed size)).
@trac-vdukhovni and I chatted about this issue this evening. Happily I now have the code and can reproduce the issue locally under rr. Also, @trac-vdukhovni has confirmed that he has seen the crash under ghc 8.8 as well.
Currently I am looking at an rr trace which fails with the sanity-checker error:
>>> bt#0 0x0000000001844cb3 in LOOKS_LIKE_INFO_PTR_NOT_NULL (p=12297829382473034410) at includes/rts/storage/ClosureMacros.h:246#1 0x0000000001844d02 in LOOKS_LIKE_INFO_PTR (p=12297829382473034410) at includes/rts/storage/ClosureMacros.h:251#2 0x0000000001844d3a in LOOKS_LIKE_CLOSURE_PTR (p=0x4205e98000) at includes/rts/storage/ClosureMacros.h:256#3 0x0000000001845703 in checkClosure (p=0x420434c270) at rts/sm/Sanity.c:346#4 0x0000000001845b5d in checkHeapChain (bd=0x4204301300) at rts/sm/Sanity.c:473#5 0x0000000001846874 in checkGeneration (gen=0x39dcdb0, after_major_gc=true) at rts/sm/Sanity.c:849#6 0x0000000001846906 in checkFullHeap (after_major_gc=true) at rts/sm/Sanity.c:864#7 0x0000000001846981 in checkSanity (after_gc=true, major_gc=true) at rts/sm/Sanity.c:873#8 0x0000000001837bd2 in GarbageCollect (collect_gen=1, do_heap_census=false, deadlock_detect=false, gc_type=2, cap=0x39cb490, idle_cap=0x748230000bb0) at rts/sm/GC.c:850#9 0x000000000181cf4f in scheduleDoGC (pcap=0x2b38137cae68, task=0x5d1d78000bb0, force_major=false, deadlock_detect=false) at rts/Schedule.c:1836#10 0x000000000181b10f in schedule (initialCapability=0x39cb490, task=0x5d1d78000bb0) at rts/Schedule.c:562#11 0x000000000181e27c in scheduleWorker (cap=0x39cb490, task=0x5d1d78000bb0) at rts/Schedule.c:2600#12 0x000000000182508b in workerStart (task=0x5d1d78000bb0) at rts/Task.c:445#13 0x00007fbe9a8efedd in start_thread () from /nix/store/xg6ilb9g9zhi2zg1dpi4zcp288rhnvns-glibc-2.30/lib/libpthread.so.0#14 0x00000000700f9aaf in clone () from /nix/store/xg6ilb9g9zhi2zg1dpi4zcp288rhnvns-glibc-2.30/lib/libc.so.6
This occurred at the end of a major GC.
The p here from frame 3 is a weak pointer with this reachability graph at the beginning of the collection:
However, it appears that this weak pointer is the result of evacuation during the current GC since if I move back to the beginning of the collection the weak is all 0xas.
Furthermore, if I set a watchpoint on $weak->header.info I find that it was written by copy_tag, which is copying from a closure:
>>> bt#0 0x0000000001875cae in copy_tag (p=0x420434c268, info=0x185b570 <stg_WEAK_info>, src=0x4205e98348, size=6, gen_no=1, tag=0) at rts/sm/Evac.c:147#1 0x0000000001875f4d in copy (p=0x420434c268, info=0x185b570 <stg_WEAK_info>, src=0x4205e98348, size=6, gen_no=1) at rts/sm/Evac.c:286#2 0x0000000001876ca2 in evacuate (p=0x420434c268) at rts/sm/Evac.c:858#3 0x000000000183c623 in markWeakPtrList () at rts/sm/MarkWeak.c:431#4 0x0000000001836b0b in GarbageCollect (collect_gen=1, do_heap_census=false, deadlock_detect=false, gc_type=2, cap=0x39cb490, idle_cap=0x748230000bb0) at rts/sm/GC.c:430#5 0x000000000181cf4f in scheduleDoGC (pcap=0x2b38137cae68, task=0x5d1d78000bb0, force_major=false, deadlock_detect=false) at rts/Schedule.c:1836#6 0x000000000181b10f in schedule (initialCapability=0x39cb490, task=0x5d1d78000bb0) at rts/Schedule.c:562#7 0x000000000181e27c in scheduleWorker (cap=0x39cb490, task=0x5d1d78000bb0) at rts/Schedule.c:2600#8 0x000000000182508b in workerStart (task=0x5d1d78000bb0) at rts/Task.c:445#9 0x00007fbe9a8efedd in start_thread () from /nix/store/xg6ilb9g9zhi2zg1dpi4zcp288rhnvns-glibc-2.30/lib/libpthread.so.0#10 0x00000000700f9aaf in clone () from /nix/store/xg6ilb9g9zhi2zg1dpi4zcp288rhnvns-glibc-2.30/lib/libc.so.6>>> print $src_weak[0]$12 = { header = { info = 0x185b570 <stg_WEAK_info> }, cfinalizers = 0x1cada60, key = 0x4205e98000, value = 0x42044d4589, finalizer = 0x1cada60, link = 0x0}>>> print $src_weak[0].key[0]$14 = { header = { info = 0x185b510 <stg_TSO_info> }, payload = 0x4205e98008}>>> x/4a 0x42044d45880x42044d4588: 0x15247c8 <base_GHCziConcziSync_ThreadId_con_info> 0x4205e980000x42044d4598: 0x1694198 <base_GHCziSTRef_STRef_con_info> 0x4200fae628
Note that the payload of the ThreadId constructor is the same TSO that the weak is keyed on.
So somehow we are evacuating a weak pointer yet not scavenging it.
Confirmed that we do not touch $weak->key again after evacuation until the sanity check.
The block that we allocate into was allocated while evacuating yet another weak pointer:
Thread 20 hit Hardware access (read/write) watchpoint 5: -location $bd->flagsOld value = 1New value = 0alloc_todo_block (ws=0x39e6730, size=6) at rts/sm/GCUtils.c:334334 bd->flags = BF_EVACUATED;>>> bt#0 alloc_todo_block (ws=0x39e6730, size=6) at rts/sm/GCUtils.c:334#1 0x000000000183af41 in todo_block_full (size=6, ws=0x39e6730) at rts/sm/GCUtils.c:298#2 0x0000000001875c25 in alloc_for_copy (size=6, gen_no=1) at rts/sm/Evac.c:125#3 0x0000000001875c9a in copy_tag (p=0x39dce20, info=0x185b570 <stg_WEAK_info>, src=0x4200fac1e0, size=6, gen_no=1, tag=0) at rts/sm/Evac.c:144#4 0x0000000001875f4d in copy (p=0x39dce20, info=0x185b570 <stg_WEAK_info>, src=0x4200fac1e0, size=6, gen_no=1) at rts/sm/Evac.c:286#5 0x0000000001876ca2 in evacuate (p=0x39dce20) at rts/sm/Evac.c:858#6 0x000000000183c623 in markWeakPtrList () at rts/sm/MarkWeak.c:431#7 0x0000000001836b0b in GarbageCollect (collect_gen=1, do_heap_census=false, deadlock_detect=false, gc_type=2, cap=0x39cb490, idle_cap=0x748230000bb0) at rts/sm/GC.c:430#8 0x000000000181cf4f in scheduleDoGC (pcap=0x2b38137cae68, task=0x5d1d78000bb0, force_major=false, deadlock_detect=false) at rts/Schedule.c:1836#9 0x000000000181b10f in schedule (initialCapability=0x39cb490, task=0x5d1d78000bb0) at rts/Schedule.c:562#10 0x000000000181e27c in scheduleWorker (cap=0x39cb490, task=0x5d1d78000bb0) at rts/Schedule.c:2600#11 0x000000000182508b in workerStart (task=0x5d1d78000bb0) at rts/Task.c:445#12 0x00007fbe9a8efedd in start_thread () from /nix/store/xg6ilb9g9zhi2zg1dpi4zcp288rhnvns-glibc-2.30/lib/libpthread.so.0#13 0x00000000700f9aaf in clone () from /nix/store/xg6ilb9g9zhi2zg1dpi4zcp288rhnvns-glibc-2.30/lib/libc.so.6
Note how the closure at $weak[0]->key is now a forwarding pointer.
However, the closure layout of stg_WEAK_info declares key to be a non-pointer (since it's a weak reference), so we do not evacuate this field.
We eventually conclude scavenging
We enter checkSanity and eventually stumble into the bad weak pointer
My recollection here is that traverseWeakPtrList (called after each round of scavenging) should have evacuated $weak->key. It's unclear why it didn't. Perhaps $weak is somehow no longer on weak_ptr_list?
Between steps 3 and 4 there is another event: another thread (thread 17) steps in and evacuates $src_weak again (to 0x42076a7388). This causes a race to evacuate $src_weak, resulting in two copies. At the end of collection one of these copies (the one on weak_ptr_list) will be valid (since its key will be evacuated by traverseWeakList) and the other will be invalid but unreachable. Nevertheless, checkSanity stumbles over the latter and crashes. This is sadly a spurious crash. I fix the cause in !4408 (closed).
@trac-vdukhovni and I again spoke on the phone. Unfortunately it looks like with the final version of !4408 (closed) I am unable to reproduce the crash with or without sanity-checking enabled; I've tried only a few times but it seems likely that the issue is hidden. @trac-vdukhovni, perhaps you could look into expanding the testcase again?
@trac-vdukhovni and I have been communicating via email, where there have been some further interesting revelations. He observed this crash (on a patched tree with additional debug information):
#0 0x00007fe3270fa625 in raise () from /lib64/libc.so.6#1 0x00007fe3270e38d9 in abort () from /lib64/libc.so.6#2 0x00000000019f506b in rtsFatalInternalErrorFn (s=0x1aa7a65 "he's dead, jim.\n", ap=0x7fe2d3ffad08) at rts/RtsMessages.c:196#3 0x00000000019f518d in barf (s=s@entry=0x1aa7a65 "he's dead, jim.\n") at rts/RtsMessages.c:58#4 0x00000000019f52d3 in putMVarError (mvar=0x4201a10880, tso=0x4200e2d010) at rts/RtsMessages.c:50#5 0x0000000001a1896b in stg_putMVarzh ()#6 0x0000000000000000 in ?? ()(gdb) fr 4#4 0x00000000019f52d3 in putMVarError (mvar=0x4201a10880, tso=0x4200e2d010) at rts/RtsMessages.c:5050 barf("he's dead, jim.\n");(gdb) l45 {46 errorBelch("putMVarzh: uh oh, blocked mismatch\n");47 errorBelch(" tso->why_blocked=%d\n", tso->why_blocked);48 errorBelch(" tso->block_info=%p\n", tso->block_info);49 errorBelch(" mvar=%p\n", mvar);50 barf("he's dead, jim.\n");51 }5253 void54 barf(const char*s, ...)(gdb) p *tso$1 = {header = {info = 0x3a22706d61747365}, _link = 0x3235393230363122, global_link = 0x616e222c22363833, stackobj = 0x797370223a22656d, what_next = 26723, why_blocked = 29807, flags = 1634887016, block_info = {closure = 0x736b726f772d7970, prev = 0x736b726f772d7970, bh = 0x736b726f772d7970, throwto = 0x736b726f772d7970, wakeup = 0x736b726f772d7970, fd = 8316866959936158064}, id = 1952804398, saved_errno = 1948396578, dirty = 577073273, bound = 0x6c6176222c226161, cap = 0x303632223a226575, trec = 0x333a303037343a36, blocked_exceptions = 0x3138363a3a373330, bq = 0x7d22656263353a66, alloc_limit = 8315172586696899338, tot_stack_size = 1886216564}(gdb) p *mvar$2 = {header = {info = 0x1a19958 <stg_WHITEHOLE_info>}, head = 0x420188d6a3, tail = 0x4200c36678, value = 0x1e9e850}
That TSO doesn't look right at all. Reinterpreting tso as ASCII characters reveals that someone scribbled the JSON data his program is working with over the TSO.
Furthermore, looking at the backtraces of the other threads, we find a few interesting cases:
(gdb) thread apply all backtrace ... Thread 37 (Thread 0x7fe2b97fa700 (LWP 2932832)): #0 reallyLockClosure (p=<optimized out>) at rts/SMPClosureOps.h:63 #1 0x0000000001a1853c in stg_takeMVarzh () #2 0x0000000000000000 in ?? () ... Thread 25 (Thread 0x7fe2bb7fe700 (LWP 2932829)): #0 0x00000000009144c7 in cryptonite_aesni_encrypt_block256 () #1 0x00000000008ddd6b in cryptonite_aes_generic_gcm_decrypt () #2 0x0000000000884e39 in ?? () #3 0x0000000000000000 in ?? () ... Thread 1 (Thread 0x7fe2d3fff700 (LWP 2932820)): #0 0x00007fe3270fa625 in raise () from /lib64/libc.so.6 #1 0x00007fe3270e38d9 in abort () from /lib64/libc.so.6 #2 0x00000000019f506b in rtsFatalInternalErrorFn (s=0x1aa7a65 "he's dead, jim.\n", ap=0x7fe2d3ffad08) at rts/RtsMessages.c:196 #3 0x00000000019f518d in barf (s=s@entry=0x1aa7a65 "he's dead, jim.\n") at rts/RtsMessages.c:58 #4 0x00000000019f52d3 in putMVarError (mvar=0x4201a10880, tso=0x4200e2d010) at rts/RtsMessages.c:50 #5 0x0000000001a1896b in stg_putMVarzh () #6 0x0000000000000000 in ?? ()
So, we certainly have foreign calls at-play here. Moreover, cryptonite makes quite heavy use of foreign buffers, suggesting that this may either be a library bug or another manifestation of #17760 (closed).
Moreover, cryptonite makes quite heavy use of foreign buffers, suggesting that this may either be a library bug or another manifestation of #17760 (closed).
Marking withForeignPtr as NOINLINE could help dismissing this hypothesis.
Yes, I do use HTTPS sometimes, presumably in this trace, as there's no other library that comes to mind that would be decrypting data. However, the data is also compressed, and what we're seeing scribbled over the TSO is text that I'd have expected to see compressed (some bits recur in many lines and some is new in each line), so that suggests that cryptonite did not directly scribble that text into memory it did not own.
The other foreign calls are primarily in streaming-bytestring and bytestring, mostly reading bytestring payloads, but also constructing bytestrings via builders...
Yes, I do use HTTPS sometimes, presumably in this trace, as there's no other library that comes to mind that would be decrypting data. However, the data is also compressed, and what we're seeing scribbled over the TSO is text that I'd have expected to see compressed (some bits recur in many lines and some is new in each line), so that suggests that cryptonite did not directly scribble that text into memory it did not own.
Yes, this is a fair point. This does indeed suggest that cryptonite isn't at fault.
Unfortunately I have not been able to reproduce the crash on the full program you sent. @trac-vdukhovni, could describe how you are reproducing this?
So either the TSO is still live and its data was stepped on by some foreign call, or alternatively, the TSO is no longer live and so its value is pointing at live data. We'll have to figure out which... I'll keep running tests to see if I can narrow this down... The data in question came in via HTTPS compressed, then was decompressed, then chunks of 1k lines are streamed into bounded channels parsed from JSON in multiple threads, filtered, and the filter output concatenated via bytestring builders, the output of which is then streamed to an output bounded channel and consumed for printing. If there is unsafe memory into which uncompressed raw JSON data is bleeding it would have to be in or after the streaming gunzip, and before the JSON data is parsed to extract just the desired "key" (we're seeing raw JSON in *tso).
[ The ARR_WORDS holds a DNS domain name not URL, but that's not terribly important. ] The domain name in question was almost certainly created via a ByteString builder, executed to yield a strict ByteString, without trimming:
The pinned byte array length was 0x80, and it contained a single domain name, which in this pipeline is generally most of the time a slice of a larger buffer, but late in the process is converted to a single bytestring via the above functions.
So if there's a safety issue it would be in toLazyByteStringWith, but otherwise it is in the GC code. Once the domain name is built, it is tested for relevance (read-only ops) and then added back to an output stream (converted back to a builder that concatenates many names into with a more generous buffer allocation).
This has been quite a challenge to get traction on; unfortunately I have been quite unable to reproduce the issue under rr. From my runs under gdb it appears that the issue is either:
the MVar's headMVAR_TSO_QUEUE being reclaimed and overwritten despite being live, or
another thread writing a bogus pointer to the MVar's head.
Unfortunately without rr it is quite tricky to distinguish these two cases.
Hmm, although now I see another failure mode: no obvious corruption but a case where we end up in putMVarzh where the mvar->head->tso->reason_blocked != mvar; both LHS and RHS look valid however.
While I don't have a concrete theory for how this program would trigger #17760 (closed), I felt it would be negligent to continue on the long road of gdb debugging without trying to the obvious NOINLINE workaround on withForeignPtr. To my surprise, a run of the program on the 23 GByte dataset finished successfully with this workaround. Of course, this doesn't necessarily mean much: it is possible that this passed due to changes in timing. Nevertheless, I'm going to leave this running in a loop overnight to see whether I can tease out a crash.
Well, the program ran 12 times without a crash over the 10.75 hours that I was away-from-keyboard. However, now I'm having trouble reproducing the issue even without the withForeignPtr change. Sigh.
So I managed to get things building with !4347 (closed) and sadly it appears to still crash with the same failure mode. I suspect that adding a NOINLINE to withForeignPtr just changed simplification/timing enough to hide the bug.
Unfortunately I think I'm going to need to put this aside for now. While I hate release %9.0.1 with a known soundness issue, this appears not to be a recent regression and it has been extremely resistant to reproduction under rr, making debugging quite difficult. @trac-vdukhovni says he has a smaller reproducer which I'm hoping he will be able to post but I'm afraid it doesn't make sense to continue trying to diagnose the current reproducer until after %9.0.1 is out.
// There are readMVar/takeMVar(s) waiting: wake up the first onetso=StgMVarTSOQueue_tso(q);StgMVar_head(mvar)=StgMVarTSOQueue_link(q);if(StgMVar_head(mvar)==stg_END_TSO_QUEUE_closure){StgMVar_tail(mvar)=stg_END_TSO_QUEUE_closure;}
Note how we mutate head yet we never actually mark the MVar as dirty. @trac-vdukhovni says that dirtying the MVar resolves the crash in his case.
While it's plausible that this write would require marking the MVar as dirty, I can also think of an argument why it may not be necessary: the write makes no closures reachable from the MVar that weren't already reachable before. Indeed this is why I concluded that the code was correct in my audit last week.
That being said, I believe the above argument is very subtly incorrect: Dirtiness only reflects whether any objects in a younger generation are reachable from the MVar. This means that the MVar itself may be (correctly) marked as clean before the mutation (since mvar->head lives in the old generation), yet then needs to be dirtied after popping the head (since mvar->head->link lives in a younger generation). That is:
Generation 0 Generation 1 mvar ┌────────────┐ │ MVAR │ │ head │ ──╮ └────────────┘ │ │ qa qb │ ┌────────────┐ ┌────────────┐ │ │ TSO_QUEUE │ ←─────╮ │ TSO_QUEUE │ ←─╯ │ head │ │ │ head │ │ link │ ╰────────────── │ link │ └────────────┘ └────────────┘
It's a bit hard to see how this could happen given that this implies that mvar->head, which was presumably allocated aftermvar->head->link, lives in an older generation than mvar->head->link. That being said I currently can't rule it out. I suspect the turn out events would be something like:
a thread waits on mvar, giving rise to qa (in nursery)
another thread waits on mvar, giving rise to qb (in nursery)
a GC is initiated
a nursery object (perhaps a TSO?) containing a reference to qb is scavenged, evacuating qa to gen0 (this is the critical step)
mvar is scavenged, evacuating qb to gen1; mvar is consequently marked as clean
qb is scavenged but qa has already been evacuated; consequently qb is added to the remembered set
GC finishes
Someone calls putMVar# mvar, popping qb off the head of mvar; now mvar contains a direct reference to qa in gen0 yet is marked as clean.
There are still a number of open questions in the above timeline (e.g. who would have a reference to qb other than the mvar?). Nevertheless, I am guessing that the failure mode looks something like that.
Here's the sequence of events leading up to the corrupted state in my recording. (I don't have to distinguish between nursery and gen0, what feature of the block descriptor lets me tell these apart)?
Each pointer in the MVAR TSO queue is shown with a generation suffix, with 0 possibly also being nursery.
The operation is either "put", "take" or "gc", the latter meaning that the head/tail pointers don't match the output of the previous state, because either got moved. After a GC I only know the head/tail pointers, but the "?" typically become apparent when exposed by "put" unless another GC happens too soon.
The letters [cd] in square brackes indicate the before/after state of the MVAR (clean or dirty).
and the MVAR was marked clean as a result, but then putMVarzh moved the head a few times to eventually land on a gen 0 queue member, with the MVAR marked clean.
I'll run a couple more tests, and post a patch later today...