We have seen a few concerning crashes ($3085) on the AArch64 Darwin runners. They appear to share a common thread, all crashing in checkBlockingQueues, called by stg_marked_upd_frame. Looking at the code, I suspect we are missing a barrier in this area.
The problem appears to be that due to CurrentTSO->bq, which is only modified in a two places:
in the messageBlackHole logic, where we add ourselves to the blocking queue of the blackhole'd thunk. This appears to have the correct release barrier on the update to tso->bq but doesn't appear to synchronize on the read from owner->bq.
in createThread during thread creation (unlikely to be problematic)
Given that Messages.c shows up in a few of the backtraces, it seems likely that (1) is the culprit.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items
...
Show closed items
Linked items
0
Link issues together to show that they're related or that one is blocking others.
Learn more.
So I finally managed to get a pretty clear backtrace:
(lldb) run Process 2074 launched: '/Users/administrator/ghc/_build/stage1/bin/haddock' (arm64) Warning: 'Verbosity' is out of scope. If you qualify the identifier, haddock can try to link it anyway. Warning: 'IOException' is out of scope. If you qualify the identifier, haddock can try to link it anyway. Warning: 'decodeUtf8' is out of scope. If you qualify the identifier, haddock can try to link it anyway. Warning: 'RawCommand' is out of scope. If you qualify the identifier, haddock can try to link it anyway. Process 2074 stopped * thread #18, name = 'ghc_worker', stop reason = EXC_BAD_ACCESS (code=1, address=0x10) frame #0: 0x00000001052889d4 haddock`stg_BLACKHOLE_info + 20 haddock`stg_BLACKHOLE_info: -> 0x1052889d4 <+20>: ldr x15, [x17] 0x1052889d8 <+24>: dmb sy 0x1052889dc <+28>: adrp x14, 0 0x1052889e0 <+32>: add x14, x14, #0x960 ; =0x960 Target 0: (haddock) stopped. (lldb) disassemble haddock`stg_BLACKHOLE_info: 0x1052889c0 <+0>: mov x18, x22 0x1052889c4 <+4>: ldr x17, [x18, #0x8] 0x1052889c8 <+8>: dmb sy 0x1052889cc <+12>: and x15, x17, #0x7 0x1052889d0 <+16>: cbnz x15, 0x105288b30 ; <+368>-> 0x1052889d4 <+20>: ldr x15, [x17] 0x1052889d8 <+24>: dmb sy 0x1052889dc <+28>: adrp x14, 0(lldb) print $x15(unsigned long) $3 = 0(lldb) x/8a $x180x7005ae3f78: 0x00000001052889c0 haddock`stg_BLACKHOLE_info0x7005ae3f80: 0x0000007002ccb8990x7005ae3f88: 0x0000007005ae3f510x7005ae3f90: 0x0000000104fac188 haddock`LsA4_info(lldb) x/8a 0x0000007002ccb8980x7002ccb898: 0x0000000103d61290 haddock`ghc_GHCziTypesziName_Name_con_info0x7002ccb8a0: 0x0000007002cac1590x7002ccb8a8: 0x0000007002cac169
This suggests that my original supposition of a missing memory barrier is correct since:
we ended up in stg_BLACKHOLE_info
but the indirectee was read to be NULL
and yet the indirectee is clearly now a valid object
However, the read clarely has the appropriate barrier (namely, the dmb at +8). This means that the missing barrier must be at the writer. Looking at the other threads' stacks reveals only one potentially-relevant clue:
thread #16, name = 'ghc_worker' frame #0: 0x0000000105261184 haddock`threadPaused(cap=0x0000000109822600, tso=0x000000700480aa38) at ThreadPaused.c:0:17 [opt] frame #1: 0x0000000105289b18 haddock`stg_returnToSchedButFirst + 88 frame #2: 0x000000010525af08 haddock`schedule(initialCapability=<unavailable>, task=0x0000000109822618) at Schedule.c:481:13 [opt]
Unfortunately, the line information and execution context seems to be unavailable for this frame.