Potential memory ordering issue in checkBlockingQueues

added Phighest Tbug backport needed:9.2 runtime crash + 1 deleted label

assigned to @bgamari

Unfortunately this issue is very hard to reproduce.

I believe that I have been able to reproduce this issue with TSAN:

ThreadSanitizer: reported 1 warnings
*** unexpected failure for conc070(ghci)
Actual stderr output differs from expected:
--- /dev/null   2021-07-12 09:54:50.852065245 -0400
+++ "/run/user/1000/ghctest-zx5jmx2r/test   spaces/testsuite/tests/concurrent/should_run/T5558.run/T5558.run.stderr.normalised" 2021-07-16 11:21:56.757800460 -0400
@@ -0,0 +1,85 @@
+==================
+WARNING: ThreadSanitizer: data race (pid=3720838)
+  Atomic read of size 8 at 0x558000709800 by thread T4:
+    #0 __tsan_atomic64_load <null> (libtsan.so.0+0x67489)
+    #1 blackHoleOwner rts/Messages.c:333 (T5558+0x7c4a9d)
+    #2 schedule rts/Schedule.c:517 (T5558+0x7713c3)
+    #3 scheduleWorker rts/Schedule.c:2668 (T5558+0x7731cf)
+    #4 workerStart rts/Task.c:445 (T5558+0x77a916)
+    #5 <null> <null> (libtsan.so.0+0x2e0b6)
+
+  Previous write of size 8 at 0x558000709800 by thread T6:
+    #0 stg_marked_upd_frame_info <null> (T5558+0x7b3201)
+    #1 scheduleWorker rts/Schedule.c:2668 (T5558+0x7731cf)
+    #2 workerStart rts/Task.c:445 (T5558+0x77a916)
+    #3 <null> <null> (libtsan.so.0+0x2e0b6)
+
+  Thread T4 (tid=3720843, running) created by thread T3 at:
+    #0 pthread_create <null> (libtsan.so.0+0x3055b)
+    #1 createOSThread rts/posix/OSThreads.c:166 (T5558+0x7a99cf)
+    #2 startWorkerTask rts/Task.c:497 (T5558+0x77b36a)
+    #3 releaseCapability_ rts/Capability.c:588 (T5558+0x761dd7)
+    #4 suspendThread rts/Schedule.c:2502 (T5558+0x772883)
+    #5 base_GHCziEventziEPoll_new10_info <null> (T5558+0x6afd5f)
+    #6 scheduleWorker rts/Schedule.c:2668 (T5558+0x7731cf)
+    #7 workerStart rts/Task.c:445 (T5558+0x77a916)
+    #8 <null> <null> (libtsan.so.0+0x2e0b6)
+
+  Thread T6 (tid=3720845, running) created by main thread at:
+    #0 pthread_create <null> (libtsan.so.0+0x3055b)
+    #1 createOSThread rts/posix/OSThreads.c:166 (T5558+0x7a99cf)
+    #2 startWorkerTask rts/Task.c:497 (T5558+0x77b36a)
+    #3 releaseCapability_ rts/Capability.c:588 (T5558+0x761dd7)
+    #4 yieldCapability rts/Capability.c:1007 (T5558+0x762392)
+    #5 scheduleYield rts/Schedule.c:705 (T5558+0x770862)
+    #6 schedule rts/Schedule.c:315 (T5558+0x770862)
+    #7 scheduleWaitThread rts/Schedule.c:2651 (T5558+0x77314f)
+    #8 rts_evalLazyIO rts/RtsAPI.c:566 (T5558+0x7c9d53)
+    #9 hs_main rts/RtsMain.c:72 (T5558+0x76a0c5)
+    #10 main <null> (T5558+0x415ec1)
+
+SUMMARY: ThreadSanitizer: data race (/nix/store/vran8acwir59772hj4vscr7zribvp7l5-gcc-9.3.0-lib/lib/libtsan.so.0+0x67489) in __tsan_atomic64_load
+==================
+==================
+WARNING: ThreadSanitizer: data race (pid=3720838)
+  Write of size 8 at 0x558000709800 by thread T6:
+    #0 stg_marked_upd_frame_info <null> (T5558+0x7b3201)
+    #1 scheduleWorker rts/Schedule.c:2668 (T5558+0x7731cf)
+    #2 workerStart rts/Task.c:445 (T5558+0x77a916)
+    #3 <null> <null> (libtsan.so.0+0x2e0b6)
+
+  Previous atomic read of size 8 at 0x558000709800 by thread T4:
+    #0 __tsan_atomic64_load <null> (libtsan.so.0+0x67489)
+    #1 messageBlackHole rts/Messages.c:171 (T5558+0x7c40b1)
+    #2 stg_BLACKHOLE_info <null> (T5558+0x7b126d)
+    #3 scheduleWorker rts/Schedule.c:2668 (T5558+0x7731cf)
+    #4 workerStart rts/Task.c:445 (T5558+0x77a916)
+    #5 <null> <null> (libtsan.so.0+0x2e0b6)
+
+  Thread T6 (tid=3720845, running) created by main thread at:
+    #0 pthread_create <null> (libtsan.so.0+0x3055b)
+    #1 createOSThread rts/posix/OSThreads.c:166 (T5558+0x7a99cf)
+    #2 startWorkerTask rts/Task.c:497 (T5558+0x77b36a)
+    #3 releaseCapability_ rts/Capability.c:588 (T5558+0x761dd7)
+    #4 yieldCapability rts/Capability.c:1007 (T5558+0x762392)
+    #5 scheduleYield rts/Schedule.c:705 (T5558+0x770862)
+    #6 schedule rts/Schedule.c:315 (T5558+0x770862)
+    #7 scheduleWaitThread rts/Schedule.c:2651 (T5558+0x77314f)
+    #8 rts_evalLazyIO rts/RtsAPI.c:566 (T5558+0x7c9d53)
+    #9 hs_main rts/RtsMain.c:72 (T5558+0x76a0c5)
+    #10 main <null> (T5558+0x415ec1)
+
+  Thread T4 (tid=3720843, running) created by thread T3 at:
+    #0 pthread_create <null> (libtsan.so.0+0x3055b)
+    #1 createOSThread rts/posix/OSThreads.c:166 (T5558+0x7a99cf)
+    #2 startWorkerTask rts/Task.c:497 (T5558+0x77b36a)
+    #3 releaseCapability_ rts/Capability.c:588 (T5558+0x761dd7)
+    #4 suspendThread rts/Schedule.c:2502 (T5558+0x772883)
+    #5 base_GHCziEventziEPoll_new10_info <null> (T5558+0x6afd5f)
+    #6 scheduleWorker rts/Schedule.c:2668 (T5558+0x7731cf)
+    #7 workerStart rts/Task.c:445 (T5558+0x77a916)
+    #8 <null> <null> (libtsan.so.0+0x2e0b6)
+
+SUMMARY: ThreadSanitizer: data race (/run/user/1000/ghctest-zx5jmx2r/test   spaces/testsuite/tests/concurrent/should_run/T5558.run/T5558+0x7b3201) in stg_marked_upd_frame_info

mentioned in issue #20127

mentioned in issue #20128

mentioned in issue #20129

So I finally managed to get a pretty clear backtrace:

(lldb) run                                                                                    
Process 2074 launched: '/Users/administrator/ghc/_build/stage1/bin/haddock' (arm64)           
Warning: 'Verbosity' is out of scope.                                                         
    If you qualify the identifier, haddock can try to link it anyway.                         
Warning: 'IOException' is out of scope.                                                       
    If you qualify the identifier, haddock can try to link it anyway.                         
Warning: 'decodeUtf8' is out of scope.                                                        
    If you qualify the identifier, haddock can try to link it anyway.                         
Warning: 'RawCommand' is out of scope.                                                        
    If you qualify the identifier, haddock can try to link it anyway.                         
Process 2074 stopped                                                                          
* thread #18, name = 'ghc_worker', stop reason = EXC_BAD_ACCESS (code=1, address=0x10)        
    frame #0: 0x00000001052889d4 haddock`stg_BLACKHOLE_info + 20                              
haddock`stg_BLACKHOLE_info:                                                                   
->  0x1052889d4 <+20>: ldr    x15, [x17]                                                      
    0x1052889d8 <+24>: dmb    sy                                                              
    0x1052889dc <+28>: adrp   x14, 0                                                          
    0x1052889e0 <+32>: add    x14, x14, #0x960          ; =0x960                              
Target 0: (haddock) stopped.  
(lldb) disassemble                                                                            
haddock`stg_BLACKHOLE_info:
    0x1052889c0 <+0>:   mov    x18, x22
    0x1052889c4 <+4>:   ldr    x17, [x18, #0x8]
    0x1052889c8 <+8>:   dmb    sy
    0x1052889cc <+12>:  and    x15, x17, #0x7
    0x1052889d0 <+16>:  cbnz   x15, 0x105288b30          ; <+368>
->  0x1052889d4 <+20>:  ldr    x15, [x17]
    0x1052889d8 <+24>:  dmb    sy
    0x1052889dc <+28>:  adrp   x14, 0
(lldb) print $x15
(unsigned long) $3 = 0
(lldb) x/8a $x18
0x7005ae3f78: 0x00000001052889c0 haddock`stg_BLACKHOLE_info
0x7005ae3f80: 0x0000007002ccb899
0x7005ae3f88: 0x0000007005ae3f51
0x7005ae3f90: 0x0000000104fac188 haddock`LsA4_info
(lldb) x/8a 0x0000007002ccb898
0x7002ccb898: 0x0000000103d61290 haddock`ghc_GHCziTypesziName_Name_con_info
0x7002ccb8a0: 0x0000007002cac159
0x7002ccb8a8: 0x0000007002cac169

This suggests that my original supposition of a missing memory barrier is correct since:

we ended up in stg_BLACKHOLE_info
but the indirectee was read to be NULL
and yet the indirectee is clearly now a valid object

However, the read clarely has the appropriate barrier (namely, the dmb at +8). This means that the missing barrier must be at the writer. Looking at the other threads' stacks reveals only one potentially-relevant clue:

  thread #16, name = 'ghc_worker'                                                                                                                                                            
    frame #0: 0x0000000105261184 haddock`threadPaused(cap=0x0000000109822600, tso=0x000000700480aa38) at ThreadPaused.c:0:17 [opt]                                                           
    frame #1: 0x0000000105289b18 haddock`stg_returnToSchedButFirst + 88                                                                                                                      
    frame #2: 0x000000010525af08 haddock`schedule(initialCapability=<unavailable>, task=0x0000000109822618) at Schedule.c:481:13 [opt]

Unfortunately, the line information and execution context seems to be unavailable for this frame.

mentioned in commit 630d5a67

mentioned in commit e2c70181

mentioned in commit 1832676a

closed with merge request !6228 (merged)

mentioned in commit b6fa60a1

mentioned in commit fa632ef7

mentioned in commit 39cc88e7

mentioned in commit 44c9ebcb

This has been backported to %9.2.1 in 39cc88e7.

removed backport needed:9.2 label

mentioned in commit 1321b707

mentioned in commit d94243f7

Potential memory ordering issue in checkBlockingQueues

Child items ...

Activity