Data race in GC time accounting
The ThreadSanitizer work (!2642 (merged)) pointed out a rather nasty data race in our GC time accounting logic which I believe is responsible for #17964 (closed).
Specifically, it reports races of the following form:
WARNING: ThreadSanitizer: data race (pid=17126)
Read of size 8 at 0x7b50000004c0 by thread T4 (mutexes: write M18):
#0 stat_endGC rts/Stats.c:510 (libHSrts-1.0_thr-ghc9.1.0.20200917.so+0x587db)
#1 GarbageCollect rts/sm/GC.c:962 (libHSrts-1.0_thr-ghc9.1.0.20200917.so+0x76a8a)
#2 scheduleDoGC rts/Schedule.c:1888 (libHSrts-1.0_thr-ghc9.1.0.20200917.so+0x50d59)
#3 schedule rts/Schedule.c:581 (libHSrts-1.0_thr-ghc9.1.0.20200917.so+0x524f8)
#4 scheduleWorker rts/Schedule.c:2671 (libHSrts-1.0_thr-ghc9.1.0.20200917.so+0x54dde)
#5 workerStart rts/Task.c:446 (libHSrts-1.0_thr-ghc9.1.0.20200917.so+0x5c624)
#6 <null> <null> (libtsan.so.0+0x2cb4d)
Previous write of size 8 at 0x7b50000004c0 by thread T6:
#0 stat_endGCWorker rts/Stats.c:417 (libHSrts-1.0_thr-ghc9.1.0.20200917.so+0x5836a)
#1 gcWorkerThread rts/sm/GC.c:1284 (libHSrts-1.0_thr-ghc9.1.0.20200917.so+0x73c13)
#2 yieldCapability rts/Capability.c:924 (libHSrts-1.0_thr-ghc9.1.0.20200917.so+0x342ad)
#3 scheduleYield rts/Schedule.c:716 (libHSrts-1.0_thr-ghc9.1.0.20200917.so+0x51af1)
#4 schedule rts/Schedule.c:317 (libHSrts-1.0_thr-ghc9.1.0.20200917.so+0x51af1)
#5 scheduleWorker rts/Schedule.c:2671 (libHSrts-1.0_thr-ghc9.1.0.20200917.so+0x54dde)
#6 workerStart rts/Task.c:446 (libHSrts-1.0_thr-ghc9.1.0.20200917.so+0x5c624)
#7 <null> <null> (libtsan.so.0+0x2cb4d)
Location is heap block of size 512 at 0x7b5000000400 allocated by main thread:
#0 malloc <null> (libtsan.so.0+0x2e1b3)
#1 stgMallocBytes rts/RtsUtils.c:64 (libHSrts-1.0_thr-ghc9.1.0.20200917.so+0x4cf9b)
#2 initGcThreads rts/sm/GC.c:1075 (libHSrts-1.0_thr-ghc9.1.0.20200917.so+0x730d5)
#3 storageAddCapabilities rts/sm/Storage.c:296 (libHSrts-1.0_thr-ghc9.1.0.20200917.so+0x888a2)
#4 setNumCapabilities rts/Schedule.c:2338 (libHSrts-1.0_thr-ghc9.1.0.20200917.so+0x54306)
#5 <null> <null> (libHSghc-9.1.0.20200917-ghc9.1.0.20200917.so+0x2a28baa)
#6 scheduleWaitThread rts/Schedule.c:2654 (libHSrts-1.0_thr-ghc9.1.0.20200917.so+0x54d70)
#7 rts_evalLazyIO rts/RtsAPI.c:530 (libHSrts-1.0_thr-ghc9.1.0.20200917.so+0x47c53)
#8 hs_main rts/RtsMain.c:72 (libHSrts-1.0_thr-ghc9.1.0.20200917.so+0x4b9a6)
#9 main <null> (ghc+0x435b82)
The problem here is that the GC leader thread does not synchronize with gcWorkerThread
s after requesting that they terminate. This means that the leader (T4
, in this case) may arrive in stat_endGC
, which sums up the GC CPU times of all GC worker threads, before the worker threads have had a chance to call stat_endGCWorker
.