Suspicious data race in work-pushing logic
While working on !1603 (closed) I stumbled upon the following data race:
WARNING: ThreadSanitizer: data race (pid=27844)
Atomic write of size 8 at 0x7b3000016030 by main thread:
#0 __tsan_atomic64_store <null> (libtsan.so.0+0x000000061e52)
#1 schedulePushWork rts/Schedule.c:814 (ffi014+0x000000736d0f)
#2 schedule rts/Schedule.c:281 (ffi014+0x000000736d0f)
#3 scheduleWaitThread rts/Schedule.c:2555 (ffi014+0x000000738efd)
#4 rts_evalIO rts/RtsAPI.c:460 (ffi014+0x00000072cad1)
#5 zdmainzdMainzdMainzumkFunc <null> (ffi014+0x000000413935)
#6 scheduleWaitThread rts/Schedule.c:2555 (ffi014+0x000000738efd)
#7 rts_evalLazyIO rts/RtsAPI.c:530 (ffi014+0x00000072cdc1)
#8 hs_main rts/RtsMain.c:72 (ffi014+0x000000730355)
#9 main <null> (ffi014+0x000000413a92)
Previous read of size 8 at 0x7b3000016030 by thread T13 (mutexes: write M292306471914397696):
#0 waitForWorkerCapability rts/Capability.c:656 (ffi014+0x000000726c40)
#1 yieldCapability rts/Capability.c:926 (ffi014+0x000000726c40)
#2 scheduleYield rts/Schedule.c:681 (ffi014+0x000000736862)
#3 schedule rts/Schedule.c:295 (ffi014+0x000000736862)
#4 scheduleWaitThread rts/Schedule.c:2555 (ffi014+0x000000738efd)
#5 rts_evalIO rts/RtsAPI.c:460 (ffi014+0x00000072cad1)
#6 zdmainzdMainzdMainzumkFunc <null> (ffi014+0x000000413935)
#7 scheduleWaitThread rts/Schedule.c:2555 (ffi014+0x000000738efd)
#8 rts_evalStableIO rts/RtsAPI.c:511 (ffi014+0x00000072cce7)
#9 forkOS_createThreadWrapper rts/posix/OSThreads.c:224 (ffi014+0x000000768bf3)
#10 <null> <null> (libtsan.so.0+0x000000028d5b)
Location is heap block of size 192 at 0x7b3000015fc0 allocated by thread T13:
#0 malloc <null> (libtsan.so.0+0x00000002b251)
#1 stgMallocBytes rts/RtsUtils.c:64 (ffi014+0x00000073192b)
#2 newTask rts/Task.c:209 (ffi014+0x00000073f87f)
#3 getTask rts/Task.c:129 (ffi014+0x00000073fc66)
#4 getTask rts/Task.c:317 (ffi014+0x000000740104)
#5 newBoundTask rts/Task.c:309 (ffi014+0x000000740104)
#6 rts_lock rts/RtsAPI.c:586 (ffi014+0x00000072ceb4)
#7 forkOS_createThreadWrapper rts/posix/OSThreads.c:223 (ffi014+0x000000768bd7)
#8 <null> <null> (libtsan.so.0+0x000000028d5b)
Mutex M292306471914397696 is already destroyed.
Thread T13 (tid=27889, running) created by main thread at:
#0 pthread_create <null> (libtsan.so.0+0x00000002c010)
#1 forkOS_createThread rts/posix/OSThreads.c:234 (ffi014+0x000000768f63)
#2 base_ControlziConcurrent_forkOS1_info <null> (ffi014+0x000000416252)
#3 scheduleWaitThread rts/Schedule.c:2555 (ffi014+0x000000738efd)
#4 rts_evalLazyIO rts/RtsAPI.c:530 (ffi014+0x00000072cdc1)
#5 hs_main rts/RtsMain.c:72 (ffi014+0x000000730355)
#6 main <null> (ffi014+0x000000413a92)
SUMMARY: ThreadSanitizer: data race (/nix/store/c7hj2bk4aqgpb3q0h5xhq7lag0lq3jm7-gcc-7.4.0-lib/lib/libtsan.so.0+0x61e52) in __tsan_atomic64_store
The root cause appears to be the logic for migration of bound threads in schedulePushWork
. Namely, we do the following:
// We're going to walk through the run queue, migrating threads to other
// capabilities until we have only keep_threads left. We might
// encounter a thread that cannot be migrated, in which case we add it
// to the current run queue and decrement keep_threads.
for (t = cap->run_queue_hd, i = 0;
t != END_TSO_QUEUE && n > keep_threads;
t = next)
{
next = t->_link;
t->_link = END_TSO_QUEUE;
// Should we keep this thread?
if (t->bound == task->incall // don't move my bound thread
|| tsoLocked(t) // don't move a locked thread
) {
// ... irrelevant
}
// Or migrate it?
else {
appendToRunQueue(free_caps[i],t);
traceEventMigrateThread (cap, t, free_caps[i]->no);
if (t->bound) {
t->bound->task->cap = free_caps[i]; // <==== this is the suspicious write
}
t->cap = free_caps[i];
n--; // we have one fewer threads now
i++; // move on to the next free_cap
if (i == n_free_caps) i = 0;
}
}
In order to safely set task->cap
we would first need to take task->lock
. However, we do not do so here.
My initial thought was that one could argue that this is correct since t
is on the run queue of the capability which we own, therefore t->bound->task
could not be running on any other capability. However, I do not believe that this is correct: there very well may be more than one TSO bound to a given task.