shutdownCapability sometimes loops indefinitely on OSX after forkProcess
The attached Haskell program is a stress test for forkProcess
. It starts 100 child processes, each of which do a single, safe, FFI call, after which the main process waits for all child processes to terminate.
I compile the test with
# gcc -c -o TestForkProcessC.o -g TestForkProcessC.c
# ghc -debug -threaded -fforce-recomp -Wall TestForkProcess.hs TestForkProcessC.o
and then start running it until it fails (that is, until one or more of the child processes fail to terminate):
# while ./TestForkProcess +RTS -N1 ; do echo "OK"; done
Actually, most of the time this happens pretty quickly (often even on the first call to TestForkProcess
).
Those child processes that do fail to terminate get stuck in an infinite loop in shutdownCapability
, which looks something like:
void shutdownCapability (Capability *cap, Task *task, rtsBool safe)
{
nat i;
task->cap = cap;
for (i = 0; /* i < 50 */; i++) {
// ... other conditionals omitted
if (cap->suspended_ccalls && safe) {
cap->running_task = NULL;
RELEASE_LOCK(&cap->lock);
// The IO manager thread might have been slow to start up,
// so the first attempt to kill it might not have
// succeeded. Just in case, try again - the kill message
// will only be sent once.
ioManagerDie();
yieldThread();
continue;
}
traceSparkCounters(cap);
RELEASE_LOCK(&cap->lock);
break;
}
}
(note that I'm only considering the threaded RTS). In the child processes that loop indefinitely this cap->suspended_ccalls && safe
condition gets triggered time and again.
When it does, it gets stuck waiting for a single InCall
. This InCall
is created by a call to newInCall
in workerStart
-- i.e., it is created on pthread startup. That begs the question where this worker task was created; this I don't know for sure but I am fairly sure that it happens during the initialization of the IO manager. (The initialization sequence of the IO manager involves the creation of 4 tasks before we even get to main
, so it's bit a hard to navigate.)
I have some further evidence that the I/O manager is involved, although not necessarily the cause of the problem. On normal termination, the I/O manager is asked to shutdown by the call to ioManagerDie
in shutdownCapability
, shown above. This will send IO_MANAGER_DIE
(0xFE
) on the I/O managers "control pipe" (created in GHC.Event.Thread.startTimerManagerThread
). When the timer manager thread receives this (in GHC.Event.TimerManager.handleControlEvent
) it calls shutdownManagers
, which shuts down the IO manager threads by sending them io_MANAGER_DIE
on their respective pipes. This gets received by GHC.Event.Manager.handleControlEvent
and the IO manager threads exit. (Note on capitalization: IO_MANAGER_DIE
is the C symbol; io_MANAGER_DIE
is the Haskell symbol.)
When the child process fails to terminate, the first part of this process still happens. The timer manager thread receives IO_MANAGER_DIE
and calls shutdownManagers
. However, now things go wrong, and it seems they go wrong in one of two ways. The very first thing that shutdownManagers
does is acquire the ioManagerLock
. Sometimes it gets stuck right there. However, this is not always the case. Sometimes it does manage to acquire the lock, and I can see it going through the loop and sending the shutdown signal to the IO manager thread (I'm saying "the" because I've exclusively been testing with -N1
). Either way, in the case that the child process gets stuck, this signal somehow never arrives at the IO manager thread (that is, I have a print statement in readControlMessage
that prints a message when it receives IO_MANAGER_DIE
, along with a bit of information where it was called from, and that print statement never triggers).
I am not sure where to go from here. Note that I have only been able to reproduce this on OSX/ghc 7.8. I have not been able to reproduce this problem on Linux/7.8 (although there are other problems with forkProcess
on Linux, which unfortunately are proving even more elusive). The attached stress test does very often get stuck on Linux/7.4 but of course that's a different I/O manager altogether and is probably an unrelated bug.
Trac metadata
Trac field | Value |
---|---|
Version | 7.8.2 |
Type | Bug |
TypeOfFailure | OtherFailure |
Priority | normal |
Resolution | Unresolved |
Component | Compiler |
Test case | |
Differential revisions | |
BlockedBy | |
Related | |
Blocking | |
CC | |
Operating system | |
Architecture |