Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
  • Sign in / Register
  • GHC GHC
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
    • Locked Files
  • Issues 5,356
    • Issues 5,356
    • List
    • Boards
    • Service Desk
    • Milestones
    • Iterations
  • Merge requests 569
    • Merge requests 569
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
    • Test Cases
  • Deployments
    • Deployments
    • Releases
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Code review
    • Insights
    • Issue
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • Glasgow Haskell CompilerGlasgow Haskell Compiler
  • GHCGHC
  • Issues
  • #9284
Closed
Open
Issue created Jul 08, 2014 by edsko@edsko.net@trac-edsko

shutdownCapability sometimes loops indefinitely on OSX after forkProcess

The attached Haskell program is a stress test for forkProcess. It starts 100 child processes, each of which do a single, safe, FFI call, after which the main process waits for all child processes to terminate.

I compile the test with

# gcc -c -o TestForkProcessC.o -g TestForkProcessC.c 
# ghc -debug -threaded -fforce-recomp -Wall TestForkProcess.hs TestForkProcessC.o 

and then start running it until it fails (that is, until one or more of the child processes fail to terminate):

# while ./TestForkProcess +RTS -N1 ; do echo "OK"; done

Actually, most of the time this happens pretty quickly (often even on the first call to TestForkProcess).

Those child processes that do fail to terminate get stuck in an infinite loop in shutdownCapability, which looks something like:

void shutdownCapability (Capability *cap, Task *task, rtsBool safe)
{
    nat i;
    task->cap = cap;

    for (i = 0; /* i < 50 */; i++) {
        // ... other conditionals omitted

        if (cap->suspended_ccalls && safe) {
            cap->running_task = NULL;
	    RELEASE_LOCK(&cap->lock);
            // The IO manager thread might have been slow to start up,
            // so the first attempt to kill it might not have
            // succeeded.  Just in case, try again - the kill message
            // will only be sent once.
            ioManagerDie();
            yieldThread();
            continue;
        }

        traceSparkCounters(cap);
	RELEASE_LOCK(&cap->lock);
	break;
    }
}

(note that I'm only considering the threaded RTS). In the child processes that loop indefinitely this cap->suspended_ccalls && safe condition gets triggered time and again.

When it does, it gets stuck waiting for a single InCall. This InCall is created by a call to newInCall in workerStart -- i.e., it is created on pthread startup. That begs the question where this worker task was created; this I don't know for sure but I am fairly sure that it happens during the initialization of the IO manager. (The initialization sequence of the IO manager involves the creation of 4 tasks before we even get to main, so it's bit a hard to navigate.)

I have some further evidence that the I/O manager is involved, although not necessarily the cause of the problem. On normal termination, the I/O manager is asked to shutdown by the call to ioManagerDie in shutdownCapability, shown above. This will send IO_MANAGER_DIE (0xFE) on the I/O managers "control pipe" (created in GHC.Event.Thread.startTimerManagerThread). When the timer manager thread receives this (in GHC.Event.TimerManager.handleControlEvent) it calls shutdownManagers, which shuts down the IO manager threads by sending them io_MANAGER_DIE on their respective pipes. This gets received by GHC.Event.Manager.handleControlEvent and the IO manager threads exit. (Note on capitalization: IO_MANAGER_DIE is the C symbol; io_MANAGER_DIE is the Haskell symbol.)

When the child process fails to terminate, the first part of this process still happens. The timer manager thread receives IO_MANAGER_DIE and calls shutdownManagers. However, now things go wrong, and it seems they go wrong in one of two ways. The very first thing that shutdownManagers does is acquire the ioManagerLock. Sometimes it gets stuck right there. However, this is not always the case. Sometimes it does manage to acquire the lock, and I can see it going through the loop and sending the shutdown signal to the IO manager thread (I'm saying "the" because I've exclusively been testing with -N1). Either way, in the case that the child process gets stuck, this signal somehow never arrives at the IO manager thread (that is, I have a print statement in readControlMessage that prints a message when it receives IO_MANAGER_DIE, along with a bit of information where it was called from, and that print statement never triggers).

I am not sure where to go from here. Note that I have only been able to reproduce this on OSX/ghc 7.8. I have not been able to reproduce this problem on Linux/7.8 (although there are other problems with forkProcess on Linux, which unfortunately are proving even more elusive). The attached stress test does very often get stuck on Linux/7.4 but of course that's a different I/O manager altogether and is probably an unrelated bug.

Trac metadata
Trac field Value
Version 7.8.2
Type Bug
TypeOfFailure OtherFailure
Priority normal
Resolution Unresolved
Component Compiler
Test case
Differential revisions
BlockedBy
Related
Blocking
CC
Operating system
Architecture
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Assignee
Assign to
Time tracking