Calling hs_try_putmvar from an unsafe foreign call can cause the RTS to hang

An unsafe foreign call which calls hs_try_putmvar can cause the RTS to hang, preventing any Haskell threads from making progress. However, compiling with -debug causes it instead to fail an assertion in the scheduler:

internal error: ASSERTION FAILED: file rts/Schedule.c, line 510

    (GHC version 8.4.3 for x86_64_apple_darwin)

Here is a minimal test case which reproduces the assertion. It needs to be built with -debug -threaded and run with +RTS -N2 or higher.

import Control.Concurrent (forkIO, threadDelay)
import Control.Concurrent.MVar (MVar, newEmptyMVar, takeMVar)
import Control.Monad (forever)
import Foreign.C.Types (CInt(..))
import Foreign.StablePtr (StablePtr)
import GHC.Conc (PrimMVar, newStablePtrPrimMVar)

foreign import ccall unsafe hs_try_putmvar :: CInt -> StablePtr PrimMVar -> IO ()

main = do
  mvar <- newEmptyMVar

  forkIO $ forever $ do
    takeMVar mvar

  forkIO $ forever $ do
    sp <- newStablePtrPrimMVar mvar
    hs_try_putmvar (-1) sp
    threadDelay 1

  -- Let it spin a few times to trigger the bug
  threadDelay 500

I actually checked out GHC and added this as a test case and did some debugging. The specific assertion that fails is ASSERT(task->cap == cap). This seems to happen because of this code in hs_try_putmvar:

Task *task = getTask();
// ...
ACQUIRE_LOCK(&cap->lock);
// If the capability is free, we can perform the tryPutMVar immediately
if (cap->running_task == NULL) {
    cap->running_task = task;
    task->cap = cap;
    RELEASE_LOCK(&cap->lock);
    // ...
    releaseCapability(cap);
} else {
    // ...
}

Basically it assumes that the current thread's task isn't currently running a capability, so it takes a new one and then releases it without restoring the previous value of task->cap.

Modifying the code to restore the value of task->cap after releasing the capability fixes the assertion. But I don't know enough about the RTS to be sure I'm not missing something here. In particular, is there a problem with the task basically holding two capabilities for a short time?

My other thought is that maybe it should check if its task is currently running a capability, and in that case do something else. But I'm not sure what.

Trac metadata

Trac field	Value
Version	8.4.3
Type	Bug
TypeOfFailure	OtherFailure
Priority	normal
Resolution	Unresolved
Component	Runtime System
Test case
Differential revisions
BlockedBy
Related
Blocking
CC
Operating system
Architecture

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information

Admin message

Calling hs_try_putmvar from an unsafe foreign call can cause the RTS to hang