Application locks up after getting `epollControl: does not exist` error
Summary
A multi-threaded program can cause an exception (epollControl: does not exist (No such file or directory)
) inside the IO manager that leaves it in a broken state.
It appears to be a race condition occurring when threadWaitRead
and closeFdWith
are called simultaneously from different threads (I am not saying that this is a particularly good idea to do in the first place - but it can happen accidentally such as due to this bug in http-client).
Despite the original issue being caused by a third-party library, I still opted to open this issue as well because it is not obvious from the documentation that this is an expected failure mode. The closeFdWith
documentation even explicitly states
Close a file descriptor in a concurrency-safe way
which clearly is not the case.
Steps to reproduce
I've managed to boil it down to the following reproducer depending on just the network
and bytestring
packages in addition to base
.
What it does is:
- Start a local TCP server in a new thread
- From the main thread, repeatedly connect to and read from said server, while concurrently calling
close
on the socket.
I ran it with stack run
, so it's compiled with optimizations enabled.
Additionally, I passed the -N
RTS flag to enable actual multi-threading.
EDIT: It appears that with -N2
the reproducer locks up even faster (ran on a 12 core machine).
{-# LANGUAGE NumericUnderscores #-}
module Main where
import Control.Concurrent (forkIO, forkFinally, threadDelay)
import Control.Concurrent.MVar (MVar, newMVar, newEmptyMVar, putMVar, takeMVar)
import Control.Exception (SomeException, SomeAsyncException, Exception (..), catch, throwIO)
import Control.Monad (unless, forever)
import GHC.Conc.IO (closeFdWith)
import qualified Data.ByteString as BS
import qualified Network.Socket as Sock
import qualified Network.Socket.ByteString as SockBS
main :: IO ()
main = do
serverReady <- newEmptyMVar
forkIO (server serverReady `catch` \e -> putStrLn $ "Caught in server " <> show (e :: SomeException))
takeMVar serverReady
putStrLn "Local server ready"
forever $ do
let
handler e
| Just as <- fromException e = throwIO (as :: SomeAsyncException)
| otherwise = putStrLn $ "Caught in client " <> show e
reader `catch` handler
reader :: IO ()
reader = do
putStrLn "Connecting reader"
sock <- Sock.socket Sock.AF_INET Sock.Stream Sock.defaultProtocol
Sock.connect sock (Sock.SockAddrInet 9999 $ Sock.tupleToHostAddress (127, 0, 0, 1))
let
drain = do
msg <- SockBS.recv sock 1024
unless (BS.null msg) drain
readingBarrier <- newEmptyMVar
forkIO $ do
takeMVar readingBarrier
Sock.close sock `catch` \e -> putStrLn $ "Caught in closer: " <> show (e :: SomeException)
putMVar readingBarrier ()
drain
server :: MVar () -> IO ()
server ready = do
listenSock <- Sock.socket Sock.AF_INET Sock.Stream Sock.defaultProtocol
Sock.bind listenSock (Sock.SockAddrInet 9999 $ Sock.tupleToHostAddress (127, 0, 0, 1))
Sock.listen listenSock 5
putMVar ready ()
forever $ do
(clientSock, _) <- Sock.accept listenSock
forkFinally (handleClient clientSock) (\_ -> Sock.close clientSock)
handleClient :: Sock.Socket -> IO ()
handleClient clientSock = do
let
loop = do
SockBS.send clientSock $ BS.replicate 1024 62
-- This delay seems to be needed for the reproducer - presumably as otherwise there's always
-- data ready at the receiving end and it never needs to suspend
threadDelay 10_000
loop
loop
Expected behavior
In terms of the reproducer, the expected behavior is that it just indefinitely prints Connecting reader
and then a variation of Caught in client threadWait: invalid argument (Bad file descriptor)
or Caught in client Network.Socket.recvBuf: invalid argument (Bad file descriptor)
.
In terms of the general behavior of the RTS API, the race should either be won by threadWaitRead
(which subsequently should then get woken up when the closeFdWith
got its turn), or by closeFdWith
and threadWaitRead
should fail immediately.
Actual behavior
What I actually observe is that after some time of the above print loop, it prints
Caught in client modifyFdOnce: invalid argument (Bad file descriptor) Connecting reader Caught in closer: epollControl: does not exist (No such file or directory)
And then it just completely locks up.
Initial research
Debugging this problem in our production application with strace
, I found that the epoll_ctl
from threadWaitRead
and the close
called from closeFdWith
arrived almost simultaneously, with the close
interrupting the epoll_ctl
, causing it to return EBADF
.
In such a case, the epoll backend would throw (https://gitlab.haskell.org/ghc/ghc/-/blob/ghc-9.0.2-release/libraries/base/GHC/Event/EPoll.hsc#L93-105), it appears to only handle ENOENT, and only in specific cases.
But from my (limited) understanding of that code, the IO manager never expects IO backends to throw, judging by the use of a Bool
return value and the check here.
Consequently, with the exception being thrown from there, it'll miss the cleanup in the else-branch at that location, which means that it'll leave the invalid file descriptor in the table, that it added here, even though it was never successfully registered with epoll.
In the next iteration, this left-over file descriptor gets reused. Once the closer thread tries to call close
on the socket, closeFdWith
will see a registration in its management tables and thus try to unregister the file descriptor, happening here.
But since the FD is from its previous incarnation, the current instance of it (that the IO manager sees in its table) has never been registered with the epoll instance - and thus we get the epollControl: does not exist (No such file or directory)
error when it tries to remove it from the epoll instance (that it has never been added to).
This error comes at a very inconvenient place between a series of takeMVar
s and putMVar
s, meaning that we skip the putMVar
part of it - explaining the lock up (since subsequent calls to the IO manager no longer have access to its tables from those particular MVars).
I hope that helps!
Environment
- GHC version used: 9.0.2 (but also observed this in a (closed-source) production application in 8.10.7)
Optional:
- Operating System: Ubuntu 20.04
- System Architecture: x86-64