Skip to content

GHC.Event.Thread.closeFdWith isn't concurrency-safe

While looking at some preparatory work for !8073 I noticed that GHC.Event.Thread.closeFdWith is extremely suspicious. Specifically, it:

  1. iterates over the capabilities and takes the event callback table MVar of each
  2. iterates over the capabilities and calls GHC.Event.Manager.closeFd_ on each previously-taken table to remove the to-be-closed fd
  3. closes the fd
  4. iterates over the capabilities and for each:
    • puts the event callback table MVar
    • runs the event notifier registered for the closed fd, if any

All of this happens under an uninterruptibleMask_.

First, a bit of background: closeFdWith is needed because POSIX does not specify the behavior of select or poll when an fd being blocked on in one thread is closed by another thread. Consequently, the IO manager must (atomically) ensure that all events registered against the fd are dispatched before the fd is closed.

However, AFAICT, the above implementation can very easily break if the number of capabilities is concurrently changed. Specifically I can imagine the following problematic interleaving:


       Capability 1                       │                  Capability 2
       ────────────                       │                  ────────────

0. `closeFdWith fd` is called             │

1. Takes event callback tables for all N  │
   capabilities                           │

                                          │ 2. Calls `setNumCapabilities (N+1)`, blocks
                                          │    waiting to acquire all capabilities

3. Yields its capability (e.g. due to GC) │

                                          │ 3. Takes capability, and now holds all capabilities.
                                          │    A new capability is spun-up and all capabilities
                                          │    are again released.

                                          │ 4. A thread is scheduled to capability N+1

                                          │ 5. The thread registers an event for `fd` and blocks on `select`/`poll`

6. `closeFdWith` continues, calling       │
   `closeFd_` on N capabilities.          │
   This removes the fd from the event     │
   list of all capabilities and           │
   wakes up those that are blocked        │
   on the fd                              │

7. the fd is closed                       │

                                          │ 8. The thread is now blocked on `fd`, which was closed while
                                          │    the thread was blocked, triggering undefined behavior.

This is quite problematic.

Edited by Ben Gamari
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information