Skip to content

Memory effects in doomed STM transactions

Problem

GHC's STM implementation does "lazy" validation meaning that speculative execution can continue even after observing an inconsistent view of memory. Transactions in the state are called "doomed transactions". While the runtime tries to avoids some effects in doomed transactions, such as entering an infinite loop, it does not successfully control all memory effects. In particular there are many (I do not have an exhaustive list) allocations that eventually end up in the allocate function which is nicely documented with the following:

allocate() never fails; if it returns, the returned value is a valid address. If the nursery is already full, then another block is allocated from the global block pool. If we need to get memory from the OS and that operation fails, then the whole process will be killed.

https://github.com/ghc/ghc/blob/master/rts/sm/Storage.c#L819

A doomed transaction that demands a large allocation that the OS cannot fulfill will kill the process. Here is a program that reliably fails for me:

-- oom.hs
{-# LANGUAGE BangPatterns #-}
import GHC.Conc
import Control.Concurrent

import qualified Data.ByteString as B

forever act = act >> forever act

check True = return ()
check False = retry

main = do
    tx <- newTVarIO 0
    ty <- newTVarIO 1
    tz <- newTVarIO 0

    done <- newTVarIO False

    _ <- forkIO $ forever $ atomically $ do
        -- Only read tx and ty
        x <- readTVar tx
        y <- readTVar ty

        if x > y  -- This should always be False
          then do
            -- We only get here in a doomed transaction.  Commenting out the next two lines
            -- but leaving the write to done and the program runs as expected because the
            -- doomed transaction is detected at commit time.
            let !big = B.length $ B.replicate maxBound 0  -- The big allocation!
            writeTVar tz big
            writeTVar done True
          else return ()

    let mut = forever $ atomically $ do
        y <- readTVar ty
        x <- readTVar tx

        if x > 1000
          then do
            -- When we get big enough, start over.
            writeTVar tx 0
            writeTVar ty 1  -- Always holds semantically that tx < ty
          else do
            -- increment x and y both
            writeTVar ty (succ y)
            writeTVar tx (succ x)  -- tx < ty

    -- Give lots of opportunities to witness inconsistent memory.
    _ <- forkIO mut
    _ <- forkIO mut
    _ <- forkIO mut

    atomically $ readTVar done >>= check

    putStrLn "Done"

Running leads to out of memory:

> ghc oom.hs -fno-omit-yields -threaded
[1 of 1] Compiling Main             ( oom.hs, oom.o )
Linking oom.exe ...
> ./oom.exe +RTS -N
oom.exe: out of memory

Fixing

When a potentially bad memory effect is about to happen in some thread, we need to ensure that we validate any running transaction and if it fails have some way to back out of the operation and abort the transaction. This might be tricky on two fronts 1) finding all the critical allocations and 2) finding places where we can both detect (1) and abort the transaction. The places I'm concerned about most are array allocation such as the ByteString in the example and maybe Integer allocation. Another fix is a different STM implementation altogether that (at a performance cost or trade-off) doesn't allow doomed transactions (to appear Haskell Symposium 2016 :D). I think we can at least address this particular example without going so far.

A related issue is more general memory leaks from doomed transactions. Consider a program with a memoized Fibonacci function. The semantics of the transactions written by the programmer may ensure that no value higher then 10 is ever demanded of that function, yet in a doomed transaction the invariant goes wrong and the program demands 100. The program will detect the doomed transaction eventually in this case, but not before it has allocated a large live blob never to be touched again.

Trac metadata
Trac field Value
Version 8.0.1
Type Bug
TypeOfFailure OtherFailure
Priority normal
Resolution Unresolved
Component Runtime System
Test case
Differential revisions
BlockedBy
Related
Blocking
CC simonmar
Operating system
Architecture
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information