STM: Locking approach during transaction validation can cause lack of progress

Summary

The current STM implementation can result in situations where even two read-only transactions running in parallel offer no guarantee of the transactions making progress. (However nothing suggests this behaviour is limited to read-only transactions).

This is one of the STM issues surfaced by #24142 (closed)

Steps to reproduce

This can reproduced on these files:

stm-starvation.cabal :

cabal-version:      3.0
name:               stm-starvation

version:            0.1.0.0
maintainer:         Xander van der Goot
build-type:         Simple

extra-doc-files:    README

common opt
    ghc-options:
      -Wall -Wpartial-fields -O2 -rtsopts -debug -threaded -Wunused-packages
      "-with-rtsopts=-N"
    build-depends:
      async,
      base,
      stm,

executable map
    import:           opt
    main-is:          Map.hs
    hs-source-dirs:   app
    default-language: Haskell2010
    build-depends:
      stm-containers,
      list-t

Map.hs:

{-# LANGUAGE NumericUnderscores #-}

import Control.Concurrent (ThreadId, myThreadId)
import qualified Control.Concurrent.Async as Async
import Control.Concurrent.STM (STM, atomically)
import Data.IORef (IORef, modifyIORef', newIORef, readIORef)
import GHC.IO (unsafePerformIO)
import qualified ListT
import qualified StmContainers.Map as STMMap
import System.IO (hFlush, stdout)

main :: IO ()
main = do
  let l = [(x,x) | x <- [1 :: Integer .. 100_000]]
  stmmap <- STMMap.newIO
  atomically $ mapM_ (\(x, y) -> STMMap.insert x y stmmap) l
  print "Initialized the STM Map."
  hFlush stdout
  -- Show that a read from a single thread does succeed
  readMap stmmap
  -- read the stm map from two threads concurrently
  foldr1 Async.concurrently_ [readMap stmmap, readMap stmmap]

-- | irrevocable counter to keep track of transaction restart.
restartCounter :: ThreadId -> IORef Integer -> STM ()
restartCounter  threadId counter = unsafePerformIO $ do
  r <- readIORef counter
  print (threadId, r)
  hFlush stdout
  modifyIORef' counter (+ 1)
  return (return ())

readMap :: (Show a, Show b) => STMMap.Map a b -> IO ()
readMap stmmap = do
  threadId <- myThreadId
  counter <- newIORef (0 :: Integer)
  _ <- atomically $ do
    restartCounter threadId counter
    ListT.toList (STMMap.listT stmmap)
  return ()

Just run the program and it will take a unreasonable long time to finish, or might never finish depending on the system it's run on and luck.

Expected behavior

Since read-only transactions don't conflict with each other I would have assumed there is a guarantee of progress.

Analysis:

When a running thread returns to the scheduler it will validate its current STM transaction. This involves locking all the TVars in the transaction optimistically. If locking of any of the tvars fails the validation will fail and we will assume it's due to a data conflict and abort the transaction, forcing it to be restarted.

This is problematic if multiple threads running transactions on the same set of tvars return to the scheduler in parallel. Both threads will try to lock the same set of tvars in parallel, guaranteeing one of them fails. Resulting in one of the two transactions being aborted and restarted.

Since it's dependent on timing which of the two transactions get's forcibly restarted, it's possible that both transactions get restarted repeatedly, in the worst case infinitely often preventing any progress.

This for example might look something like this:

Thread1: run_t-> validate -> run_t  -> validate    -> run_t      -> validate fails -> abort -> restart -> run_t -> ..
Thread2:   run_t-> validate fails -> abort -> restart -> run_t -> validate -> run_t    -> validate -> run_t -> ..       

Thread1: validate -> run_t -> validate -> run_t -> ...
Thread2:   -> validate fails -> abort -> restart -> ...

Solutions:

I think the underlying issue is that the current algorithm causes transaction validation to fail if a tvar is only temporarily locked, even if the memory is still consistent with the transactions view of memory.

The solution is to allow us to distinguish between locks failing because of inconsistent memory states, and locks failing for "spurious" reasons.

Environment

GHC version used: Any

Optional:

Operating System:
System Architecture: Any SMP

Edited Mar 12, 2024 by Simon Peyton Jones

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information