Validation of STM transactions by the scheduler causes large overhead. Potentially preventing progress.
Summary
The scheduler performing validation of STM transactions in every scheduler loop might lead to scheduler time dominating a programs runtime.
Details
When a thread returns to the scheduler we call schedulePostRunThread
which will validate active transactions through stmValidateNestOfTransactions
. stmValidateNestOfTransactions
itself will call validate_and_acquire_ownership
on the current transaction, and all the transactions that enclose it with the options acquire_all: true
and retain_ownership
false.
validate_and_acquire_ownership
will then walk over all the tvars in the transaction log, lock them and check their expected values against their actual ones. If any don't pass the check we ultimately condemn the whole transaction. So far so good. In this case this will never happens as we only perform reads.
This is already bad with just one capability:
- As the number as TVars increases performance of each read slows down, resulting in fewer reads per second. (See #24410)
- As read performance decreases we return to the scheduler more frequently relative to the progress made by the transaction because of context switches. Each time we return to the scheduler we pay the full cost of
validate_and_acquire_ownership
which also scales with the number of TVars in the transaction.
But especially with multiple capabilities this get's worse:
- Each thread will perform the
validate_and_acquire_ownership
check whenever the thread returns to the scheduler. This means scheduling will take longer. - As we spend more time in the scheduler code, less time is spent on productive work in between context switches. This means we end up using more cycles on overhead per unit of work.
- In the worst case most of the runtime is spent doing validation of transactions, with little time spent doing productive work that progresses the transactions.
- If two threads perform validation in parallel they might also run into #24446 preventing either thread to progress.
Some napkin math:
In the example I'm looking at we deal with 100k tvars. According to "Evaluating the Cost of Atomic Operations on Modern Architectures" a cas takes on the order of 10-100ns.
Near the end during each context switch we will perform 400k cas operations. As both threads will perform one cas each to lock and unlock each tvar.
If we assume 50ns for each cas this means a full context switch, just from the cost of the cas, will take somewhere around 20ms. By default our context switch frequency is also 20ms.
I think it's plausible that we run into situations where we get:
- x: Mutator is runnings
- x+0ms: Ticker triggers context switch:
- x+0-19ms: Scheduler is running
validate_and_acquire_ownership
- x+19-20ms: Mutator runs - we perform a handful of reads
- x+20ms: Ticker triggers another context switch.
This would repeat until eventually the runtime of the scheduler crosses 20ms, so we would "lose" one context switch, and the mutator would get a full 20ms slice to perform reads. But before we get there we might spend a long time during which 95%+ of the time is spent on scheduler overhead.
However it's hard to confirm this, as any kind of instrumentation causes enough overhead to majorly distorts this pattern. And #24446 also affects this.
Steps to reproduce
So far I investigated this using this as reproducer:
{-# LANGUAGE NumericUnderscores #-}
{-# LANGUAGE ScopedTypeVariables #-}
module Main where
import Control.Concurrent (ThreadId, myThreadId, forkIO)
import Control.Concurrent.STM (STM, atomically, newTVar, readTVar, TVar(..))
import Data.IORef (IORef, modifyIORef', newIORef, readIORef)
import GHC.IO (unsafePerformIO)
import System.IO (hFlush, stdout)
import Data.Time.Clock
import Control.Concurrent.MVar
import Control.Monad
main = do
let nums = [
10_000 :: Int
-- ,50_000
,100_000
-- ,150_000
-- ,200_000
-- ,300_000
-- ,400_000
]
forM_ nums $ \tvar_count -> do
print $ "Measuring: " ++ show tvar_count
(tvars :: [TVar Int]) <- atomically $ mapM (newTVar) [1 .. tvar_count]
putStrLn "Created TVars"
start_time <- getCurrentTime
s1 <- runAsync $ atomically $ do
readTList tvars
s2 <- runAsync $ atomically $ do
readTList tvars
s1 <- takeMVar s1
s2 <- takeMVar s2
print $ "done, " ++ show (s1,s2)
done_time <- getCurrentTime
print $ done_time `diffUTCTime` start_time
runAsync :: (IO a) -> IO (MVar a)
runAsync act = do
mvar <- newEmptyMVar
_tid <- forkIO $ do
r <- act
putMVar mvar r
return mvar
readTList tvars = do
values <- mapM readTVar tvars :: STM [Int]
!s <- return $! sum values
return $! s
Expected behavior
Scheduling performance should not be significantly affected by the number of tvars involved in transactions.
Environment
- GHC version used:
Optional:
- Operating System:
- System Architecture: