Validation of STM transactions by the scheduler causes large overhead. Potentially preventing progress.

Summary

The scheduler performing validation of STM transactions in every scheduler loop might lead to scheduler time dominating a programs runtime.

Details

When a thread returns to the scheduler we call schedulePostRunThread which will validate active transactions through stmValidateNestOfTransactions. stmValidateNestOfTransactions itself will call validate_and_acquire_ownership on the current transaction, and all the transactions that enclose it with the options acquire_all: true and retain_ownership false.

validate_and_acquire_ownership will then walk over all the tvars in the transaction log, lock them and check their expected values against their actual ones. If any don't pass the check we ultimately condemn the whole transaction. So far so good. In this case this will never happens as we only perform reads.

This is already bad with just one capability:

As the number as TVars increases performance of each read slows down, resulting in fewer reads per second. (See #24410)
As read performance decreases we return to the scheduler more frequently relative to the progress made by the transaction because of context switches. Each time we return to the scheduler we pay the full cost of validate_and_acquire_ownership which also scales with the number of TVars in the transaction.

But especially with multiple capabilities this get's worse:

Each thread will perform the validate_and_acquire_ownership check whenever the thread returns to the scheduler. This means scheduling will take longer.
As we spend more time in the scheduler code, less time is spent on productive work in between context switches. This means we end up using more cycles on overhead per unit of work.
In the worst case most of the runtime is spent doing validation of transactions, with little time spent doing productive work that progresses the transactions.
If two threads perform validation in parallel they might also run into #24446 preventing either thread to progress.

Some napkin math:

In the example I'm looking at we deal with 100k tvars. According to "Evaluating the Cost of Atomic Operations on Modern Architectures" a cas takes on the order of 10-100ns.

Near the end during each context switch we will perform 400k cas operations. As both threads will perform one cas each to lock and unlock each tvar.

If we assume 50ns for each cas this means a full context switch, just from the cost of the cas, will take somewhere around 20ms. By default our context switch frequency is also 20ms.

I think it's plausible that we run into situations where we get:

x: Mutator is runnings
x+0ms: Ticker triggers context switch:
x+0-19ms: Scheduler is running validate_and_acquire_ownership
x+19-20ms: Mutator runs - we perform a handful of reads
x+20ms: Ticker triggers another context switch.

This would repeat until eventually the runtime of the scheduler crosses 20ms, so we would "lose" one context switch, and the mutator would get a full 20ms slice to perform reads. But before we get there we might spend a long time during which 95%+ of the time is spent on scheduler overhead.

However it's hard to confirm this, as any kind of instrumentation causes enough overhead to majorly distorts this pattern. And #24446 also affects this.

Steps to reproduce

So far I investigated this using this as reproducer:

{-# LANGUAGE NumericUnderscores #-}
{-# LANGUAGE ScopedTypeVariables #-}

module Main where

import Control.Concurrent (ThreadId, myThreadId, forkIO)
import Control.Concurrent.STM (STM, atomically, newTVar, readTVar, TVar(..))
import Data.IORef (IORef, modifyIORef', newIORef, readIORef)
import GHC.IO (unsafePerformIO)
import System.IO (hFlush, stdout)

import Data.Time.Clock

import Control.Concurrent.MVar
import Control.Monad

main = do

    let nums = [
                 10_000 :: Int
                -- ,50_000
                ,100_000
                -- ,150_000
                -- ,200_000
                -- ,300_000
                -- ,400_000
                ]

    forM_ nums $ \tvar_count -> do
        print $ "Measuring: " ++ show tvar_count

        (tvars :: [TVar Int]) <- atomically $ mapM (newTVar) [1 .. tvar_count]

        putStrLn "Created TVars"

        start_time <- getCurrentTime
        s1 <- runAsync $ atomically $ do
                  readTList tvars
        s2 <- runAsync $ atomically $ do
                  readTList tvars

        s1 <- takeMVar s1
        s2 <- takeMVar s2
        print $ "done, " ++ show (s1,s2)
        done_time <- getCurrentTime

        print $ done_time `diffUTCTime` start_time

runAsync :: (IO a) -> IO (MVar a)
runAsync act = do
    mvar <- newEmptyMVar
    _tid <- forkIO $ do
        r <- act
        putMVar mvar r
    return mvar

readTList tvars = do
    values <- mapM readTVar tvars :: STM [Int]
    !s <- return $! sum values
    return $! s

Expected behavior

Scheduling performance should not be significantly affected by the number of tvars involved in transactions.

Environment

GHC version used:

Optional:

Operating System:
System Architecture:

Edited Feb 17, 2024 by Andreas Klebinger

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information