STM fails to validate read.

This issue was brought up by napping in #haskell with this paste:

http://hpaste.org/85134 (the first paste in particular)

The code is:

import Control.Concurrent.STM
import Control.Concurrent
import Control.Monad

main = do
  dog <- newTVarIO False
  cat <- newTVarIO False
  let unset = do
        d <- readTVar dog
        c <- readTVar cat
        if (d || c) then retry else return ()
      setDog = unset >> writeTVar dog True
      setCat = unset >> writeTVar cat True
      oops = do
        d <- readTVar dog
        c <- readTVar cat
        guard (d && c)
      reset = do
        writeTVar dog False
        writeTVar cat False
      reset' = do
        d <- readTVar dog
        c <- readTVar cat
        guard (d || c)
        reset

  forkIO (atomically oops >> putStrLn "Oh Noes!")
  forever (do
    forkIO (atomically setDog)
    forkIO (atomically setCat)
    atomically reset'
    atomically reset')

When run it produces:

$ ghc --make test.hs -threaded -rtsopts
$ ./test +RTS -N2
Oh Noes!
test: thread blocked indefinitely in an STM transaction

The second message is just a consequence of entering an unexpected state. The first message indicates that both the transactions cat and dog committed at the same time.

It does this for HEAD and 7.6.

I've sketched out an interleaving that leads to this. TRec entries are in the first and third column and TVar's are in the second column. Each entry has a TVar name and the expected value followed by the new value and then a number of updates if it has been read. TVars list their value and their number of updates.

  A TRec       TVar         B TRec
                                     -- Transactions start
cat F F      cat F 0      cat F F    -- Initial reads.
dog F F      dog F 0      dog F F  

cat F T                   dog F T    -- Local writes in TRec's
                        
                                     -- Validation:

                          cat F F 0  -- B reads num_updates from cat (with
                                  ^  --   consistency check with value)
cat F T    cat A 0                |  -- A acquires lock for cat (atomic cas)
dog F F 0        ^                |  -- A reads num_updates from dog (with
        ^        |                |  --    consistency check with value)
        |      dog B 0    dog F T |  -- B acquires lock for dog (atomic cas)
        |        |   ^            |  
        |        |   |            |
Success 0        |   0            |  -- read check for A
Success          0                0  -- read check for B

           cat A 1                   -- Increment version
           cat T 1                   -- Unlock with new value
               dog B 1               -- Increment version
               dog B T               -- Unlock with new value

What is clear here is that the version number is not enough to check in check_read_only because there is a gap between locking and incrementing the version. We need to know atomically that the TVar is not locked and it's version number is the same.

I need to read through the right parts of Keir Fraser's thesis carefully, but it seems like the read phase here is only helpful in preventing a commit that writes back the exact value we have already seen while we are in the middle of committing.

The code for check_read_only is here:

static StgBool check_read_only(StgTRecHeader *trec STG_UNUSED) {
  StgBool result = TRUE;

  ASSERT (config_use_read_phase);
  IF_STM_FG_LOCKS({
    FOR_EACH_ENTRY(trec, e, {
      StgTVar *s;
      s = e -> tvar;
      if (entry_is_read_only(e)) {
        TRACE("%p : check_read_only for TVar %p, saw %ld", trec, s, e -> num_updates);
        if (s -> num_updates != e -> num_updates) {
          // ||s -> current_value != e -> expected_value) {
          TRACE("%p : mismatch", trec);
          result = FALSE;
          BREAK_FOR_EACH;
        }
      }
    });
  });

  return result;
}

If I restore the commented out line (which appears commented out in the first commit of this code that I can find) I can't reproduce the issue, but I think there is still a problem due to the ordering of those checks: we could observe the version as the same, while it is locked, have the TVar unlock, then observe the value the same.

Switching the order we can only observe the TVar unlocked if the update has been incremented (as long as we are on an architecture that ensures this such as x86).

Does this seem right? One issue is that given the interleaving that above with this added check both transactions could fail to commit. I think the algorithms from Fraser avoid this, but I think it always involves invalidating another transaction (i.e. killing off any other transactions observed to be in the read check phase with an overlapping TVar in a way that results in only one winner (see page 21 section 4.4 in Concurrent Programming Without Locks)).

As a side note, the issue can be avoided by ensuring that the reads become writes and avoiding the read only check. But you can only become a write if the values do not have matching pointers, switching to Ints instead of Bools gets you there.

Trac metadata

Trac field	Value
Version	7.7
Type	Bug
TypeOfFailure	OtherFailure
Priority	normal
Resolution	Unresolved
Component	Runtime System
Test case
Differential revisions
BlockedBy
Related
Blocking
CC
Operating system
Architecture

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information

Admin message

STM fails to validate read.