Buggy behavior with threaded runtime (-N1 working, -N2 getting into <<loop>>)

changed weight to 5

changed the description

I get the same behavior with HEAD.

Compiling with -debug I sometimes get

BrokenCode: internal error: ASSERTION FAILED: file rts/Schedule.c, line 570

I simplified the problem a little. The first command line argument is for the chunk size given to parListChunk, which seems to matter. I get the same result with ghc-7.8.3 and ghc-7.10.1 on os x

import           Control.Parallel.Strategies
import           System.Environment

type Pair = (Int, [Int])

loop' :: Pair ->  [Pair] 
loop' (size,qns) = go 1
    where go n | n > size  = []
               | otherwise =  (size, n:qns) : (go (n+1))

worker :: Int -> Int -> Pair -> [Pair]
worker chunksize =  go 
    where go 1  = loop' 
          go n  = withStrategy (parListChunk chunksize rseq) 
                 . concatMap (go (n-1)) 
                 . loop' 

main :: IO ()
main = do
    chunksize:n:_ <- fmap (map read) getArgs
    print $ length (worker chunksize n (n,[]))

With this i get, e.g.:

$ ./threads 3 7 +RTS -N2
threads: <<loop>>

$ ./threads 2 7 +RTS -N2  --reduce chunk size to 2
823543

$ ./threads 2 7 +RTS -N1 -- use N1 instead
823543

$ ./threads 3 7 +RTS -N1
823543

This came into my head again. It can be simplified, and I think clarified, like so:

import           Control.Parallel.Strategies
import           System.Environment

parConcatMapN :: Int -> Int -> (a -> [a]) -> a -> [a]
parConcatMapN chunksize depth step =  go depth 
    where go 0  = (:[])
          go n  = withStrategy (parListChunk chunksize rseq) 
                 . concatMap (go (n-1)) 
                 . step

main :: IO ()
main = do
    depth:_ <- fmap (map read) getArgs
    print $ length (test depth)

test depth = parConcatMapN 3 depth show 'x'  
-- i.e.  iterate (concatMap show) 'x' !! depth

We keep re-chunking, but do not respect the chunking we already did, in the would-be progress through

'x'
'\'''x''\''
'\'''\\''\'''\'''\'''x''\'''\'''\\''\'''\''
...

We've run into what seems to be the same issue. We've isolated a test case here: https://gist.github.com/j-e-k/bf8c5f027178ee20ac94

Could we get this fixed in 7.10.2?

I'd love it to be fixed in 7.10.2, but first someone needs to volunteer to look into what is wrong, and hopefully find out how to fix it. Is it a bug in a library? In the RTS? In the compiler?

Help needed!

Simon

Attached file par.hs ($1101).

single-module reproducer

I inlined the relevant parts of the parallel package into michaelt's example for the sake of testing across multiple versions of GHC. In the process I discovered that building with -feager-blackholing is necessary to reproduce the bug. Well, not surprising. (The parallel package specifies this flag in its cabal file.) To be specific, I am building it as

ghc -threaded -O -rtsopts par -fforce-recomp -feager-blackholing

and running as

while ./par 8 +RTS -N4; do :; done

until it fails (which is usually immediately).

I managed to bisect the failure down to these three commits between 7.6 and 7.8 which added cardinality analysis: https://github.com/ghc/ghc/compare/da4ff650ae77930a5a10d4886c8bc7d37f081db7...62653122f3cf2d48a475cadecc9b4483488c9769

Interestingly in the versions that have cardinality analysis, the program still loops even when built with -fkill-absence.

Hopefully this provides some clues to someone...

That's extremely helpful, thank you Reid.

I'm guessing that the culprit, somehow, is this code in CoreToStg:

    upd_flag | isSingleUsed (idDemandInfo bndr)  = SingleEntry
             | otherwise                         = Updatable

This arranges not to black-hole a thunk that is used at most once. Can you try with -fkill-one-shot? That switches off this optimisation, and I bet it'll make the program work.

Assuming that does fix it, the next question is: is it just the CoreToStg upd_flag? (-fkill-one-shot affects more things than just that one spot.) The next thing I'd do would be just to comment out the SingleEntry case above, rebuild the compiler, and check that the bug is gone. I'm 95% sure that it'll will, but worth checking.

Next. If some thunk is being marked SingleEntry, and that that is causing the bug:

Which thunk is it?
Is it really single-entry? (I.e. is the analysis right or not)

If you -ddump-stg and look for thunks marked \s, those are the single-entry ones. There probably aren't very many. If it's very few, we could look at the "is the analysis right?" question. If it's many, we'll need to find a way to narrow it down somehow.

Simon

BTW is is the overloading of Traversable important? Eliminate it from the single-module reproducer?

Lazy blackholing will still take place for thunks that are not blackholed by eager blackholing, because we have no way to distinguish between an eager-blackholed and a lazy-blackholed thunk in the runtime. We had bugs in this area in the past, see #5226 (closed). I'm not sure this helps, but it's possible that the cardinality analysis is correct and this is a runtime bug.

Unfortunately -fkill-one-shot made no difference.

My gut feeling is that this is a bug in the RTS that was uncovered (for this program) by the cardinality analysis patch, but I have no real evidence for this.

I reduced this a little further so that it just uses the Monad instance and the two basic combinators:

rparWith s a = Eval $ \s0 -> spark# r s0
  where r = case s a of  Eval f -> case f realWorld# of  (# _, a' #) -> a'

runEval :: Eval a -> a
runEval (Eval x) = case x realWorld# of (# _, a #) -> a

and a non-recursive but monotonously layered concrete function.

The mechanism for attaching source must be before my eyes, but here is the reduced module:

{-# LANGUAGE MagicHash, UnboxedTuples #-}
import Control.Applicative
import Control.Monad
import GHC.Exts

newtype Eval a = Eval (State# RealWorld -> (# State# RealWorld, a #))

instance Functor Eval where fmap = liftM

instance Applicative Eval where  (<*>) = ap; pure  = return

instance Monad Eval where
  return x = Eval $ \s -> (# s, x #)
  Eval x >>= k = Eval $ \s -> case x s of
                                (# s', a #) -> case k a of
                                                      Eval f -> f s'

rparWith s a = Eval $ \s0 -> spark# r s0
  where r = case s a of  Eval f -> case f realWorld# of  (# _, a' #) -> a'


runEval :: Eval a -> a
runEval (Eval x) = case x realWorld# of (# _, a #) -> a


main :: IO ()
main = do -- print $ length (pf 'x') -- either statement works at least on and off
          print (program 'y')   -- but I seem to lose the effect if I use both statements

program = 
  pchunk . concatMap (pchunk . concatMap (pchunk . concatMap (pchunk . show) . show) . show) . show
  where
  -- the effect seems to vanish if I eta expand pchunk
  pchunk  = runEval 
         . fmap concat 
         .  mapM (rparWith (mapM (\x -> Eval $ \s -> seq# x s) )) 
         . chunk' 

  -- the effect seems to disappear if I reject splitAt in favor
  -- of a pattern match chunk' (a:b:c:xs) = [a,b,c]: chunk' xs
  chunk' ::  [a] -> [[a]]
  chunk' [] = []
  chunk' xs =  as : chunk' bs where (as,bs) = splitAt 3 xs

And a bit more compressed, for what it may be worth:

{-# LANGUAGE MagicHash, UnboxedTuples #-}
import GHC.Exts
newtype Eval a = Eval {runEval :: State# RealWorld -> (# State# RealWorld, a #)}

-- inline sequence ::  [Eval a] -> Eval [a]
well_sequenced ::  [Eval a] -> Eval [a]
well_sequenced = foldr op (Eval $ \s -> (# s, [] #))  where
  op e es = Eval $ \s -> case runEval e s of
                    (# s', a #) -> case runEval es s' of
                      (# s'', as #) -> (# s'', a : as #)

-- seemingly demonic use of spark#
ill_sequenced ::  [Eval a] -> Eval [a]
ill_sequenced  as = Eval $ spark# (case well_sequenced as of 
             Eval f -> case f realWorld# of  (# _, a' #) -> a')

main :: IO ()
main = print ((layer . layer . layer . layer . layer) show 'y')   
  where
  layer :: (Char -> String) -> (Char -> String)
  layer f = (\(Eval x) -> case x realWorld# of (# _, as #) -> concat as) 
        . well_sequenced
        . map ill_sequenced -- all is well with well_seqenced
        . map (map (\x -> Eval $ \s -> (# s, x #))) 
        . chunk'  
        . concatMap f 
        . show

This seems pretty reliably bad. I have been using

ghc -threaded -O -rtsopts par.hs -fforce-recomp -feager-blackholing
./par +RTS -N2

Attached file par2.hs ($1104).

reduced version of rbarton's par.hs

Sorry, I'm spamming the trac a bit. Notice that in the ultra-simplified module, now attached properly, the wrapping with Lift that parallel uses for rparWith is no where to be found. If I wrap stuff in my ill_sequenced with Lift, I can't get the effect. If though, that use of Lift in the definition of rparWith is required by whatever is going on with spark# and some of these other opaque-to-me primitives, then there is a question whether it is used enough: the original program is doing an end-run around this. It is presumably obviously undesirable, but if in rbarton's par.hs I complicate the definition of rpar , which is

rpar :: a -> Eval a
rpar  x = Eval $ \s -> spark# x s

and use instead something like

rpar :: a -> Eval a
rpar x = Eval $ \s -> case y of
   Eval f -> case f s of 
     (# s1 , l #) -> case l of Lift w -> (# s1 , w #)
  where y = Eval $ \s -> spark# (Lift x) s

then it seems all is well again. That probably destroys all the desired effects; but if it doesn't, then the problem may just be that the library is letting the user get too close to spark# which is practically naked in rpar.

I took a look at the generated STG for par2.hs before and after the cardinality analysis commit. Before, there are no \s thunks at all. After, there are two. One is in

Main.main_go [Occ=LoopBreaker]
  :: [[GHC.Types.Char]] -> [GHC.Types.Char]
[GblId,
 Arity=1,
 Caf=NoCafRefs,
 Str=DmdType <S,U>,
 Unf=OtherCon []] =
    \r srt:SRT:[] [ds_s1nu]
        case ds_s1nu of _ {
          [] -> [] [];
          : y_s1ny [Occ=Once] ys_s1nz [Occ=Once] ->
              let {
                sat_s1pj [Occ=Once, Dmd=<L,1*U>] :: [GHC.Types.Char]
                [LclId, Str=DmdType] =
                    \s srt:SRT:[] [] Main.main_go ys_s1nz;
              } in  GHC.Base.++ y_s1ny sat_s1pj;
        };

This comes from the Core

Main.main_go [Occ=LoopBreaker]
  :: [[GHC.Types.Char]] -> [GHC.Types.Char]
[GblId, Arity=1, Caf=NoCafRefs, Str=DmdType <S,U>]
Main.main_go =
  \ (ds_XrA :: [[GHC.Types.Char]]) ->
    case ds_XrA of _ {
      [] -> GHC.Types.[] @ GHC.Types.Char;
      : y_arg ys_arh ->
        GHC.Base.++ @ GHC.Types.Char y_arg (Main.main_go ys_arh)
    }

which presumably comes from the use of concat in layer. This happens even when I build the program with -fkill-absence -fkill-one-shot. Could that be because base was built with cardinality analysis enabled? I don't entirely see how, but I can try rebuilding the libraries with -fkill-absence -fkill-one-shot.

Anyways I guess the main question, which I'm not sure how to answer, is whether the fact that this thunk is marked as single-entry is correct.

The other single-entry thunk is, I think, very similar and arises from concatMap:

{- note
$wlayer_r1m6
  :: (GHC.Types.Char -> GHC.Base.String)
     -> GHC.Prim.Char# -> GHC.Base.String
[GblId, Caf=NoCafRefs, Str=DmdType, Unf=OtherCon []]
w3_r1ma :: GHC.Types.Char -> GHC.Base.String
[GblId, Arity=1, Str=DmdType, Unf=OtherCon []]
-}
Main.main_go2 [Occ=LoopBreaker]
  :: [GHC.Types.Char] -> [GHC.Types.Char]
[GblId, Arity=1, Str=DmdType <S,1*U>, Unf=OtherCon []] =
    \r srt:SRT:[(r8, Main.main_go2), (r1m6, $wlayer_r1m6),
                (r1ma, w3_r1ma)] [ds_s1oh]
        case ds_s1oh of _ {
          [] -> [] [];
          : y_s1ol [Occ=Once!] ys_s1oq [Occ=Once] ->
              case y_s1ol of _ {
                GHC.Types.C# ww1_s1oo [Occ=Once] ->
                    let {
                      sat_s1pv [Occ=Once, Dmd=<L,1*U>] :: [GHC.Types.Char]
                      [LclId, Str=DmdType] =
                          \s srt:SRT:[(r8, Main.main_go2)] [] Main.main_go2 ys_s1oq;
                    } in 
                      case $wlayer_r1m6 w3_r1ma ww1_s1oo of sat_s1pu {
                        __DEFAULT -> GHC.Base.++ sat_s1pu sat_s1pv;
                      };
              };
        };

I'm going to see what happens when I inline the definitions of concat and concatMap into this module.

Inlining concat and concatMap made no difference, the program still loops and a \s thunk is still generated inside "main_mygo". This happens for any combination of the -fkill-absence and -fkill-one-shot flags, and in every version I tested (the 7.7 commit adding cardinality analysis, 7.8.4, 7.10.1 and HEAD).

I then tried removing the SingleEntry case as Simon suggested in ticket:10414#comment:101567 and that did generate a \u thunk instead and the program no longer <<loop>>s.

So, one conclusion is that -fkill-absence/-fkill-one-shot don't fully disable cardinality analysis like they are expected to.

Michael, Reid, that is super-helpful. The cardinality analysis is plain wrong, so now we know just what is happening. I'm working on a fix.

Drat. On further reflection I the cardinality analysis is correct. So it looks as though there is a bug in the runtime system.

Simon M: might you find time to investigate? With only two single-entry thunks it can't be that hard!

I have literally no hypothesis for why a single-entry thunk could cause <loop>.

Simon

I think I've finally come up with at least part of a plausible explanation. Tell me if this seems right...

Suppose I have a heap object whose initial value is

  x = \u []  concat [[1],[]]   -- using "\u []" as syntax for "updatable thunk",
                               -- otherwise normal Core syntax (not STG)

and suppose first some thread evaluates x to WHNF. Using the definition of concat as main_go from above, this will allocate a single-entry thunk y = \s [] concat [], and then calling (++) will lead to

  x = 1 : z
  y = \s []  concat []
  z = \u []  (++) [] y

At this point the heap has the following property

(*) The heap contains a single-entry thunk (y) and a regular thunk (z) such that entering the regular thunk will cause the single-entry thunk to be entered as well.

(The key point is that the single-entry thunk has already been allocated on the heap, in contrast to a situation in which entering a regular thunk would cause a new single-entry thunk to be allocated, possibly entered, and then become garbage, all before evaluation of that regular thunk is complete.)

Now, consider the following execution path of two threads A and B.

Both threads enter z simultaneously (before either manages to overwrite it with a black hole, if eager blackholing was enabled when we compiled (++); otherwise before either manages to update it to an indirection).
Thread A does the case analysis inside (++) and enters y, and overwrites it with a black hole before thread B does anything else.
Now thread B does the case analysis inside (++) and enters y, but y has been overwritten with a black hole so thread B blocks. But y is never going to be updated, so thread B will block forever. This is bad!
Finally thread A finishes evaluating y (to []) and then updates z accordingly. But thread B is still blocking on the black hole and even if it could get unblocked by some mechanism (say in the GC) there's no longer enough information on the heap to recover the correct value of y and allow thread B to continue.

This doesn't exactly explain why the programs in this ticket <<loop>>, but a thread becoming permanently blocked is equally bad behavior I think.

Some extra evidence that something like this is going on is that if I inline the definition of (++) into par2.hs as well, so that it is compiled with eager blackholing enabled, the program <<loop>>s much less frequently, just a few percent of the time as opposed to over half the time. That would match up with a smaller window of simultaneity in the first step of the execution trace above.

If this analysis is correct, then assuming we want to continue to allow threads to enter thunks in an unsynchronized way (to avoid prohibitive locking costs), it seems we have to ensure that the condition (*) never holds, at least when eager blackholing is enabled. Generating single-entry thunks is still okay as long as they never survive as live heap objects after the thunk that allocated them has been reduced to WHNF.

I also came across this comment in rts/ThreadPaused.c (I think it is concerning a different scenario):

    // NB. Blackholing is *compulsory*, we must either do lazy
    // blackholing, or eager blackholing consistently.  See Note
    // [upd-black-hole] in sm/Scav.c.

Does this mean that every module in the entire program, including all the libraries the program links against like base, needs to be compiled with the same presence or absence of -feager-blackholing? The User's Guide doesn't mention anything about this and if that is the case, the flag seems essentially unusable except for those who are building their own GHC from source. I'm hoping the comment is just worded unclearly.

We need Simon Marlow's advice here.

What you say does seem plausible:

With lazy blackholing, a single-entry thunk will never be black-holed. Lazy black-holing only black-holes thunks with an update frame on the stack, and single-entry thunks do not push an update frame. So that could explain why -feager-blackholing is required.
Rather to my surprise, the code generator emits code to black-hole even a single-entry thunk. I can't see any good reason why this happens. Look at StgCmmClosure.blackHoleOnEntry.
Re ticket:10414#comment:101799 we need Simon to say. It would be pretty bad if every module had to be compiled consistently. However I note that the comment in Scav.c is in code that messes with update frames, and so single-entry thunks probably don't matter.

Even if all this is right, we still don't know why we get <<loop>>. I'm assuming that this means we have entered a black hole, that is under evaluation by this thread; but suddenly I'm not sure. (For a black hole being evaluated by another thread, we'd block.)

I suggest you try making blackHoleOnEntry return False for single-entry thunks.

Thanks for all the awesome debugging work here, I think we finally have all the pieces!

@rwbarton's analysis seems plausible to me too. A single-entry thunk can still be entered multiple times in a parallel setting, and if eager blackholing is on then it is possible that a thread can get blocked indefinitely.

Threads blocked indefinitely are detected by the GC as deadlocked, and receive an exception, which results in the <loop> message.

The fix should be simple: just don't do eager blackholing for single-entry thunks.

Regarding this comment:

    // NB. Blackholing is *compulsory*, we must either do lazy
    // blackholing, or eager blackholing consistently.  See Note
    // [upd-black-hole] in sm/Scav.c.

It means every update frame must refer to a black hole by the time the GC runs, it's an invariant the GC relies on. It shouldn't be a problem here because it only applies to update frames. I can reword the comment so it's clearer.

Attached file par2.log (download).

set -o pipefail; while ./par2 +RTS -N2 -Ds 2>&1 (closed) >/dev/null | ts -s "%.s" > par2.log; do :; done

The actual cause of the <<loop>> here seems to be that two threads are each blocking on a black hole that is being evaluated, or more likely has been evaluated but not updated, by the other thread. I attached a complete -Ds log above, but the relevant lines are

...
0.011574 7ff6be7fc700: cap 1: thread 6 stopped (blocked on black hole owned by thread 5)
...
0.011808 7ff6c6700740: cap 0: thread 5 stopped (blocked on black hole owned by thread 6)
...

I didn't work out exactly how this can arise, but it probably involves two single-entry thunks and two ordinary thunks whose evaluations force both of the single-entry thunks, but in different orders.

Changing blackHoleOnEntry for single-entry thunks as suggested did fix par2.hs. I'm going to test the other examples in this ticket now.

rwbarton, I think I have tested all of these examples on my machine, for it's worth.

I rebuilt HEAD with what I took to be the patch described above:

-     LFThunk _ _no_fvs _updatable _ _ -> True
+     LFThunk _ _no_fvs _updatable _ _ -> _updatable

for https://github.com/ghc/ghc/blob/master/compiler/codeGen/StgCmmClosure.hs#L769 .

Everything works fine, or seems to work fine.

The complex program yongqli linked, with all the fancy imports, was a little erratic; I increased the size of the csv a lot, to make it more reliably bad somewhere along the way. I then just used the scheme of running ./bugcsv +RTS -N_ | grep loop 500 times each with -N5 and -N3 . (-N2 does't seem to make bring out the pathology with this program.) With ghc-7.10.1 I got

bugcsv: <<loop>>

about 200 times either way, but with the patched head, blessed silence.

Trac metadata

Trac field	Value
Differential revisions	→ D10414

Trac metadata

Trac field	Value
Differential revisions	D10414 → D1040

mentioned in commit aaa0cd20

mentioned in commit d27e7fdb

closed

Merged.

Trac metadata

Trac field	Value
Resolution	Unresolved → ResolvedFixed

mentioned in issue #10613

mentioned in issue #10744 (closed)

changed milestone to %8.0.1

mentioned in issue #13751 (closed)

added incorrect runtime result label

added Pnormal label

Buggy behavior with threaded runtime (-N1 working, -N2 getting into <<loop>>)

Child items ...

Activity