Reducing the number of capabilities to 1, it works for those inputs
$ ./ghc-bug 7 +RTS -N1
As a side-note, the problem only happens randomly with small inputs (on my hardware), and it seems to go away with bigger inputs (the original testcase felt a bit more deterministic, but I think the testcase in the ticket is good enough)
I only tested this with GHC 7.8.4 (on Debian), but people on IRC reported the same behavior with GHC 7.10.1 on OS X and Debian
Similar bug: #10218 (closed) (-fno-cse and -flate-dmd-anal didn't help with this)
importControl.ApplicativeimportControl.MonadimportControl.Parallel.StrategiesimportSystem.EnvironmentnewtypeParLista=ParList{unParList::[a]}nil::ParListanil=ParList[]cons::a->ParLista->ParListaconsx(ParListxs)=ParList(x:xs)instanceFunctorParListwherefmap=liftMinstanceApplicativeParListwherepure=return(<*>)=apinstanceMonadParListwherereturn=ParList.return{- v code that doesn't work -}(ParListxs)>>=f=ParList(withStrategy(parListChunk8rseq)(xs>>=unParList.f))--(ParList xs) >>= f = ParList (concat (parMap rseq (unParList . f) xs)){- ^ code that works -}typePair=(Int,[Int])loop'::Pair->ParListPairloop'(size,qns)=go1wheregon|n>size=nil|otherwise=cons(size,n:qns)(go(n+1))worker::Int->Pair->[Pair]workern=unParList.gonwherego1=loop'gon=loop'>=>go(n-1)main::IO()main=do[n]<-(read<$>)<$>getArgsprint$length(workern(n,[]))
Edited
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items
...
Show closed items
Linked items
0
Link issues together to show that they're related or that one is blocking others.
Learn more.
I simplified the problem a little. The first command line argument is for the chunk size given to parListChunk, which seems to matter. I get the same result with ghc-7.8.3 and ghc-7.10.1 on os x
I'd love it to be fixed in 7.10.2, but first someone needs to volunteer to look into what is wrong, and hopefully find out how to fix it. Is it a bug in a library? In the RTS? In the compiler?
I inlined the relevant parts of the parallel package into michaelt's example for the sake of testing across multiple versions of GHC. In the process I discovered that building with -feager-blackholing is necessary to reproduce the bug. Well, not surprising. (The parallel package specifies this flag in its cabal file.) To be specific, I am building it as
ghc -threaded -O -rtsopts par -fforce-recomp -feager-blackholing
This arranges not to black-hole a thunk that is used at most once. Can you try with -fkill-one-shot? That switches off this optimisation, and I bet it'll make the program work.
Assuming that does fix it, the next question is: is it just the CoreToStgupd_flag? (-fkill-one-shot affects more things than just that one spot.) The next thing I'd do would be just to comment out the SingleEntry case above, rebuild the compiler, and check that the bug is gone. I'm 95% sure that it'll will, but worth checking.
Next. If some thunk is being marked SingleEntry, and that that is causing the bug:
Which thunk is it?
Is it really single-entry? (I.e. is the analysis right or not)
If you -ddump-stg and look for thunks marked \s, those are the single-entry ones. There probably aren't very many. If it's very few, we could look at the "is the analysis right?" question. If it's many, we'll need to find a way to narrow it down somehow.
Lazy blackholing will still take place for thunks that are not blackholed by eager blackholing, because we have no way to distinguish between an eager-blackholed and a lazy-blackholed thunk in the runtime. We had bugs in this area in the past, see #5226 (closed). I'm not sure this helps, but it's possible that the cardinality analysis is correct and this is a runtime bug.
My gut feeling is that this is a bug in the RTS that was uncovered (for this program) by the cardinality analysis patch, but I have no real evidence for this.
The mechanism for attaching source must be before my eyes, but here is the reduced module:
{-# LANGUAGE MagicHash, UnboxedTuples #-}importControl.ApplicativeimportControl.MonadimportGHC.ExtsnewtypeEvala=Eval(State#RealWorld->(#State#RealWorld,a#))instanceFunctorEvalwherefmap=liftMinstanceApplicativeEvalwhere(<*>)=ap;pure=returninstanceMonadEvalwherereturnx=Eval$\s->(#s,x#)Evalx>>=k=Eval$\s->casexsof(#s',a#)->casekaofEvalf->fs'rparWithsa=Eval$\s0->spark#rs0wherer=casesaofEvalf->casefrealWorld#of(#_,a'#)->a'runEval::Evala->arunEval(Evalx)=casexrealWorld#of(#_,a#)->amain::IO()main=do-- print $ length (pf 'x') -- either statement works at least on and offprint(program'y')-- but I seem to lose the effect if I use both statementsprogram=pchunk.concatMap(pchunk.concatMap(pchunk.concatMap(pchunk.show).show).show).showwhere-- the effect seems to vanish if I eta expand pchunkpchunk=runEval.fmapconcat.mapM(rparWith(mapM(\x->Eval$\s->seq#xs))).chunk'-- the effect seems to disappear if I reject splitAt in favor-- of a pattern match chunk' (a:b:c:xs) = [a,b,c]: chunk' xschunk'::[a]->[[a]]chunk'[]=[]chunk'xs=as:chunk'bswhere(as,bs)=splitAt3xs
And a bit more compressed, for what it may be worth:
{-# LANGUAGE MagicHash, UnboxedTuples #-}importGHC.ExtsnewtypeEvala=Eval{runEval::State#RealWorld->(#State#RealWorld,a#)}-- inline sequence :: [Eval a] -> Eval [a]well_sequenced::[Evala]->Eval[a]well_sequenced=foldrop(Eval$\s->(#s,[]#))whereopees=Eval$\s->caserunEvalesof(#s',a#)->caserunEvaless'of(#s'',as#)->(#s'',a:as#)-- seemingly demonic use of spark#ill_sequenced::[Evala]->Eval[a]ill_sequencedas=Eval$spark#(casewell_sequencedasofEvalf->casefrealWorld#of(#_,a'#)->a')main::IO()main=print((layer.layer.layer.layer.layer)show'y')wherelayer::(Char->String)->(Char->String)layerf=(\(Evalx)->casexrealWorld#of(#_,as#)->concatas).well_sequenced.mapill_sequenced-- all is well with well_seqenced.map(map(\x->Eval$\s->(#s,x#))).chunk'.concatMapf.show
Sorry, I'm spamming the trac a bit. Notice that in the ultra-simplified module, now attached properly, the wrapping with Lift that parallel uses for rparWith is no where to be found. If I wrap stuff in my ill_sequenced with Lift, I can't get the effect. If though, that use of Lift in the definition of rparWith is required by whatever is going on with spark# and some of these other opaque-to-me primitives, then there is a question whether it is used enough: the original program is doing an end-run around this. It is presumably obviously undesirable, but if in rbarton's par.hs I complicate the definition of rpar , which is
then it seems all is well again. That probably destroys all the desired effects; but if it doesn't, then the problem may just be that the library is letting the user get too close to spark# which is practically naked in rpar.
I took a look at the generated STG for par2.hs before and after the cardinality analysis commit. Before, there are no \s thunks at all. After, there are two. One is in
which presumably comes from the use of concat in layer. This happens even when I build the program with -fkill-absence -fkill-one-shot. Could that be because base was built with cardinality analysis enabled? I don't entirely see how, but I can try rebuilding the libraries with -fkill-absence -fkill-one-shot.
Anyways I guess the main question, which I'm not sure how to answer, is whether the fact that this thunk is marked as single-entry is correct.
The other single-entry thunk is, I think, very similar and arises from concatMap:
Inlining concat and concatMap made no difference, the program still loops and a \s thunk is still generated inside "main_mygo". This happens for any combination of the -fkill-absence and -fkill-one-shot flags, and in every version I tested (the 7.7 commit adding cardinality analysis, 7.8.4, 7.10.1 and HEAD).
I then tried removing the SingleEntry case as Simon suggested in ticket:10414#comment:101567 and that did generate a \u thunk instead and the program no longer <<loop>>s.
So, one conclusion is that -fkill-absence/-fkill-one-shot don't fully disable cardinality analysis like they are expected to.
I think I've finally come up with at least part of a plausible explanation. Tell me if this seems right...
Suppose I have a heap object whose initial value is
x = \u [] concat [[1],[]] -- using "\u []" as syntax for "updatable thunk", -- otherwise normal Core syntax (not STG)
and suppose first some thread evaluates x to WHNF. Using the definition of concat as main_go from above, this will allocate a single-entry thunk y = \s [] concat [], and then calling (++) will lead to
x = 1 : z y = \s [] concat [] z = \u [] (++) [] y
At this point the heap has the following property
(*) The heap contains a single-entry thunk (y) and a regular thunk (z) such that entering the regular thunk will cause the single-entry thunk to be entered as well.
(The key point is that the single-entry thunk has already been allocated on the heap, in contrast to a situation in which entering a regular thunk would cause a new single-entry thunk to be allocated, possibly entered, and then become garbage, all before evaluation of that regular thunk is complete.)
Now, consider the following execution path of two threads A and B.
Both threads enter z simultaneously (before either manages to overwrite it with a black hole, if eager blackholing was enabled when we compiled (++); otherwise before either manages to update it to an indirection).
Thread A does the case analysis inside (++) and enters y, and overwrites it with a black hole before thread B does anything else.
Now thread B does the case analysis inside (++) and enters y, but y has been overwritten with a black hole so thread B blocks. But y is never going to be updated, so thread B will block forever. This is bad!
Finally thread A finishes evaluating y (to []) and then updates z accordingly. But thread B is still blocking on the black hole and even if it could get unblocked by some mechanism (say in the GC) there's no longer enough information on the heap to recover the correct value of y and allow thread B to continue.
This doesn't exactly explain why the programs in this ticket <<loop>>, but a thread becoming permanently blocked is equally bad behavior I think.
Some extra evidence that something like this is going on is that if I inline the definition of (++) into par2.hs as well, so that it is compiled with eager blackholing enabled, the program <<loop>>s much less frequently, just a few percent of the time as opposed to over half the time. That would match up with a smaller window of simultaneity in the first step of the execution trace above.
If this analysis is correct, then assuming we want to continue to allow threads to enter thunks in an unsynchronized way (to avoid prohibitive locking costs), it seems we have to ensure that the condition (*) never holds, at least when eager blackholing is enabled. Generating single-entry thunks is still okay as long as they never survive as live heap objects after the thunk that allocated them has been reduced to WHNF.
I also came across this comment in rts/ThreadPaused.c (I think it is concerning a different scenario):
// NB. Blackholing is *compulsory*, we must either do lazy // blackholing, or eager blackholing consistently. See Note // [upd-black-hole] in sm/Scav.c.
Does this mean that every module in the entire program, including all the libraries the program links against like base, needs to be compiled with the same presence or absence of -feager-blackholing? The User's Guide doesn't mention anything about this and if that is the case, the flag seems essentially unusable except for those who are building their own GHC from source. I'm hoping the comment is just worded unclearly.
With lazy blackholing, a single-entry thunk will never be black-holed. Lazy black-holing only black-holes thunks with an update frame on the stack, and single-entry thunks do not push an update frame. So that could explain why -feager-blackholing is required.
Rather to my surprise, the code generator emits code to black-hole even a single-entry thunk. I can't see any good reason why this happens. Look at StgCmmClosure.blackHoleOnEntry.
Re ticket:10414#comment:101799 we need Simon to say. It would be pretty bad if every module had to be compiled consistently. However I note that the comment in Scav.c is in code that messes with update frames, and so single-entry thunks probably don't matter.
Even if all this is right, we still don't know why we get <<loop>>. I'm assuming that this means we have entered a black hole, that is under evaluation by this thread; but suddenly I'm not sure. (For a black hole being evaluated by another thread, we'd block.)
I suggest you try making blackHoleOnEntry return False for single-entry thunks.
Thanks for all the awesome debugging work here, I think we finally have all the pieces!
@rwbarton's analysis seems plausible to me too. A single-entry thunk can still be entered multiple times in a parallel setting, and if eager blackholing is on then it is possible that a thread can get blocked indefinitely.
Threads blocked indefinitely are detected by the GC as deadlocked, and receive an exception, which results in the <loop> message.
The fix should be simple: just don't do eager blackholing for single-entry thunks.
Regarding this comment:
// NB. Blackholing is *compulsory*, we must either do lazy // blackholing, or eager blackholing consistently. See Note // [upd-black-hole] in sm/Scav.c.
It means every update frame must refer to a black hole by the time the GC runs, it's an invariant the GC relies on. It shouldn't be a problem here because it only applies to update frames. I can reword the comment so it's clearer.
The actual cause of the <<loop>> here seems to be that two threads are each blocking on a black hole that is being evaluated, or more likely has been evaluated but not updated, by the other thread. I attached a complete -Ds log above, but the relevant lines are
...0.011574 7ff6be7fc700: cap 1: thread 6 stopped (blocked on black hole owned by thread 5)...0.011808 7ff6c6700740: cap 0: thread 5 stopped (blocked on black hole owned by thread 6)...
I didn't work out exactly how this can arise, but it probably involves two single-entry thunks and two ordinary thunks whose evaluations force both of the single-entry thunks, but in different orders.
Changing blackHoleOnEntry for single-entry thunks as suggested did fix par2.hs. I'm going to test the other examples in this ticket now.
The complex program yongqli linked, with all the fancy imports, was a little erratic; I increased the size of the csv a lot, to make it more reliably bad somewhere along the way. I then just used the scheme of running ./bugcsv +RTS -N_ | grep loop 500 times each with -N5 and -N3 . (-N2 does't seem to make bring out the pathology with this program.) With ghc-7.10.1 I got
bugcsv: <<loop>>
about 200 times either way, but with the patched head, blessed silence.