The T9198 test is showing rather unstable metrics in head, but the real issue is not the wobbly numbers but the scale. In GHC 8.10.4 and 9.0.1, adding more continuation terms results in linear cost increases, but with head the cost doubles with every term, as reported in !4890 (comment 343948)
Steps to reproduce
Compile testsuite/tests/perf/compiler/T9198.hs with +RTS -s with varying numbers of additional a continuation terms. The GHC runtime and memory allocation grow exponentially. The same is not observed with GHC 8.10.4 for example.
Expected behavior
Linear cost. With 8.10.4 and the continuation count unmodified I get:
$ ghc -O -fasm testsuite/tests/perf/compiler/T9198.hs +RTS -s 116,102,480 bytes allocated in the heap... MUT time 0.069s ( 0.572s elapsed) GC time 0.111s ( 0.111s elapsed) Total time 0.181s ( 0.689s elapsed)
With head I get:
514,989,696 bytes allocated in the heap... MUT time 0.212s ( 0.604s elapsed) GC time 0.366s ( 0.366s elapsed) Total time 0.581s ( 0.979s elapsed)
Environment
GHC version used: head (9.1.20210402)
Edited
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items
...
Show closed items
Linked items
0
Link issues together to show that they're related or that one is blocking others.
Learn more.
I think this may be one of those examples in which the types grow to exponential size, so it may not be fully soluble. But it is very mysterious that things have got worse, so insight into that would be great.
Another mystery is why the compiler stats wobble around by +/- 5% without apparent cause, which mess up CI.
In ghc-8.10.4 the types are similarly massive but compilation finishes quickly. It seems that the zonking of the type of type argument of begin is forced now but it wasn't before.
I have insight. And the insight is that Quick Look is the culprit.
(On the call, we thought that GHC 9.0 exhibited the bad behavior, which was a solid alibi for Quick Look, because Quick Look is not in 9.0. But we were wrong on the call: 9.0 is just as fast as 8.10.)
The problem is in the second type argument (t) to begin. The concrete type we use to instantiate t is utterly massive. Previous to Quick Look, however, the type had sharing; now it has much much less sharing (though still some).
To see why, let's trace through type-checking the body of main both pre- and post-QL.
Pre-QL:
We instantiate the type of begin with fresh variables m0 and t1: begin :: (m1 () -> t1) -> t1.
We then see that begin is applied to an argument. matchActualFunTysPart requires that begin have a function type. It does, with argument type m1 () -> t1.
We now type-check the argument a: Instantiate a with fresh variables a1 and t2: a :: IO a1 -> (IO () -> t2) -> t2
We now match the expected argument type m1 () -> t1 against actual argument type IO a1 -> (IO () -> t2) -> t2.
a. Unify m1 := IO
b. Unify a1 := ()
c. Unify t1 := (IO () -> t2) -> t2
Repeat with the next a, eventually unifying t2 := (IO () -> t3) -> t3 and so on.
This creates a type with lots of sharing, because t1 refers to two copies of t2, which refers to two copies of t3, and so on. Easy to zonk, given that the zonker tracks an environment mapping metavariables to their fully-zonked counterparts (and thus we do not need to walk over the expanded type, ever).
Post-QL:
We instantiate the type of begin with fresh variables m0 and t1: begin :: (m1 () -> t1) -> t1.
We then see that begin is applied to N arguments, and we thus must ensure that begin's type can accommodate N arguments. This unifies t1 := t2 -> t3 -> t4 -> ... -> tN.
We now type-check the argument a: Instantiate a with fresh variables a1 and u1: a :: IO a1 -> (IO () -> u1) -> u1
We now match the expected argument type m1 () -> t2 -> t3 -> ... -> tN against IO a1 -> (IO () -> u1) -> u1
a. Unify m1 := IO
b. Unify a1 := ()
c. Unify t2 := IO () -> u1
d. Unify u1 := t3 -> ... -> tN
Repeat with the next a, eventually unifying t3 := IO () -> u2, u2 := t4 -> ... -> tN, t4 := IO () -> u3, u3 := t5 -> ... -> tN, etc.
Conclusion:
The post-QL type-checking still has sharing, but not nearly as much as the pre-QL world, and this loss of sharing is what causes the problem. What's vexing is that I don't see a fix here: delaying looking at the arguments and unifying is key to the QL algorithm, so I don't see a way to restore the old behavior, short of reimplementing the old type-checker and choosing which to use based on the presence of -XImpredicativeTypes... but that's silly.
Red herring:
I thought that the IVAR rule from the paper (implemented separately in tcInstFun) was the root of the problem, because it eagerly looks ahead to count all the arguments when unifying a function type. But commenting out that code didn't help, because the more routine IARG rule still does much the same work, only one-at-a-time instead of all-at-once. (IVAR is necessary for the best possible impredicative inference, but that's not at issue here, so commenting it out seemed like a sensible debugging step.)
Does that actually work? So my analysis above isn't the cause?
(It's clear your patch is an improvement over the status quo. But I still think the older typechecking algorithm has more sharing. Maybe your patch means the new algorithm has enough sharing?)
It seems to work spectacularly! With the patch, and even after adding eight more continuation terms (previously tens of GB of memory needed), I see:
70,211,216 bytes allocated in the heap 37,587,208 bytes copied during GC 11,936,456 bytes maximum residency (4 sample(s)) 208,184 bytes maximum slop 28 MiB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 10 colls, 0 par 0.031s 0.031s 0.0031s 0.0062s Gen 1 4 colls, 0 par 0.074s 0.074s 0.0186s 0.0213s TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1) SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.002s ( 0.002s elapsed) MUT time 3.106s ( 3.479s elapsed) GC time 0.105s ( 0.105s elapsed) EXIT time 0.001s ( 0.005s elapsed) Total time 3.214s ( 3.591s elapsed)
which compares quite well against 8.10.4:
125,999,280 bytes allocated in the heap 62,644,624 bytes copied during GC 11,542,640 bytes maximum residency (6 sample(s)) 164,168 bytes maximum slop 31 MiB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 66 colls, 0 par 0.058s 0.058s 0.0009s 0.0032s Gen 1 6 colls, 0 par 0.122s 0.122s 0.0203s 0.0287s TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1) SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.002s ( 0.002s elapsed) MUT time 2.671s ( 3.138s elapsed) GC time 0.180s ( 0.180s elapsed) EXIT time 0.001s ( 0.004s elapsed) Total time 2.854s ( 3.325s elapsed)
Mind you, while the memory blowup is resolved, there is still (also in GHC 8.10.4) considerable runtime cost from adding more terms, that is not reflected in proportional additional memory usage. Thus with 12 additional continuations I see (head):
74,273,712 bytes allocated in the heap 37,752,480 bytes copied during GC 12,001,304 bytes maximum residency (4 sample(s)) 217,064 bytes maximum slop 30 MiB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 10 colls, 0 par 0.032s 0.032s 0.0032s 0.0073s Gen 1 4 colls, 0 par 0.080s 0.080s 0.0200s 0.0346s TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1) SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.003s ( 0.003s elapsed) MUT time 47.380s ( 47.747s elapsed) GC time 0.112s ( 0.112s elapsed) EXIT time 0.001s ( 0.008s elapsed) Total time 47.496s ( 47.870s elapsed)
vs. GHC 8.10.4:
129,913,000 bytes allocated in the heap 77,780,256 bytes copied during GC 19,245,896 bytes maximum residency (7 sample(s)) 246,968 bytes maximum slop 47 MiB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 60 colls, 0 par 0.073s 0.073s 0.0012s 0.0068s Gen 1 7 colls, 0 par 0.159s 0.159s 0.0227s 0.0388s TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1) SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.002s ( 0.002s elapsed) MUT time 42.332s ( 42.800s elapsed) GC time 0.232s ( 0.232s elapsed) EXIT time 0.001s ( 0.004s elapsed) Total time 42.567s ( 43.038s elapsed)
75,025,112 bytes allocated in the heap 37,730,176 bytes copied during GC 11,991,904 bytes maximum residency (4 sample(s)) 210,080 bytes maximum slop 30 MiB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 10 colls, 0 par 0.027s 0.027s 0.0027s 0.0058s Gen 1 4 colls, 0 par 0.067s 0.067s 0.0168s 0.0213s TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1) SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.003s ( 0.003s elapsed) MUT time 95.087s ( 95.450s elapsed) GC time 0.094s ( 0.094s elapsed) EXIT time 0.001s ( 0.011s elapsed) Total time 95.185s ( 95.558s elapsed)
vs. GHC 8.10.4:
130,921,240 bytes allocated in the heap 77,996,752 bytes copied during GC 19,241,480 bytes maximum residency (7 sample(s)) 243,192 bytes maximum slop 47 MiB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 61 colls, 0 par 0.053s 0.053s 0.0009s 0.0041s Gen 1 7 colls, 0 par 0.130s 0.130s 0.0185s 0.0296s TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1) SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.001s ( 0.001s elapsed) MUT time 84.241s ( 84.699s elapsed) GC time 0.182s ( 0.183s elapsed) EXIT time 0.001s ( 0.005s elapsed) Total time 84.426s ( 84.888s elapsed)
So the compiler runtime still looks exponential (also in 8.10.4), but it now runs in something like linear memory. Is the runtime exponential cost unavoidable?
Sure it actually works! (HEAD then allocates less than 8.10, rather than 10x more.) With QL we may get 20 unification variables rather than 10, but each should be zonked exactly once. No big deal. By throwing away huge chunks of the (supposedly threaded) meta_tv_env we were simply re-zonking many things many times. Asymptotically disaster.
Would the proposal of lazy type substitution avoid explicitly building the jumbo outer type t in this case? The type t built in this MR is ultimately never used, the entire expression ends up being just IO (), and the generated core makes no mention of the jumbo type. Its construction ends up being almost entirely futile and very expensive.
A human reader can reason inductively about the types of the various partial substitutions, concluding that the function is well-typed, and the final type is IO (), and the explicit expansion of the jumbo type parameter t of begin is never actually needed. So if delayed type substitutions potentially accomplish something akin to what a naive human reader of the code would do, that sounds rather interesting to me.
A possibly related (but under-specified) issue was recently posted on Reddit:
https://www.reddit.com/r/haskell/comments/mpfwso/memory_pb_with_gch_810/
where there's apparently some of blowup in deep applicative expressions. It could be the same sort of thing, in which case, perhaps delayed substitution might not merely be useful in this artificial test case, but something that could help more broadly?
A side node (not on this concrete proposal) there are "calculi of explicit substitution", cf. http://perso.ens-lyon.fr/pierre.lescanne/publications.html#es, the POPL94 paper seems a good intro/overview. The stated goal is "efficient implemetation of .. abstract machines". Sure needs some work to apply/adapt it to the concrete goings-on in GHC.
A possibly related (but under-specified) issue was recently posted on Reddit: https://www.reddit.com/r/haskell/comments/mpfwso/memory_pb_with_gch_810/ where there's apparently some of blowup in deep applicative expressions. It could be the same sort of thing, in which case, perhaps delayed substitution might not merely be useful in this artificial test case, but something that could help more broadly?