It seems we allocate about 15% more with much higher residency to boot.
$ winpty cabal new-run phash-bench -w /e/ghc-8.10.1/bin/ghc.exe -- -n100 +RTS -s
Warning: Unknown/unsupported 'ghc' version detected (Cabal 3.1.0.0 supports
'ghc' version < 8.10): E:/ghc-8.10.1/bin/ghc.exe is version 8.10.1
Resolving dependencies...
Up to date
benchmarking fileHash/cat.png
benchmarking fileHash/frog.jpeg
benchmarking fileHash/frog.png
19,486,559,856 bytes allocated in the heap
18,837,168 bytes copied during GC
3,863,392 bytes maximum residency (301 sample(s))
507,040 bytes maximum slop
26 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 17311 colls, 0 par 0.172s 0.097s 0.0000s 0.0001s
Gen 1 301 colls, 0 par 0.094s 0.116s 0.0004s 0.0015s
INIT time 0.000s ( 0.001s elapsed)
MUT time 6.594s ( 6.769s elapsed)
GC time 0.266s ( 0.213s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 6.859s ( 6.983s elapsed)
Alloc rate 2,955,307,655 bytes per MUT second
Productivity 96.1% of total user, 96.9% of total elapsed
Andi@Horzube MINGW64 /e/tmp/phash
$ winpty cabal new-run phash-bench -w /e/ghc-8.8.1/bin/ghc.exe -- -n100 +RTS -s
Resolving dependencies...
Up to date
benchmarking fileHash/cat.png
benchmarking fileHash/frog.jpeg
benchmarking fileHash/frog.png
16,839,850,160 bytes allocated in the heap
272,864 bytes copied during GC
3,863,376 bytes maximum residency (301 sample(s))
507,056 bytes maximum slop
3 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 14835 colls, 0 par 0.141s 0.086s 0.0000s 0.0001s
Gen 1 301 colls, 0 par 0.156s 0.119s 0.0004s 0.0018s
INIT time 0.000s ( 0.001s elapsed)
MUT time 5.344s ( 5.713s elapsed)
GC time 0.297s ( 0.205s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 5.641s ( 5.918s elapsed)
Alloc rate 3,151,316,989 bytes per MUT second
Productivity 94.7% of total user, 96.5% of total elapsed
This likely rules out recent codegen changes for a start.
A little less then half of the increase in allocations seems to originate from readImage in the hip library.
But the module itself has a large number of code changes with the module at the core of phash resulting in 21k/25k lines of core in 8.8/8.10.
I can reduce the hip part of the regression to this:
{-# NOINLINE load #-}load what = do !thing <- readImage what :: IO (Either String (Image RSU Y Double)) Right !img <- pure thing writeImage "tmp.png" (img) return img
Sadly this still expands to 16k lines of Core which no obvious culprint. The code for 8.10 is generally smaller than 8.8 in the regressing cases. So I suspect we fail to specialize or inline things at one place or another.
But its hard to say where from the volume of core generated alone.
Which constructs a map function by passing dictionaries to a less specialized version of map ($fArrayVScse_$cmap). The dictionaries we pass show the same pattern again and the pattern repeats for a few levels of nesting.
In the end (for the read >>= write case) I looked at we "only" allocate an additional 6 words per pixel. I strongly suspect the stronger specialization allows ww to happen somewhere where it doesn't with 8.10.
I did not track down where exactly this happens.
@sgraf812 I know you looked into things related to specialization, do you have any insights on changes which might have caused this change in behaviour?
I bundled the read/write regression repro in this repo: https://github.com/AndreasPK/hip-bench.git for whoever wants to dig deeper. I lack the experience with the specializer to dig deeper at this time.
I have taken a brief look. I haven’t worked out all the details yet, but I have found a bit of a smoking gun: on GHC 8.10, basicUnsafeNew fails to specialize where GHC 8.8 specializes it. What’s so interesting about basicUnsafeNew? It has the following definition:
Note that m is still polymorphic. The changes in !668 (merged) are specifically designed to handle cases like these differently, since sometimes these added big lambdas would cause specialization to fail. My hunch is that those changes also unintentionally broke specializing functions like basicUnsafeNew, but I haven’t verified that hunch yet. More to follow, most likely!
So in some ways 8.10 can specialise more because it can look through an arrow (->) to find another (=>). But in some ways you have discovered that it specialises less because it insists on having a witness for each dictionary argument. That is an unexpected consequence of your patch.
Consider
f :: forall a b. (Eq a, Ord b) => blahf = <f_body>g :: forall b Ord b => blahg = /\b \(d:Ord b). ...(f @Int @b dEqInt d)...
We'd like to specialise that call to f, even though one of its dictionary arguments isn't in scope at f's definition site:
$sf :: forall b. Ord b => blah[Int/a]$sf = /\b \(d:Ord b). <f_body> @Int @b @dEqInt dRULE forall b (dInt :: Eq Int) (dOrd :: Ord b). f @Int b dInt dOrd = $sf b dOrd
We even have the mechanism available: when d goes out of scope, change SpecDict d to UnspecArg.
So in some ways 8.10 can specialise more because it can look through an arrow (->) to find another (=>). But in some ways you have discovered that it specialises less because it insists on having a witness for each dictionary argument. That is an unexpected consequence of your patch.
Yes, precisely! (Well, except the bit about it being my patch—I can take credit for neither the improvement nor the regression. )
We even have the mechanism available: when d goes out of scope, change SpecDict d to UnspecArg.
Unfortunately, that doesn’t completely solve the problem on its own. Consider this small program, which reproduces the issue:
{-# LANGUAGE FlexibleInstances, MultiParamTypeClasses #-}moduleT17966whereclassCabwherem::Showc=>a->b->c->StringinstanceShowb=>CBoolbwheremabc=showa++showb++showc{-# INLINABLE [0] m #-}f::(Cab,Showc)=>a->b->c->Stringfabc=mabc++"!"{-# INLINABLE [0] f #-}x::Stringx=fTrue()(Just42)
GHC 8.8 specializes both f and m (though it only specializes m at its first two arguments), but GHC 8.10 only specializes f. If we desugar the definition of x, it becomes apparent why:
Here we pass $cm $fShow() as a dictionary, and that in isolation is a specializable call according to GHC 8.8. But on GHC 8.10, it isn’t that the second dictionary for $cm is locally-bound, we don’t even have it at all! So we still have to generate call info even if the call looks like it isn’t fully-saturated.
An alternative approach to consider is that, after specializing f, we end up with a call to $cm applied to two dictionaries, not just one, but we don’t immediately see it because we floated $cm $fShow() out. I don’t know if it’s worth arranging to discover that case, but its existence was surprising to me to discover.
Ah yes, I was conflating two conversations -- it was Sandy Maguire in this (excellent) patch
commit 2d0cf6252957b8980d89481ecd0b79891da4b14bAuthor: Sandy Maguire <sandy@sandymaguire.me>Date: Thu May 16 12:12:10 2019 -0400 Let the specialiser work on dicts under lambdas Following the discussion under #16473, this change allows the specializer to work on any dicts in a lambda, not just those that occur at the beginning. For example, if you use data types which contain dictionaries and higher-rank functions then once these are erased by the optimiser you end up with functions such as: ``` go_s4K9 Int# -> forall (m :: * -> *). Monad m => (forall x. Union '[State (Sum Int)] x -> m x) -> m () ``` The dictionary argument is after the Int# value argument, this patch allows `go` to be specialised.
It took me some while to grok this, but what you mean is this. In a more extreme version we might have this:
f :: forall a b. (Eq a, Ord b) => blahg = ...map (f ty1 ty2 dEq)...
So f is applied to one dictionary, but not two. (In your test case the forall is around the other way.)
This example and the one I gave earlier are closely related; I could equally have written
g = ...map (\dOrd. f ty1 ty2 dEq dOrd)...
and now the call is saturated but with a dictionary that goes out of scope, in my previous comment.
Hmm. There is actually no reason for us to require that a call is saturated, in order to specialise it. Provided we have some SpecDict arguments, we can specialise regardless. Is that an opportunity you feel inclined to take up?
So f is applied to one dictionary, but not two. (In your test case the forall is around the other way.)
Yes, that’s right.
This example and the one I gave earlier are closely related; I could equally have written
That is true, but note that, at the Haskell level, doing this eta expansion is somewhat more involved. (Is it even possible to do, or will the simple optimizer take it back out again before the specializer sees it?)
Hmm. There is actually no reason for us to require that a call is saturated, in order to specialise it. Provided we have some SpecDict arguments, we can specialise regardless. Is that an opportunity you feel inclined to take up?
I don’t know! I think I understand what’s going on, but I have no idea what the tradeoffs are in terms of specialization of real programs. (In all honesty, all this partial specialization business makes me a little uncomfortable as it is; GHC’s optimizer already generates a lot of code, and this seems like it can only make that worse.)
So I don’t think I have a horse in this race, at least not yet. The only reason I bring it up at all is that it’s something GHC 8.8 did that 8.10 does not.
Yes, specialisation can generate a lot of code, but it's extraordinarily effective in making programs run faster!
Indeed! I wouldn’t want to imply otherwise. (I’m more worried about the potential pitfalls of overlapping specializations than of the concept of partial specialization in general.)
Let's hope someone has the bandwidth to take this up. I hate the idea that 8.10 programs in the wild are running slower.
I will consider looking into how hard it would be to fix after !2913 (closed) is finished. If it turns out to be a lot of work, I don’t want to commit to it (I’ve been working on !2913 (closed) on my own time, and it’s already eaten up a lot of that!). But if it’s easy, I probably might as well do it, seeing as I already have the relevant context in my head.
@bgamari, do you intend to backport this to %8.10.2? I'll optimistically assign the milestone accordingly, but feel free to change according to your preferences.