Add heuristic-based optimization flag for marking Generic methods as INLINE[1] (#11068)
Fixes #11068 (closed).
Merge request reports
Activity
Thanks for doing this, @arybczak. I have a request and a question:
- This MR is sorely lacking a perf regression test to demonstrate the benefits of this patch. Can you create such a test and add it to
testsuite/tests/perf
? The existing test cases there should provide a template of how to do this. See also https://gitlab.haskell.org/ghc/ghc/wikis/building/running-tests/performance-tests. - I worry that always inlining
from
/to
will cause a blowup in compilation times in certain cases, especially when the implementations of each method are large. Is there a tradeoff between inlining these methods and compile times? The compile-time performance of derivedGeneric
instances is something that many people are sensitive to—see #5642.
- This MR is sorely lacking a perf regression test to demonstrate the benefits of this patch. Can you create such a test and add it to
Side comment: If we could tell GHC to not try to optimize the
from
andto
at all (let the generator generate optimal code, as it probably does already), and if they areINLINE
, then compilation of derivedGeneric
instances would be top-notch. cc @alp (INLINE
needed that the use-sites don't regress).https://gitlab.haskell.org/ghc/ghc/-/wikis/building/running-tests/performance-tests says
Performance tests measure some metric (e.g. number of bytes allocated) of ghc or generated code,
but I fail to find what metrics of generated code (or GHC for that matter) are available.
Edited by Oleg Grenrushttps://gitlab.haskell.org/ghc/ghc/-/wikis/building/running-tests/performance-tests says
Performance tests measure some metric (e.g. number of bytes allocated) of ghc or generated code,
but I fail to find what metrics of generated code (or GHC for that matter) are available.
GHC's perf tests allow you to measure both runtime and compile-time statistics. For example, here is a test case in
testsuite/tests/deriving/perf/all.T
that measures the number of bytes allocated during compilation:test('T10858', [ collect_compiler_stats('bytes allocated',8), only_ways(['normal'])], compile, ['-O'])
If we could tell GHC to not try to optimize the
from
andto
at all (let the generator generate optimal code, as it probably does already), and if they areINLINE
, then compilation of derivedGeneric
instances would be top-notch.I don't really understand what you mean by "let the generator generate optimal code".
TcGenGenerics
generatesLHsBinds
for derivedfrom
/to
implementations, not direct Core, so some amount of compilation (i.e., optimization) seems inevitable.Yes. But the correct test here would be in the spirit of
inspection-testing
. That in multi module setup, optimisations happen. Measuring them via allocations is quite indirect (though good double check).I don't really understand what you mean by "let the generator generate optimal code".
Ah. It would be great if we could generate
Core
directly. It's quite handy from my experience writing type-checker plugins.Yes. But the correct test here would be in the spirit of
inspection-testing
. That in multi module setup, optimisations happen. Measuring them via allocations is quite indirect (though good double check).To be clear, you aren't just limited to checking the number of bytes allocated during compilation. There are other tests that check if certain things have been optimized away by
grep
'ing through the-ddump-simpl
output. (See, for instance, this test case which checks for the absence ofheq_sel
in the dumped Core.) I'm not sure if that's what you had in mind for a potential test case in this MR, but it's another tool in your toolbelt.Ah. It would be great if we could generate
Core
directly.Perhaps so, but that would require a pretty different approach from the way
deriving
currently works. If that would provide a shorter path to making derivedfrom
/to
implementations more efficient, perhaps we should open a separate issue about this.
added 1 commit
- 1af87492 - Mark methods in derived Generic instances as INLINE[1] (#11068 (closed))
To address the second point, there's https://github.com/arybczak/generic-lens/tree/perf.
Relevant benchmark can be run with
cabal build generic-optics:perf
.It has a data type with 100 fields and 50 site usages of
view
/over
of half of appropriate field optics. You can also control the number of constructors from 1 to 4 via CPP.The results of GHC resource usage are as follows (with -j4 -A32m):
8.10.1
constructors memory[M] time[s] 1 974 7.43 2 1382 13.12 3 1611 20.33 4 2934 28.141 8.11.0 patched
constructors memory[M] time[s] 1 766 6.85 2 1364 13.05 3 2010 30.92 4 3181 48.89 So for up to 2 constructors inlining
from
/to
actually leads to faster compilation AND faster code since then generics optimize away.For 3 and more constructors (this of course also depends on the number of fields each constructor has) GHC no longer optimizes away generic representation and inlining becomes useless, leading to increased resource consumption during compilation and huge blowup in generated core.
This is one data point. I did some testing with derivation of NFData for 1 constructor / 100 fields and while GHC takes slightly more memory and time to compile, it optimizes generics away, something that doesn't currently happen, leading to much faster runtime.
It looks like always marking these methods as INLINE is not a good idea. However, since we have access to the data type, we can emit INLINE pragmas based on the amount of constructors/fields the data type has.
The most common are either pure sums of pure products, and:
- pure sums optimize well for up to 40 constructors
- pure products optimize well for up to 100+ fields (I didn't find upper bound)
Looks like 2 constructors also optimize pretty well (2 constructors, 100 fields tested positive), but after 3rd constructor number of fields drops quite sharply to around 10.
@RyanGlScott does it sound reasonable to mark these as INLINE based on this heuristic? Then we'll get the best of both worlds.
Edited by Andrzej RybczakThanks for making these perf tests. Some further questions:
- The results in !2965 (comment 261707) are for compile times, not runtime, correct? If so, how are you determining if the resulting code is faster?
- How are you verifying that the "generics optimize away"?
- How confident are you that these results generally apply to all uses of
GHC.Generics
? I'm slightly worried that the results we see here are simply an artifact of the way GHC's optimizer happens to work over thegeneric-lens
library. You say that you also tested genericNFData
instances as well, which is great, although I'm not sure where these tests live. - Having heuristics guide whether we inline things is great, but it is a bit alarming to see that compile times increased in some common cases. Given that heuristics are never perfect, do you think it would be worthwhile to have GHC flags to control how
from
/to
are inlined?
- I had criterion tests elsewhere. I pushed them to the branch.
Results for one constructor (large is 100 fields, small is 20):
8.10.1:
benchmarking large/view/lens/th time 7.121 ns (7.099 ns .. 7.153 ns) 1.000 R² (1.000 R² .. 1.000 R²) mean 7.121 ns (7.103 ns .. 7.174 ns) std dev 66.70 ps (28.55 ps .. 119.8 ps) benchmarking large/view/lens/generic time 143.0 ns (142.8 ns .. 143.5 ns) 0.999 R² (0.997 R² .. 1.000 R²) mean 142.9 ns (142.8 ns .. 143.1 ns) std dev 399.8 ps (267.3 ps .. 559.2 ps) benchmarking large/over/lens/th time 72.31 ns (72.05 ns .. 72.75 ns) 1.000 R² (0.999 R² .. 1.000 R²) mean 72.45 ns (72.10 ns .. 73.74 ns) std dev 1.535 ns (304.8 ps .. 3.039 ns) variance introduced by outliers: 29% (moderately inflated) benchmarking large/over/lens/generic time 265.1 ns (264.4 ns .. 265.8 ns) 1.000 R² (1.000 R² .. 1.000 R²) mean 264.9 ns (264.4 ns .. 265.7 ns) std dev 1.473 ns (680.2 ps .. 2.311 ns) benchmarking large/deepseq time 285.8 ns (284.6 ns .. 287.8 ns) 1.000 R² (1.000 R² .. 1.000 R²) mean 285.6 ns (284.8 ns .. 287.8 ns) std dev 3.062 ns (1.423 ns .. 5.354 ns) benchmarking small/view/lens/th time 6.853 ns (6.844 ns .. 6.861 ns) 1.000 R² (1.000 R² .. 1.000 R²) mean 6.851 ns (6.844 ns .. 6.866 ns) std dev 24.97 ps (14.42 ps .. 40.68 ps) benchmarking small/view/lens/generic time 30.76 ns (30.72 ns .. 30.81 ns) 1.000 R² (1.000 R² .. 1.000 R²) mean 30.75 ns (30.72 ns .. 30.81 ns) std dev 90.77 ps (31.65 ps .. 155.3 ps) benchmarking small/over/lens/th time 18.20 ns (18.18 ns .. 18.22 ns) 1.000 R² (1.000 R² .. 1.000 R²) mean 18.19 ns (18.18 ns .. 18.22 ns) std dev 42.02 ps (25.32 ps .. 65.63 ps) benchmarking small/over/lens/generic time 62.26 ns (62.17 ns .. 62.35 ns) 1.000 R² (1.000 R² .. 1.000 R²) mean 62.23 ns (62.16 ns .. 62.34 ns) std dev 212.1 ps (124.9 ps .. 325.0 ps) benchmarking small/deepseq time 48.47 ns (48.40 ns .. 48.54 ns) 1.000 R² (1.000 R² .. 1.000 R²) mean 48.47 ns (48.41 ns .. 48.57 ns) std dev 191.0 ps (126.9 ps .. 267.3 ps)
8.11.0 patched:
benchmarking large/view/lens/th time 7.613 ns (7.588 ns .. 7.643 ns) 1.000 R² (1.000 R² .. 1.000 R²) mean 7.612 ns (7.592 ns .. 7.658 ns) std dev 71.30 ps (34.41 ps .. 118.0 ps) benchmarking large/view/lens/generic time 7.343 ns (7.338 ns .. 7.351 ns) 1.000 R² (1.000 R² .. 1.000 R²) mean 7.342 ns (7.338 ns .. 7.352 ns) std dev 14.94 ps (6.868 ps .. 25.20 ps) benchmarking large/over/lens/th time 70.38 ns (70.27 ns .. 70.54 ns) 1.000 R² (1.000 R² .. 1.000 R²) mean 70.38 ns (70.30 ns .. 70.60 ns) std dev 297.6 ps (111.0 ps .. 554.8 ps) benchmarking large/over/lens/generic time 70.29 ns (70.23 ns .. 70.36 ns) 1.000 R² (1.000 R² .. 1.000 R²) mean 70.30 ns (70.25 ns .. 70.39 ns) std dev 166.5 ps (104.4 ps .. 249.7 ps) benchmarking large/deepseq time 106.8 ns (106.8 ns .. 106.9 ns) 1.000 R² (1.000 R² .. 1.000 R²) mean 106.8 ns (106.8 ns .. 106.9 ns) std dev 176.8 ps (132.2 ps .. 256.7 ps) benchmarking small/view/lens/th time 7.403 ns (7.390 ns .. 7.421 ns) 1.000 R² (0.999 R² .. 1.000 R²) mean 7.395 ns (7.388 ns .. 7.406 ns) std dev 21.87 ps (12.55 ps .. 32.57 ps) benchmarking small/view/lens/generic time 7.642 ns (7.632 ns .. 7.659 ns) 1.000 R² (1.000 R² .. 1.000 R²) mean 7.643 ns (7.635 ns .. 7.655 ns) std dev 24.17 ps (15.14 ps .. 36.18 ps) benchmarking small/over/lens/th time 18.14 ns (18.10 ns .. 18.20 ns) 1.000 R² (1.000 R² .. 1.000 R²) mean 18.24 ns (18.17 ns .. 18.37 ns) std dev 246.1 ps (169.2 ps .. 372.5 ps) variance introduced by outliers: 15% (moderately inflated) benchmarking small/over/lens/generic time 18.04 ns (18.03 ns .. 18.06 ns) 1.000 R² (1.000 R² .. 1.000 R²) mean 18.05 ns (18.04 ns .. 18.08 ns) std dev 44.99 ps (29.05 ps .. 76.24 ps) benchmarking small/deepseq time 21.54 ns (21.53 ns .. 21.56 ns) 1.000 R² (1.000 R² .. 1.000 R²) mean 21.55 ns (21.53 ns .. 21.56 ns) std dev 37.61 ps (26.88 ps .. 48.96 ps)
As you can see, the difference is very significant. Here you don't even have to look at core to see that generics optimized away since the runtime of generic-based version is the same as code generated with TH.
-
By looking at the output of
ddump-simpl -dsuppress-all
. -
Not sure. Only more testing could tell. But looks like GHC is pretty good at optimizing a lot of stuff away (judging from my work on the
optics
library where I was surprised how much it could do, given enough inlining takes place) and the only thing that usually stands in its way is things not inlining. -
Having flags unconditionally enabling/disabling inlining of
from
/to
sounds like a good idea so that people can experiment with more "exotic" cases, though I'd say in addition to having these heuristics. They could be made a bit more conservative then, but it should "just work" in the most typical cases (at least pure sums/products).
Speaking of common cases you mean NFData? Cutting the benchmark file to only generate NFData instance for X (1 constructor, 100 fields), compilation stats are as follows:
- 8.10.1: 272M, 1.27s
- 8.11.0: 281M, 1.38s
So it's a slight decrease in performance, but the runtime of
rnf
is 265ns vs 106ns. Seems like a good trade-off to me.Edited by Andrzej Rybczak@arybczak BTW you can use https://github.com/bgamari/criterion-compare to make it easier to visually compare the criterion results.
@sjakobi Thanks, here's the comparison:
Benchmark 8.10.1 8.11.0 large/view/lens/th -6.4% 7.57e-9 large/view/lens/generic 1893.8% 7.33e-9 large/over/lens/th 1.5% 7.05e-8 large/over/lens/generic 278.6% 7.06e-8 large/deepseq 167.4% 1.07e-7 small/view/lens/th -6.7% 7.36e-9 small/view/lens/generic 306.4% 7.59e-9 small/over/lens/th 1.3% 1.80e-8 small/over/lens/generic 246.0% 1.80e-8 small/deepseq 125.2% 2.15e-8 GeoMean (calculated) 137.6% 1.00 Benchmark 8.10.1 8.11.0 large/view/lens/th 7.09e-9 6.9% large/view/lens/generic 1.46e-7 -95.0% large/over/lens/th 7.15e-8 -1.5% large/over/lens/generic 2.67e-7 -73.6% large/deepseq 2.86e-7 -62.6% small/view/lens/th 6.87e-9 7.2% small/view/lens/generic 3.08e-8 -75.4% small/over/lens/th 1.82e-8 -1.3% small/over/lens/generic 6.23e-8 -71.1% small/deepseq 4.84e-8 -55.6% GeoMean (calculated) 1.00 -57.9% Edited by Andrzej Rybczak
That's intentional, it shows how much slower version without INLINE pragmas is. I added the other way around to satisfy your curiosity :P
Edited by Andrzej RybczakThose numbers definitely seem promising. My hope is that you can condense some small fragment of
generic-lens
+ your test harness down into a test case that we can check into the GHC test suite.Not sure. Only more testing could tell. But looks like GHC is pretty good at optimizing a lot of stuff away (judging from my work on the
optics
library where I was surprised how much it could do, given enough inlining takes place) and the only thing that usually stands in its way is things not inlining.It's likely that GHC's decision to inline
from
/to
depends on the surrounding context to some degree, but I think I agree that this is a pretty good litmus test for determining whetherfrom
/to
will inline in general. As a result, I think that this would be a decent basis for a heuristic.Having flags unconditionally enabling/disabling inlining of
from
/to
sounds like a good idea so that people can experiment with more "exotic" cases, though I'd say in addition to having these heuristics. They could be made a bit more conservative then, but it should "just work" in the most typical cases (at least pure sums/products).Sure, I definitely agree that the heuristics should be the default. Moreover, I was thinking of making the flags slightly more fine-grained than just a simple inline/don't-inline switch. For example, you note in !2965 (comment 261707) that after a certain number of products or sums,
from
/to
stop inlining effectively. Perhaps we could introduce-fmax-inline-derived-generic-products
/-fmax-inline-derived-generic-sums
to allow the user to configure what these numbers should be? It's possible that there are better ways to specify these settings, but that seems like the general direction we would want to go with this.Edited by Ryan ScottLuckily the basic code for generation of generic lenses is pretty short (around 100 lines of code) and doesn't require anything other than base, so it's feasible.
I'm thinking that the test should have 2 modules - one where generic lenses and data types (for various constructor/field pairs used in heuristics) are defined and the second where we instantiate lenses. Then we can grep the core output of the second module to verify that no generics are left. Does that make sense?
Perhaps we could introduce -fmax-inline-derived-generic-products/-fmax-inline-derived-generic-sums to allow the user to configure what these numbers should be?
It's slightly tricky, because max fields change depending on number of constructors, so ideally it would be something like
-fmax-inline-derived-generic-sop=1,100
or-fmax-inline-derived-generic-sop=3,10
and you could pass many pairs to the compiler.However, I was thinking that perhaps we could use ANN pragmas to override heuristics and instruct GHC to generate INLINE pragmas per type?
GHC.Generics
could define something likedata GenericInline = GenericAlwaysInline | GenericAutoInline | GenericNoInline
and you could then write
{-# ANN type X GenericAlwaysInline #-} data X = ...
to bypass heuristics and always emit INLINE pragmas (or use
GenericAutoInline
for heuristics /GenericNoInline
to never emit them).I'm thinking that the test should have 2 modules - one where generic lenses and data types (for various constructor/field pairs used in heuristics) are defined and the second where we instantiate lenses. Then we can grep the core output of the second module to verify that no generics are left. Does that make sense?
That makes sense to me.
It's slightly tricky, because max fields change depending on number of constructors, so ideally it would be something like
-fmax-inline-derived-generic-sop=1,100
or-fmax-inline-derived-generic-sop=3,10
and you could pass many pairs to the compiler.That is a good point. A sensible upper bound on the number of fields for a data type with one constructor may not be a sensible upper bound for multiple constructors. I like your suggested design of passing multiple sum/product pairs to communicate this information to GHC. Luckily, there is already some machinery in GHC's command-line parser to support this kind of use case. For instance, the
-package-id
flags allows passing multiple arguments in this fashion.Another tricky part about making this flag work is that a user could pass nonsense like
-fmax-inline-derived-generic-sop=1,1,1,1,1,1
. I think the simplest solution would be to have a pass over the user input that parses it into the format<int>,<int>
and throws an error message if the input does not match that format.However, I was thinking that perhaps we could use ANN pragmas to override heuristics and instruct GHC to generate INLINE pragmas per type?
That sounds incredibly fancy, and I'd be inclined to stick to a simpler solution unless absolutely necessary.
Does this MR cause that? It would be really weird, because
T5642
doesn't do anything with generated instance.Edited by Andrzej RybczakYes, this MR is most likely responsible for that increase. Although
T5642
doesn't do anything with the generated instance, there are a few other ways in which you could explain the regression. It might store unfoldings, for example, where before it might not have stored them. Or introduce auxiliary bindings for the pragmas that have to be type-checked, which is exactly the scenario whereT5642
blows up.It's hard to say without comparing
-ddump-timings
and-ddump-tc
.Edited by Sebastian Graf
@arybczak Are you still looking into this?
@AndreasK no, I didn't touch it due to time constraints. I also don't depend on generics enough to make it happen.
Feel free to pick it up if you want.
I made it INLINE[1] because afaik it's the middle phase so it gives the biggest potential to align it with your code (assuming you hit a situation where it matters).
mentioned in issue #18523 (closed)
added 1080 commits
-
1af87492...4365d77a - 1079 commits from branch
ghc:master
- f6c10c7e - Mark methods in derived Generic instances as INLINE[1] (#11068 (closed))
-
1af87492...4365d77a - 1079 commits from branch
added 1 commit
- 379a233f - Mark methods in derived Generic instances as INLINE[1] (#11068 (closed))
added 1 commit
- bc89cfda - Mark methods in derived Generic instances as INLINE[1] (#11068 (closed))
added 1 commit
- 4bb77a35 - Mark methods in derived Generic instances as INLINE[1] (#11068 (closed))
added 27 commits
-
4bb77a35...b81350bb - 26 commits from branch
ghc:master
- 7f59f6ad - Mark methods in derived Generic instances as INLINE[1] (#11068 (closed))
-
4bb77a35...b81350bb - 26 commits from branch
added 5 commits
-
7f59f6ad...1033a720 - 4 commits from branch
ghc:master
- 9eb15e10 - Mark methods in derived Generic instances as INLINE[1] (#11068 (closed))
-
7f59f6ad...1033a720 - 4 commits from branch
added 1 commit
- 7f59f6ad - Mark methods in derived Generic instances as INLINE[1] (#11068 (closed))