Add heuristic-based optimization flag for marking Generic methods as INLINE[1] (#11068)

Thanks for doing this, @arybczak. I have a request and a question:

This MR is sorely lacking a perf regression test to demonstrate the benefits of this patch. Can you create such a test and add it to testsuite/tests/perf? The existing test cases there should provide a template of how to do this. See also https://gitlab.haskell.org/ghc/ghc/wikis/building/running-tests/performance-tests.
I worry that always inlining from/to will cause a blowup in compilation times in certain cases, especially when the implementations of each method are large. Is there a tradeoff between inlining these methods and compile times? The compile-time performance of derived Generic instances is something that many people are sensitive to—see #5642.

Side comment: If we could tell GHC to not try to optimize the from and to at all (let the generator generate optimal code, as it probably does already), and if they are INLINE, then compilation of derived Generic instances would be top-notch. cc @alp (INLINE needed that the use-sites don't regress).

https://gitlab.haskell.org/ghc/ghc/-/wikis/building/running-tests/performance-tests says

Performance tests measure some metric (e.g. number of bytes allocated) of ghc or generated code,

but I fail to find what metrics of generated code (or GHC for that matter) are available.

https://gitlab.haskell.org/ghc/ghc/-/wikis/building/running-tests/performance-tests says

Performance tests measure some metric (e.g. number of bytes allocated) of ghc or generated code,

but I fail to find what metrics of generated code (or GHC for that matter) are available.

GHC's perf tests allow you to measure both runtime and compile-time statistics. For example, here is a test case in testsuite/tests/deriving/perf/all.T that measures the number of bytes allocated during compilation:

test('T10858',
     [ collect_compiler_stats('bytes allocated',8),
      only_ways(['normal'])],
     compile,
     ['-O'])

If we could tell GHC to not try to optimize the from and to at all (let the generator generate optimal code, as it probably does already), and if they are INLINE, then compilation of derived Generic instances would be top-notch.

I don't really understand what you mean by "let the generator generate optimal code". TcGenGenerics generates LHsBinds for derived from/to implementations, not direct Core, so some amount of compilation (i.e., optimization) seems inevitable.

Yes. But the correct test here would be in the spirit of inspection-testing. That in multi module setup, optimisations happen. Measuring them via allocations is quite indirect (though good double check).

I don't really understand what you mean by "let the generator generate optimal code".

Ah. It would be great if we could generate Core directly. It's quite handy from my experience writing type-checker plugins.

Yes. But the correct test here would be in the spirit of inspection-testing. That in multi module setup, optimisations happen. Measuring them via allocations is quite indirect (though good double check).

To be clear, you aren't just limited to checking the number of bytes allocated during compilation. There are other tests that check if certain things have been optimized away by grep'ing through the -ddump-simpl output. (See, for instance, this test case which checks for the absence of heq_sel in the dumped Core.) I'm not sure if that's what you had in mind for a potential test case in this MR, but it's another tool in your toolbelt.

Ah. It would be great if we could generate Core directly.

Perhaps so, but that would require a pretty different approach from the way deriving currently works. If that would provide a shorter path to making derived from/to implementations more efficient, perhaps we should open a separate issue about this.

added 1 commit

1af87492 - Mark methods in derived Generic instances as INLINE[1] (#11068 (closed))

Compare with previous version

To address the second point, there's https://github.com/arybczak/generic-lens/tree/perf.

Relevant benchmark can be run with cabal build generic-optics:perf.

It has a data type with 100 fields and 50 site usages of view/over of half of appropriate field optics. You can also control the number of constructors from 1 to 4 via CPP.

The results of GHC resource usage are as follows (with -j4 -A32m):

8.10.1

constructors	memory[M]	time[s]
1	974	7.43
2	1382	13.12
3	1611	20.33
4	2934	28.141

8.11.0 patched

constructors	memory[M]	time[s]
1	766	6.85
2	1364	13.05
3	2010	30.92
4	3181	48.89

So for up to 2 constructors inlining from/to actually leads to faster compilation AND faster code since then generics optimize away.

For 3 and more constructors (this of course also depends on the number of fields each constructor has) GHC no longer optimizes away generic representation and inlining becomes useless, leading to increased resource consumption during compilation and huge blowup in generated core.

This is one data point. I did some testing with derivation of NFData for 1 constructor / 100 fields and while GHC takes slightly more memory and time to compile, it optimizes generics away, something that doesn't currently happen, leading to much faster runtime.

It looks like always marking these methods as INLINE is not a good idea. However, since we have access to the data type, we can emit INLINE pragmas based on the amount of constructors/fields the data type has.

The most common are either pure sums of pure products, and:

pure sums optimize well for up to 40 constructors
pure products optimize well for up to 100+ fields (I didn't find upper bound)

Looks like 2 constructors also optimize pretty well (2 constructors, 100 fields tested positive), but after 3rd constructor number of fields drops quite sharply to around 10.

@RyanGlScott does it sound reasonable to mark these as INLINE based on this heuristic? Then we'll get the best of both worlds.

Thanks for making these perf tests. Some further questions:

The results in !2965 (comment 261707) are for compile times, not runtime, correct? If so, how are you determining if the resulting code is faster?
How are you verifying that the "generics optimize away"?
How confident are you that these results generally apply to all uses of GHC.Generics? I'm slightly worried that the results we see here are simply an artifact of the way GHC's optimizer happens to work over the generic-lens library. You say that you also tested generic NFData instances as well, which is great, although I'm not sure where these tests live.
Having heuristics guide whether we inline things is great, but it is a bit alarming to see that compile times increased in some common cases. Given that heuristics are never perfect, do you think it would be worthwhile to have GHC flags to control how from/to are inlined?

I had criterion tests elsewhere. I pushed them to the branch.

Results for one constructor (large is 100 fields, small is 20):

8.10.1:

benchmarking large/view/lens/th
time                 7.121 ns   (7.099 ns .. 7.153 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 7.121 ns   (7.103 ns .. 7.174 ns)
std dev              66.70 ps   (28.55 ps .. 119.8 ps)

benchmarking large/view/lens/generic
time                 143.0 ns   (142.8 ns .. 143.5 ns)
                     0.999 R²   (0.997 R² .. 1.000 R²)
mean                 142.9 ns   (142.8 ns .. 143.1 ns)
std dev              399.8 ps   (267.3 ps .. 559.2 ps)

benchmarking large/over/lens/th
time                 72.31 ns   (72.05 ns .. 72.75 ns)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 72.45 ns   (72.10 ns .. 73.74 ns)
std dev              1.535 ns   (304.8 ps .. 3.039 ns)
variance introduced by outliers: 29% (moderately inflated)

benchmarking large/over/lens/generic
time                 265.1 ns   (264.4 ns .. 265.8 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 264.9 ns   (264.4 ns .. 265.7 ns)
std dev              1.473 ns   (680.2 ps .. 2.311 ns)

benchmarking large/deepseq
time                 285.8 ns   (284.6 ns .. 287.8 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 285.6 ns   (284.8 ns .. 287.8 ns)
std dev              3.062 ns   (1.423 ns .. 5.354 ns)

benchmarking small/view/lens/th
time                 6.853 ns   (6.844 ns .. 6.861 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 6.851 ns   (6.844 ns .. 6.866 ns)
std dev              24.97 ps   (14.42 ps .. 40.68 ps)

benchmarking small/view/lens/generic
time                 30.76 ns   (30.72 ns .. 30.81 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 30.75 ns   (30.72 ns .. 30.81 ns)
std dev              90.77 ps   (31.65 ps .. 155.3 ps)

benchmarking small/over/lens/th
time                 18.20 ns   (18.18 ns .. 18.22 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 18.19 ns   (18.18 ns .. 18.22 ns)
std dev              42.02 ps   (25.32 ps .. 65.63 ps)

benchmarking small/over/lens/generic
time                 62.26 ns   (62.17 ns .. 62.35 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 62.23 ns   (62.16 ns .. 62.34 ns)
std dev              212.1 ps   (124.9 ps .. 325.0 ps)

benchmarking small/deepseq
time                 48.47 ns   (48.40 ns .. 48.54 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 48.47 ns   (48.41 ns .. 48.57 ns)
std dev              191.0 ps   (126.9 ps .. 267.3 ps)

8.11.0 patched:

benchmarking large/view/lens/th
time                 7.613 ns   (7.588 ns .. 7.643 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 7.612 ns   (7.592 ns .. 7.658 ns)
std dev              71.30 ps   (34.41 ps .. 118.0 ps)

benchmarking large/view/lens/generic
time                 7.343 ns   (7.338 ns .. 7.351 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 7.342 ns   (7.338 ns .. 7.352 ns)
std dev              14.94 ps   (6.868 ps .. 25.20 ps)

benchmarking large/over/lens/th
time                 70.38 ns   (70.27 ns .. 70.54 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 70.38 ns   (70.30 ns .. 70.60 ns)
std dev              297.6 ps   (111.0 ps .. 554.8 ps)

benchmarking large/over/lens/generic
time                 70.29 ns   (70.23 ns .. 70.36 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 70.30 ns   (70.25 ns .. 70.39 ns)
std dev              166.5 ps   (104.4 ps .. 249.7 ps)

benchmarking large/deepseq
time                 106.8 ns   (106.8 ns .. 106.9 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 106.8 ns   (106.8 ns .. 106.9 ns)
std dev              176.8 ps   (132.2 ps .. 256.7 ps)

benchmarking small/view/lens/th
time                 7.403 ns   (7.390 ns .. 7.421 ns)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 7.395 ns   (7.388 ns .. 7.406 ns)
std dev              21.87 ps   (12.55 ps .. 32.57 ps)

benchmarking small/view/lens/generic
time                 7.642 ns   (7.632 ns .. 7.659 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 7.643 ns   (7.635 ns .. 7.655 ns)
std dev              24.17 ps   (15.14 ps .. 36.18 ps)

benchmarking small/over/lens/th
time                 18.14 ns   (18.10 ns .. 18.20 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 18.24 ns   (18.17 ns .. 18.37 ns)
std dev              246.1 ps   (169.2 ps .. 372.5 ps)
variance introduced by outliers: 15% (moderately inflated)

benchmarking small/over/lens/generic
time                 18.04 ns   (18.03 ns .. 18.06 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 18.05 ns   (18.04 ns .. 18.08 ns)
std dev              44.99 ps   (29.05 ps .. 76.24 ps)

benchmarking small/deepseq
time                 21.54 ns   (21.53 ns .. 21.56 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 21.55 ns   (21.53 ns .. 21.56 ns)
std dev              37.61 ps   (26.88 ps .. 48.96 ps)

As you can see, the difference is very significant. Here you don't even have to look at core to see that generics optimized away since the runtime of generic-based version is the same as code generated with TH.

By looking at the output of ddump-simpl -dsuppress-all.
Not sure. Only more testing could tell. But looks like GHC is pretty good at optimizing a lot of stuff away (judging from my work on the optics library where I was surprised how much it could do, given enough inlining takes place) and the only thing that usually stands in its way is things not inlining.
Having flags unconditionally enabling/disabling inlining of from/to sounds like a good idea so that people can experiment with more "exotic" cases, though I'd say in addition to having these heuristics. They could be made a bit more conservative then, but it should "just work" in the most typical cases (at least pure sums/products).

Speaking of common cases you mean NFData? Cutting the benchmark file to only generate NFData instance for X (1 constructor, 100 fields), compilation stats are as follows:

8.10.1: 272M, 1.27s
8.11.0: 281M, 1.38s

So it's a slight decrease in performance, but the runtime of rnf is 265ns vs 106ns. Seems like a good trade-off to me.

@arybczak BTW you can use https://github.com/bgamari/criterion-compare to make it easier to visually compare the criterion results.

@sjakobi Thanks, here's the comparison:

Benchmark	8.10.1	8.11.0
large/view/lens/th	-6.4%	7.57e-9
large/view/lens/generic	1893.8%	7.33e-9
large/over/lens/th	1.5%	7.05e-8
large/over/lens/generic	278.6%	7.06e-8
large/deepseq	167.4%	1.07e-7
small/view/lens/th	-6.7%	7.36e-9
small/view/lens/generic	306.4%	7.59e-9
small/over/lens/th	1.3%	1.80e-8
small/over/lens/generic	246.0%	1.80e-8
small/deepseq	125.2%	2.15e-8
GeoMean (calculated)	137.6%	1.00

Benchmark	8.10.1	8.11.0
large/view/lens/th	7.09e-9	6.9%
large/view/lens/generic	1.46e-7	-95.0%
large/over/lens/th	7.15e-8	-1.5%
large/over/lens/generic	2.67e-7	-73.6%
large/deepseq	2.86e-7	-62.6%
small/view/lens/th	6.87e-9	7.2%
small/view/lens/generic	3.08e-8	-75.4%
small/over/lens/th	1.82e-8	-1.3%
small/over/lens/generic	6.23e-8	-71.1%
small/deepseq	4.84e-8	-55.6%
GeoMean (calculated)	1.00	-57.9%

Thanks! I suspect that you've picked the wrong run as the baseline though? I.e. the difference should be shown for the patched version.

(To rule out other causes of perf differences, you could also try benchmarking against the patch's base commit.)

That's intentional, it shows how much slower version without INLINE pragmas is. I added the other way around to satisfy your curiosity :P

Those numbers definitely seem promising. My hope is that you can condense some small fragment of generic-lens + your test harness down into a test case that we can check into the GHC test suite.

Not sure. Only more testing could tell. But looks like GHC is pretty good at optimizing a lot of stuff away (judging from my work on the optics library where I was surprised how much it could do, given enough inlining takes place) and the only thing that usually stands in its way is things not inlining.

It's likely that GHC's decision to inline from/to depends on the surrounding context to some degree, but I think I agree that this is a pretty good litmus test for determining whether from/to will inline in general. As a result, I think that this would be a decent basis for a heuristic.

Having flags unconditionally enabling/disabling inlining of from/to sounds like a good idea so that people can experiment with more "exotic" cases, though I'd say in addition to having these heuristics. They could be made a bit more conservative then, but it should "just work" in the most typical cases (at least pure sums/products).

Sure, I definitely agree that the heuristics should be the default. Moreover, I was thinking of making the flags slightly more fine-grained than just a simple inline/don't-inline switch. For example, you note in !2965 (comment 261707) that after a certain number of products or sums, from/to stop inlining effectively. Perhaps we could introduce -fmax-inline-derived-generic-products/-fmax-inline-derived-generic-sums to allow the user to configure what these numbers should be? It's possible that there are better ways to specify these settings, but that seems like the general direction we would want to go with this.

Luckily the basic code for generation of generic lenses is pretty short (around 100 lines of code) and doesn't require anything other than base, so it's feasible.

I'm thinking that the test should have 2 modules - one where generic lenses and data types (for various constructor/field pairs used in heuristics) are defined and the second where we instantiate lenses. Then we can grep the core output of the second module to verify that no generics are left. Does that make sense?

Perhaps we could introduce -fmax-inline-derived-generic-products/-fmax-inline-derived-generic-sums to allow the user to configure what these numbers should be?

It's slightly tricky, because max fields change depending on number of constructors, so ideally it would be something like -fmax-inline-derived-generic-sop=1,100 or -fmax-inline-derived-generic-sop=3,10 and you could pass many pairs to the compiler.

However, I was thinking that perhaps we could use ANN pragmas to override heuristics and instruct GHC to generate INLINE pragmas per type? GHC.Generics could define something like

data GenericInline
  = GenericAlwaysInline
  | GenericAutoInline 
  | GenericNoInline

and you could then write

{-# ANN type X GenericAlwaysInline #-} 
data X = ...

to bypass heuristics and always emit INLINE pragmas (or use GenericAutoInline for heuristics / GenericNoInline to never emit them).

I'm thinking that the test should have 2 modules - one where generic lenses and data types (for various constructor/field pairs used in heuristics) are defined and the second where we instantiate lenses. Then we can grep the core output of the second module to verify that no generics are left. Does that make sense?

That makes sense to me.

It's slightly tricky, because max fields change depending on number of constructors, so ideally it would be something like -fmax-inline-derived-generic-sop=1,100 or -fmax-inline-derived-generic-sop=3,10 and you could pass many pairs to the compiler.

That is a good point. A sensible upper bound on the number of fields for a data type with one constructor may not be a sensible upper bound for multiple constructors. I like your suggested design of passing multiple sum/product pairs to communicate this information to GHC. Luckily, there is already some machinery in GHC's command-line parser to support this kind of use case. For instance, the -package-id flags allows passing multiple arguments in this fashion.

Another tricky part about making this flag work is that a user could pass nonsense like -fmax-inline-derived-generic-sop=1,1,1,1,1,1. I think the simplest solution would be to have a pass over the user input that parses it into the format <int>,<int> and throws an error message if the input does not match that format.

However, I was thinking that perhaps we could use ANN pragmas to override heuristics and instruct GHC to generate INLINE pragmas per type?

That sounds incredibly fancy, and I'd be inclined to stick to a simpler solution unless absolutely necessary.

I think we definitely don't want to regress by another 43% in T5642. Derivig Generic for large sums already cause quite a horrible compile-time blowup.

Does this MR cause that? It would be really weird, because T5642 doesn't do anything with generated instance.

Yes, this MR is most likely responsible for that increase. Although T5642 doesn't do anything with the generated instance, there are a few other ways in which you could explain the regression. It might store unfoldings, for example, where before it might not have stored them. Or introduce auxiliary bindings for the pragmas that have to be type-checked, which is exactly the scenario where T5642 blows up.

It's hard to say without comparing -ddump-timings and -ddump-tc.

@arybczak Are you still looking into this?

Also at a glance I wonder why it's INLINE[1] instead of just INLINE.

@AndreasK no, I didn't touch it due to time constraints. I also don't depend on generics enough to make it happen.

Feel free to pick it up if you want.

I made it INLINE[1] because afaik it's the middle phase so it gives the biggest potential to align it with your code (assuming you hit a situation where it matters).

mentioned in issue #18523 (closed)

added 1080 commits

1af87492...4365d77a - 1079 commits from branch ghc:master
f6c10c7e - Mark methods in derived Generic instances as INLINE[1] (#11068 (closed))

Compare with previous version

added 1 commit

379a233f - Mark methods in derived Generic instances as INLINE[1] (#11068 (closed))

Compare with previous version

added 1 commit

bc89cfda - Mark methods in derived Generic instances as INLINE[1] (#11068 (closed))

Compare with previous version

added 1 commit

4bb77a35 - Mark methods in derived Generic instances as INLINE[1] (#11068 (closed))

Compare with previous version

added 27 commits

4bb77a35...b81350bb - 26 commits from branch ghc:master
7f59f6ad - Mark methods in derived Generic instances as INLINE[1] (#11068 (closed))

Compare with previous version

added 5 commits

7f59f6ad...1033a720 - 4 commits from branch ghc:master
9eb15e10 - Mark methods in derived Generic instances as INLINE[1] (#11068 (closed))

Compare with previous version

added 1 commit

7f59f6ad - Mark methods in derived Generic instances as INLINE[1] (#11068 (closed))

Compare with previous version

Add heuristic-based optimization flag for marking Generic methods as INLINE[1] (#11068)

Merged by Marge Bot 4 years ago (Oct 16, 2020 1:57am UTC) 4 years ago

Activity

Add heuristic-based optimization flag for marking Generic methods as INLINE[1] (#11068)

Merge request reports

Merged by Marge Bot 4 years ago (Oct 16, 2020 1:57am UTC) 4 years ago

Activity