Skip to content
Snippets Groups Projects

Add heuristic-based optimization flag for marking Generic methods as INLINE[1] (#11068)

Merged Andrzej Rybczak requested to merge arybczak/ghc:11068-inline0-generic into master
Edited by Andrzej Rybczak

Merge request reports

Merge request pipeline #26342 passed with warnings

Merge request pipeline passed with warnings for 998803dc

Merged by Marge BotMarge Bot 4 years ago (Oct 16, 2020 1:57am UTC)

Loading

Pipeline #26385 passed with warnings

Pipeline passed with warnings for 998803dc on master

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Thanks for doing this, @arybczak. I have a request and a question:

    1. This MR is sorely lacking a perf regression test to demonstrate the benefits of this patch. Can you create such a test and add it to testsuite/tests/perf? The existing test cases there should provide a template of how to do this. See also https://gitlab.haskell.org/ghc/ghc/wikis/building/running-tests/performance-tests.
    2. I worry that always inlining from/to will cause a blowup in compilation times in certain cases, especially when the implementations of each method are large. Is there a tradeoff between inlining these methods and compile times? The compile-time performance of derived Generic instances is something that many people are sensitive to—see #5642.
    • Side comment: If we could tell GHC to not try to optimize the from and to at all (let the generator generate optimal code, as it probably does already), and if they are INLINE, then compilation of derived Generic instances would be top-notch. cc @alp (INLINE needed that the use-sites don't regress).

      https://gitlab.haskell.org/ghc/ghc/-/wikis/building/running-tests/performance-tests says

      Performance tests measure some metric (e.g. number of bytes allocated) of ghc or generated code,

      but I fail to find what metrics of generated code (or GHC for that matter) are available.

      Edited by Oleg Grenrus
    • https://gitlab.haskell.org/ghc/ghc/-/wikis/building/running-tests/performance-tests says

      Performance tests measure some metric (e.g. number of bytes allocated) of ghc or generated code,

      but I fail to find what metrics of generated code (or GHC for that matter) are available.

      GHC's perf tests allow you to measure both runtime and compile-time statistics. For example, here is a test case in testsuite/tests/deriving/perf/all.T that measures the number of bytes allocated during compilation:

      test('T10858',
           [ collect_compiler_stats('bytes allocated',8),
            only_ways(['normal'])],
           compile,
           ['-O'])
    • If we could tell GHC to not try to optimize the from and to at all (let the generator generate optimal code, as it probably does already), and if they are INLINE, then compilation of derived Generic instances would be top-notch.

      I don't really understand what you mean by "let the generator generate optimal code". TcGenGenerics generates LHsBinds for derived from/to implementations, not direct Core, so some amount of compilation (i.e., optimization) seems inevitable.

    • Yes. But the correct test here would be in the spirit of inspection-testing. That in multi module setup, optimisations happen. Measuring them via allocations is quite indirect (though good double check).

      I don't really understand what you mean by "let the generator generate optimal code".

      Ah. It would be great if we could generate Core directly. It's quite handy from my experience writing type-checker plugins.

    • Yes. But the correct test here would be in the spirit of inspection-testing. That in multi module setup, optimisations happen. Measuring them via allocations is quite indirect (though good double check).

      To be clear, you aren't just limited to checking the number of bytes allocated during compilation. There are other tests that check if certain things have been optimized away by grep'ing through the -ddump-simpl output. (See, for instance, this test case which checks for the absence of heq_sel in the dumped Core.) I'm not sure if that's what you had in mind for a potential test case in this MR, but it's another tool in your toolbelt.

      Ah. It would be great if we could generate Core directly.

      Perhaps so, but that would require a pretty different approach from the way deriving currently works. If that would provide a shorter path to making derived from/to implementations more efficient, perhaps we should open a separate issue about this.

    • Please register or sign in to reply
  • added 1 commit

    Compare with previous version

  • Author Contributor

    To address the second point, there's https://github.com/arybczak/generic-lens/tree/perf.

    Relevant benchmark can be run with cabal build generic-optics:perf.

    It has a data type with 100 fields and 50 site usages of view/over of half of appropriate field optics. You can also control the number of constructors from 1 to 4 via CPP.

    The results of GHC resource usage are as follows (with -j4 -A32m):

    8.10.1

    constructors memory[M] time[s]
    1 974 7.43
    2 1382 13.12
    3 1611 20.33
    4 2934 28.141

    8.11.0 patched

    constructors memory[M] time[s]
    1 766 6.85
    2 1364 13.05
    3 2010 30.92
    4 3181 48.89

    So for up to 2 constructors inlining from/to actually leads to faster compilation AND faster code since then generics optimize away.

    For 3 and more constructors (this of course also depends on the number of fields each constructor has) GHC no longer optimizes away generic representation and inlining becomes useless, leading to increased resource consumption during compilation and huge blowup in generated core.

    This is one data point. I did some testing with derivation of NFData for 1 constructor / 100 fields and while GHC takes slightly more memory and time to compile, it optimizes generics away, something that doesn't currently happen, leading to much faster runtime.

    It looks like always marking these methods as INLINE is not a good idea. However, since we have access to the data type, we can emit INLINE pragmas based on the amount of constructors/fields the data type has.

    The most common are either pure sums of pure products, and:

    • pure sums optimize well for up to 40 constructors
    • pure products optimize well for up to 100+ fields (I didn't find upper bound)

    Looks like 2 constructors also optimize pretty well (2 constructors, 100 fields tested positive), but after 3rd constructor number of fields drops quite sharply to around 10.

    @RyanGlScott does it sound reasonable to mark these as INLINE based on this heuristic? Then we'll get the best of both worlds.

    Edited by Andrzej Rybczak
    • Thanks for making these perf tests. Some further questions:

      1. The results in !2965 (comment 261707) are for compile times, not runtime, correct? If so, how are you determining if the resulting code is faster?
      2. How are you verifying that the "generics optimize away"?
      3. How confident are you that these results generally apply to all uses of GHC.Generics? I'm slightly worried that the results we see here are simply an artifact of the way GHC's optimizer happens to work over the generic-lens library. You say that you also tested generic NFData instances as well, which is great, although I'm not sure where these tests live.
      4. Having heuristics guide whether we inline things is great, but it is a bit alarming to see that compile times increased in some common cases. Given that heuristics are never perfect, do you think it would be worthwhile to have GHC flags to control how from/to are inlined?
    • Author Contributor
      1. I had criterion tests elsewhere. I pushed them to the branch.

      Results for one constructor (large is 100 fields, small is 20):

      8.10.1:

      benchmarking large/view/lens/th
      time                 7.121 ns   (7.099 ns .. 7.153 ns)
                           1.000 R²   (1.000 R² .. 1.000 R²)
      mean                 7.121 ns   (7.103 ns .. 7.174 ns)
      std dev              66.70 ps   (28.55 ps .. 119.8 ps)
      
      benchmarking large/view/lens/generic
      time                 143.0 ns   (142.8 ns .. 143.5 ns)
                           0.999 R²   (0.997 R² .. 1.000 R²)
      mean                 142.9 ns   (142.8 ns .. 143.1 ns)
      std dev              399.8 ps   (267.3 ps .. 559.2 ps)
      
      benchmarking large/over/lens/th
      time                 72.31 ns   (72.05 ns .. 72.75 ns)
                           1.000 R²   (0.999 R² .. 1.000 R²)
      mean                 72.45 ns   (72.10 ns .. 73.74 ns)
      std dev              1.535 ns   (304.8 ps .. 3.039 ns)
      variance introduced by outliers: 29% (moderately inflated)
      
      benchmarking large/over/lens/generic
      time                 265.1 ns   (264.4 ns .. 265.8 ns)
                           1.000 R²   (1.000 R² .. 1.000 R²)
      mean                 264.9 ns   (264.4 ns .. 265.7 ns)
      std dev              1.473 ns   (680.2 ps .. 2.311 ns)
      
      benchmarking large/deepseq
      time                 285.8 ns   (284.6 ns .. 287.8 ns)
                           1.000 R²   (1.000 R² .. 1.000 R²)
      mean                 285.6 ns   (284.8 ns .. 287.8 ns)
      std dev              3.062 ns   (1.423 ns .. 5.354 ns)
      
      benchmarking small/view/lens/th
      time                 6.853 ns   (6.844 ns .. 6.861 ns)
                           1.000 R²   (1.000 R² .. 1.000 R²)
      mean                 6.851 ns   (6.844 ns .. 6.866 ns)
      std dev              24.97 ps   (14.42 ps .. 40.68 ps)
      
      benchmarking small/view/lens/generic
      time                 30.76 ns   (30.72 ns .. 30.81 ns)
                           1.000 R²   (1.000 R² .. 1.000 R²)
      mean                 30.75 ns   (30.72 ns .. 30.81 ns)
      std dev              90.77 ps   (31.65 ps .. 155.3 ps)
      
      benchmarking small/over/lens/th
      time                 18.20 ns   (18.18 ns .. 18.22 ns)
                           1.000 R²   (1.000 R² .. 1.000 R²)
      mean                 18.19 ns   (18.18 ns .. 18.22 ns)
      std dev              42.02 ps   (25.32 ps .. 65.63 ps)
      
      benchmarking small/over/lens/generic
      time                 62.26 ns   (62.17 ns .. 62.35 ns)
                           1.000 R²   (1.000 R² .. 1.000 R²)
      mean                 62.23 ns   (62.16 ns .. 62.34 ns)
      std dev              212.1 ps   (124.9 ps .. 325.0 ps)
      
      benchmarking small/deepseq
      time                 48.47 ns   (48.40 ns .. 48.54 ns)
                           1.000 R²   (1.000 R² .. 1.000 R²)
      mean                 48.47 ns   (48.41 ns .. 48.57 ns)
      std dev              191.0 ps   (126.9 ps .. 267.3 ps)

      8.11.0 patched:

      benchmarking large/view/lens/th
      time                 7.613 ns   (7.588 ns .. 7.643 ns)
                           1.000 R²   (1.000 R² .. 1.000 R²)
      mean                 7.612 ns   (7.592 ns .. 7.658 ns)
      std dev              71.30 ps   (34.41 ps .. 118.0 ps)
      
      benchmarking large/view/lens/generic
      time                 7.343 ns   (7.338 ns .. 7.351 ns)
                           1.000 R²   (1.000 R² .. 1.000 R²)
      mean                 7.342 ns   (7.338 ns .. 7.352 ns)
      std dev              14.94 ps   (6.868 ps .. 25.20 ps)
      
      benchmarking large/over/lens/th
      time                 70.38 ns   (70.27 ns .. 70.54 ns)
                           1.000 R²   (1.000 R² .. 1.000 R²)
      mean                 70.38 ns   (70.30 ns .. 70.60 ns)
      std dev              297.6 ps   (111.0 ps .. 554.8 ps)
      
      benchmarking large/over/lens/generic
      time                 70.29 ns   (70.23 ns .. 70.36 ns)
                           1.000 R²   (1.000 R² .. 1.000 R²)
      mean                 70.30 ns   (70.25 ns .. 70.39 ns)
      std dev              166.5 ps   (104.4 ps .. 249.7 ps)
      
      benchmarking large/deepseq
      time                 106.8 ns   (106.8 ns .. 106.9 ns)
                           1.000 R²   (1.000 R² .. 1.000 R²)
      mean                 106.8 ns   (106.8 ns .. 106.9 ns)
      std dev              176.8 ps   (132.2 ps .. 256.7 ps)
      
      benchmarking small/view/lens/th
      time                 7.403 ns   (7.390 ns .. 7.421 ns)
                           1.000 R²   (0.999 R² .. 1.000 R²)
      mean                 7.395 ns   (7.388 ns .. 7.406 ns)
      std dev              21.87 ps   (12.55 ps .. 32.57 ps)
      
      benchmarking small/view/lens/generic
      time                 7.642 ns   (7.632 ns .. 7.659 ns)
                           1.000 R²   (1.000 R² .. 1.000 R²)
      mean                 7.643 ns   (7.635 ns .. 7.655 ns)
      std dev              24.17 ps   (15.14 ps .. 36.18 ps)
      
      benchmarking small/over/lens/th
      time                 18.14 ns   (18.10 ns .. 18.20 ns)
                           1.000 R²   (1.000 R² .. 1.000 R²)
      mean                 18.24 ns   (18.17 ns .. 18.37 ns)
      std dev              246.1 ps   (169.2 ps .. 372.5 ps)
      variance introduced by outliers: 15% (moderately inflated)
      
      benchmarking small/over/lens/generic
      time                 18.04 ns   (18.03 ns .. 18.06 ns)
                           1.000 R²   (1.000 R² .. 1.000 R²)
      mean                 18.05 ns   (18.04 ns .. 18.08 ns)
      std dev              44.99 ps   (29.05 ps .. 76.24 ps)
      
      benchmarking small/deepseq
      time                 21.54 ns   (21.53 ns .. 21.56 ns)
                           1.000 R²   (1.000 R² .. 1.000 R²)
      mean                 21.55 ns   (21.53 ns .. 21.56 ns)
      std dev              37.61 ps   (26.88 ps .. 48.96 ps)

      As you can see, the difference is very significant. Here you don't even have to look at core to see that generics optimized away since the runtime of generic-based version is the same as code generated with TH.

      1. By looking at the output of ddump-simpl -dsuppress-all.

      2. Not sure. Only more testing could tell. But looks like GHC is pretty good at optimizing a lot of stuff away (judging from my work on the optics library where I was surprised how much it could do, given enough inlining takes place) and the only thing that usually stands in its way is things not inlining.

      3. Having flags unconditionally enabling/disabling inlining of from/to sounds like a good idea so that people can experiment with more "exotic" cases, though I'd say in addition to having these heuristics. They could be made a bit more conservative then, but it should "just work" in the most typical cases (at least pure sums/products).

      Speaking of common cases you mean NFData? Cutting the benchmark file to only generate NFData instance for X (1 constructor, 100 fields), compilation stats are as follows:

      • 8.10.1: 272M, 1.27s
      • 8.11.0: 281M, 1.38s

      So it's a slight decrease in performance, but the runtime of rnf is 265ns vs 106ns. Seems like a good trade-off to me.

      Edited by Andrzej Rybczak
    • @arybczak BTW you can use https://github.com/bgamari/criterion-compare to make it easier to visually compare the criterion results.

    • Please register or sign in to reply
    • Author Contributor

      @sjakobi Thanks, here's the comparison:

      Benchmark 8.10.1 8.11.0
      large/view/lens/th -6.4% 7.57e-9
      large/view/lens/generic 1893.8% 7.33e-9
      large/over/lens/th 1.5% 7.05e-8
      large/over/lens/generic 278.6% 7.06e-8
      large/deepseq 167.4% 1.07e-7
      small/view/lens/th -6.7% 7.36e-9
      small/view/lens/generic 306.4% 7.59e-9
      small/over/lens/th 1.3% 1.80e-8
      small/over/lens/generic 246.0% 1.80e-8
      small/deepseq 125.2% 2.15e-8
      GeoMean (calculated) 137.6% 1.00
      Benchmark 8.10.1 8.11.0
      large/view/lens/th 7.09e-9 6.9%
      large/view/lens/generic 1.46e-7 -95.0%
      large/over/lens/th 7.15e-8 -1.5%
      large/over/lens/generic 2.67e-7 -73.6%
      large/deepseq 2.86e-7 -62.6%
      small/view/lens/th 6.87e-9 7.2%
      small/view/lens/generic 3.08e-8 -75.4%
      small/over/lens/th 1.82e-8 -1.3%
      small/over/lens/generic 6.23e-8 -71.1%
      small/deepseq 4.84e-8 -55.6%
      GeoMean (calculated) 1.00 -57.9%
      Edited by Andrzej Rybczak
    • Thanks! I suspect that you've picked the wrong run as the baseline though? I.e. the difference should be shown for the patched version.

      (To rule out other causes of perf differences, you could also try benchmarking against the patch's base commit.)

    • Please register or sign in to reply
  • Author Contributor

    That's intentional, it shows how much slower version without INLINE pragmas is. I added the other way around to satisfy your curiosity :P

    Edited by Andrzej Rybczak
    • Those numbers definitely seem promising. My hope is that you can condense some small fragment of generic-lens + your test harness down into a test case that we can check into the GHC test suite.

      Not sure. Only more testing could tell. But looks like GHC is pretty good at optimizing a lot of stuff away (judging from my work on the optics library where I was surprised how much it could do, given enough inlining takes place) and the only thing that usually stands in its way is things not inlining.

      It's likely that GHC's decision to inline from/to depends on the surrounding context to some degree, but I think I agree that this is a pretty good litmus test for determining whether from/to will inline in general. As a result, I think that this would be a decent basis for a heuristic.

      Having flags unconditionally enabling/disabling inlining of from/to sounds like a good idea so that people can experiment with more "exotic" cases, though I'd say in addition to having these heuristics. They could be made a bit more conservative then, but it should "just work" in the most typical cases (at least pure sums/products).

      Sure, I definitely agree that the heuristics should be the default. Moreover, I was thinking of making the flags slightly more fine-grained than just a simple inline/don't-inline switch. For example, you note in !2965 (comment 261707) that after a certain number of products or sums, from/to stop inlining effectively. Perhaps we could introduce -fmax-inline-derived-generic-products/-fmax-inline-derived-generic-sums to allow the user to configure what these numbers should be? It's possible that there are better ways to specify these settings, but that seems like the general direction we would want to go with this.

      Edited by Ryan Scott
    • Author Contributor

      Luckily the basic code for generation of generic lenses is pretty short (around 100 lines of code) and doesn't require anything other than base, so it's feasible.

      I'm thinking that the test should have 2 modules - one where generic lenses and data types (for various constructor/field pairs used in heuristics) are defined and the second where we instantiate lenses. Then we can grep the core output of the second module to verify that no generics are left. Does that make sense?

      Perhaps we could introduce -fmax-inline-derived-generic-products/-fmax-inline-derived-generic-sums to allow the user to configure what these numbers should be?

      It's slightly tricky, because max fields change depending on number of constructors, so ideally it would be something like -fmax-inline-derived-generic-sop=1,100 or -fmax-inline-derived-generic-sop=3,10 and you could pass many pairs to the compiler.

      However, I was thinking that perhaps we could use ANN pragmas to override heuristics and instruct GHC to generate INLINE pragmas per type? GHC.Generics could define something like

      data GenericInline
        = GenericAlwaysInline
        | GenericAutoInline 
        | GenericNoInline

      and you could then write

      {-# ANN type X GenericAlwaysInline #-} 
      data X = ...

      to bypass heuristics and always emit INLINE pragmas (or use GenericAutoInline for heuristics / GenericNoInline to never emit them).

    • I'm thinking that the test should have 2 modules - one where generic lenses and data types (for various constructor/field pairs used in heuristics) are defined and the second where we instantiate lenses. Then we can grep the core output of the second module to verify that no generics are left. Does that make sense?

      That makes sense to me.

      It's slightly tricky, because max fields change depending on number of constructors, so ideally it would be something like -fmax-inline-derived-generic-sop=1,100 or -fmax-inline-derived-generic-sop=3,10 and you could pass many pairs to the compiler.

      That is a good point. A sensible upper bound on the number of fields for a data type with one constructor may not be a sensible upper bound for multiple constructors. I like your suggested design of passing multiple sum/product pairs to communicate this information to GHC. Luckily, there is already some machinery in GHC's command-line parser to support this kind of use case. For instance, the -package-id flags allows passing multiple arguments in this fashion.

      Another tricky part about making this flag work is that a user could pass nonsense like -fmax-inline-derived-generic-sop=1,1,1,1,1,1. I think the simplest solution would be to have a pass over the user input that parses it into the format <int>,<int> and throws an error message if the input does not match that format.

      However, I was thinking that perhaps we could use ANN pragmas to override heuristics and instruct GHC to generate INLINE pragmas per type?

      That sounds incredibly fancy, and I'd be inclined to stick to a simpler solution unless absolutely necessary.

    • Please register or sign in to reply
    • I think we definitely don't want to regress by another 43% in T5642. Derivig Generic for large sums already cause quite a horrible compile-time blowup.

    • Author Contributor

      Does this MR cause that? It would be really weird, because T5642 doesn't do anything with generated instance.

      Edited by Andrzej Rybczak
    • Yes, this MR is most likely responsible for that increase. Although T5642 doesn't do anything with the generated instance, there are a few other ways in which you could explain the regression. It might store unfoldings, for example, where before it might not have stored them. Or introduce auxiliary bindings for the pragmas that have to be type-checked, which is exactly the scenario where T5642 blows up.

      It's hard to say without comparing -ddump-timings and -ddump-tc.

      Edited by Sebastian Graf
    • Please register or sign in to reply
  • @arybczak Are you still looking into this?

  • Also at a glance I wonder why it's INLINE[1] instead of just INLINE.

  • Author Contributor

    @AndreasK no, I didn't touch it due to time constraints. I also don't depend on generics enough to make it happen.

    Feel free to pick it up if you want.

    I made it INLINE[1] because afaik it's the middle phase so it gives the biggest potential to align it with your code (assuming you hit a situation where it matters).

  • mentioned in issue #18523 (closed)

  • Andrzej Rybczak added 1080 commits

    added 1080 commits

    Compare with previous version

  • added 1 commit

    Compare with previous version

  • added 1 commit

    Compare with previous version

  • added 1 commit

    Compare with previous version

  • Andrzej Rybczak added 27 commits

    added 27 commits

    Compare with previous version

  • Andrzej Rybczak added 5 commits

    added 5 commits

    Compare with previous version

  • added 1 commit

    Compare with previous version

  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
Please register or sign in to reply
Loading