Skip to content
Snippets Groups Projects
Forked from Glasgow Haskell Compiler / GHC
Source project has a limited visibility.
  • Sebastian Graf's avatar
    c261f220
    Nested CPR light unleashed (#18174) · c261f220
    Sebastian Graf authored and Marge Bot's avatar Marge Bot committed
    This patch enables worker/wrapper for nested constructed products, as described
    in `Note [Nested CPR]`. The machinery for expressing Nested CPR was already
    there, since !5054. Worker/wrapper is equipped to exploit Nested CPR annotations
    since !5338. CPR analysis already handles applications in batches since !5753.
    This patch just needs to flip a few more switches:
    
    1. In `cprTransformDataConWork`, we need to look at the field expressions
       and their `CprType`s to see whether the evaluation of the expressions
       terminates quickly (= is in HNF) or if they are put in strict fields.
       If that is the case, then we retain their CPR info and may unbox nestedly
       later on. More details in `Note [Nested CPR]`.
    2. Enable nested `ConCPR` signatures in `GHC.Types.Cpr`.
    3. In the `asConCpr` call in `GHC.Core.Opt.WorkWrap.Utils`, pass CPR info of
       fields to the `Unbox`.
    4. Instead of giving CPR signatures to DataCon workers and wrappers, we now have
       `cprTransformDataConWork` for workers and treat wrappers by analysing their
       unfolding. As a result, the code from GHC.Types.Id.Make went away completely.
    5. I deactivated worker/wrappering for recursive DataCons and wrote a function
       `isRecDataCon` to detect them. We really don't want to give `repeat` or
       `replicate` the Nested CPR property.
       See Note [CPR for recursive data structures] for which kind of recursive
       DataCons we target.
    6. Fix a couple of tests and their outputs.
    
    I also documented that CPR can destroy sharing and lead to asymptotic increase
    in allocations (which is tracked by #13331/#19326) in
    `Note [CPR for data structures can destroy sharing]`.
    
    Nofib results:
    ```
    --------------------------------------------------------------------------------
            Program         Allocs    Instrs
    --------------------------------------------------------------------------------
       ben-raytrace          -3.1%     -0.4%
       binary-trees          +0.8%     -2.9%
       digits-of-e2          +5.8%     +1.2%
              event          +0.8%     -2.1%
     fannkuch-redux          +0.0%     -1.4%
               fish           0.0%     -1.5%
             gamteb          -1.4%     -0.3%
            mkhprog          +1.4%     +0.8%
         multiplier          +0.0%     -1.9%
                pic          -0.6%     -0.1%
            reptile         -20.9%    -17.8%
          wave4main          +4.8%     +0.4%
               x2n1        -100.0%     -7.6%
    --------------------------------------------------------------------------------
                Min         -95.0%    -17.8%
                Max          +5.8%     +1.2%
     Geometric Mean          -2.9%     -0.4%
    ```
    The huge wins in x2n1 (loopy list) and reptile (see #19970) are due to
    refraining from unboxing (:). Other benchmarks like digits-of-e2 or wave4main
    regress because of that. Ultimately there are no great improvements due to
    Nested CPR alone, but at least it's a win.
    Binary sizes decrease by 0.6%.
    
    There are a significant number of metric decreases. The most notable ones (>1%):
    ```
           ManyAlternatives(normal) ghc/alloc   771656002.7   762187472.0  -1.2%
           ManyConstructors(normal) ghc/alloc  4191073418.7  4114369216.0  -1.8%
          MultiLayerModules(normal) ghc/alloc  3095678333.3  3128720704.0  +1.1%
                  PmSeriesG(normal) ghc/alloc    50096429.3    51495664.0  +2.8%
                  PmSeriesS(normal) ghc/alloc    63512989.3    64681600.0  +1.8%
                  PmSeriesV(normal) ghc/alloc    62575424.0    63767208.0  +1.9%
                     T10547(normal) ghc/alloc    29347469.3    29944240.0  +2.0%
                    T11303b(normal) ghc/alloc    46018752.0    47367576.0  +2.9%
                     T12150(optasm) ghc/alloc    81660890.7    82547696.0  +1.1%
                     T12234(optasm) ghc/alloc    59451253.3    60357952.0  +1.5%
                     T12545(normal) ghc/alloc  1705216250.7  1751278952.0  +2.7%
                     T12707(normal) ghc/alloc   981000472.0   968489800.0  -1.3% GOOD
                     T13056(optasm) ghc/alloc   389322664.0   372495160.0  -4.3% GOOD
                     T13253(normal) ghc/alloc   337174229.3   341954576.0  +1.4%
                     T13701(normal) ghc/alloc  2381455173.3  2439790328.0  +2.4%  BAD
                       T14052(ghci) ghc/alloc  2162530642.7  2139108784.0  -1.1%
                     T14683(normal) ghc/alloc  3049744728.0  2977535064.0  -2.4% GOOD
                     T14697(normal) ghc/alloc   362980213.3   369304512.0  +1.7%
                     T15164(normal) ghc/alloc  1323102752.0  1307480600.0  -1.2%
                     T15304(normal) ghc/alloc  1304607429.3  1291024568.0  -1.0%
                     T16190(normal) ghc/alloc   281450410.7   284878048.0  +1.2%
                     T16577(normal) ghc/alloc  7984960789.3  7811668768.0  -2.2% GOOD
                     T17516(normal) ghc/alloc  1171051192.0  1153649664.0  -1.5%
                     T17836(normal) ghc/alloc  1115569746.7  1098197592.0  -1.6%
                    T17836b(normal) ghc/alloc    54322597.3    55518216.0  +2.2%
                     T17977(normal) ghc/alloc    47071754.7    48403408.0  +2.8%
                    T17977b(normal) ghc/alloc    42579133.3    43977392.0  +3.3%
                     T18923(normal) ghc/alloc    71764237.3    72566240.0  +1.1%
                      T1969(normal) ghc/alloc   784821002.7   773971776.0  -1.4% GOOD
                      T3294(normal) ghc/alloc  1634913973.3  1614323584.0  -1.3% GOOD
                      T4801(normal) ghc/alloc   295619648.0   292776440.0  -1.0%
                    T5321FD(normal) ghc/alloc   278827858.7   276067280.0  -1.0%
                      T5631(normal) ghc/alloc   586618202.7   577579960.0  -1.5%
                      T5642(normal) ghc/alloc   494923048.0   487927208.0  -1.4%
                      T5837(normal) ghc/alloc    37758061.3    39261608.0  +4.0%
                      T9020(optasm) ghc/alloc   257362077.3   254672416.0  -1.0%
                      T9198(normal) ghc/alloc    49313365.3    50603936.0  +2.6%  BAD
                      T9233(normal) ghc/alloc   704944258.7   685692712.0  -2.7% GOOD
                      T9630(normal) ghc/alloc  1476621560.0  1455192784.0  -1.5%
                      T9675(optasm) ghc/alloc   443183173.3   433859696.0  -2.1% GOOD
                     T9872a(normal) ghc/alloc  1720926653.3  1693190072.0  -1.6% GOOD
                     T9872b(normal) ghc/alloc  2185618061.3  2162277568.0  -1.1% GOOD
                     T9872c(normal) ghc/alloc  1765842405.3  1733618088.0  -1.8% GOOD
       TcPlugin_RewritePerf(normal) ghc/alloc  2388882730.7  2365504696.0  -1.0%
                      WWRec(normal) ghc/alloc   607073186.7   597512216.0  -1.6%
    
                      T9203(normal) run/alloc   107284064.0   102881832.0  -4.1%
              haddock.Cabal(normal) run/alloc 24025329589.3 23768382560.0  -1.1%
               haddock.base(normal) run/alloc 25660521653.3 25370321824.0  -1.1%
           haddock.compiler(normal) run/alloc 74064171706.7 73358712280.0  -1.0%
    ```
    The biggest exception to the rule is T13701 which seems to fluctuate as usual
    (not unlike T12545). T14697 has a similar quality, being a generated
    multi-module test. T5837 is small enough that it similarly doesn't measure
    anything significant besides module loading overhead.
    T13253 simply does one additional round of Simplification due to Nested CPR.
    
    There are also some apparent regressions in T9198, T12234 and PmSeriesG that we
    (@mpickering and I) were simply unable to reproduce locally. @mpickering tried
    to run the CI script in a local Docker container and actually found that T9198
    and PmSeriesG *improved*. In MRs that were rebased on top this one, like !4229,
    I did not experience such increases. Let's not get hung up on these regression
    tests, they were meant to test for asymptotic regressions.
    
    The build-cabal test improves by 1.2% in -O0.
    
    Metric Increase:
        T10421
        T12234
        T12545
        T13035
        T13056
        T13701
        T14697
        T18923
        T5837
        T9198
    Metric Decrease:
        ManyConstructors
        T12545
        T12707
        T13056
        T14683
        T16577
        T18223
        T1969
        T3294
        T9203
        T9233
        T9675
        T9872a
        T9872b
        T9872c
        T9961
        TcPlugin_RewritePerf
    c261f220
    History
    Nested CPR light unleashed (#18174)
    Sebastian Graf authored and Marge Bot's avatar Marge Bot committed
    This patch enables worker/wrapper for nested constructed products, as described
    in `Note [Nested CPR]`. The machinery for expressing Nested CPR was already
    there, since !5054. Worker/wrapper is equipped to exploit Nested CPR annotations
    since !5338. CPR analysis already handles applications in batches since !5753.
    This patch just needs to flip a few more switches:
    
    1. In `cprTransformDataConWork`, we need to look at the field expressions
       and their `CprType`s to see whether the evaluation of the expressions
       terminates quickly (= is in HNF) or if they are put in strict fields.
       If that is the case, then we retain their CPR info and may unbox nestedly
       later on. More details in `Note [Nested CPR]`.
    2. Enable nested `ConCPR` signatures in `GHC.Types.Cpr`.
    3. In the `asConCpr` call in `GHC.Core.Opt.WorkWrap.Utils`, pass CPR info of
       fields to the `Unbox`.
    4. Instead of giving CPR signatures to DataCon workers and wrappers, we now have
       `cprTransformDataConWork` for workers and treat wrappers by analysing their
       unfolding. As a result, the code from GHC.Types.Id.Make went away completely.
    5. I deactivated worker/wrappering for recursive DataCons and wrote a function
       `isRecDataCon` to detect them. We really don't want to give `repeat` or
       `replicate` the Nested CPR property.
       See Note [CPR for recursive data structures] for which kind of recursive
       DataCons we target.
    6. Fix a couple of tests and their outputs.
    
    I also documented that CPR can destroy sharing and lead to asymptotic increase
    in allocations (which is tracked by #13331/#19326) in
    `Note [CPR for data structures can destroy sharing]`.
    
    Nofib results:
    ```
    --------------------------------------------------------------------------------
            Program         Allocs    Instrs
    --------------------------------------------------------------------------------
       ben-raytrace          -3.1%     -0.4%
       binary-trees          +0.8%     -2.9%
       digits-of-e2          +5.8%     +1.2%
              event          +0.8%     -2.1%
     fannkuch-redux          +0.0%     -1.4%
               fish           0.0%     -1.5%
             gamteb          -1.4%     -0.3%
            mkhprog          +1.4%     +0.8%
         multiplier          +0.0%     -1.9%
                pic          -0.6%     -0.1%
            reptile         -20.9%    -17.8%
          wave4main          +4.8%     +0.4%
               x2n1        -100.0%     -7.6%
    --------------------------------------------------------------------------------
                Min         -95.0%    -17.8%
                Max          +5.8%     +1.2%
     Geometric Mean          -2.9%     -0.4%
    ```
    The huge wins in x2n1 (loopy list) and reptile (see #19970) are due to
    refraining from unboxing (:). Other benchmarks like digits-of-e2 or wave4main
    regress because of that. Ultimately there are no great improvements due to
    Nested CPR alone, but at least it's a win.
    Binary sizes decrease by 0.6%.
    
    There are a significant number of metric decreases. The most notable ones (>1%):
    ```
           ManyAlternatives(normal) ghc/alloc   771656002.7   762187472.0  -1.2%
           ManyConstructors(normal) ghc/alloc  4191073418.7  4114369216.0  -1.8%
          MultiLayerModules(normal) ghc/alloc  3095678333.3  3128720704.0  +1.1%
                  PmSeriesG(normal) ghc/alloc    50096429.3    51495664.0  +2.8%
                  PmSeriesS(normal) ghc/alloc    63512989.3    64681600.0  +1.8%
                  PmSeriesV(normal) ghc/alloc    62575424.0    63767208.0  +1.9%
                     T10547(normal) ghc/alloc    29347469.3    29944240.0  +2.0%
                    T11303b(normal) ghc/alloc    46018752.0    47367576.0  +2.9%
                     T12150(optasm) ghc/alloc    81660890.7    82547696.0  +1.1%
                     T12234(optasm) ghc/alloc    59451253.3    60357952.0  +1.5%
                     T12545(normal) ghc/alloc  1705216250.7  1751278952.0  +2.7%
                     T12707(normal) ghc/alloc   981000472.0   968489800.0  -1.3% GOOD
                     T13056(optasm) ghc/alloc   389322664.0   372495160.0  -4.3% GOOD
                     T13253(normal) ghc/alloc   337174229.3   341954576.0  +1.4%
                     T13701(normal) ghc/alloc  2381455173.3  2439790328.0  +2.4%  BAD
                       T14052(ghci) ghc/alloc  2162530642.7  2139108784.0  -1.1%
                     T14683(normal) ghc/alloc  3049744728.0  2977535064.0  -2.4% GOOD
                     T14697(normal) ghc/alloc   362980213.3   369304512.0  +1.7%
                     T15164(normal) ghc/alloc  1323102752.0  1307480600.0  -1.2%
                     T15304(normal) ghc/alloc  1304607429.3  1291024568.0  -1.0%
                     T16190(normal) ghc/alloc   281450410.7   284878048.0  +1.2%
                     T16577(normal) ghc/alloc  7984960789.3  7811668768.0  -2.2% GOOD
                     T17516(normal) ghc/alloc  1171051192.0  1153649664.0  -1.5%
                     T17836(normal) ghc/alloc  1115569746.7  1098197592.0  -1.6%
                    T17836b(normal) ghc/alloc    54322597.3    55518216.0  +2.2%
                     T17977(normal) ghc/alloc    47071754.7    48403408.0  +2.8%
                    T17977b(normal) ghc/alloc    42579133.3    43977392.0  +3.3%
                     T18923(normal) ghc/alloc    71764237.3    72566240.0  +1.1%
                      T1969(normal) ghc/alloc   784821002.7   773971776.0  -1.4% GOOD
                      T3294(normal) ghc/alloc  1634913973.3  1614323584.0  -1.3% GOOD
                      T4801(normal) ghc/alloc   295619648.0   292776440.0  -1.0%
                    T5321FD(normal) ghc/alloc   278827858.7   276067280.0  -1.0%
                      T5631(normal) ghc/alloc   586618202.7   577579960.0  -1.5%
                      T5642(normal) ghc/alloc   494923048.0   487927208.0  -1.4%
                      T5837(normal) ghc/alloc    37758061.3    39261608.0  +4.0%
                      T9020(optasm) ghc/alloc   257362077.3   254672416.0  -1.0%
                      T9198(normal) ghc/alloc    49313365.3    50603936.0  +2.6%  BAD
                      T9233(normal) ghc/alloc   704944258.7   685692712.0  -2.7% GOOD
                      T9630(normal) ghc/alloc  1476621560.0  1455192784.0  -1.5%
                      T9675(optasm) ghc/alloc   443183173.3   433859696.0  -2.1% GOOD
                     T9872a(normal) ghc/alloc  1720926653.3  1693190072.0  -1.6% GOOD
                     T9872b(normal) ghc/alloc  2185618061.3  2162277568.0  -1.1% GOOD
                     T9872c(normal) ghc/alloc  1765842405.3  1733618088.0  -1.8% GOOD
       TcPlugin_RewritePerf(normal) ghc/alloc  2388882730.7  2365504696.0  -1.0%
                      WWRec(normal) ghc/alloc   607073186.7   597512216.0  -1.6%
    
                      T9203(normal) run/alloc   107284064.0   102881832.0  -4.1%
              haddock.Cabal(normal) run/alloc 24025329589.3 23768382560.0  -1.1%
               haddock.base(normal) run/alloc 25660521653.3 25370321824.0  -1.1%
           haddock.compiler(normal) run/alloc 74064171706.7 73358712280.0  -1.0%
    ```
    The biggest exception to the rule is T13701 which seems to fluctuate as usual
    (not unlike T12545). T14697 has a similar quality, being a generated
    multi-module test. T5837 is small enough that it similarly doesn't measure
    anything significant besides module loading overhead.
    T13253 simply does one additional round of Simplification due to Nested CPR.
    
    There are also some apparent regressions in T9198, T12234 and PmSeriesG that we
    (@mpickering and I) were simply unable to reproduce locally. @mpickering tried
    to run the CI script in a local Docker container and actually found that T9198
    and PmSeriesG *improved*. In MRs that were rebased on top this one, like !4229,
    I did not experience such increases. Let's not get hung up on these regression
    tests, they were meant to test for asymptotic regressions.
    
    The build-cabal test improves by 1.2% in -O0.
    
    Metric Increase:
        T10421
        T12234
        T12545
        T13035
        T13056
        T13701
        T14697
        T18923
        T5837
        T9198
    Metric Decrease:
        ManyConstructors
        T12545
        T12707
        T13056
        T14683
        T16577
        T18223
        T1969
        T3294
        T9203
        T9233
        T9675
        T9872a
        T9872b
        T9872c
        T9961
        TcPlugin_RewritePerf
Code owners
Assign users and groups as approvers for specific file changes. Learn more.