Nested CPR light unleashed (#18174)
Nested CPR light unleashed (#18174 (closed))
This patch enables worker/wrapper for nested constructed products, as described
in Note [Nested CPR]
. The machinery for expressing Nested CPR was already
there, since !5054 (closed). Worker/wrapper is equipped to exploit Nested CPR annotations
since !5338 (merged). CPR analysis already handles applications in batches since !5753 (closed).
This patch just needs to flip a few more switches:
- In
cprTransformDataConWork
, we need to look at the field expressions and theirCprType
s to see whether the evaluation of the expressions terminates quickly (= is in HNF) or if they are put in strict fields. If that is the case, then we retain their CPR info and may unbox nestedly later on. More details inNote [Nested CPR]
. - Enable nested
ConCPR
signatures inGHC.Types.Cpr
. - In the
asConCpr
call inGHC.Core.Opt.WorkWrap.Utils
, pass CPR info of fields to theUnbox
. - Instead of giving CPR signatures to DataCon workers and wrappers, we now have
cprTransformDataConWork
for workers and treat wrappers by analysing their unfolding. As a result, the code from GHC.Types.Id.Make went away completely. - I deactivated worker/wrappering for recursive DataCons and wrote a function
isRecDataCon
to detect them. We really don't want to giverepeat
orreplicate
the Nested CPR property. See Note [CPR for recursive data structures] for which kind of recursive DataCons we target. - Fix a couple of tests and their outputs.
I also documented that CPR can destroy sharing and lead to asymptotic increase
in allocations (which is tracked by #13331 (closed)/#19326) in
Note [CPR for data structures can destroy sharing]
.
Nofib results:
--------------------------------------------------------------------------------
Program Allocs Instrs
--------------------------------------------------------------------------------
ben-raytrace -3.1% -0.4%
binary-trees +0.8% -2.9%
digits-of-e2 +5.8% +1.2%
event +0.8% -2.1%
fannkuch-redux +0.0% -1.4%
fish 0.0% -1.5%
gamteb -1.4% -0.3%
mkhprog +1.4% +0.8%
multiplier +0.0% -1.9%
pic -0.6% -0.1%
reptile -20.9% -17.8%
wave4main +4.8% +0.4%
x2n1 -100.0% -7.6%
--------------------------------------------------------------------------------
Min -95.0% -17.8%
Max +5.8% +1.2%
Geometric Mean -2.9% -0.4%
The huge wins in x2n1 (loopy list) and reptile (see #19970) are due to refraining from unboxing (:). Other benchmarks like digits-of-e2 or wave4main regress because of that. Ultimately there are no great improvements due to Nested CPR alone, but at least it's a win. Binary sizes decrease by 0.6%.
There are a significant number of metric decreases. The most notable ones (>1%):
ManyAlternatives(normal) ghc/alloc 771656002.7 762187472.0 -1.2%
ManyConstructors(normal) ghc/alloc 4191073418.7 4114369216.0 -1.8%
MultiLayerModules(normal) ghc/alloc 3095678333.3 3128720704.0 +1.1%
PmSeriesG(normal) ghc/alloc 50096429.3 51495664.0 +2.8%
PmSeriesS(normal) ghc/alloc 63512989.3 64681600.0 +1.8%
PmSeriesV(normal) ghc/alloc 62575424.0 63767208.0 +1.9%
T10547(normal) ghc/alloc 29347469.3 29944240.0 +2.0%
T11303b(normal) ghc/alloc 46018752.0 47367576.0 +2.9%
T12150(optasm) ghc/alloc 81660890.7 82547696.0 +1.1%
T12234(optasm) ghc/alloc 59451253.3 60357952.0 +1.5%
T12545(normal) ghc/alloc 1705216250.7 1751278952.0 +2.7%
T12707(normal) ghc/alloc 981000472.0 968489800.0 -1.3% GOOD
T13056(optasm) ghc/alloc 389322664.0 372495160.0 -4.3% GOOD
T13253(normal) ghc/alloc 337174229.3 341954576.0 +1.4%
T13701(normal) ghc/alloc 2381455173.3 2439790328.0 +2.4% BAD
T14052(ghci) ghc/alloc 2162530642.7 2139108784.0 -1.1%
T14683(normal) ghc/alloc 3049744728.0 2977535064.0 -2.4% GOOD
T14697(normal) ghc/alloc 362980213.3 369304512.0 +1.7%
T15164(normal) ghc/alloc 1323102752.0 1307480600.0 -1.2%
T15304(normal) ghc/alloc 1304607429.3 1291024568.0 -1.0%
T16190(normal) ghc/alloc 281450410.7 284878048.0 +1.2%
T16577(normal) ghc/alloc 7984960789.3 7811668768.0 -2.2% GOOD
T17516(normal) ghc/alloc 1171051192.0 1153649664.0 -1.5%
T17836(normal) ghc/alloc 1115569746.7 1098197592.0 -1.6%
T17836b(normal) ghc/alloc 54322597.3 55518216.0 +2.2%
T17977(normal) ghc/alloc 47071754.7 48403408.0 +2.8%
T17977b(normal) ghc/alloc 42579133.3 43977392.0 +3.3%
T18923(normal) ghc/alloc 71764237.3 72566240.0 +1.1%
T1969(normal) ghc/alloc 784821002.7 773971776.0 -1.4% GOOD
T3294(normal) ghc/alloc 1634913973.3 1614323584.0 -1.3% GOOD
T4801(normal) ghc/alloc 295619648.0 292776440.0 -1.0%
T5321FD(normal) ghc/alloc 278827858.7 276067280.0 -1.0%
T5631(normal) ghc/alloc 586618202.7 577579960.0 -1.5%
T5642(normal) ghc/alloc 494923048.0 487927208.0 -1.4%
T5837(normal) ghc/alloc 37758061.3 39261608.0 +4.0%
T9020(optasm) ghc/alloc 257362077.3 254672416.0 -1.0%
T9198(normal) ghc/alloc 49313365.3 50603936.0 +2.6% BAD
T9233(normal) ghc/alloc 704944258.7 685692712.0 -2.7% GOOD
T9630(normal) ghc/alloc 1476621560.0 1455192784.0 -1.5%
T9675(optasm) ghc/alloc 443183173.3 433859696.0 -2.1% GOOD
T9872a(normal) ghc/alloc 1720926653.3 1693190072.0 -1.6% GOOD
T9872b(normal) ghc/alloc 2185618061.3 2162277568.0 -1.1% GOOD
T9872c(normal) ghc/alloc 1765842405.3 1733618088.0 -1.8% GOOD
TcPlugin_RewritePerf(normal) ghc/alloc 2388882730.7 2365504696.0 -1.0%
WWRec(normal) ghc/alloc 607073186.7 597512216.0 -1.6%
T9203(normal) run/alloc 107284064.0 102881832.0 -4.1%
haddock.Cabal(normal) run/alloc 24025329589.3 23768382560.0 -1.1%
haddock.base(normal) run/alloc 25660521653.3 25370321824.0 -1.1%
haddock.compiler(normal) run/alloc 74064171706.7 73358712280.0 -1.0%
The biggest exception to the rule is T13701 which seems to fluctuate as usual (not unlike T12545). T14697 has a similar quality, being a generated multi-module test. T5837 is small enough that it similarly doesn't measure anything significant besides module loading overhead. T13253 simply does one additional round of Simplification due to Nested CPR.
There are also some apparent regressions in T9198 and PmSeriesG that we (@mpickering and I) were simply unable to reproduce locally. @mpickering tried to run the CI script in a local Docker container and actually found that T9198 and PmSeriesG improved. Let's not get hung up on these regression tests, they were meant to test for asymptotic regressions.