Code generation should not use the pretty-printer
Currently, GHC uses SDoc
to serialize assembly generated by the backend during codegen. This is needlessly expensive: SDoc
is an implementation of Hughes–Jones pretty-printing combinators, which are designed primarily to nicely print things that need complex two-dimensional layout (such as Haskell code). It is not designed for high throughput serialization of megabytes’ worth of data intended exclusively for machine consumption, but that is what the backend is using it for.
This choice has a real and significant performance cost, particularly on programs that generate a lot of code. Recently, I’ve been using the following program as an extreme test case:
module M where
data T
= C000 Int
| C001 Int
| C002 Int
| C003 Int
| C004 Int
| C005 Int
... -- etc.
| C997 Int
| C998 Int
| C999 Int
deriving (Read)
This seems to be a pretty good test of the code generator, as it takes fairly little time to typecheck but generates a very large amount of code (and code size is not significantly improved by -O
). On my machine, using GHC HEAD, it generates a 4.7M .o
file but only a 90K .hi
file!
Building GHC with profiling (and inserting all cost centres by hand to avoid optimization changes) reveals that GHC spends roughly 15% of its time and 17% of its allocations inside pprNativeCode
. Now, sure, at the end of the day, there’s a lot of code to print, so any implementation is going to take a decent chunk of time, but the current implementation is very much not I/O-bound: only about 0.2% of the total execution is spent inside hPutBuf
(measured via the eventlog, not cost centre profiling). Can we do better?
The answer is yes. I have built a proof-of-concept reimplementation of the subset of the SDoc
API used by the x86 pretty-printer, which I have christened FDoc
(for fast). Its definition is extremely simple:
newtype FDoc = FDoc' { runFDoc :: SDocContext -> BufHandle -> IO () }
pattern FDoc :: (SDocContext -> BufHandle -> IO ()) -> FDoc
pattern FDoc f <- FDoc' f
where FDoc f = FDoc' (oneShot (\ctx -> oneShot (\h -> f ctx h)))
The x86 pretty-printer does use some Outputable
instances and other existing functions for printing simple datatypes, and it would be unfortunate not to continue to share code with those types. For that reason, I have also created a simple IsDoc
typeclass that allows those functions to be parameterized over the choice of implementation. Since specialization of those definitions to FDoc
is crucial for performance, I have inserted SPECIALIZE
pragmas in the appropriate places to ensure it occurs, and I’ve also implemented a simple workaround for #21851 (closed) to ensure they actually work.
The results speak for themselves: compiling the above program with my branch results in an 8% reduction in compile time and a 14% reduction in allocations. Compare the relevant lines from the top of the .prof
output:
COST CENTRE MODULE %time %alloc ticks bytes
pprNativeCode GHC.CmmToAsm 15.0 16.7 336 633278872
pprNativeCodeF GHC.CmmToAsm 8.5 3.6 176 116213856
Hard to argue with that!
Perf test results
Maybe you think my example program above is too artificial to be meaningful—which is fair. In that case, I direct your attention to the compile-time perf test results:
Metrics: compile_time/bytes allocated
-------------------------------------
Baseline
Test Metric value New value Change
----------------------------------------------------------------------------------------------------
CoOpt_Read(normal) ghc/alloc 834,316,696 798,903,488 -4.2% GOOD
CoOpt_Singletons(normal) ghc/alloc 940,293,640 920,246,816 -2.1% GOOD
InstanceMatching(normal) ghc/alloc 5,067,193,136 5,067,297,040 +0.0%
InstanceMatching1(normal) ghc/alloc 27,413,499,160 27,414,265,712 +0.0%
LargeRecord(normal) ghc/alloc 6,046,198,760 6,017,371,936 -0.5%
ManyAlternatives(normal) ghc/alloc 730,153,424 652,233,864 -10.7% GOOD
ManyConstructors(normal) ghc/alloc 3,916,818,240 3,358,774,128 -14.2% GOOD
MultiComponentModules(normal) ghc/alloc 1,823,975,360 1,825,185,360 +0.1%
MultiComponentModulesRecomp(normal) ghc/alloc 778,488,720 779,331,952 +0.1%
MultiLayerModules(normal) ghc/alloc 3,106,772,728 3,090,409,616 -0.5%
MultiLayerModulesRecomp(normal) ghc/alloc 862,765,832 862,793,560 +0.0%
PmSeriesG(normal) ghc/alloc 43,850,400 43,743,536 -0.2%
PmSeriesS(normal) ghc/alloc 53,321,568 51,700,112 -3.0%
PmSeriesT(normal) ghc/alloc 75,761,432 73,342,400 -3.2%
PmSeriesV(normal) ghc/alloc 52,610,360 51,223,784 -2.6%
T10421(normal) ghc/alloc 112,318,600 109,044,520 -2.9% GOOD
T10421a(normal) ghc/alloc 78,632,528 76,369,312 -2.9%
T10547(normal) ghc/alloc 28,120,208 28,205,632 +0.3%
T10858(normal) ghc/alloc 131,654,520 128,436,176 -2.4%
T11195(normal) ghc/alloc 236,392,160 220,246,248 -6.8%
T11276(normal) ghc/alloc 96,422,064 91,458,728 -5.1%
T11303b(normal) ghc/alloc 40,033,248 38,983,120 -2.6%
T11374(normal) ghc/alloc 191,151,152 173,878,208 -9.0%
T11545(normal) ghc/alloc 85,428,920 83,892,488 -1.8%
T11822(normal) ghc/alloc 105,481,088 100,198,504 -5.0%
T12150(optasm) ghc/alloc 80,091,192 79,574,328 -0.6%
T12227(normal) ghc/alloc 480,801,096 478,119,288 -0.6%
T12234(optasm) ghc/alloc 56,239,272 55,211,352 -1.8%
T12425(optasm) ghc/alloc 89,000,008 86,835,496 -2.4% GOOD
T12545(normal) ghc/alloc 1,614,836,736 1,581,737,536 -2.0%
T12707(normal) ghc/alloc 943,198,512 835,765,336 -11.4% GOOD
T13035(normal) ghc/alloc 101,145,384 93,610,384 -7.4% GOOD
T13056(optasm) ghc/alloc 348,738,104 337,820,976 -3.1% GOOD
T13253(normal) ghc/alloc 344,001,760 333,103,016 -3.2% GOOD
T13253-spj(normal) ghc/alloc 124,757,880 122,754,856 -1.6%
T13379(normal) ghc/alloc 344,059,528 297,105,760 -13.6% GOOD
T13386(normal) ghc/alloc 867,246,488 867,340,512 +0.0%
T13701(normal) ghc/alloc 2,401,607,832 2,379,588,816 -0.9%
T13719(normal) ghc/alloc 5,133,368,464 5,068,763,048 -1.3%
T14052(ghci) ghc/alloc 3,623,898,856 3,623,608,696 -0.0%
T14052Type(ghci) ghc/alloc 6,318,686,368 6,318,689,368 +0.0%
T14683(normal) ghc/alloc 2,822,724,976 2,741,515,224 -2.9% GOOD
T14697(normal) ghc/alloc 348,999,856 346,627,920 -0.7%
T14766(normal) ghc/alloc 968,033,672 964,491,704 -0.4%
T15164(normal) ghc/alloc 1,296,077,648 1,259,272,608 -2.8% GOOD
T15304(normal) ghc/alloc 1,291,102,856 1,264,968,520 -2.0%
T15630(normal) ghc/alloc 160,882,768 155,745,896 -3.2%
T15703(normal) ghc/alloc 517,521,464 509,034,712 -1.6% GOOD
T16190(normal) ghc/alloc 276,731,064 275,985,352 -0.3%
T16577(normal) ghc/alloc 8,008,824,896 7,805,268,512 -2.5% GOOD
T16875(normal) ghc/alloc 34,507,936 34,493,032 -0.0%
T17096(normal) ghc/alloc 216,684,928 199,022,608 -8.2%
T17516(normal) ghc/alloc 1,814,902,000 1,746,721,712 -3.8%
T17836(normal) ghc/alloc 831,324,296 814,822,504 -2.0%
T17836b(normal) ghc/alloc 45,100,928 44,602,128 -1.1%
T17977(normal) ghc/alloc 39,992,784 39,109,248 -2.2%
T17977b(normal) ghc/alloc 37,763,552 37,021,296 -2.0%
T18140(normal) ghc/alloc 75,919,896 73,776,360 -2.8% GOOD
T18223(normal) ghc/alloc 917,679,264 917,794,784 +0.0%
T18282(normal) ghc/alloc 149,694,848 144,274,816 -3.6% GOOD
T18304(normal) ghc/alloc 77,386,584 75,588,928 -2.3% GOOD
T18478(normal) ghc/alloc 497,442,784 462,063,968 -7.1%
T18698a(normal) ghc/alloc 200,411,024 195,535,728 -2.4% GOOD
T18698b(normal) ghc/alloc 221,748,200 216,613,120 -2.3% GOOD
T18923(normal) ghc/alloc 66,822,240 65,274,848 -2.3% GOOD
T1969(normal) ghc/alloc 757,023,352 709,342,728 -6.3% GOOD
T19695(normal) ghc/alloc 1,447,007,976 1,417,865,384 -2.0% GOOD
T20049(normal) ghc/alloc 90,615,712 87,490,976 -3.4% GOOD
T20261(normal) ghc/alloc 599,778,488 571,749,264 -4.7%
T3064(normal) ghc/alloc 179,712,856 171,276,088 -4.7% GOOD
T3294(normal) ghc/alloc 1,613,420,072 1,238,737,800 -23.2% GOOD
T4801(normal) ghc/alloc 300,937,032 260,468,904 -13.4% GOOD
T5030(normal) ghc/alloc 350,472,056 344,260,232 -1.8%
T5321FD(normal) ghc/alloc 249,406,896 226,267,976 -9.3% GOOD
T5321Fun(normal) ghc/alloc 277,355,216 254,469,384 -8.3% GOOD
T5631(normal) ghc/alloc 530,485,000 513,169,032 -3.3% GOOD
T5642(normal) ghc/alloc 469,336,744 454,260,192 -3.2% GOOD
T5837(normal) ghc/alloc 35,715,208 35,533,408 -0.5%
T6048(optasm) ghc/alloc 102,030,544 99,036,056 -2.9% GOOD
T783(normal) ghc/alloc 383,823,952 349,182,760 -9.0% GOOD
T8095(normal) ghc/alloc 3,158,851,176 3,157,478,568 -0.0%
T9020(optasm) ghc/alloc 240,738,200 240,763,272 +0.0%
T9198(normal) ghc/alloc 45,983,824 43,555,504 -5.3% GOOD
T9233(normal) ghc/alloc 719,275,416 704,235,328 -2.1% GOOD
T9630(normal) ghc/alloc 1,523,816,720 1,502,597,208 -1.4%
T9675(optasm) ghc/alloc 437,959,392 431,927,408 -1.4%
T9872a(normal) ghc/alloc 1,667,045,816 1,667,142,368 +0.0%
T9872b(normal) ghc/alloc 1,920,713,560 1,920,802,464 +0.0%
T9872b_defer(normal) ghc/alloc 2,954,095,232 2,952,954,960 -0.0%
T9872c(normal) ghc/alloc 1,625,876,288 1,625,975,944 +0.0%
T9872d(normal) ghc/alloc 398,157,568 394,765,728 -0.9%
T9961(normal) ghc/alloc 358,331,496 346,361,232 -3.3% GOOD
TcPlugin_RewritePerf(normal) ghc/alloc 2,119,899,416 208,657,088 -90.2% GOOD
WWRec(normal) ghc/alloc 620,009,784 604,621,528 -2.5% GOOD
hard_hole_fits(normal) ghc/alloc 431,498,744 425,808,584 -1.3%
hie002(normal) ghc/alloc 9,043,362,224 9,045,110,440 +0.0%
parsing001(normal) ghc/alloc 451,463,344 451,518,952 +0.0%
geo. mean -5.5%
minimum -90.2%
maximum +0.3%
* All baselines were measured in the same environment as this test run
* All baseline commits are 460505345e
Metrics: compile_time/max_bytes_used
------------------------------------
Baseline
Test Metric value New value Change
---------------------------------------------------------------------------
MultiLayerModulesDefsGhci(ghci) ghc/max 79,437,944 79,455,504 +0.0%
MultiLayerModulesNoCode(ghci) ghc/max 44,084,768 44,097,016 +0.0%
T10370(optasm) ghc/max 23,804,688 23,813,968 +0.0%
T11545(normal) ghc/max 10,688,272 10,711,464 +0.2%
T15304(normal) ghc/max 27,649,584 27,661,816 +0.0%
T15630(normal) ghc/max 8,919,080 8,913,808 -0.1%
T18698a(normal) ghc/max 13,604,312 13,645,072 +0.3%
T18698b(normal) ghc/max 13,705,760 13,737,560 +0.2%
T1969(normal) ghc/max 19,613,520 19,610,096 -0.0%
T20261(normal) ghc/max 38,491,744 38,529,448 +0.1%
T3064(normal) ghc/max 14,013,440 13,430,664 -4.2%
T3294(normal) ghc/max 33,041,704 33,043,208 +0.0%
T9630(normal) ghc/max 43,353,416 43,253,488 -0.2%
T9675(optasm) ghc/max 16,309,680 16,147,640 -1.0%
geo. mean -0.3%
minimum -4.2%
maximum +0.3%
* All baselines were measured in the same environment as this test run
* All baseline commits are 460505345e
Metrics: compile_time/peak_megabytes_allocated
----------------------------------------------
Baseline
Test Metric value New value Change
----------------------------------------------------------------------------
MultiLayerModulesDefsGhci(ghci) ghc/peak 210 210 +0.0%
MultiLayerModulesNoCode(ghci) ghc/peak 116 116 +0.0%
T10370(optasm) ghc/peak 60 60 +0.0%
T11545(normal) ghc/peak 28 28 +0.0%
T15304(normal) ghc/peak 73 73 +0.0%
T15630(normal) ghc/peak 25 26 +4.0%
T18698a(normal) ghc/peak 36 36 +0.0%
T18698b(normal) ghc/peak 37 37 +0.0%
T1969(normal) ghc/peak 50 50 +0.0%
T20261(normal) ghc/peak 92 93 +1.1%
T3064(normal) ghc/peak 36 36 +0.0%
T3294(normal) ghc/peak 89 89 +0.0%
T9630(normal) ghc/peak 111 110 -0.9%
T9675(optasm) ghc/peak 41 43 +4.9%
geo. mean +0.6%
minimum -0.9%
maximum +4.9%
* All baselines were measured in the same environment as this test run
* All baseline commits are 460505345e
The -90.2%
result on TcPlugin_RewritePerf
seems probably too good to be true—I haven’t investigated it yet—but if we ignore that outlier, there are still a lot of tests with savings of 10–20%!
Try it yourself
My branch is available as lexi.lambda/ghc:codegen-without-sdoc. The test suite results are from hadrian/build-cabal --flavour=perf
and the profiling results are from --flavour=perf+profiled_ghc+no_dynamic_ghc
. My branch includes some minor patches to Hadrian to disable -fprof-late
(which I didn’t find helpful) and to enable compiling GHC with -fsimpl-after-spec
(which enables my workaround for #21851 (closed)).
The implementation is not at all polished or documented and currently only works on x86. I am not even sure if the implementation approach I’ve taken is a particularly good one! But it nevertheless proves that GHC is leaving a lot of performance on the table.