Code generation should not use the pretty-printer

Currently, GHC uses SDoc to serialize assembly generated by the backend during codegen. This is needlessly expensive: SDoc is an implementation of Hughes–Jones pretty-printing combinators, which are designed primarily to nicely print things that need complex two-dimensional layout (such as Haskell code). It is not designed for high throughput serialization of megabytes’ worth of data intended exclusively for machine consumption, but that is what the backend is using it for.

This choice has a real and significant performance cost, particularly on programs that generate a lot of code. Recently, I’ve been using the following program as an extreme test case:

module M where

data T
  = C000 Int
  | C001 Int
  | C002 Int
  | C003 Int
  | C004 Int
  | C005 Int
  ... -- etc.
  | C997 Int
  | C998 Int
  | C999 Int
  deriving (Read)

This seems to be a pretty good test of the code generator, as it takes fairly little time to typecheck but generates a very large amount of code (and code size is not significantly improved by -O). On my machine, using GHC HEAD, it generates a 4.7M .o file but only a 90K .hi file!

Building GHC with profiling (and inserting all cost centres by hand to avoid optimization changes) reveals that GHC spends roughly 15% of its time and 17% of its allocations inside pprNativeCode. Now, sure, at the end of the day, there’s a lot of code to print, so any implementation is going to take a decent chunk of time, but the current implementation is very much not I/O-bound: only about 0.2% of the total execution is spent inside hPutBuf (measured via the eventlog, not cost centre profiling). Can we do better?

The answer is yes. I have built a proof-of-concept reimplementation of the subset of the SDoc API used by the x86 pretty-printer, which I have christened FDoc (for fast). Its definition is extremely simple:

newtype FDoc = FDoc' { runFDoc :: SDocContext -> BufHandle -> IO () }

pattern FDoc :: (SDocContext -> BufHandle -> IO ()) -> FDoc
pattern FDoc f <- FDoc' f
  where FDoc f = FDoc' (oneShot (\ctx -> oneShot (\h -> f ctx h)))

The x86 pretty-printer does use some Outputable instances and other existing functions for printing simple datatypes, and it would be unfortunate not to continue to share code with those types. For that reason, I have also created a simple IsDoc typeclass that allows those functions to be parameterized over the choice of implementation. Since specialization of those definitions to FDoc is crucial for performance, I have inserted SPECIALIZE pragmas in the appropriate places to ensure it occurs, and I’ve also implemented a simple workaround for #21851 (closed) to ensure they actually work.

The results speak for themselves: compiling the above program with my branch results in an 8% reduction in compile time and a 14% reduction in allocations. Compare the relevant lines from the top of the .prof output:

COST CENTRE      MODULE        %time %alloc  ticks     bytes
pprNativeCode    GHC.CmmToAsm   15.0   16.7    336 633278872
pprNativeCodeF   GHC.CmmToAsm    8.5    3.6    176 116213856

Hard to argue with that!

Perf test results

Maybe you think my example program above is too artificial to be meaningful—which is fair. In that case, I direct your attention to the compile-time perf test results:

Metrics: compile_time/bytes allocated
-------------------------------------

                                                    Baseline                                        
                               Test    Metric          value      New value Change                  
----------------------------------------------------------------------------------------------------
                 CoOpt_Read(normal) ghc/alloc    834,316,696    798,903,488  -4.2% GOOD
           CoOpt_Singletons(normal) ghc/alloc    940,293,640    920,246,816  -2.1% GOOD
           InstanceMatching(normal) ghc/alloc  5,067,193,136  5,067,297,040  +0.0%                  
          InstanceMatching1(normal) ghc/alloc 27,413,499,160 27,414,265,712  +0.0%                  
                LargeRecord(normal) ghc/alloc  6,046,198,760  6,017,371,936  -0.5%                  
           ManyAlternatives(normal) ghc/alloc    730,153,424    652,233,864 -10.7% GOOD
           ManyConstructors(normal) ghc/alloc  3,916,818,240  3,358,774,128 -14.2% GOOD
      MultiComponentModules(normal) ghc/alloc  1,823,975,360  1,825,185,360  +0.1%                  
MultiComponentModulesRecomp(normal) ghc/alloc    778,488,720    779,331,952  +0.1%                  
          MultiLayerModules(normal) ghc/alloc  3,106,772,728  3,090,409,616  -0.5%                  
    MultiLayerModulesRecomp(normal) ghc/alloc    862,765,832    862,793,560  +0.0%                  
                  PmSeriesG(normal) ghc/alloc     43,850,400     43,743,536  -0.2%                  
                  PmSeriesS(normal) ghc/alloc     53,321,568     51,700,112  -3.0%                  
                  PmSeriesT(normal) ghc/alloc     75,761,432     73,342,400  -3.2%                  
                  PmSeriesV(normal) ghc/alloc     52,610,360     51,223,784  -2.6%                  
                     T10421(normal) ghc/alloc    112,318,600    109,044,520  -2.9% GOOD
                    T10421a(normal) ghc/alloc     78,632,528     76,369,312  -2.9%                  
                     T10547(normal) ghc/alloc     28,120,208     28,205,632  +0.3%                  
                     T10858(normal) ghc/alloc    131,654,520    128,436,176  -2.4%                  
                     T11195(normal) ghc/alloc    236,392,160    220,246,248  -6.8%                  
                     T11276(normal) ghc/alloc     96,422,064     91,458,728  -5.1%                  
                    T11303b(normal) ghc/alloc     40,033,248     38,983,120  -2.6%                  
                     T11374(normal) ghc/alloc    191,151,152    173,878,208  -9.0%                  
                     T11545(normal) ghc/alloc     85,428,920     83,892,488  -1.8%                  
                     T11822(normal) ghc/alloc    105,481,088    100,198,504  -5.0%                  
                     T12150(optasm) ghc/alloc     80,091,192     79,574,328  -0.6%                  
                     T12227(normal) ghc/alloc    480,801,096    478,119,288  -0.6%                  
                     T12234(optasm) ghc/alloc     56,239,272     55,211,352  -1.8%                  
                     T12425(optasm) ghc/alloc     89,000,008     86,835,496  -2.4% GOOD
                     T12545(normal) ghc/alloc  1,614,836,736  1,581,737,536  -2.0%                  
                     T12707(normal) ghc/alloc    943,198,512    835,765,336 -11.4% GOOD
                     T13035(normal) ghc/alloc    101,145,384     93,610,384  -7.4% GOOD
                     T13056(optasm) ghc/alloc    348,738,104    337,820,976  -3.1% GOOD
                     T13253(normal) ghc/alloc    344,001,760    333,103,016  -3.2% GOOD
                 T13253-spj(normal) ghc/alloc    124,757,880    122,754,856  -1.6%                  
                     T13379(normal) ghc/alloc    344,059,528    297,105,760 -13.6% GOOD
                     T13386(normal) ghc/alloc    867,246,488    867,340,512  +0.0%                  
                     T13701(normal) ghc/alloc  2,401,607,832  2,379,588,816  -0.9%                  
                     T13719(normal) ghc/alloc  5,133,368,464  5,068,763,048  -1.3%                  
                       T14052(ghci) ghc/alloc  3,623,898,856  3,623,608,696  -0.0%                  
                   T14052Type(ghci) ghc/alloc  6,318,686,368  6,318,689,368  +0.0%                  
                     T14683(normal) ghc/alloc  2,822,724,976  2,741,515,224  -2.9% GOOD
                     T14697(normal) ghc/alloc    348,999,856    346,627,920  -0.7%                  
                     T14766(normal) ghc/alloc    968,033,672    964,491,704  -0.4%                  
                     T15164(normal) ghc/alloc  1,296,077,648  1,259,272,608  -2.8% GOOD
                     T15304(normal) ghc/alloc  1,291,102,856  1,264,968,520  -2.0%                  
                     T15630(normal) ghc/alloc    160,882,768    155,745,896  -3.2%                  
                     T15703(normal) ghc/alloc    517,521,464    509,034,712  -1.6% GOOD
                     T16190(normal) ghc/alloc    276,731,064    275,985,352  -0.3%                  
                     T16577(normal) ghc/alloc  8,008,824,896  7,805,268,512  -2.5% GOOD
                     T16875(normal) ghc/alloc     34,507,936     34,493,032  -0.0%                  
                     T17096(normal) ghc/alloc    216,684,928    199,022,608  -8.2%                  
                     T17516(normal) ghc/alloc  1,814,902,000  1,746,721,712  -3.8%                  
                     T17836(normal) ghc/alloc    831,324,296    814,822,504  -2.0%                  
                    T17836b(normal) ghc/alloc     45,100,928     44,602,128  -1.1%                  
                     T17977(normal) ghc/alloc     39,992,784     39,109,248  -2.2%                  
                    T17977b(normal) ghc/alloc     37,763,552     37,021,296  -2.0%                  
                     T18140(normal) ghc/alloc     75,919,896     73,776,360  -2.8% GOOD
                     T18223(normal) ghc/alloc    917,679,264    917,794,784  +0.0%                  
                     T18282(normal) ghc/alloc    149,694,848    144,274,816  -3.6% GOOD
                     T18304(normal) ghc/alloc     77,386,584     75,588,928  -2.3% GOOD
                     T18478(normal) ghc/alloc    497,442,784    462,063,968  -7.1%                  
                    T18698a(normal) ghc/alloc    200,411,024    195,535,728  -2.4% GOOD
                    T18698b(normal) ghc/alloc    221,748,200    216,613,120  -2.3% GOOD
                     T18923(normal) ghc/alloc     66,822,240     65,274,848  -2.3% GOOD
                      T1969(normal) ghc/alloc    757,023,352    709,342,728  -6.3% GOOD
                     T19695(normal) ghc/alloc  1,447,007,976  1,417,865,384  -2.0% GOOD
                     T20049(normal) ghc/alloc     90,615,712     87,490,976  -3.4% GOOD
                     T20261(normal) ghc/alloc    599,778,488    571,749,264  -4.7%                  
                      T3064(normal) ghc/alloc    179,712,856    171,276,088  -4.7% GOOD
                      T3294(normal) ghc/alloc  1,613,420,072  1,238,737,800 -23.2% GOOD
                      T4801(normal) ghc/alloc    300,937,032    260,468,904 -13.4% GOOD
                      T5030(normal) ghc/alloc    350,472,056    344,260,232  -1.8%                  
                    T5321FD(normal) ghc/alloc    249,406,896    226,267,976  -9.3% GOOD
                   T5321Fun(normal) ghc/alloc    277,355,216    254,469,384  -8.3% GOOD
                      T5631(normal) ghc/alloc    530,485,000    513,169,032  -3.3% GOOD
                      T5642(normal) ghc/alloc    469,336,744    454,260,192  -3.2% GOOD
                      T5837(normal) ghc/alloc     35,715,208     35,533,408  -0.5%                  
                      T6048(optasm) ghc/alloc    102,030,544     99,036,056  -2.9% GOOD
                       T783(normal) ghc/alloc    383,823,952    349,182,760  -9.0% GOOD
                      T8095(normal) ghc/alloc  3,158,851,176  3,157,478,568  -0.0%                  
                      T9020(optasm) ghc/alloc    240,738,200    240,763,272  +0.0%                  
                      T9198(normal) ghc/alloc     45,983,824     43,555,504  -5.3% GOOD
                      T9233(normal) ghc/alloc    719,275,416    704,235,328  -2.1% GOOD
                      T9630(normal) ghc/alloc  1,523,816,720  1,502,597,208  -1.4%                  
                      T9675(optasm) ghc/alloc    437,959,392    431,927,408  -1.4%                  
                     T9872a(normal) ghc/alloc  1,667,045,816  1,667,142,368  +0.0%                  
                     T9872b(normal) ghc/alloc  1,920,713,560  1,920,802,464  +0.0%                  
               T9872b_defer(normal) ghc/alloc  2,954,095,232  2,952,954,960  -0.0%                  
                     T9872c(normal) ghc/alloc  1,625,876,288  1,625,975,944  +0.0%                  
                     T9872d(normal) ghc/alloc    398,157,568    394,765,728  -0.9%                  
                      T9961(normal) ghc/alloc    358,331,496    346,361,232  -3.3% GOOD
       TcPlugin_RewritePerf(normal) ghc/alloc  2,119,899,416    208,657,088 -90.2% GOOD
                      WWRec(normal) ghc/alloc    620,009,784    604,621,528  -2.5% GOOD
             hard_hole_fits(normal) ghc/alloc    431,498,744    425,808,584  -1.3%                  
                     hie002(normal) ghc/alloc  9,043,362,224  9,045,110,440  +0.0%                  
                 parsing001(normal) ghc/alloc    451,463,344    451,518,952  +0.0%                  
                                                                                                    
                          geo. mean                                          -5.5%                  
                          minimum                                           -90.2%                  
                          maximum                                            +0.3%                  

* All baselines were measured in the same environment as this test run
* All baseline commits are 460505345e

Metrics: compile_time/max_bytes_used
------------------------------------

                                             Baseline                      
                           Test  Metric         value     New value Change 
---------------------------------------------------------------------------
MultiLayerModulesDefsGhci(ghci) ghc/max    79,437,944    79,455,504  +0.0% 
  MultiLayerModulesNoCode(ghci) ghc/max    44,084,768    44,097,016  +0.0% 
                 T10370(optasm) ghc/max    23,804,688    23,813,968  +0.0% 
                 T11545(normal) ghc/max    10,688,272    10,711,464  +0.2% 
                 T15304(normal) ghc/max    27,649,584    27,661,816  +0.0% 
                 T15630(normal) ghc/max     8,919,080     8,913,808  -0.1% 
                T18698a(normal) ghc/max    13,604,312    13,645,072  +0.3% 
                T18698b(normal) ghc/max    13,705,760    13,737,560  +0.2% 
                  T1969(normal) ghc/max    19,613,520    19,610,096  -0.0% 
                 T20261(normal) ghc/max    38,491,744    38,529,448  +0.1% 
                  T3064(normal) ghc/max    14,013,440    13,430,664  -4.2% 
                  T3294(normal) ghc/max    33,041,704    33,043,208  +0.0% 
                  T9630(normal) ghc/max    43,353,416    43,253,488  -0.2% 
                  T9675(optasm) ghc/max    16,309,680    16,147,640  -1.0% 
                                                                           
                      geo. mean                                      -0.3% 
                      minimum                                        -4.2% 
                      maximum                                        +0.3% 

* All baselines were measured in the same environment as this test run
* All baseline commits are 460505345e

Metrics: compile_time/peak_megabytes_allocated
----------------------------------------------

                                              Baseline                      
                           Test   Metric         value     New value Change 
----------------------------------------------------------------------------
MultiLayerModulesDefsGhci(ghci) ghc/peak           210           210  +0.0% 
  MultiLayerModulesNoCode(ghci) ghc/peak           116           116  +0.0% 
                 T10370(optasm) ghc/peak            60            60  +0.0% 
                 T11545(normal) ghc/peak            28            28  +0.0% 
                 T15304(normal) ghc/peak            73            73  +0.0% 
                 T15630(normal) ghc/peak            25            26  +4.0% 
                T18698a(normal) ghc/peak            36            36  +0.0% 
                T18698b(normal) ghc/peak            37            37  +0.0% 
                  T1969(normal) ghc/peak            50            50  +0.0% 
                 T20261(normal) ghc/peak            92            93  +1.1% 
                  T3064(normal) ghc/peak            36            36  +0.0% 
                  T3294(normal) ghc/peak            89            89  +0.0% 
                  T9630(normal) ghc/peak           111           110  -0.9% 
                  T9675(optasm) ghc/peak            41            43  +4.9% 
                                                                            
                      geo. mean                                       +0.6% 
                      minimum                                         -0.9% 
                      maximum                                         +4.9% 

* All baselines were measured in the same environment as this test run
* All baseline commits are 460505345e

The -90.2% result on TcPlugin_RewritePerf seems probably too good to be true—I haven’t investigated it yet—but if we ignore that outlier, there are still a lot of tests with savings of 10–20%!

Try it yourself

My branch is available as lexi.lambda/ghc:codegen-without-sdoc. The test suite results are from hadrian/build-cabal --flavour=perf and the profiling results are from --flavour=perf+profiled_ghc+no_dynamic_ghc. My branch includes some minor patches to Hadrian to disable -fprof-late (which I didn’t find helpful) and to enable compiling GHC with -fsimpl-after-spec (which enables my workaround for #21851 (closed)).

The implementation is not at all polished or documented and currently only works on x86. I am not even sure if the implementation approach I’ve taken is a particularly good one! But it nevertheless proves that GHC is leaving a lot of performance on the table.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information