!2: OccAnal: rewrite imp_rule_edges to strict left fold · Merge requests · jeffrey young / GHC

jeffrey young requested to merge wip/occanal-unrolled-intmap into master Oct 14, 2021

Please take a few moments to verify that your commits fulfill the following:

are either individually buildable or squashed
have commit messages which describe what they do (referring to [Notes][notes] and tickets using #NNNN syntax when appropriate)
have added source comments describing your change. For larger changes you likely should add a [Note][notes] and cross-reference it from the relevant places.

The MR

This is a small MR that does two things:

Avoids a lot of small IntMaps in OccAnal.hs by using a list and then building an IntMap once a threshold is reached.
Converts most of the lazy foldrs to strict left folds. Changing the foldr in occAnalRecBind creates a compiler panic so I've left this as a right fold but made the accumulator strict. The idea here is that foldl's can be tail call optimized while foldr' can't, furthermore using foldr over a list is fine if we are going to take advantage of laziness however I didn't see any of that occurring in the source so I opted for foldl' instead.

Evidence

I've experimented with several thresholds and set the threshold value to 100 based on my coarse grained perf profiling. The threshold is the number of conss onto a temporary list, when the list becomes greater than the threshold I convert to an intmap and merge to the intmap accumulator. Somewhat similar to a banker's queue. Here is some perf data, all data represents building cabal 5 times. I ensured a quiet machine (but did not begin to disable systemd units):

master

      Performance counter stats for '_build/stage1/bin/ghc -package parsec -fforce-recomp -O2 -XHaskell2010 -hidir _cabal_out -odir _cabal_out -ilibraries/Cabal/Cabal -ilibraries/Cabal/Cabal/src libraries/Cabal/Cabal/Setup.hs +RTS -s -RTS' (5 runs):

             159,662.54 msec task-clock:u              #    1.001 CPUs utilized            ( +-  0.14% )
                      0      context-switches:u        #    0.000 K/sec
                      0      cpu-migrations:u          #    0.000 K/sec
              1,532,566      page-faults:u             #    0.010 M/sec                    ( +-  0.10% )
        625,276,440,487      cycles:u                  #    3.916 GHz                      ( +-  0.14% )
         42,627,090,249      stalled-cycles-frontend:u #    6.82% frontend cycles idle     ( +-  0.54% )
        103,217,939,258      stalled-cycles-backend:u  #   16.51% backend cycles idle      ( +-  0.16% )
        619,944,486,999      instructions:u            #    0.99  insn per cycle
                                                       #    0.17  stalled cycles per insn  ( +-  0.05% )
        114,745,551,865      branches:u                #  718.675 M/sec                    ( +-  0.05% )
          7,596,432,854      branch-misses:u           #    6.62% of all branches          ( +-  0.24% )

                159.580 +- 0.307 seconds time elapsed  ( +-  0.19% )

Notice the 619 Billion instructions and 625 Billion cycles

Threshold == 1k

       Performance counter stats for '_build/stage1/bin/ghc -package parsec -fforce-recomp -XHaskell2010 -O2 -hidir _cabal_out -odir _cabal_out -ilibraries/Cabal/Cabal -ilibraries/Cabal/Cabal/src libraries/Cabal/Cabal/Setup.hs +RTS -s -RTS' (5 runs):

              157,875.69 msec task-clock:u              #    1.000 CPUs utilized            ( +-  0.29% )
                       0      context-switches:u        #    0.000 K/sec
                       0      cpu-migrations:u          #    0.000 K/sec
               1,520,663      page-faults:u             #    0.010 M/sec                    ( +-  0.27% )
         618,000,123,600      cycles:u                  #    3.914 GHz                      ( +-  0.29% )
          37,215,328,497      stalled-cycles-frontend:u #    6.02% frontend cycles idle     ( +-  0.41% )
         108,047,052,325      stalled-cycles-backend:u  #   17.48% backend cycles idle      ( +-  0.40% )
         609,482,076,883      instructions:u            #    0.99  insn per cycle
                                                        #    0.18  stalled cycles per insn  ( +-  0.26% )
         112,938,114,482      branches:u                #  715.361 M/sec                    ( +-  0.26% )
           7,524,732,889      branch-misses:u           #    6.66% of all branches          ( +-  0.33% )

                 157.808 +- 0.462 seconds time elapsed ( +- 0.29% )

618B cycles and 609B instructions

threshold == 100 (This is what the MR is set to)

      Performance counter stats for '_build/stage1/bin/ghc -package parsec -fforce-recomp -XHaskell2010 -O2 -hidir _cabal_out -odir _cabal_out -ilibraries/Cabal/Cabal -ilibraries/Cabal/Cabal/src libraries/Cabal/Cabal/Setup.hs +RTS -s -RTS' (5 runs):

             157,733.47 msec task-clock:u              #    1.001 CPUs utilized            ( +-  0.04% )
                      0      context-switches:u        #    0.000 K/sec
                      0      cpu-migrations:u          #    0.000 K/sec
              1,520,975      page-faults:u             #    0.010 M/sec                    ( +-  0.25% )
        617,612,103,395      cycles:u                  #    3.916 GHz                      ( +-  0.04% )
         39,081,768,629      stalled-cycles-frontend:u #    6.33% frontend cycles idle     ( +-  0.50% )
        105,424,311,382      stalled-cycles-backend:u  #   17.07% backend cycles idle      ( +-  0.13% )
        608,298,554,062      instructions:u            #    0.98  insn per cycle
                                                       #    0.17  stalled cycles per insn  ( +-  0.12% )
        112,703,271,951      branches:u                #  714.517 M/sec                    ( +-  0.12% )
          7,499,504,558      branch-misses:u           #    6.65% of all branches          ( +-  0.02% )

               157.5636 +- 0.0694 seconds time elapsed  ( +-  0.04% )

617B cycles and 608B instructions

threshold == 10

      Performance counter stats for '_build/stage1/bin/ghc -package parsec -fforce-recomp -XHaskell2010 -O2 -hidir _cabal_out -odir _cabal_out -ilibraries/Cabal/Cabal -ilibraries/Cabal/Cabal/src libraries/Cabal/Cabal/Setup.hs +RTS -s -RTS' (5 runs):

             158,364.96 msec task-clock:u              #    1.001 CPUs utilized            ( +-  0.11% )
                      0      context-switches:u        #    0.000 K/sec
                      0      cpu-migrations:u          #    0.000 K/sec
              1,514,619      page-faults:u             #    0.010 M/sec                    ( +-  0.20% )
        619,934,326,827      cycles:u                  #    3.915 GHz                      ( +-  0.12% )
         38,944,131,119      stalled-cycles-frontend:u #    6.28% frontend cycles idle     ( +-  0.63% )
        105,754,507,663      stalled-cycles-backend:u  #   17.06% backend cycles idle      ( +-  0.17% )
        610,812,808,013      instructions:u            #    0.99  insn per cycle
                                                       #    0.17  stalled cycles per insn  ( +-  0.13% )
        113,187,126,373      branches:u                #  714.723 M/sec                    ( +-  0.12% )
          7,520,629,013      branch-misses:u           #    6.64% of all branches          ( +-  0.14% )

                158.188 +- 0.190 seconds time elapsed ( +- 0.12% )

619B cycles and 610B instructions

Apparently shaving off 2 billion cycles yields a reduction of 2 seconds when compiling cabal. Somewhat disappointing but less instructions and cycles are better in my opinion.

Please let me know if any other changes are requested, or if anyone is aware of other cites where this sort of small optimization could be useful.

OccAnal: rewrite imp_rule_edges to strict left fold