OccAnal: rewrite imp_rule_edges to strict left fold
Please take a few moments to verify that your commits fulfill the following:
-
are either individually buildable or squashed -
have commit messages which describe what they do (referring to [Notes][notes] and tickets using #NNNN
syntax when appropriate) -
have added source comments describing your change. For larger changes you likely should add a [Note][notes] and cross-reference it from the relevant places.
The MR
This is a small MR that does two things:
- Avoids a lot of small
IntMap
s inOccAnal.hs
by using a list and then building anIntMap
once a threshold is reached. - Converts most of the lazy
foldr
s to strict left folds. Changing thefoldr
inoccAnalRecBind
creates a compiler panic so I've left this as a right fold but made the accumulator strict. The idea here is thatfoldl'
s can be tail call optimized whilefoldr'
can't, furthermore usingfoldr
over a list is fine if we are going to take advantage of laziness however I didn't see any of that occurring in the source so I opted forfoldl'
instead.
Evidence
I've experimented with several thresholds and set the threshold value to 100 based on my coarse grained perf
profiling. The threshold is the number of cons
s onto a temporary list, when the list becomes greater than the threshold I convert to an intmap and merge to the intmap accumulator. Somewhat similar to a banker's queue. Here is some perf
data, all data represents building cabal 5 times. I ensured a quiet machine (but did not begin to disable systemd units):
master
Performance counter stats for '_build/stage1/bin/ghc -package parsec -fforce-recomp -O2 -XHaskell2010 -hidir _cabal_out -odir _cabal_out -ilibraries/Cabal/Cabal -ilibraries/Cabal/Cabal/src libraries/Cabal/Cabal/Setup.hs +RTS -s -RTS' (5 runs):
159,662.54 msec task-clock:u # 1.001 CPUs utilized ( +- 0.14% )
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
1,532,566 page-faults:u # 0.010 M/sec ( +- 0.10% )
625,276,440,487 cycles:u # 3.916 GHz ( +- 0.14% )
42,627,090,249 stalled-cycles-frontend:u # 6.82% frontend cycles idle ( +- 0.54% )
103,217,939,258 stalled-cycles-backend:u # 16.51% backend cycles idle ( +- 0.16% )
619,944,486,999 instructions:u # 0.99 insn per cycle
# 0.17 stalled cycles per insn ( +- 0.05% )
114,745,551,865 branches:u # 718.675 M/sec ( +- 0.05% )
7,596,432,854 branch-misses:u # 6.62% of all branches ( +- 0.24% )
159.580 +- 0.307 seconds time elapsed ( +- 0.19% )
Notice the 619 Billion instructions and 625 Billion cycles
Threshold == 1k
Performance counter stats for '_build/stage1/bin/ghc -package parsec -fforce-recomp -XHaskell2010 -O2 -hidir _cabal_out -odir _cabal_out -ilibraries/Cabal/Cabal -ilibraries/Cabal/Cabal/src libraries/Cabal/Cabal/Setup.hs +RTS -s -RTS' (5 runs):
157,875.69 msec task-clock:u # 1.000 CPUs utilized ( +- 0.29% )
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
1,520,663 page-faults:u # 0.010 M/sec ( +- 0.27% )
618,000,123,600 cycles:u # 3.914 GHz ( +- 0.29% )
37,215,328,497 stalled-cycles-frontend:u # 6.02% frontend cycles idle ( +- 0.41% )
108,047,052,325 stalled-cycles-backend:u # 17.48% backend cycles idle ( +- 0.40% )
609,482,076,883 instructions:u # 0.99 insn per cycle
# 0.18 stalled cycles per insn ( +- 0.26% )
112,938,114,482 branches:u # 715.361 M/sec ( +- 0.26% )
7,524,732,889 branch-misses:u # 6.66% of all branches ( +- 0.33% )
157.808 +- 0.462 seconds time elapsed ( +- 0.29% )
618B cycles and 609B instructions
threshold == 100 (This is what the MR is set to)
Performance counter stats for '_build/stage1/bin/ghc -package parsec -fforce-recomp -XHaskell2010 -O2 -hidir _cabal_out -odir _cabal_out -ilibraries/Cabal/Cabal -ilibraries/Cabal/Cabal/src libraries/Cabal/Cabal/Setup.hs +RTS -s -RTS' (5 runs):
157,733.47 msec task-clock:u # 1.001 CPUs utilized ( +- 0.04% )
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
1,520,975 page-faults:u # 0.010 M/sec ( +- 0.25% )
617,612,103,395 cycles:u # 3.916 GHz ( +- 0.04% )
39,081,768,629 stalled-cycles-frontend:u # 6.33% frontend cycles idle ( +- 0.50% )
105,424,311,382 stalled-cycles-backend:u # 17.07% backend cycles idle ( +- 0.13% )
608,298,554,062 instructions:u # 0.98 insn per cycle
# 0.17 stalled cycles per insn ( +- 0.12% )
112,703,271,951 branches:u # 714.517 M/sec ( +- 0.12% )
7,499,504,558 branch-misses:u # 6.65% of all branches ( +- 0.02% )
157.5636 +- 0.0694 seconds time elapsed ( +- 0.04% )
617B cycles and 608B instructions
threshold == 10
Performance counter stats for '_build/stage1/bin/ghc -package parsec -fforce-recomp -XHaskell2010 -O2 -hidir _cabal_out -odir _cabal_out -ilibraries/Cabal/Cabal -ilibraries/Cabal/Cabal/src libraries/Cabal/Cabal/Setup.hs +RTS -s -RTS' (5 runs):
158,364.96 msec task-clock:u # 1.001 CPUs utilized ( +- 0.11% )
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
1,514,619 page-faults:u # 0.010 M/sec ( +- 0.20% )
619,934,326,827 cycles:u # 3.915 GHz ( +- 0.12% )
38,944,131,119 stalled-cycles-frontend:u # 6.28% frontend cycles idle ( +- 0.63% )
105,754,507,663 stalled-cycles-backend:u # 17.06% backend cycles idle ( +- 0.17% )
610,812,808,013 instructions:u # 0.99 insn per cycle
# 0.17 stalled cycles per insn ( +- 0.13% )
113,187,126,373 branches:u # 714.723 M/sec ( +- 0.12% )
7,520,629,013 branch-misses:u # 6.64% of all branches ( +- 0.14% )
158.188 +- 0.190 seconds time elapsed ( +- 0.12% )
619B cycles and 610B instructions
Apparently shaving off 2 billion cycles yields a reduction of 2 seconds when compiling cabal. Somewhat disappointing but less instructions and cycles are better in my opinion.
Please let me know if any other changes are requested, or if anyone is aware of other cites where this sort of small optimization could be useful.