WIP: Make bdescr more dense
Currently the block descriptor (bdescr
) struct is sized to take a single
cache line. This is nice as it eliminates any possibility of false sharing
between GC threads. On the other hand, there is a lot of redundancy in this
representation:
- The
start
field is completely redundant and can be computed from the block descriptor address - The
gen
andgen_no
fields are redundant:bd->gen == &generations[bd->gen_no]
(#15414) - The
free
field is a pointer but can only range over a single megablock
Given how often we look at block descriptors and how brutal GHC's runtime tends to be on caches, I hypothesized that we would be better off with a denser (but slightly more computationally intensive) representation. I test this hypothesis here. In short I:
- eliminate the
start
field entirely - drop the
gen
field in favor of the much smallergen_no
- turn the
free
field into a 32-bit offset - rearrange the struct to pack nicely into half a cacheline
Performance impact
On the whole this makes no difference for programs with only a small
residency, even despite the small bump in instruction count that the extra
computation requires. For instance, the mean cycle count and time across the
nofib tests (even when run under the slow
speed) is statistically unchanged.
For instance (shrink
here refers to the last patch in this branch; the sample
size is 5 and the σ columns refer to the standard error of the mean):
test | metric | baseline | shrink | σ_baseline | σ_shrink |
---|---|---|---|---|---|
gc/treejoin | run//perf//cache-misses | 1.21e+06 | 1.14e+06 | 2.96e+04 | 7.57e+03 |
gc/treejoin | run//perf//cycles | 4.53e+08 | 4.44e+08 | 3.99e+06 | 7.05e+06 |
gc/treejoin | run//perf//instructions | 7.97e+08 | 8.06e+08 | 7.85e+04 | 4.23e+05 |
parallel/matmult | run//perf//cache-misses | 3.18e+05 | 3.5e+05 | 5.17e+04 | 4.71e+04 |
parallel/matmult | run//perf//cycles | 1.01e+10 | 1.1e+10 | 4.95e+07 | 3.47e+08 |
parallel/matmult | run//perf//instructions | 1.19e+10 | 1.19e+10 | 8.27e+04 | 3e+05 |
real/fem | run//perf//cache-misses | 4.15e+04 | 3.57e+04 | 3.18e+03 | 7.38e+03 |
real/fem | run//perf//cycles | 3.38e+09 | 3.3e+09 | 1.35e+07 | 5.41e+06 |
real/fem | run//perf//instructions | 6.83e+09 | 6.85e+09 | 1.05e+05 | 1.43e+05 |
spectral/fibheaps | run//perf//cache-misses | 4.6e+04 | 4.46e+04 | 3.31e+03 | 3.6e+03 |
spectral/fibheaps | run//perf//cycles | 5.61e+09 | 5.58e+09 | 1.38e+07 | 1.59e+07 |
spectral/fibheaps | run//perf//instructions | 1.05e+10 | 1.06e+10 | 1e+05 | 8.76e+04 |
spectral/treejoin | run//perf//cache-misses | 2.26e+05 | 2.24e+05 | 1.72e+04 | 3.21e+04 |
spectral/treejoin | run//perf//cycles | 9.4e+09 | 9.39e+09 | 3.57e+07 | 3.26e+07 |
spectral/treejoin | run//perf//instructions | 2.1e+10 | 2.12e+10 | 9.67e+06 | 1.19e+07 |
However, for programs with large heaps and significant GC time (e.g. ghc
)
this can improve performance by a few percent:
test | metric | baseline | shrink | σ_baseline | σ_shrink |
---|---|---|---|---|---|
build-Cabal | run//perf//cache-misses | 5.14e+09 | 5.08e+09 | 1.02e+07 | 1.54e+07 |
build-Cabal | run//perf//cycles | 8.23e+11 | 8.07e+11 | 9.13e+09 | 9.37e+09 |
build-Cabal | run//perf//instructions | 7.07e+11 | 7.11e+11 | 1.58e+09 | 6.32e+08 |
build-Cabal | run//rts//average_bytes_used | 3.12e+08 | 3.14e+08 | 1.05e+06 | 2.97e+06 |
build-containers | run//perf//cache-misses | 1.11e+08 | 1.08e+08 | 4.9e+05 | 5.4e+05 |
build-containers | run//perf//cycles | 2.47e+10 | 2.49e+10 | 4.83e+08 | 5.34e+08 |
build-containers | run//perf//instructions | 2.25e+10 | 2.25e+10 | 2.69e+07 | 2.99e+07 |
build-containers | run//rts//average_bytes_used | 3.85e+07 | 3.85e+07 | 1.46e+05 | 1.19e+05 |
These tests examine building Cabal
(namely the Setup.hs
, which pulls in a
good fraction of the package's modules) and containers
(namely Data.Map
),
both with -O
. Here we see that the containers
build, which has an average
residency of approximately 30 MBytes, is unchanged, whereas the Cabal
build, with has an
average residency an order of magnitude larger, improves in cycle count on the order of 1% (specifically, 2% with 0.5% error bars) despite an overall increase in instruction count. This reduction in
cycle count also manifests in the overall runtime.