WIP: Make bdescr more dense
Currently the block descriptor (bdescr) struct is sized to take a single
cache line. This is nice as it eliminates any possibility of false sharing
between GC threads. On the other hand, there is a lot of redundancy in this
representation:
- The
startfield is completely redundant and can be computed from the block descriptor address - The
genandgen_nofields are redundant:bd->gen == &generations[bd->gen_no](#15414) - The
freefield is a pointer but can only range over a single megablock
Given how often we look at block descriptors and how brutal GHC's runtime tends to be on caches, I hypothesized that we would be better off with a denser (but slightly more computationally intensive) representation. I test this hypothesis here. In short I:
- eliminate the
startfield entirely - drop the
genfield in favor of the much smallergen_no - turn the
freefield into a 32-bit offset - rearrange the struct to pack nicely into half a cacheline
Performance impact
On the whole this makes no difference for programs with only a small
residency, even despite the small bump in instruction count that the extra
computation requires. For instance, the mean cycle count and time across the
nofib tests (even when run under the slow speed) is statistically unchanged.
For instance (shrink here refers to the last patch in this branch; the sample
size is 5 and the σ columns refer to the standard error of the mean):
| test | metric | baseline | shrink | σ_baseline | σ_shrink |
|---|---|---|---|---|---|
| gc/treejoin | run//perf//cache-misses | 1.21e+06 | 1.14e+06 | 2.96e+04 | 7.57e+03 |
| gc/treejoin | run//perf//cycles | 4.53e+08 | 4.44e+08 | 3.99e+06 | 7.05e+06 |
| gc/treejoin | run//perf//instructions | 7.97e+08 | 8.06e+08 | 7.85e+04 | 4.23e+05 |
| parallel/matmult | run//perf//cache-misses | 3.18e+05 | 3.5e+05 | 5.17e+04 | 4.71e+04 |
| parallel/matmult | run//perf//cycles | 1.01e+10 | 1.1e+10 | 4.95e+07 | 3.47e+08 |
| parallel/matmult | run//perf//instructions | 1.19e+10 | 1.19e+10 | 8.27e+04 | 3e+05 |
| real/fem | run//perf//cache-misses | 4.15e+04 | 3.57e+04 | 3.18e+03 | 7.38e+03 |
| real/fem | run//perf//cycles | 3.38e+09 | 3.3e+09 | 1.35e+07 | 5.41e+06 |
| real/fem | run//perf//instructions | 6.83e+09 | 6.85e+09 | 1.05e+05 | 1.43e+05 |
| spectral/fibheaps | run//perf//cache-misses | 4.6e+04 | 4.46e+04 | 3.31e+03 | 3.6e+03 |
| spectral/fibheaps | run//perf//cycles | 5.61e+09 | 5.58e+09 | 1.38e+07 | 1.59e+07 |
| spectral/fibheaps | run//perf//instructions | 1.05e+10 | 1.06e+10 | 1e+05 | 8.76e+04 |
| spectral/treejoin | run//perf//cache-misses | 2.26e+05 | 2.24e+05 | 1.72e+04 | 3.21e+04 |
| spectral/treejoin | run//perf//cycles | 9.4e+09 | 9.39e+09 | 3.57e+07 | 3.26e+07 |
| spectral/treejoin | run//perf//instructions | 2.1e+10 | 2.12e+10 | 9.67e+06 | 1.19e+07 |
However, for programs with large heaps and significant GC time (e.g. ghc)
this can improve performance by a few percent:
| test | metric | baseline | shrink | σ_baseline | σ_shrink |
|---|---|---|---|---|---|
| build-Cabal | run//perf//cache-misses | 5.14e+09 | 5.08e+09 | 1.02e+07 | 1.54e+07 |
| build-Cabal | run//perf//cycles | 8.23e+11 | 8.07e+11 | 9.13e+09 | 9.37e+09 |
| build-Cabal | run//perf//instructions | 7.07e+11 | 7.11e+11 | 1.58e+09 | 6.32e+08 |
| build-Cabal | run//rts//average_bytes_used | 3.12e+08 | 3.14e+08 | 1.05e+06 | 2.97e+06 |
| build-containers | run//perf//cache-misses | 1.11e+08 | 1.08e+08 | 4.9e+05 | 5.4e+05 |
| build-containers | run//perf//cycles | 2.47e+10 | 2.49e+10 | 4.83e+08 | 5.34e+08 |
| build-containers | run//perf//instructions | 2.25e+10 | 2.25e+10 | 2.69e+07 | 2.99e+07 |
| build-containers | run//rts//average_bytes_used | 3.85e+07 | 3.85e+07 | 1.46e+05 | 1.19e+05 |
These tests examine building Cabal (namely the Setup.hs, which pulls in a
good fraction of the package's modules) and containers (namely Data.Map),
both with -O. Here we see that the containers build, which has an average
residency of approximately 30 MBytes, is unchanged, whereas the Cabal build, with has an
average residency an order of magnitude larger, improves in cycle count on the order of 1% (specifically, 2% with 0.5% error bars) despite an overall increase in instruction count. This reduction in
cycle count also manifests in the overall runtime.