Skip to content

WIP: Make bdescr more dense

Ben Gamari requested to merge wip/drop-bdescr-start into master

Currently the block descriptor (bdescr) struct is sized to take a single cache line. This is nice as it eliminates any possibility of false sharing between GC threads. On the other hand, there is a lot of redundancy in this representation:

  • The start field is completely redundant and can be computed from the block descriptor address
  • The gen and gen_no fields are redundant: bd->gen == &generations[bd->gen_no] (#15414)
  • The free field is a pointer but can only range over a single megablock

Given how often we look at block descriptors and how brutal GHC's runtime tends to be on caches, I hypothesized that we would be better off with a denser (but slightly more computationally intensive) representation. I test this hypothesis here. In short I:

  • eliminate the start field entirely
  • drop the gen field in favor of the much smaller gen_no
  • turn the free field into a 32-bit offset
  • rearrange the struct to pack nicely into half a cacheline

On the whole this makes no difference for programs with only a small residency, even despite the small bump in instruction count that the extra computation requires. For instance, the mean cycle count and time across the nofib tests (even when run under the slow speed) is statistically unchanged. For instance (shrink here refers to the last patch in this branch; the sample size is 5 and the σ columns refer to the standard error of the mean):

test metric baseline shrink σ_baseline σ_shrink
gc/treejoin run//perf//cache-misses 1.21e+06 1.14e+06 2.96e+04 7.57e+03
gc/treejoin run//perf//cycles 4.53e+08 4.44e+08 3.99e+06 7.05e+06
gc/treejoin run//perf//instructions 7.97e+08 8.06e+08 7.85e+04 4.23e+05
parallel/matmult run//perf//cache-misses 3.18e+05 3.5e+05 5.17e+04 4.71e+04
parallel/matmult run//perf//cycles 1.01e+10 1.1e+10 4.95e+07 3.47e+08
parallel/matmult run//perf//instructions 1.19e+10 1.19e+10 8.27e+04 3e+05
real/fem run//perf//cache-misses 4.15e+04 3.57e+04 3.18e+03 7.38e+03
real/fem run//perf//cycles 3.38e+09 3.3e+09 1.35e+07 5.41e+06
real/fem run//perf//instructions 6.83e+09 6.85e+09 1.05e+05 1.43e+05
spectral/fibheaps run//perf//cache-misses 4.6e+04 4.46e+04 3.31e+03 3.6e+03
spectral/fibheaps run//perf//cycles 5.61e+09 5.58e+09 1.38e+07 1.59e+07
spectral/fibheaps run//perf//instructions 1.05e+10 1.06e+10 1e+05 8.76e+04
spectral/treejoin run//perf//cache-misses 2.26e+05 2.24e+05 1.72e+04 3.21e+04
spectral/treejoin run//perf//cycles 9.4e+09 9.39e+09 3.57e+07 3.26e+07
spectral/treejoin run//perf//instructions 2.1e+10 2.12e+10 9.67e+06 1.19e+07

However, for programs with large heaps and significant GC time (e.g. ghc) this can improve performance by a few percent:

test metric baseline shrink σ_baseline σ_shrink
build-Cabal run//perf//cache-misses 5.14e+09 5.08e+09 1.02e+07 1.54e+07
build-Cabal run//perf//cycles 8.23e+11 8.07e+11 9.13e+09 9.37e+09
build-Cabal run//perf//instructions 7.07e+11 7.11e+11 1.58e+09 6.32e+08
build-Cabal run//rts//average_bytes_used 3.12e+08 3.14e+08 1.05e+06 2.97e+06
build-containers run//perf//cache-misses 1.11e+08 1.08e+08 4.9e+05 5.4e+05
build-containers run//perf//cycles 2.47e+10 2.49e+10 4.83e+08 5.34e+08
build-containers run//perf//instructions 2.25e+10 2.25e+10 2.69e+07 2.99e+07
build-containers run//rts//average_bytes_used 3.85e+07 3.85e+07 1.46e+05 1.19e+05

These tests examine building Cabal (namely the Setup.hs, which pulls in a good fraction of the package's modules) and containers (namely Data.Map), both with -O. Here we see that the containers build, which has an average residency of non-moving GC MBytes, is unchanged, whereas the Cabal build, with has an average residency an order of magnitude larger, improves in cycle count on the order of 1% (specifically, 2% with 0.5% error bars) despite an overall increase in instruction count. This reduction in cycle count also manifests in the overall runtime.

Edited by Ben Gamari

Merge request reports