Ben Gamari requested to merge wip/drop-bdescr-start into master Dec 01, 2019

Currently the block descriptor (bdescr) struct is sized to take a single cache line. This is nice as it eliminates any possibility of false sharing between GC threads. On the other hand, there is a lot of redundancy in this representation:

The start field is completely redundant and can be computed from the block descriptor address
The gen and gen_no fields are redundant: bd->gen == &generations[bd->gen_no] (#15414)
The free field is a pointer but can only range over a single megablock

Given how often we look at block descriptors and how brutal GHC's runtime tends to be on caches, I hypothesized that we would be better off with a denser (but slightly more computationally intensive) representation. I test this hypothesis here. In short I:

eliminate the start field entirely
drop the gen field in favor of the much smaller gen_no
turn the free field into a 32-bit offset
rearrange the struct to pack nicely into half a cacheline

Performance impact

On the whole this makes no difference for programs with only a small residency, even despite the small bump in instruction count that the extra computation requires. For instance, the mean cycle count and time across the nofib tests (even when run under the slow speed) is statistically unchanged. For instance (shrink here refers to the last patch in this branch; the sample size is 5 and the σ columns refer to the standard error of the mean):

test	metric	baseline	shrink	σ_baseline	σ_shrink
gc/treejoin	run//perf//cache-misses	1.21e+06	1.14e+06	2.96e+04	7.57e+03
gc/treejoin	run//perf//cycles	4.53e+08	4.44e+08	3.99e+06	7.05e+06
gc/treejoin	run//perf//instructions	7.97e+08	8.06e+08	7.85e+04	4.23e+05
parallel/matmult	run//perf//cache-misses	3.18e+05	3.5e+05	5.17e+04	4.71e+04
parallel/matmult	run//perf//cycles	1.01e+10	1.1e+10	4.95e+07	3.47e+08
parallel/matmult	run//perf//instructions	1.19e+10	1.19e+10	8.27e+04	3e+05
real/fem	run//perf//cache-misses	4.15e+04	3.57e+04	3.18e+03	7.38e+03
real/fem	run//perf//cycles	3.38e+09	3.3e+09	1.35e+07	5.41e+06
real/fem	run//perf//instructions	6.83e+09	6.85e+09	1.05e+05	1.43e+05
spectral/fibheaps	run//perf//cache-misses	4.6e+04	4.46e+04	3.31e+03	3.6e+03
spectral/fibheaps	run//perf//cycles	5.61e+09	5.58e+09	1.38e+07	1.59e+07
spectral/fibheaps	run//perf//instructions	1.05e+10	1.06e+10	1e+05	8.76e+04
spectral/treejoin	run//perf//cache-misses	2.26e+05	2.24e+05	1.72e+04	3.21e+04
spectral/treejoin	run//perf//cycles	9.4e+09	9.39e+09	3.57e+07	3.26e+07
spectral/treejoin	run//perf//instructions	2.1e+10	2.12e+10	9.67e+06	1.19e+07

However, for programs with large heaps and significant GC time (e.g. ghc) this can improve performance by a few percent:

test	metric	baseline	shrink	σ_baseline	σ_shrink
build-Cabal	run//perf//cache-misses	5.14e+09	5.08e+09	1.02e+07	1.54e+07
build-Cabal	run//perf//cycles	8.23e+11	8.07e+11	9.13e+09	9.37e+09
build-Cabal	run//perf//instructions	7.07e+11	7.11e+11	1.58e+09	6.32e+08
build-Cabal	run//rts//average_bytes_used	3.12e+08	3.14e+08	1.05e+06	2.97e+06
build-containers	run//perf//cache-misses	1.11e+08	1.08e+08	4.9e+05	5.4e+05
build-containers	run//perf//cycles	2.47e+10	2.49e+10	4.83e+08	5.34e+08
build-containers	run//perf//instructions	2.25e+10	2.25e+10	2.69e+07	2.99e+07
build-containers	run//rts//average_bytes_used	3.85e+07	3.85e+07	1.46e+05	1.19e+05

These tests examine building Cabal (namely the Setup.hs, which pulls in a good fraction of the package's modules) and containers (namely Data.Map), both with -O. Here we see that the containers build, which has an average residency of approximately 30 MBytes, is unchanged, whereas the Cabal build, with has an average residency an order of magnitude larger, improves in cycle count on the order of 1% (specifically, 2% with 0.5% error bars) despite an overall increase in instruction count. This reduction in cycle count also manifests in the overall runtime.

Edited Jun 11, 2024 by Ben Gamari

WIP: Make bdescr more dense

Performance impact

Merge request reports