WIP: Make bdescr more dense (June 2024 edition)
Currently the block descriptor (bdescr
) struct is sized to take a single
cache line. This is nice as it eliminates any possibility of false sharing
between GC threads. On the other hand, there is a lot of redundancy in this
representation:
- The
start
field is completely redundant and can be computed from the block descriptor address - The
gen
andgen_no
fields are redundant:bd->gen == &generations[bd->gen_no]
(#15414) - The
free
field is a pointer but can only range over a single megablock
Given how often we look at block descriptors and how brutal GHC's runtime tends to be on caches, I hypothesized that we would be better off with a denser (but slightly more computationally intensive) representation. I test this hypothesis here. In short I:
- eliminate the
start
field entirely as it can be easily computed - drop the
gen
field in favor of the much smallergen_no
- turn the
free
field into a 32-bit offset (given in bytes relative to the start of the block) - rearrange the struct to pack nicely into half a 64-byte cacheline
Therefore, a copying GC heap block looks something like this:
old new
┌──────────────────────────────────────┐ ┌──────────────────────────────────────┐
│ start │ │ link │
├──────────────────────────────────────┤ ├──────────────────────────────────────┤
│ free │ │ scan │
├──────────────────────────────────────┤ ├────────┬─────────┬─────────┬─────────┤
│ link │ │ gen_no │ dest_no │ node │ flags │
├──────────────────────────────────────┤ ├────────┴─────────┼─────────┴─────────┤
│ scan │ │ free_off │ blocks │
├──────────────────────────────────────┤ └──────────────────┴───────────────────┘
│ gen │
├────────┬─────────┬─────────┬─────────┤
│ gen_no │ dest_no │ node │ flags │
├────────┴─────────┼─────────┴─────────┤
│ blocks │ padding │
├──────────────────┴───────────────────┤
│ padding │
└──────────────────────────────────────┘
There is a bit of a trade-off going on here as the fact that two block descriptors now fit into each cache line means that we may see false-sharing where there was previously none. However, I have not observed this in practice.
Performance impact
TODO