Ben Gamari requested to merge wip/bdescr/june-2024 into master Jun 11, 2024

Currently the block descriptor (bdescr) struct is sized to take a single cache line. This is nice as it eliminates any possibility of false sharing between GC threads. On the other hand, there is a lot of redundancy in this representation:

The start field is completely redundant and can be computed from the block descriptor address
The gen and gen_no fields are redundant: bd->gen == &generations[bd->gen_no] (#15414)
The free field is a pointer but can only range over a single megablock

Given how often we look at block descriptors and how brutal GHC's runtime tends to be on caches, I hypothesized that we would be better off with a denser (but slightly more computationally intensive) representation. I test this hypothesis here. In short I:

eliminate the start field entirely as it can be easily computed
drop the gen field in favor of the much smaller gen_no
turn the free field into a 32-bit offset (given in bytes relative to the start of the block)
rearrange the struct to pack nicely into half a 64-byte cacheline

Therefore, a copying GC heap block looks something like this:

 old                                                        new                                           

 ┌──────────────────────────────────────┐                   ┌──────────────────────────────────────┐        
 │ start                                │                   │ link                                 │        
 ├──────────────────────────────────────┤                   ├──────────────────────────────────────┤        
 │ free                                 │                   │ scan                                 │        
 ├──────────────────────────────────────┤                   ├────────┬─────────┬─────────┬─────────┤        
 │ link                                 │                   │ gen_no │ dest_no │ node    │ flags   │        
 ├──────────────────────────────────────┤                   ├────────┴─────────┼─────────┴─────────┤        
 │ scan                                 │                   │ free_off         │ blocks            │        
 ├──────────────────────────────────────┤                   └──────────────────┴───────────────────┘        
 │ gen                                  │        
 ├────────┬─────────┬─────────┬─────────┤        
 │ gen_no │ dest_no │ node    │ flags   │        
 ├────────┴─────────┼─────────┴─────────┤        
 │ blocks           │ padding           │        
 ├──────────────────┴───────────────────┤        
 │ padding                              │        
 └──────────────────────────────────────┘

There is a bit of a trade-off going on here as the fact that two block descriptors now fit into each cache line means that we may see false-sharing where there was previously none. However, I have not observed this in practice.

Performance impact

TODO

Edited Jun 11, 2024 by Ben Gamari

WIP: Make bdescr more dense (June 2024 edition)

Performance impact

Merge request reports