A pathological large object allocation pattern that leads to fragmentation

Summary

This is about a pathological allocation pattern of large objects that leads to some quite high fragmentation, discovered while investigating #24150 (closed).

I think it's somewhat similar to #7831

The issue

Let's first recall how the block allocator works.

The block allocator has free lists that are power of 2 sized. They span 2^x - 2^(x+1)-1.

So, if I wish to allocate X blocks, then I will look for free groups in the free lists of index log2_ceil(X), ie, we round up to the next power of 2.

This means that when allocating quite large objects, we will be able to fit fewer into a megablock than optimally possible. For instance, a megablock can hold at most 252 blocks (256 - 4 used for block descriptors). Two 126 block sized objects could fill it perfectly. But in practice, we can only hold one. This is because after allocating one, we have 126 blocks left, which would go onto the 64 - 127 block free list, but we would need to allocate from the next free list, namely the 128 - 255 one. In this instance we are wasting half a block.

This also applies to a lesser extent to smaller allocation requests. For instance, only two 65 block groups can be serviced by the block allocator using one megablock, but technically 3 could fit.

On the face of it this isn't too concerning. These gaps can be used to service other types of allocations and large objects normally account for a small percentage of total memory usage.

This behaviour only leads to high fragmentation when another factor comes into play. For me, this was the nonmoving heap. Because the nonmoving heap is made up of block groups that do not get copied by GC, they tend to hem in these large objects.

The pattern I saw was a 65 block large object, some blocks used by the non-moving GC, then another 65 block large object. The issue arises when the 2 large objects are GC'd. Now we have two 65 block wide gaps. Yet, these gaps cannot be used to service new allocations of the same size as before, since recall that we round up to the next power of 2. So, when we next go to allocate we are quite likely to need to allocate a new megablock to service the allocation. And because the non-moving heap is quite likely to hold onto its blocks for a long time, we are quite unlikely to see this fragmented megablock be de-allocated.

So, what I was seeing is a pattern where an application would occasionally allocate 65 block large objects, these would get promoted to the oldest generation, and the space around them would be filled with non-moving heap blocks. Then at the next major GC they would get freed, but this space could not be used to service future requests for 65 block allocations because we would require a 128 block gap due to the rounding up done by the free list.

Solution

In my case, the solution was just to avoid allocating such large objects. This made the fragmentation disappear. FWIW I think this is a fine solution to something like this. It's unavoidable that any allocator will have some pathological edge-cases.

Perhaps we could do better by being less extreme with rounding up. Maybe we could have a free list for each block size. That would give us 252, which isn't that many, but maybe the performance cost would be too high. Or maybe a compromise between the two would work, eg, having free lists for multiples of 8 blocks, etc.

Now that I have a workaround, I'm not too motivated to investigate this further, but I thought I'd make a ticket to document this behaviour and maybe help someone who runs into it in the future.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information