rts: Support hugepages
See also:
Summary
Modern CPU architectures tend to have a virtual memory system. Memory is divided into pages of usually 4KB. Programs deal with virtual addresses and the CPU translates this into actual physical addresses behind the scenes. This necessitates a mapping from virtual pages to their physical addresses.
To avoid an extra memory lookup there is a cache called the TLB. This can hold a limited amount of page entries, and is much quicker to access than normal memory. Whenever an entry isn't present in the TLB, we call that a TLB miss, and we must fetch the page entry from memory before we can do the lookup.
It's most efficient if we can fit as much of the memory as possible into the amount of pages that can be held by the TLB. This allows us to avoid TLB misses as much as possible.
For instance, a Ryzen AMD Ryzen 9 5950X, has 255 entries in the L1 cache's 4K page TLB, and 2048 entries in the L2 cache's TLB. This seems to be per core. This comes out to be 150 MB overall (16 cores with 255+2048 entries each). This is not a lot of memory for a modern program to use. The chrome tab rendering this probably uses more.
Modern CPUs have a feature called Hugepages (Linux)/ Super pages (Apple) / Large pages (Windows). This allows using pages that are bigger than 4KB. This increases the amount of memory we can hold in the TLB.
The size of Hugepages is CPU/architecture dependent. 2MB seems to be a common size. 1GB is also supported in x86_64.
Hugepages and the RTS
The RTS deals with memory allocation/deallocation in MBlock sized chunks. MBlocks are (currently) 1MB aligned chunks of memory. This raises an issue for hugepage support. Most hugepage sizes are bigger than 1MB. But it's not possible to allocate only part of a hugepage. This requires some changes to our memory subsystem.
There are multiple options:
- Use transparent huge pages. This is a feature of the Linux kernel, where it will scan a program's memory and replace pages with huge pages where possible. This has potential performance downsides as it requires extra work from the kernel, both to find the memory to replace, and to compact the virtual address space to allow usage of hugepages as much as possible. Yet, the design of the RTS OS memory allocator makes for a relatively compact virtual address space, so it's a relatively good candidate. This only supports 2MB sized hugepages at present.
- Add another level to the RTS memory hierarchy. We could introduce an explicit notion of Hugepages above MBlocks. We already track decommitted memory so this wouldn't be a huge leap. This would entail tracking the page size for memory we have allocated, so we don't try to deallocate part of a hugepage. This would allow supporting multiple page sizes, eg, 2M and 1G at the same time. We could have a system where 1G pages are allocated at application start time and kept around for the lifetime of the application through a flag (since it's expensive to allocate/deallocate these).
- Ensure allocation/deallocation only happens in hugepage sized chunks. We can introduce an invariant to allocating/deallocating memory to make sure we only do so in chunks that are multiples of the hugepage size. This only requires making change to the functions where we allocate/deallocate MBlocks. When allocating, we round up to the hugepage size, and put any excess on the current NUMA node's mblock free list. When deallocating we only deallocate aligned hugepages of MBlocks. This is somewhat ad-hoc, but a lot less effort than (2). This allows supporting both 2MB and 1GB hugepages, but can't distinguish between the smaller and larger sizes. We are forced to always round up to the larger one, even if a certain allocation isn't backed by that size.
- Change the MBlock size. We could increase the MBlock size to the huge page size. This seems feasible for 2MB, but not for 1GB. In the balance of things, it's probably best not to do this as it might make fragmentation worse for all users, including those not using hugepages. And the MBlock size has not been changed for a long time.
Currently I'm attempting (3). The MR for this is !4523