Putting large objects in compact regions causes a segfault under the right conditions
Summary
When putting a large object in a compact region, and then several smaller ones afterwards, the program crashes with a segfault because the GC tries to follow an invalid pointer.
Steps to reproduce
The following Haskell program reliably causes a segfault for me:
import Data.Traversable (for)
import Data.Primitive.ByteArray
import GHC.Compact
import System.Mem (performGC)
main :: IO ()
main = do
-- Make a fresh compact region. So far, it only contains a single almost empty small block
c <- compact ()
print =<< compactSize c -- prints 32768, as the default block size is 32K
-- Now add a big object to the compact region. The size is chosen so that the compact region
-- allocator will allocate two megablocks while actually leaving space in the first megablock.
let
totalSizeInWords = 129016
arraySizeInWords = totalSizeInWords - 2 {- header words -}
big <- newByteArray (arraySizeInWords * 8)
bigFrozen <- unsafeFreezeByteArray big
-- Add the big object to the compact region.
-- It is larger than the existing block, and also larger than the big object threshold.
-- The `ByteArray` wrapper is likely put still into the first block,
-- but the `ByteArray#` is put in a new block.
-- One can fit 129024 words into a megablock. Our object is 129016 words in size,
-- but the allocator reserves a bit too much space, so it needs to allocate 129025 words,
-- and therefore two megablocks.
_ <- compactAdd c bigFrozen
print =<< compactSize c -- prints 2113536, because it allocated a megablock group of 2 megablocks
-- Now we have to fill the first block, which is currently the nursery,
-- and eventually, one of the Ints will be added to the second block,
-- causing the segfault.
placeholders <- for [0 :: Int ..10000] $ \i -> do
r <- getCompact <$> compactAdd c i
print i
performGC
pure r
-- Keep the placeholders alive so the GC actually has a pointer to follow into the invalid
-- parts of the compact region.
print $ length placeholders
I dug a bit into the RTS source code, but I don't yet fully understand what is going on.
The segfault above is the result of an invariant about megablock groups that is violated by allocateForCompact
.
In the example program, when we start adding the Int
s in a loop, it will first allocate them from the nursery,
which at that point is still the first block of the compact region.
I think that those allocations happen in the ALLOCATE
macro (https://gitlab.haskell.org/ghc/ghc/blob/ghc-8.6.3-release/rts/Compact.cmm#L25).
Then, once the nursery is full, allocateForCompact
is called (https://gitlab.haskell.org/ghc/ghc/blob/ghc-8.6.3-release/rts/Compact.cmm#L29).
There, it first does the same check on the nursery as the ALLOCATE
macro (https://gitlab.haskell.org/ghc/ghc/blob/ghc-8.6.3-release/rts/sm/CNF.c#L479), which obviously fails.
Then it proceeds, and it will find that the Int
is not a large object, therefore skipping the special logic there (https://gitlab.haskell.org/ghc/ghc/blob/ghc-8.6.3-release/rts/sm/CNF.c#L490).
However, then it will find that the nursery is full (https://gitlab.haskell.org/ghc/ghc/blob/ghc-8.6.3-release/rts/sm/CNF.c#L501) and pick the next block that is not full as new nursery.
In the example above, there is a new block: the one that was allocated for the large array. It consists of 2 megablocks,
but it's free
pointer still points within the first megablock, and there's enough room, so it picks that block.
But once that block has been chosen as a nursery, it will start allocating closures even in the second megablock of that group. If I understand the comment in BlockAlloc.c (https://gitlab.haskell.org/ghc/ghc/blob/ghc-8.6.3-release/rts/sm/BlockAlloc.c#L120) correctly, this violates the invariants of megablock groups, as closures that are allocated there don't have a valid bdescr
of the block they've been placed into.
The remaining question is: why is there still space left in the first megablock, when the allocator actually allocated two megablocks?
I think this is because the size computation for large objects at https://gitlab.haskell.org/ghc/ghc/blob/ghc-8.6.3-release/rts/sm/CNF.c#L491.
It reserves room for an StgCompactNFData
while only space for a StgCompactNFDataBlock
is needed.
I tried to fix this by changing it to not picking megablock groups as nursery, and that indeed fixed the issue above, but then we got other segfaults that I wasn't able to reliably reproduce yet, as they occurred as part of a large application.
It could be that the loop that tries to fit the allocation in subsequent blocks (https://gitlab.haskell.org/ghc/ghc/blob/ghc-8.6.3-release/rts/sm/CNF.c#L517) is at fault, as it would still put more than one object into a megablock allocation, while https://gitlab.haskell.org/ghc/ghc/blob/ghc-8.6.3-release/rts/sm/BlockAlloc.c#L117 seems to imply that only a single object should be put in those.
Additionally, the bdescr
initialization in compactAllocateBlockInternal
(https://gitlab.haskell.org/ghc/ghc/blob/ghc-8.6.3-release/rts/sm/CNF.c#L247) seems to write into the data area of the allocation when the allocated area is larger than one megablock.
Expected behavior
I expect the program to print the sizes of the compact regions and then all numbers from 0 to 10000, and not segfault after printing number 2046.
Environment
- GHC version used: 8.6.3
Optional:
- Operating System: Ubuntu 18.04
- System Architecture: x86_64