Putting large objects in compact regions causes a segfault under the right conditions

changed the description

Here is a version of the program which has no external dependencies:

{-# LANGUAGE MagicHash #-}
{-# LANGUAGE TypeFamilies #-}
{-# LANGUAGE UnboxedTuples #-}

import Data.Traversable (for)
import GHC.Compact
import GHC.Exts
import GHC.IO
import System.Mem (performGC)

main :: IO ()
main = do
  -- Make a fresh compact region. So far, it only contains a single almost empty small block
  c <- compact ()
  print =<< compactSize c -- prints 32768, as the default block size is 32K
  -- Now add a big object to the compact region. The size is chosen so that the compact region
  -- allocator will allocate two megablocks while actually leaving space in the first megablock.
  let
    totalSizeInWords = 129016
    arraySizeInWords = totalSizeInWords - 2 {- header words -}
  big <- newByteArray (arraySizeInWords * 8)
  bigFrozen <- unsafeFreezeByteArray big

  -- Add the big object to the compact region.
  -- It is larger than the existing block, and also larger than the big object threshold.
  -- The `ByteArray` wrapper is likely put still into the first block,
  -- but the `ByteArray#` is put in a new block.
  -- One can fit 129024 words into a megablock. Our object is 129016 words in size,
  -- but the allocator reserves a bit too much space, so it needs to allocate 129025 words,
  -- and therefore two megablocks.
  _ <- compactAdd c bigFrozen
  print =<< compactSize c -- prints 2113536, because it allocated a megablock group of 2 megablocks

  -- Now we have to fill the first block, which is currently the nursery,
  -- and eventually, one of the Ints will be added to the second block,
  -- causing the segfault.
  placeholders <- for [0 :: Int ..10000] $ \i -> do
    r <- getCompact <$> compactAdd c i
    print i
    performGC
    pure r

  -- Keep the placeholders alive so the GC actually has a pointer to follow into the invalid
  -- parts of the compact region.
  print $ length placeholders

-----
-- primitive stuff
-----

class Monad m => PrimMonad m where
  type PrimState m
  primitive :: (State# (PrimState m) -> (# State# (PrimState m), a #)) -> m a

instance PrimMonad IO where
  type PrimState IO = RealWorld
  primitive = IO
  {-# INLINE primitive #-}

data ByteArray = ByteArray ByteArray#

data MutableByteArray s = MutableByteArray (MutableByteArray# s)

newByteArray :: PrimMonad m => Int -> m (MutableByteArray (PrimState m))
{-# INLINE newByteArray #-}
newByteArray (I# n#)
  = primitive (\s# -> case newByteArray# n# s# of
                        (# s'#, arr# #) -> (# s'#, MutableByteArray arr# #))

unsafeFreezeByteArray
  :: PrimMonad m => MutableByteArray (PrimState m) -> m ByteArray
{-# INLINE unsafeFreezeByteArray #-}
unsafeFreezeByteArray (MutableByteArray arr#)
  = primitive (\s# -> case unsafeFreezeByteArray# arr# s# of
                        (# s'#, arr'# #) -> (# s'#, ByteArray arr'# #))

added compact normal forms label

Confirmed on GHC HEAD as well. Some notes:

The program passes Core, STG, and Cmm lints.

With -debug I get this RTS panic:

Main: internal error: ASSERTION FAILED: file rts/sm/GCUtils.c, line 213

    (GHC version 8.9.0.20190807 for x86_64_unknown_linux)
    Please report this as a GHC bug:  https://www.haskell.org/ghc/reportabug
zsh: abort (core dumped)  ./Main

changed milestone to %8.8.1

added Phigh label and removed needs triage label

Unfortunately this certainly won't make %8.8.1.

changed milestone to %8.8.2

I debugged this a little. The problem is, as far as I can see, after copying the array to the compact region, one of the Int allocations in the region corrupts bdescr of another Int in the same region.

I used this version of the reproducer:

{-# LANGUAGE MagicHash #-}
{-# LANGUAGE TypeFamilies #-}
{-# LANGUAGE UnboxedTuples #-}

import Data.Traversable (for)
import GHC.Compact
import GHC.Exts
import GHC.IO
import System.Mem

main :: IO ()
main = do
  c <- compact ()
  big <- newByteArray 1032128
  bigFrozen <- unsafeFreezeByteArray big
  c <- compactAdd c bigFrozen
  placeholders <- for [0 :: Int ..2045] $ \i -> do
    getCompact <$> compactAdd c i

  return ()

class Monad m => PrimMonad m where
  type PrimState m
  primitive :: (State# (PrimState m) -> (# State# (PrimState m), a #)) -> m a

instance PrimMonad IO where
  type PrimState IO = RealWorld
  primitive = IO

data ByteArray = ByteArray ByteArray#

data MutableByteArray s = MutableByteArray (MutableByteArray# s)

newByteArray :: PrimMonad m => Int -> m (MutableByteArray (PrimState m))
newByteArray (I# n#)
  = primitive (\s# -> case newByteArray# n# s# of
                        (# s'#, arr# #) -> (# s'#, MutableByteArray arr# #))

unsafeFreezeByteArray
  :: PrimMonad m => MutableByteArray (PrimState m) -> m ByteArray
unsafeFreezeByteArray (MutableByteArray arr#)
  = primitive (\s# -> case unsafeFreezeByteArray# arr# s# of
                        (# s'#, arr'# #) -> (# s'#, ByteArray arr'# #))

build.mk:

BuildFlavour = quick

ifneq "$(BuildFlavour)" ""
include mk/flavours/$(BuildFlavour).mk
endif

GhcRtsHcOpts += -O0 -g3

BUILD_PROF_LIBS    = NO

HADDOCK_DOCS       = NO
BUILD_SPHINX_HTML  = NO
BUILD_SPHINX_PDF   = NO
BUILD_MAN          = NO

STRIP_CMD = :

I debugged this a little bit more. I have this printCompact function that I use for debugging:

void printCompact(StgCompactNFData *str)
{
    debugBelch("StgCompactNFData: %p\n", (void*)str);
    StgCompactNFDataBlock *first = compactGetFirstBlock(str);
    bdescr *bd = Bdescr((P_)first);

    while (bd) {
        StgPtr p = bd->start + sizeofW(StgCompactNFDataBlock);
        StgPtr end;
        if ((P_)str->nursery == bd->start) {
            end = str->hp;
        } else {
            end = bd->free;
        }
        while (p < end) {
            printClosure((StgClosure*)p);
            p += closure_sizeW((StgClosure*)p);
        }
        StgCompactNFDataBlock *cnf_block = (StgCompactNFDataBlock*)bd->start;
        bd = Bdescr((P_)cnf_block->next);
    }
}

Reproducer I use:

{-# LANGUAGE MagicHash #-}
{-# LANGUAGE TypeFamilies #-}
{-# LANGUAGE UnboxedTuples #-}

import Data.Traversable (for)
import GHC.Compact
import GHC.Exts
import GHC.IO
import System.Mem

main :: IO ()
main = do
  c <- compact ()
  big <- newByteArray 1032128
  bigFrozen <- unsafeFreezeByteArray big
  c <- compactAdd c bigFrozen

  placeholders <- for [0 :: Int .. 2044] $ \i -> do
    getCompact <$> compactAdd c i

  return ()

data ByteArray = ByteArray ByteArray#

data MutableByteArray s = MutableByteArray (MutableByteArray# s)

newByteArray :: Int -> IO (MutableByteArray RealWorld)
newByteArray (I# n#) = IO (\s# -> case newByteArray# n# s# of (# s'#, arr# #) -> (# s'#, MutableByteArray arr# #))

unsafeFreezeByteArray :: MutableByteArray RealWorld -> IO ByteArray
unsafeFreezeByteArray (MutableByteArray arr#) = IO (\s# -> case unsafeFreezeByteArray# arr# s# of (# s'#, arr'# #) -> (# s'#, ByteArray arr'# #))

There's a compact at 0x42001f5018:

>>> p printClosure(0x42001f5018)
0x42001f5018: COMPACT_NFDATA(size=2113536)

>>> p *Bdescr(0x42001f5018)
$6 = {
  start = 0x42001f5000,
  free = 0x42001fd000,
  link = 0x0,
  u = {
    back = 0x0,
    bitmap = 0x0,
    scan = 0x0
  },
  gen = 0x24ce408,
  gen_no = 1,
  dest_no = 1,
  node = 0,
  flags = 513, // BF_COMPACT | BF_EVACUATED
  blocks = 8,
  _padding = {[0] = 0, [1] = 0, [2] = 0}
}

This holds all the ints + the array. If I print this, the last objects in the compact are:

>>> p printCompact(0x42001f5018)
...
0x42001fcf90: ghc-prim:GHC.Types.I#(0x7f3#)
0x42001fcfa0: ghc-prim:GHC.Types.I#(0x7f4#)
0x42001fcfb0: ghc-prim:GHC.Types.I#(0x7f5#)
0x42001fcfc0: ghc-prim:GHC.Types.I#(0x7f6#)
0x42001fcfd0: ghc-prim:GHC.Types.I#(0x7f7#)
0x42001fcfe0: ghc-prim:GHC.Types.I#(0x7f8#)
0x42001fcff0: ghc-prim:GHC.Types.I#(0x7f9#)
0x4200304018: ARR_WORDS("")
0x42003fffe8: ghc-prim:GHC.Types.I#(0x7fa#)
0x42003ffff8: ghc-prim:GHC.Types.I#(0x7fb#)
0x4200400008: ghc-prim:GHC.Types.I#(0x7fc#)

bdescrs of these objects are fine, except the last one:

>>> p *Bdescr(0x4200400008)
$15 = {
  start = 0x7fb,
  free = 0x73c670 <ghczmprim_GHCziTypes_Izh_con_info>,
  link = 0x7fc,
  u = {
    back = 0xaaaaaaaaaaaaaaaa,
    bitmap = 0xaaaaaaaaaaaaaaaa,
    scan = 0xaaaaaaaaaaaaaaaa
  },
  gen = 0xaaaaaaaaaaaaaaaa,
  gen_no = 43690,
  dest_no = 43690,
  node = 43690,
  flags = 43690,
  blocks = 2863311530,
  _padding = {[0] = 2863311530, [1] = 2863311530, [2] = 2863311530}
}

It's also interesting that the array is that far away from the ByteArray wrapper, which is the second object (after the Compact) in the region. I wonder why that is.

Backtrace at the point of segfault:

>>> bt
#0  0x00000000007627cc in objectGetCompact (closure=0x4200400008) at rts/sm/CNF.h:67
#1  0x0000000000762c8d in evacuate_compact (p=0x4200400008) at rts/sm/Evac.c:373
#2  0x000000000076308d in evacuate (p=0x42001af200) at rts/sm/Evac.c:589
#3  0x0000000000784dca in scavenge_block (bd=0x4200102bc0) at rts/sm/Scav.c:578
#4  0x0000000000786ff8 in scavenge_find_work () at rts/sm/Scav.c:2047
#5  0x00000000007870dd in scavenge_loop () at rts/sm/Scav.c:2123
#6  0x000000000075e9f7 in scavenge_until_all_done () at rts/sm/GC.c:1100
#7  0x000000000075d697 in GarbageCollect (collect_gen=1, do_heap_census=false, gc_type=0, cap=0x924a40 <MainCapability>, idle_cap=0x0) at rts/sm/GC.c:425
#8  0x0000000000742cf4 in scheduleDoGC (pcap=0x7ffd6408ae18, task=0x24dfd50, force_major=false) at rts/Schedule.c:1806
#9  0x0000000000742227 in schedule (initialCapability=0x924a40 <MainCapability>, task=0x24dfd50) at rts/Schedule.c:550
#10 0x00000000007436e9 in scheduleWaitThread (tso=0x4200105388, ret=0x0, pcap=0x7ffd6408af20) at rts/Schedule.c:2547
#11 0x0000000000751c09 in rts_evalLazyIO (cap=0x7ffd6408af20, p=0x7e5b18, ret=0x0) at rts/RtsAPI.c:530
#12 0x0000000000752362 in hs_main (argc=1, argv=0x7ffd6408b118, main_closure=0x7e5b18, rts_config=...) at rts/RtsMain.c:72
#13 0x00000000004056b6 in main ()

bdescrs of these objects are fine, except the last one:

Its address is 0x4200400008, which is suspiciously close to the start of a megablock (0x4200400000). It lies in the region where the bdescr are normally located, I think that's why Bdescr(*0x4200400008) looks corrupted. AFAIK, 0x4200400000 is the second megablock of a 2 megablock large group. The group starts at 0x4200300000. Then the array itself is the first object in that megablock group, starting at 0x4200304018. Between the start of the megablock group and the array, we have 0x4000 bytes reserved for bdescrs and 0x18 bytes for the StgCompactNFDataBlock header.

(But I only started looking into the RTS code last week, so don't take my word for it)

It's also interesting that the array is that far away from the ByteArray wrapper, which is the second object (after the Compact) in the region. I wonder why that is.

It looks to me that this is because the ByteArray wrapper is a small object that still fits in the first compact region block, and the array itself is quite large (larger than the 32K block that is allocated initially), therefore it goes into a newly allocated block.

Its address is 0x4200400008, which is suspiciously close to the start of a megablock (0x4200400000).

How do you know this? As far as I can see this program doesn't allocate a megablock around 0x4200400000.

How do you know this?

I spent the better part of last week debugging this issue as well (see also my summary in the original issue description).

Megablocks seem to be aligned to 1MB boundaries, and 0x4200400000 happens to be on such aligned address. The megablock 0x4200400000 is not allocated individually, but is part of a megablock group consisting of two megablocks that is allocated at 0x4200300000.

mentioned in merge request !1575 (closed)

Hmm, reading your very detailed report again (thanks for that!), I think you already found the problem:

The remaining question is: why is there still space left in the first megablock, when the allocator actually allocated two megablocks? I think this is because the size computation for large objects at https://gitlab.haskell.org/ghc/ghc/blob/ghc-8.6.3-release/rts/sm/CNF.c#L491. It reserves room for an StgCompactNFData while only space for a StgCompactNFDataBlock is needed.

There should be more bugs in this code, but fixing this makes your reproducer work. I've created !1575 (closed) for this. @fatho could you take a look at my patch and let me know if it looks good?

The patch looks good to me, and takes care of this segfault indeed, thanks!

There should be more bugs in this code, but fixing this makes your reproducer work

One other thing that looks suspicious (but I think due to the order in which things are initialized it is also not problematic) is the loop that initializes bdescrs in compactAllocateBlockInternal (https://gitlab.haskell.org/ghc/ghc/blob/ghc-8.6.3-release/rts/sm/CNF.c#L247), due to the different invariants that must hold for megablock groups as opposed to normal sub-megablock allocations.

mentioned in commit 8a254d6b

closed via commit 8a254d6b

Putting large objects in compact regions causes a segfault under the right conditions

Summary

Steps to reproduce

Expected behavior

Environment

Child items ...

Activity