Heap corruption (segfault) from large 'let' expression

The attached test program reliably triggers an assertion in the storage manager with the -debug rts.

LargeUse: internal error: ASSERTION FAILED: file rts/sm/GCUtils.c, line 208

    (GHC version 7.6.3 for x86_64_unknown_linux)
    Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug

This behaviour is reproducible with many recent ghc versions (tried 7.6.3, 7.4.2, 6.12.3) and all fail at the same assertion when using the -debug rts. (Without -debug we get a more random variety of segfaults and GC errors.)

It looks like a pretty clear case of heap corruption. I'll explain why...

The test program uses TH to generate a program that looks like this:

data Large = Large Int Int ... -- 512 non-strict Int fields

test =
  let step (Large i1 i2 ... i512) =
        let j1 = i1 + i4
            j2 = i2 + i7
            ...
            j511 = i511 + i510
            j512 = i512 + i1
         in Large j1 j2 ... j512

   in runSteps step 100000 (Large 1 1 ... 1)

-- basically an unfoldr:
runSteps :: (state -> (state, Int)) -> Int -> state -> [Int]
runSteps f n i | n <= 0    = []
               | otherwise = case f i of
                               (i', r) -> r : runSteps f (n - 1) i'

We use TH to generate this program, and we use a "size" parameter that determines size of the data constructor (and corresponding letrec). This makes it easy to find the size threshold where it fails.

For small sizes this program works fine, and for larger values it triggers the assert. With ghc 7.6.3 on a x86-64 machine, the magic threshold is 511: that is, the program works fine with size 510 and hits the assertion at size 511. This is suspiciously close to 512. And of course on a 64bit machine 512 * 8 is 4k, which is the storage manager's block size. And the failing assertion is in a bit of the storage manager that is dealing with blocks...

        // If this block does not have enough space to allocate the
        // current object, but it also doesn't have any work to push, then 
        // push it on to the scanned list.  It cannot be empty, because
        // then there would be enough room to copy the current object.
        if (bd->u.scan == bd->free)
        {
            ASSERT(bd->free != bd->start);
            push_scanned_block(bd, ws);
        }

So it looks very much like we have a situation where something is writing over the end of a block and messing up the SM's data structures.

But, it is not nearly as simple as the data constructor being too big. We can demonstrate other programs that use much larger data constructors without any problem at all. Our suspicion falls on the big letrec.

Indeed with this program if we change the data constructor to have strict fields then it no longer fails, and we can run it with much larger data constructor sizes. What would be different between strict and non-strict fields here? Well, one observation is that when it is strict then ghc can (and does) turn the code into a big cascade of case expressions, while when it is non-strict then the STG code is all 'let's.

        case tpl_s6jQ of tpl_s6Ak {
          __DEFAULT ->
              case tpl_s6jS of tpl_s6Al {
                __DEFAULT ->
                    case tpl_s6jU of tpl_s6Am {
    -- etc for all 500+ elements

versus

              let {
                sat_s5UK :: GHC.Types.Int
                [LclId] =
                    \u [] GHC.Num.$fNumInt_$c+ i511_s5Ly i1_s5E9; } in
              let {
                sat_s62X :: GHC.Types.Int
                [LclId] =
                    \u [] GHC.Num.$fNumInt_$c+ i510_s5R2 i509_s5UG; } in
              let {
                sat_s62W :: GHC.Types.Int
                [LclId] =
                    \u [] GHC.Num.$fNumInt_$c+ i509_s5UG i506_s5UC; } in
    -- etc for all 500+ elements

Note also, that it is nothing to do with the obvious space leak here. If we modify the code to generate an NFData instance and to use deepseq at each iteration then we eliminate the space leak, but we keep the big stg 'let', and it still fails.

Trac metadata

Trac field	Value
Version	7.6.3
Type	Bug
TypeOfFailure	OtherFailure
Priority	normal
Resolution	Unresolved
Component	Runtime System
Test case
Differential revisions
BlockedBy
Related
Blocking
CC
Operating system
Architecture

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information