Eager AP_STACK blackholing causes incorrect size info for sanity checks
While debugging #15508 (closed) I found a case where eager blackholing in AP_STACK
closure_sizeW() to return incorrect size, which in turn causes
incorrect slop zeroing by
OVERWRITING_CLOSURE(), which breaks sanity checks.
To reproduce, cd into
$ ghc-stage2 Mult.hs -fforce-recomp -debug -rtsopts $ ./Mult +RTS -DS Mult: internal error: checkClosure: stack frame (GHC version 8.7.20180825 for x86_64_unknown_linux) Please report this as a GHC bug: http://www.haskell.org/ghc/reportabug zsh: abort (core dumped) ./Mult +RTS -DS
Here's how the problem occurs:
- Allocate an AP_STACK in a generation during a GC.
- Evaluate the AP_STACK. The entry code first WHITEHOLEs and then eagerly
BLACKHOLEs it. At this point size of the STACK becomes 2 because that's the
size of (eager or not) BLACKHOLE.
- To start a GC the thread does
threadPaused, which in line 342 actually
BLACKHOLEs the eager blackhole (is this part really correct?) and zeros the
slop, but because the eager blackhole has the same size as BLACKHOLE it
doesn't actually zero the stack frames in the original AP_STACK's payload.
- In the next GC, in pre-GC sanity check we check the whole heap. When
checking the generation that the BLACKHOLE (the AP_STACK that became a
BLACKHOLE in step (2)) resides in we check the closure, and then check
closure + 2 (2 is the size of BLACKHOLE) instead of
closure + <size of the stack>, and end up checking a stack frame of the original AP_STACK.
This causes the sanity check to fail because we don't expect to see a stack
frame outside of a stack.
In summary, normally when blackhole an object we zero the space after the blackhole (i.e. some part of the original object's payload) so that in sanity checks we can skip over that space, but we can't do this when eagerly blackholing (because the payload of the original object will be used) which causes sanity check failures.