byte code interpreter segfault (9.6 branch) (text benchmarks with optimisation)

bytestring benchmarks also segfault with a similar backtrace.

0x00007fffef19e2d4 in stg_ap_pp_info () from /home/matt/ghc-9.6-backports/_clean/stage1/bin/../lib/x86_64-linux-ghc-9.6.0.20230201/libHSrts-1.0.2_thr_debug-ghc9.6.0.20230201.so
(gdb) bt
#0  0x00007fffef19e2d4 in stg_ap_pp_info () from /home/matt/ghc-9.6-backports/_clean/stage1/bin/../lib/x86_64-linux-ghc-9.6.0.20230201/libHSrts-1.0.2_thr_debug-ghc9.6.0.20230201.so
#1  0x0000000000000000 in ?? ()

Here is a simple reproducer which can trigger the segfault with just cabal build.

https://github.com/mpickering/bytestring-th-segfault

assigned to @AndreasK

If I turn off worker wrapper then I get a different error:

All
  mempty: ghc: internal error: scavenge_stack: weird activation record found on stack: 14594088
Stack trace:
            0x7f2841886f91    set_initial_registers (rts/Libdw.c:294.5)
            0x7f28414ba938    dwfl_thread_getframes (/nix/store/rz7yr9qv1z3p2pdjm3lbw5302zx2g66l-elfutils-0.188/lib/libdw-0.188.so)
            0x7f28414ba47b    get_one_thread_cb (/nix/store/rz7yr9qv1z3p2pdjm3lbw5302zx2g66l-elfutils-0.188/lib/libdw-0.188.so)
            0x7f28414ba7a2    dwfl_getthreads (/nix/store/rz7yr9qv1z3p2pdjm3lbw5302zx2g66l-elfutils-0.188/lib/libdw-0.188.so)
            0x7f28414bace7    dwfl_getthread_frames (/nix/store/rz7yr9qv1z3p2pdjm3lbw5302zx2g66l-elfutils-0.188/lib/libdw-0.188.so)
            0x7f2841886e5f    libdwGetBacktrace (rts/Libdw.c:263.15)
            0x7f2841897a57    rtsFatalInternalErrorFn (rts/RtsMessages.c:175.22)
            0x7f28418975d0    barf (rts/RtsMessages.c:49.3)
            0x7f28418e4cfe    scavenge_stack1 (rts/sm/Scav.c:1884.12)
            0x7f28418e4264    scavenge_one (rts/sm/Scav.c:1521.27)
            0x7f28418e4e5d    scavenge_large (rts/sm/Scav.c:2068.12)
            0x7f28418e4fd6    scavenge_find_work (rts/sm/Scav.c:2139.27)
            0x7f28418e50fc    scavenge_loop1 (rts/sm/Scav.c:2207.8)
            0x7f28418cf3f1    scavenge_until_all_done (rts/sm/GC.c:1335.9)
            0x7f28418cd45d    GarbageCollect (rts/sm/GC.c:557.26)
            0x7f28418a0477    scheduleDoGC (rts/Schedule.c:1881.5)
            0x7f284189e347    schedule (rts/Schedule.c:575.7)
            0x7f28418a1c05    scheduleWorker (rts/Schedule.c:2668.11)
            0x7f28418aa2bc    workerStart (rts/Task.c:446.12)
            0x7f28418ed800    start_thread (rts/posix/OSThreads.c:219.1)
            0x7f28415b3e86    start_thread (/nix/store/ayfr5l52xkqqjn3n4h9jfacgnchz1z7s-glibc-2.35-224/lib/libc.so.6)
            0x7f284163ace0    __clone3 (/nix/store/ayfr5l52xkqqjn3n4h9jfacgnchz1z7s-glibc-2.35-224/lib/libc.so.6)

    (GHC version 9.6.0.20230201 for x86_64_unknown_linux)
    Please report this as a GHC bug:  https://www.haskell.org/ghc/reportabug
Error: cabal: Failed to build exe:bytestring-th-segfault from
bytestring-th-segfault-0.1.0.0. The build process terminated with exit code -6

I'm minimizing.

Reduction:

Main.hs

{-# LANGUAGE TemplateHaskell #-}
module Main (main) where

import BenchTop

main :: IO ()
main = $(benchTop)

BenchTop.hs

{-# LANGUAGE TemplateHaskell #-}
module BenchTop where

import Control.Monad.IO.Class (liftIO)
import Control.Exception (evaluate)
import System.Mem (performGC)
import Language.Haskell.TH

runB :: (Int -> IO ()) -> IO ()
runB b = do
    b 1
    performGC
    b 2
    let x = x in x

benchTop :: Q Exp
benchTop = liftIO defaultMain >> [| () |]

defaultMain :: IO ()
defaultMain = runB (benchLoop g)

g :: () -> () -> ()
g _ x = x

benchLoop :: (() -> () -> ()) -> Int -> IO ()
benchLoop _ 0 = pure ()
benchLoop f n = evaluate (f ()) >> benchLoop f (n-1)

$ ./stage1/bin/ghc -fprefer-byte-code -fbyte-code-and-object-code -O -fforce-recomp Main
[1 of 3] Compiling BenchTop         ( BenchTop.hs, BenchTop.o, interpreted )
[2 of 3] Compiling Main             ( Main.hs, Main.o, interpreted )
Segmentation fault (core dumped)

Adding -fno-worker-wrapper gives scavenge_stack: weird activation record found on stack as in the original.

Reproduced on perf build of 3e18d493 (ghc-9.6 branch) and on devel2 build of 3e09cf82 with cherry-picked 75b87ef6 (master).

changed milestone to %9.6.2

Here's a version that causes a segfault in bare ghci - no optimisation or -fprefer-byte-code:

{-# LANGUAGE MagicHash, UnboxedTuples #-}
module BenchTop where

import GHC.Exts
import GHC.IO

unit :: ()
unit = ()

bad :: IO ()
bad = IO i

i :: State# RealWorld -> (# State# RealWorld, () #)
i s = case seq# unit s of (# s', a #) -> (# s', a #)

$ ghci BenchTop
GHCi, version 9.2.4: https://www.haskell.org/ghc/  :? for help
[1 of 1] Compiling BenchTop         ( BenchTop.hs, interpreted )
Ok, one module loaded.
ghci> bad
Segmentation fault (core dumped)

This causes a segfault on 9.2.4, 9.4.3 and master. Works on 8.10 and 9.0.2.

The file must be interpreted (i.e. rm *.hi *.o if the segfault does not happen).

great work with the minimization!

I believe I have tracked down the bug. It is twisty and subtle, and I think the incorrect code is arguably in the RTS, not in the bytecode compiler. To understand how things go wrong, it is important to understand several pieces of context.

Bytecode and RTS prerequisites

First, we must understand how the following STG binding is represented in bytecode:
```
i = \r [void]
  case seq# [unit void#] of
    Solo# a -> Solo# [a]
```
This is a very simple definition, so the BCO really only needs to do two things: set up a continuation to receive the result of the application of seq#, then make the call. The generated BCO that does those things essentially looks like this:
```
ProtoBCO i:
  PUSH_ALTS_UNLIFTED P
    ProtoBCO result_cont:
      PUSH_L 1
      SLIDE 1 4
      RETURN_UNLIFTED P
  PUSH_APPLY_V
  PUSH_G unit
  PUSH_APPLY_P
  PUSH_G GHC.PrimopWrappers.seq#
  ENTER
```
The real BCO is a little more complicated, with some extra indirection involving nested BCOs, but those complications are not relevant to the bug.

The portion of this BCO that makes the call to seq# is very simple: it just pushes stg_ap_v and stg_ap_p frames onto the stack to apply seq# to arguments, then enters GHC.PrimopWrappers.seq# (which is just a small native-code wrapper around the seq# primop).

The body of result_cont is a little less obvious, but it is crucial for understanding this bug. The PUSH_L and SLIDE instructions just rearrange the stack: PUSH_L 1 pushes the word at Sp[1] onto the top of the stack, and SLIDE 1 4 pops the first 4 words underneath the first word—that is, it “slides” the first word down 4 words. To understand where these numbers come from, we must consider RTS calling conventions.
The expression seq# [unit void#] returns a single pointer. Normally, the RTS calling convention for a function that returns a single word is straightforward and direct: it is always returned in R1. But bytecode does not use registers at all—it is a stack machine—so a bit of glue is necessary to push the returned value onto the stack before jumping back into the bytecode interpreter.

The PUSH_ALTS_UNLIFTED instruction handles setting up the appropriate glue, which for this expression takes the form of a stg_ctoi_R1unpt stack frame. That stack frame has the following Cmm definition:
```
INFO_TABLE_RET(stg_ctoi_R1unpt, RET_BCO)
{
    Sp_adj(-2);
    Sp(1) = R1;
    Sp(0) = stg_ret_p_info;
    jump stg_yield_to_interpreter [];
}
```
This stack frame is a RET_BCO closure, so its payload is a BCO. In the above example, PUSH_ALTS_UNLIFTED pushes a stg_ctoi_R1unpt frame with result_cont as its payload. This means the top of the stack looks like this when GHC.PrimopWrappers.seq# returns:
```
┌──────────────────────┐
│ stg_ctoi_R1unpt_info │
├──────────────────────┤
│     result_cont      │
├──────────────────────┤
```
When stg_ctoi_R1unpt_info is returned to, it does not pop itself off the stack. Instead, it pushes two new stack frames, leaving the stack like this:
```
┌──────────────────────┐
│    stg_ret_p_info    │
├──────────────────────┤
│      GHC.Prim.()     │
├──────────────────────┤
│ stg_ctoi_R1unpt_info │
├──────────────────────┤
│     result_cont      │
├──────────────────────┤
```
It then jumps into the bytecode interpreter. The bytecode interpreter looks at the frame at the top of the stack to decide what to do, and it knows that stg_ret_*_info always means “return to a BCO representing a case expression”. The frame immediately underneath the stg_ret_* frame is always a RET_BCO, so the bytecode interpreter starts executing the result_cont BCO in the stg_ctoi_R1unpt payload.

This calling convention explains the numbers in the PUSH_L and SLIDE instructions. The PUSH_L 1 instruction fetches the GHC.Prim.() closure from underneath the stg_ret_p frame and pushes it onto the top of the stack, and the SLIDE 1 4 instruction then drops the four words of stack underneath. At this point, the stack is in a good state, and the BCO can return to its caller.
Whenever the bytecode interpreter returns a value to native code, it must yield to the scheduler. In situations where the native code is one of these “glue” stack frames that just jumps back into the bytecode interpreter, this is needlessly wasteful. Therefore, the bytecode interpreter emulates certain known stack frames to avoid returning to native code as long as possible.

For example, suppose our definition of i didn’t call into native code at all, and instead it just matched on a value directly:
```
i = \r [void]
  case Solo# [unit] of
    Solo# a -> Solo# [a]
```
This generates similar bytecode as before, but now there’s no need to make a call, so unit is returned to result_cont directly:
```
ProtoBCO i:
  PUSH_ALTS_UNLIFTED P
    ProtoBCO result_cont:
      PUSH_L 1
      SLIDE 1 4
      RETURN_UNLIFTED P
  PUSH_G unit
  RETURN_UNLIFTED P
```
When the bytecode interpreter evaluates a RETURN_UNLIFTED P instruction, it just pushes a stg_ret_p frame onto the stack and jumps to the start of the interpreter loop. Since the stack is now set up in precisely the same way that stg_ctoi_R1unpt sets it up, there is no need to yield to native code at all, and the interpreter just executes result_cont directly.

Important to this bug is that the interpreter’s ability to take these shortcuts extends beyond interpreter-specific stack frames. In particular, the interpreter is able to emulate UPDATE_FRAMEs by updating the thunk directly and popping the stack, even if those update frames were pushed by native code. This behavior is essential to the cause of the bug.

The bug: a play by play

With all of that context out of the way, I can finally explain what actually goes wrong when executing the above bytecode. Let’s walk through it step by step:

First, the bytecode interpreter executes the PUSH_ALTS_UNLIFTED P instruction, which sets up the stack to receive the result of the call to seq#:

┌──────────────────────┐
│ stg_ctoi_R1unpt_info │
├──────────────────────┤
│     result_cont      │
├──────────────────────┤

Next, the bytecode interpreter sets up the stack to apply seq# to arguments:

┌──────────────────────┐
│    stg_ap_p_info     │
├──────────────────────┤
│         unit         │
├──────────────────────┤
│    stg_ap_v_info     │
├──────────────────────┤
│ stg_ctoi_R1unpt_info │
├──────────────────────┤
│     result_cont      │
├──────────────────────┤

The interpreter then executes the ENTER GHC.PrimopWrappers.seq# instruction. Its argument is a function defined in native code, so the interpreter sets up the top of the stack to perform the enter in native code, then yields to the scheduler.

GHC.PrimopWrappers.seq# has the following definition:

seq# :: a -> State# s -> (# State# s,a #)
seq# x s = GHC.Prim.seq# x s

The only reason it exists is for the bytecode compiler. Its implementation is trivial even at the Cmm level:

GHC.PrimopWrappers.seqzh_code() {
  R1 = R2;
  call stg_ap_0_fast(R1);
}

This is a tail call that immediately enters unit. unit happens to be defined in bytecode, so it is an AP closure with type ARG_BCO produced by mkApUpd0#. Its entry code pushes an update frame for the thunk and sets up the stack to jump back into the bytecode interpreter:

┌───────────────────────┐
│ stg_apply_interp_info │
├───────────────────────┤
│       unit_BCO        │
├───────────────────────┤
│  stg_upd_frame_info   │
├───────────────────────┤
│       unit_AP         │
├───────────────────────┤
│ stg_ctoi_R1unpt_info  │
├───────────────────────┤
│      result_cont      │
├───────────────────────┤

The bytecode interpreter inspects the top of the stack and finds the stg_apply_interp frame at the top. This is used to request that the bytecode interpreter apply a BCO as a function, so it pops the first two words off the top of the stack and applies unit_BCO, which has the following definition:
```
ProtoBCO unit:
  PACK GHC.Prim.() 0
  RETURN
```
The PACK instruction just allocates a fresh CONSTR closure with the given infotable, and RETURN returns it.
The bytecode interpreter consults the top of the stack to determine how to return its newly-allocated () constructor. The first stack frame is an update frame, which it knows how to handle itself, so it performs the thunk update and pops the stack frame. This leaves us with a familiar stack:
```
┌──────────────────────┐
│ stg_ctoi_R1unpt_info │
├──────────────────────┤
│     result_cont      │
├──────────────────────┤
```
Now the interpreter finds the stg_ctoi_R1unpt frame, which is a RET_BCO, and it knows how to handle that as well. However, it is at this point that the interpreter makes a critical error.

Recall that stg_ctoi_R1unpt sets up the stack by pushing both the returned pointer and a stg_ret_p frame. Also recall that RETURN_UNLIFTED P emulates this protocol by pushing an stg_ret_p frame at the top of the stack. However, note that we are not currently executing a RETURN_UNLIFTED instruction, we are executing a RETURN instruction. This is not fundamentally incorrect, as the representation of (# () #) is indeed identical to the representation of ().

Unfortunately, the calling convention for BCOs that receive values via PUSH_ALTS and PUSH_ALTS_UNLIFTED differs. When the bytecode interpreter returns a lifted value to a RET_BCO frame, it assumes the continuation was installed by a PUSH_ALTS instruction and sets up the top of the stack accordingly:
```
┌──────────────────────┐
│     GHC.Prim.()      │
├──────────────────────┤
│ stg_ctoi_R1unpt_info │
├──────────────────────┤
│     result_cont      │
├──────────────────────┤
```
Note that this has no stg_ret_* frame at the top!
At this point, we are thoroughly lost: the bytecode interpreter blindly executes result_cont with the wrong stack layout. It interprets the PUSH_L 1, SLIDE 1 4 sequence, but there are only three frames at the top of the stack, so we end up with the following:
```
┌──────────────────────┐
│ stg_ctoi_R1unpt_info │
├──────────────────────┤
│  <orphaned payload>  │
├──────────────────────┤
```
The interpreter now executes a RETURN_UNLIFTED instruction, but instead of a closure, we are returning a stg_ctoi_R1unpt_info pointer, and instead of returning to our caller, we’re now returning to our caller’s caller (if we’re lucky) or to the middle of nowhere (if we’re not).

What precisely happens next depends on the enclosing program, but it is usually only a matter of time before the erroneous program state leads to a page fault and the operating system mercifully puts our program out of its misery.

What do we do about it?

There are a few potential ways to solve this issue, but I think the nicest one would be to make the calling convention for PUSH_ALTS and PUSH_ALTS_UNLIFTED P BCOs the same. I think this should avoid this problem, and it is more consistent with the way the rest of the RTS works. The simplest way to go about this would be to make both calling conventions expect 3 words on the stack, which is what PUSH_ALTS continuations currently expect.

The bytecode compiler will have to change to produce BCOs with the uniform calling convention, and the bytecode interpreter will have to arrange to invoke BCOs with that calling convention, but I don’t think this should be a significant task. I will try to put together an MR sometime in the next few days, but I am not going to try right now: I am done thinking about this for today. :) I will report back if I run into any unforeseen challenges.

You should put up the whole of Bytecode and RTS prerequisites on the wiki, it might be the most documentation about the workings of the interpreter I've seen in a single place.

Excellent work.

added Phigh Tbug bytecode labels

mentioned in merge request !10040 (closed)

As I just explained in !10040 (comment 484886), I’ve been contemplating whether it makes sense to make any distinction between RETURN/PUSH_ALTS and RETURN_UNLIFTED P/PUSH_ALTS_UNLIFTED P at all.

I don’t think there is any distinction of this sort made at the Cmm level. To my knowledge, in Cmm, a pointer is a pointer, and there are only two differences between returning a lifted value versus returning an unlifted value:

When returning a lifted value that might not be fully evaluated, we must enter it. When returning an unlifted value, we must not enter it.
Lifted pointers must be tagged before we return them.

The first point is already handled explicitly in bytecode because we have a separate ENTER instruction. For the second point, the interpreter keeps closures tagged on the stack, so RETURN doesn’t do anything special in that regard. Similarly, on the PUSH_ALTS side, the returned value is left tagged on the stack, so the difference is all in the generated code.

I don’t think this is wildly surprising. After all, this source of this whole issue is really that, at the level of the RTS, we don’t make any distinction between these things, so it’s possible for a return of a lifted value to end up returning directly to a continuation that scrutinizes an “unlifted value” (which is actually an unboxed 1-tuple containing a lifted value, so from the runtime’s perspective, it is very much lifted). Therefore, I propose that we scrap this distinction altogether, since pretending it exists seems likely to only cause trouble.

Does that sound reasonable to everyone else? Do chime in if you think there is something I’ve missed.

mentioned in merge request !10061 (closed)

mentioned in commit 036e3cc3

mentioned in commit a0b037cb

mentioned in commit f48462e9

closed with commit d85ed900

mentioned in commit d85ed900

mentioned in commit cb8cd007

mentioned in commit 74b8e5bd

byte code interpreter segfault (9.6 branch) (text benchmarks with optimisation)

Child items ...

Activity

Bytecode and RTS prerequisites

The bug: a play by play

What do we do about it?