I modified the interpreter to allow optimised code by commenting out the part of Session.hs which forces -O0.
Then I built the benchmarks for text library and copied the command which builds the benchmark executable.
I replaced --make with --run so that the main function would be run immediately by bytecode interpreter.
The result is a segfault:
(gdb) bt#0 0x00007fffef19e2d4 in stg_ap_pp_info () from /home/matt/ghc-9.6-backports/_clean/stage1/bin/../lib/x86_64-linux-ghc-9.6.0.20230201/libHSrts-1.0.2_thr_debug-ghc9.6.0.20230201.so#1 0x0000000000000000 in ?? ()
This seems similar in flavour to #22956 but a bit easier to reproduce.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items
...
Show closed items
Linked items
0
Link issues together to show that they're related or that one is blocking others.
Learn more.
bytestring benchmarks also segfault with a similar backtrace.
0x00007fffef19e2d4 in stg_ap_pp_info () from /home/matt/ghc-9.6-backports/_clean/stage1/bin/../lib/x86_64-linux-ghc-9.6.0.20230201/libHSrts-1.0.2_thr_debug-ghc9.6.0.20230201.so(gdb) bt#0 0x00007fffef19e2d4 in stg_ap_pp_info () from /home/matt/ghc-9.6-backports/_clean/stage1/bin/../lib/x86_64-linux-ghc-9.6.0.20230201/libHSrts-1.0.2_thr_debug-ghc9.6.0.20230201.so#1 0x0000000000000000 in ?? ()
{-# LANGUAGE TemplateHaskell #-}moduleMain(main)whereimportBenchTopmain::IO()main=$(benchTop)
BenchTop.hs
{-# LANGUAGE TemplateHaskell #-}moduleBenchTopwhereimportControl.Monad.IO.Class(liftIO)importControl.Exception(evaluate)importSystem.Mem(performGC)importLanguage.Haskell.THrunB::(Int->IO())->IO()runBb=dob1performGCb2letx=xinxbenchTop::QExpbenchTop=liftIOdefaultMain>>[|()|]defaultMain::IO()defaultMain=runB(benchLoopg)g::()->()->()g_x=xbenchLoop::(()->()->())->Int->IO()benchLoop_0=pure()benchLoopfn=evaluate(f())>>benchLoopf(n-1)
$ ./stage1/bin/ghc -fprefer-byte-code -fbyte-code-and-object-code -O -fforce-recomp Main[1 of 3] Compiling BenchTop ( BenchTop.hs, BenchTop.o, interpreted )[2 of 3] Compiling Main ( Main.hs, Main.o, interpreted )Segmentation fault (core dumped)
Adding -fno-worker-wrapper gives scavenge_stack: weird activation record found on stack as in the original.
Reproduced on perf build of 3e18d493 (ghc-9.6 branch) and on devel2 build of 3e09cf82 with cherry-picked 75b87ef6 (master).
Here's a version that causes a segfault in bare ghci - no optimisation or -fprefer-byte-code:
{-# LANGUAGE MagicHash, UnboxedTuples #-}moduleBenchTopwhereimportGHC.ExtsimportGHC.IOunit::()unit=()bad::IO()bad=IOii::State#RealWorld->(#State#RealWorld,()#)is=caseseq#unitsof(#s',a#)->(#s',a#)
$ ghci BenchTopGHCi, version 9.2.4: https://www.haskell.org/ghc/ :? for help[1 of 1] Compiling BenchTop ( BenchTop.hs, interpreted )Ok, one module loaded.ghci> badSegmentation fault (core dumped)
This causes a segfault on 9.2.4, 9.4.3 and master. Works on 8.10 and 9.0.2.
The file must be interpreted (i.e. rm *.hi *.o if the segfault does not happen).
I believe I have tracked down the bug. It is twisty and subtle, and I think the incorrect code is arguably in the RTS, not in the bytecode compiler. To understand how things go wrong, it is important to understand several pieces of context.
Bytecode and RTS prerequisites
First, we must understand how the following STG binding is represented in bytecode:
i=\r[void]caseseq#[unitvoid#]ofSolo#a->Solo#[a]
This is a very simple definition, so the BCO really only needs to do two things: set up a continuation to receive the result of the application of seq#, then make the call. The generated BCO that does those things essentially looks like this:
ProtoBCO i: PUSH_ALTS_UNLIFTED P ProtoBCO result_cont: PUSH_L 1 SLIDE 1 4 RETURN_UNLIFTED P PUSH_APPLY_V PUSH_G unit PUSH_APPLY_P PUSH_G GHC.PrimopWrappers.seq# ENTER
The real BCO is a little more complicated, with some extra indirection involving nested BCOs, but those complications are not relevant to the bug.
The portion of this BCO that makes the call to seq# is very simple: it just pushes stg_ap_v and stg_ap_p frames onto the stack to apply seq# to arguments, then enters GHC.PrimopWrappers.seq# (which is just a small native-code wrapper around the seq# primop).
The body of result_cont is a little less obvious, but it is crucial for understanding this bug. The PUSH_L and SLIDE instructions just rearrange the stack: PUSH_L 1 pushes the word at Sp[1] onto the top of the stack, and SLIDE 1 4 pops the first 4 words underneath the first word—that is, it “slides” the first word down 4 words. To understand where these numbers come from, we must consider RTS calling conventions.
The expression seq# [unit void#] returns a single pointer. Normally, the RTS calling convention for a function that returns a single word is straightforward and direct: it is always returned in R1. But bytecode does not use registers at all—it is a stack machine—so a bit of glue is necessary to push the returned value onto the stack before jumping back into the bytecode interpreter.
The PUSH_ALTS_UNLIFTED instruction handles setting up the appropriate glue, which for this expression takes the form of a stg_ctoi_R1unpt stack frame. That stack frame has the following Cmm definition:
This stack frame is a RET_BCO closure, so its payload is a BCO. In the above example, PUSH_ALTS_UNLIFTED pushes a stg_ctoi_R1unpt frame with result_cont as its payload. This means the top of the stack looks like this when GHC.PrimopWrappers.seq# returns:
It then jumps into the bytecode interpreter. The bytecode interpreter looks at the frame at the top of the stack to decide what to do, and it knows that stg_ret_*_info always means “return to a BCO representing a case expression”. The frame immediately underneath the stg_ret_* frame is always a RET_BCO, so the bytecode interpreter starts executing the result_cont BCO in the stg_ctoi_R1unpt payload.
This calling convention explains the numbers in the PUSH_L and SLIDE instructions. The PUSH_L 1 instruction fetches the GHC.Prim.() closure from underneath the stg_ret_p frame and pushes it onto the top of the stack, and the SLIDE 1 4 instruction then drops the four words of stack underneath. At this point, the stack is in a good state, and the BCO can return to its caller.
Whenever the bytecode interpreter returns a value to native code, it must yield to the scheduler. In situations where the native code is one of these “glue” stack frames that just jumps back into the bytecode interpreter, this is needlessly wasteful. Therefore, the bytecode interpreter emulates certain known stack frames to avoid returning to native code as long as possible.
For example, suppose our definition of i didn’t call into native code at all, and instead it just matched on a value directly:
i=\r[void]caseSolo#[unit]ofSolo#a->Solo#[a]
This generates similar bytecode as before, but now there’s no need to make a call, so unit is returned to result_cont directly:
ProtoBCO i: PUSH_ALTS_UNLIFTED P ProtoBCO result_cont: PUSH_L 1 SLIDE 1 4 RETURN_UNLIFTED P PUSH_G unit RETURN_UNLIFTED P
When the bytecode interpreter evaluates a RETURN_UNLIFTED P instruction, it just pushes a stg_ret_p frame onto the stack and jumps to the start of the interpreter loop. Since the stack is now set up in precisely the same way that stg_ctoi_R1unpt sets it up, there is no need to yield to native code at all, and the interpreter just executes result_cont directly.
Important to this bug is that the interpreter’s ability to take these shortcuts extends beyond interpreter-specific stack frames. In particular, the interpreter is able to emulate UPDATE_FRAMEs by updating the thunk directly and popping the stack, even if those update frames were pushed by native code. This behavior is essential to the cause of the bug.
The bug: a play by play
With all of that context out of the way, I can finally explain what actually goes wrong when executing the above bytecode. Let’s walk through it step by step:
First, the bytecode interpreter executes the PUSH_ALTS_UNLIFTED P instruction, which sets up the stack to receive the result of the call to seq#:
The interpreter then executes the ENTER GHC.PrimopWrappers.seq# instruction. Its argument is a function defined in native code, so the interpreter sets up the top of the stack to perform the enter in native code, then yields to the scheduler.
GHC.PrimopWrappers.seq# has the following definition:
This is a tail call that immediately enters unit. unit happens to be defined in bytecode, so it is an AP closure with type ARG_BCO produced by mkApUpd0#. Its entry code pushes an update frame for the thunk and sets up the stack to jump back into the bytecode interpreter:
The bytecode interpreter inspects the top of the stack and finds the stg_apply_interp frame at the top. This is used to request that the bytecode interpreter apply a BCO as a function, so it pops the first two words off the top of the stack and applies unit_BCO, which has the following definition:
ProtoBCO unit: PACK GHC.Prim.() 0 RETURN
The PACK instruction just allocates a fresh CONSTR closure with the given infotable, and RETURN returns it.
The bytecode interpreter consults the top of the stack to determine how to return its newly-allocated () constructor. The first stack frame is an update frame, which it knows how to handle itself, so it performs the thunk update and pops the stack frame. This leaves us with a familiar stack:
Now the interpreter finds the stg_ctoi_R1unpt frame, which is a RET_BCO, and it knows how to handle that as well. However, it is at this point that the interpreter makes a critical error.
Recall that stg_ctoi_R1unpt sets up the stack by pushing both the returned pointer and a stg_ret_p frame. Also recall that RETURN_UNLIFTED P emulates this protocol by pushing an stg_ret_p frame at the top of the stack. However, note that we are not currently executing a RETURN_UNLIFTED instruction, we are executing a RETURN instruction. This is not fundamentally incorrect, as the representation of (# () #) is indeed identical to the representation of ().
Unfortunately, the calling convention for BCOs that receive values via PUSH_ALTS and PUSH_ALTS_UNLIFTED differs. When the bytecode interpreter returns a lifted value to a RET_BCO frame, it assumes the continuation was installed by a PUSH_ALTS instruction and sets up the top of the stack accordingly:
At this point, we are thoroughly lost: the bytecode interpreter blindly executes result_cont with the wrong stack layout. It interprets the PUSH_L 1, SLIDE 1 4 sequence, but there are only three frames at the top of the stack, so we end up with the following:
The interpreter now executes a RETURN_UNLIFTED instruction, but instead of a closure, we are returning a stg_ctoi_R1unpt_info pointer, and instead of returning to our caller, we’re now returning to our caller’s caller (if we’re lucky) or to the middle of nowhere (if we’re not).
What precisely happens next depends on the enclosing program, but it is usually only a matter of time before the erroneous program state leads to a page fault and the operating system mercifully puts our program out of its misery.
What do we do about it?
There are a few potential ways to solve this issue, but I think the nicest one would be to make the calling convention for PUSH_ALTS and PUSH_ALTS_UNLIFTED P BCOs the same. I think this should avoid this problem, and it is more consistent with the way the rest of the RTS works. The simplest way to go about this would be to make both calling conventions expect 3 words on the stack, which is what PUSH_ALTS continuations currently expect.
The bytecode compiler will have to change to produce BCOs with the uniform calling convention, and the bytecode interpreter will have to arrange to invoke BCOs with that calling convention, but I don’t think this should be a significant task. I will try to put together an MR sometime in the next few days, but I am not going to try right now: I am done thinking about this for today. :) I will report back if I run into any unforeseen challenges.
You should put up the whole of Bytecode and RTS prerequisites on the wiki, it might be the most documentation about the workings of the interpreter I've seen in a single place.
As I just explained in !10040 (comment 484886), I’ve been contemplating whether it makes sense to make any distinction between RETURN/PUSH_ALTS and RETURN_UNLIFTED P/PUSH_ALTS_UNLIFTED P at all.
I don’t think there is any distinction of this sort made at the Cmm level. To my knowledge, in Cmm, a pointer is a pointer, and there are only two differences between returning a lifted value versus returning an unlifted value:
When returning a lifted value that might not be fully evaluated, we must enter it. When returning an unlifted value, we must not enter it.
Lifted pointers must be tagged before we return them.
The first point is already handled explicitly in bytecode because we have a separate ENTER instruction. For the second point, the interpreter keeps closures tagged on the stack, so RETURN doesn’t do anything special in that regard. Similarly, on the PUSH_ALTS side, the returned value is left tagged on the stack, so the difference is all in the generated code.
I don’t think this is wildly surprising. After all, this source of this whole issue is really that, at the level of the RTS, we don’t make any distinction between these things, so it’s possible for a return of a lifted value to end up returning directly to a continuation that scrutinizes an “unlifted value” (which is actually an unboxed 1-tuple containing a lifted value, so from the runtime’s perspective, it is very much lifted). Therefore, I propose that we scrap this distinction altogether, since pretending it exists seems likely to only cause trouble.
Does that sound reasonable to everyone else? Do chime in if you think there is something I’ve missed.