Tag inference failure with -fprefer-byte-code

added Tbug pointer tagging prefer byte code labels

assigned to @AndreasK

added Phigh label

added compiler crash label

changed milestone to %9.6.1

added bytecode label

I slightly simplified the reproducer and added one way to reproduce this using ghci rather than TH.

changed the description

This seems to boil down to a difference in semantics between compiled and executed code.

We have this code in STG:

B.theA :: A.A
[GblId] =
    A.MkA! [theA1_iBH];

theA1_iBH :: GHC.Maybe.Maybe GHC.Types.Bool
[GblId] =
    GHC.Maybe.Just! [GHC.Types.True];

We expect the argument to MkA to be properly tagged but that should be fine. When we compile code then we will look at the RHS of theA1_iBH and produce a proper tagged pointer for theA1_iBH inside A.MkA! [theA1_iBH];.

For the interpreter this goes wrong. How does the interpreter work?`

For A.MkA! [theA1_iBH]; it will generate the code

   PUSH_G   theA1_iBH
   PACK     A.MkA 1
   RETURN

What this does is push the pointer for theA1_iBH onto the stack, allocating a MkA constructor which the argument from the stack as it's field and returning the constructor.

The issue here is that the pointer representing theA1_iBH isn't tagged. But the tag inference currently assumes it is based on the RHS of theA1_iBH. There are two ways to fix this:

Make tag inference aware of weither or not we compile for the interpreter and infer different taggedness based on that.
Ensure that looking up the pointer for theA1_iBH returns a properly tagged pointer in the interpreter.

The later seems better. The analysis remains simpler and it improves interpreter performance so I'm looking into how feasible this is. If this turns out to be unrealistic then changing the analysis wouldn't be hard either. But it seems like a worse way to fix this.

I'm still digging more into the code of the interpreter but I think we will have to go with changing the analysis.

By the looks of it the interpreter has no notion allocating top level constructors at all. That is if we have:

foo = Just True

Instead of compiling this into a closure right away. It results in a closure which has to be executed to get an evaluated value. This becomes evident when looking at the BCO we produce.

This binding:

theA1_iBH :: GHC.Maybe.Maybe GHC.Types.Bool
[GblId] =
    GHC.Maybe.Just! [GHC.Types.True];

Results in this BCO

ProtoBCO theA1_iBH#0 []:
   GHC.Maybe.Just! [GHC.Types.True]
   bitmap:  0 []
   PACK     GHC.Types.True 0
   PACK     GHC.Maybe.Just 1
   RETURN

Would it be nice if the interpreter would create constructor objects immediately? Obviously yes. But I'm not sure it's worth it so I will adjust the analysis for now.

I don’t know very much about how tag inference works or the invariants that it expects to be maintained. But in general I think it would be strongly preferable to make the bytecode compiler smarter rather than accommodating its inadequacies, especially now that we have -fprefer-byte-code. My understanding of this specific issue, based only on your comments so far, is roughly this:

We generate different STG (not just different Cmm) if tag inference suggests something should always be tagged (and therefore necessarily fully evaluated). That is, it allows us to optimize
```
case theA1_iBH of x
  __DEFAULT -> A.MkA! [x]
```
into just
```
A.MkA! theA1_iBH
```
and omit the evaluation altogether.
We rely on this optimization being done everywhere, so we expect to have the invariant that strict constructor fields are in fact always fully evaluated and always tagged. This means that when we have
```
case a of
  MkA b -> case b of b'
    __DEFAULT -> b'
```
then we can optimize this to simply
```
case a of
  MkA b -> b
```
since we know b must already be evaluated.
Currently, the analysis that drives the optimization described in point 1 makes certain assumptions about what closures the code generator will produce for certain definitions. These assumptions hold for the Cmm code generator, but they don’t hold for the bytecode compiler.

Normally, these differences wouldn’t matter, because the bytecode compiler would just generate more naïve code. But in this case, since the STG code itself is altered, the bytecode compiler happily does as it is told and applies MkA to its argument without evaluating it, which violates the (global, runtime) invariant described in point 2.

Assuming this summary is correct, it reveals that these sorts of optimizations on STG are in somewhat perilous territory, as they rely on certain invariants being upheld by all code generators. Is there something somewhere that concisely describes the invariants that are expected? I was trying to look into this myself, but I found it difficult to hunt down, especially since we now have various flavors of pointer tagging and (at least it seems to me) different invariants in different scenarios.

But in general I think it would be strongly preferable to make the bytecode compiler smarter rather than accommodating its inadequacies

I agree with the sentiment but for this particular case it seems like a decently sized feature to implement. Even if I overestimate the work involved it's not feasible for the timeline of the 9.6 release.

So for practical reasons our options are either adjust the analysis now. Or adjust the analysis now and change the interpreter later.

We generate different STG

This is correct. But tag inference only ever adds cases to the STG AST. Eliminating redundant evals happens during StgToCmm. See also Note [Tag inference passes] for a very brief overview.

In the case of

case a of
  MkA b -> case b of b'
    __DEFAULT -> b'

Tag inference would turn this into:

case a of
  MkA b -> case b<tagged> of b'
    __DEFAULT -> b'

Which makes the case a no-op during StgToCmm. It has the same end result as just removing the case but was simpler to implement.

Do note that in the general case something like:

case a of
  Just b -> b

This would still result in b being evaluated as we jump to b implicitly rather than simply returning it. This has confused people in the past but is quite crucial. However if we know b is evaluated we can pass to the code generator

case a of
  MkA b -> b<tagged>

Which tell's it that it's free to simply return b instead of jumping to it. Iirc it's exactly d6ea8356 which added this optimization. Before that we would still have jumped to b.

It's all a bit involved. If you have suggestions for removing the notes or further questions I'm happy to look at either.

I would say your conclusions in point two/three are broadly correct.

I was trying to look into this myself, but I found it difficult to hunt down, especially since we now have various flavors of pointer tagging and (at least it seems to me) different invariants in different scenarios.

The main notes are Note [Strict Field Invariant] and Note [CBV Function Ids]. I'm sure they could be improved. I wrote them after having had all the knowledge in my head for a long time so they might be short some details. So if you have questions because something is unclear feel free to ping me on irc or otherwise and I will happily answer.

mentioned in merge request !9863 (closed)

I agree with the sentiment but for this particular case it seems like a decently sized feature to implement. Even if I overestimate the work involved it's not feasible for the timeline of the 9.6 release.

Yes, that makes sense, of course. The workaround is certainly fine for now, but let’s keep this issue open (or open another one) to track the better approach.

This is correct. But tag inference only ever adds cases to the STG AST. Eliminating redundant evals happens during StgToCmm. See also Note [Tag inference passes] for a very brief overview.

I see, thank you for the clarification. Initially, it left me rather confused about how things were actually going wrong. After all, starting from desugared Core, we actually have:

$WMkA :: Maybe Bool -> A
$WMkA x = case x of y { __DEFAULT -> MkA y }

theA :: A
theA = $WMkA (Just True)

So it seems as though there ought to be no problem: the application of $WMkA will force the closure. However, I now realize my error: even with -O0, GHC will happily inline $WMkA and eliminate the case expression… and indeed, it should! Just True is already fully evaluated, so there is no need to evaluate it again.

Given the above, I think one way to view the discrepancy here is that the simplifier transforms programs based on a Core-level understanding of what an evaluated value is, and the runtime takes shortcuts based on an STG-level understanding of what an evaluated value is, and the latter introduces some distinctions that are not present in the former. The code generator must therefore take care to preserve evaluatedness, but the bytecode compiler currently does not, and that causes problems. (I am sure this is all quite obvious to you, but it was not immediately obvious to me.)

Do note that in the general case something like […] This would still result in b being evaluated as we jump to b implicitly rather than simply returning it.

Indeed, of course—I didn’t think it through, but it seems obvious now that you’ve pointed it out. But the optimization in the code generator you describe makes sense.

The main notes are Note [Strict Field Invariant] and Note [CBV Function Ids]. I'm sure they could be improved. I wrote them after having had all the knowledge in my head for a long time so they might be short some details. So if you have questions because something is unclear feel free to ping me on irc or otherwise and I will happily answer.

This was all quite helpful, so thank you! I think my confusion was mostly one of where to look. I’ll add a brief mention of tag inference to the wiki page on pointer tagging that mostly just points to the Notes in GHC.Stg.InferTag, as I think that would have been helpful for me.

Here is another thought. Suppose we do tag-inference, but do not do the rewriting that adds the extra evals.

Consider the STG->bytecode code generator. It carries an environment BCEnv that tells something about each in-scope variable:

type BCEnv = Map Id StackDepth -- To find vars on the stack

The STG->Cmm code generator has a more elaborate environment

type CgBindings = IdEnv CgIdInfo

It would be rather easy for these environments to keep track of which Ids were evaluated-and-properly-tagged: it's a property that comes straight from the binding site:

For Stg->Cmm. the binding let x = Just y in .. would say that x is evaluated-and-properly-tagged.
For Stg->Bytecode, that binding would not say that, beause (apparently) we build a thunk.

OK then, when we come to a strict data constructor, the code generator would be responsible for adding the extra evals, if the argument was not statically known to be evaluated-and-properly-tagged.

This is just an engineering shift: it's move the "rewriting" pass to the code generator, and indeed fuse it with the code generator. Advantantage: no need to anticipate the vagaries of the code generator.

This is just an engineering shift: it's move the "rewriting" pass to the code generator, and indeed fuse it with the code generator. Advantantage: no need to anticipate the vagaries of the code generator.

Unfortunately, with Andreas’s correction to my understanding, I now realize this would not actually help at all (and would actually make things worse). The example in my comment above illustrates why: the elimination of the case that forces the argument to MkA happens at the Core level, introduced by the simplifier! So we unfortunately cannot be sound but incomplete by doing nothing; the analysis is mandatory.

the analysis is mandatory.

Under my suggestion, the STG->Bytecode generator would say "oh, I can't statically see that this thing is evaluated-and-properly-tagged, so I'd better evaluate it". So no, I don't think it's mandatory. At worst, the code generator adds redundant evals.

So no, I don't think it's mandatory.

Okay, sure: I agree in that the analysis can always be replaced with an analysis that assumes nothing is evaluated. That is true. But this still requires every code generator to understand where the extra evaluations may need to be inserted, so it doesn’t seem like it’s an improvement. My understanding of Andreas’s proposed workaround is to switch to this approximation when the backend is the bytecode compiler, and then it gets handled at the STG level.

It's not a big deal either way. We can

Have a STG-rewrite phase, immediately before code gen, with a flag to say which code gen is about to be used. This completely rewrites the tree, adding evals.
Make each code gen add evals. Each code gen knows which pointers are evaluated-and-properly-tagged -- it generates the code! So it's easy to track that. It knows which call sites need evalutated-and-properly-tagged arguments (namely the strict constructors or, more generally, functions with CBV arguemnts)

I think it's a bit more direct, and certainly more efficient, to do the latter. But the former is also fine.

My understanding of Andreas’s proposed workaround is to switch to this approximation when the backend is the bytecode compiler, and then it gets handled at the STG level.

That's the plan. But we don't have to switch to a complete "never assume tagged" analysis.

Assuming tag presence for case binders case x of x' -> ..., strict fields case e of _ { MkStrict y' -> ...} and CBV arguments to functions should still be fine at the very least.

But every lifted binder introduced through let will have to be assumed untagged when targeting bytecode, even if the RHS is just a constructor. (Which is fairly easy to model in the analysis as-is).

There clearly would be benefits to doing some of this work in the code generator.
But I still think doing the analysis+ast rewrite on STG rather than in the backend makes for an easier to understand implementation as well as making it easier to reason about what a pass did since STG is far easier to read than Cmm.

It's clear now that the analysis needs to deal with more than one model of what a backend will generate.

If the code to model the backend behavior is in the backend the major drawback is that we will end up with different backends sharing or more likely duplicating a good deal of the logic.
If we put all of the analysis in one place the code that models a backend will be distinct from the backend implementation but in exchange we can trivially share a lot of the logic between backends.

To me personally not having to go down to Cmm to understand runtime behaviour is quite valuable so I plan to add a flag to the analysis which controls if we assume bytecode or cmm backend behaviour.

mentioned in commit d036785f

mentioned in commit 483e9c8b

mentioned in commit f59e3363

mentioned in commit 88580856

mentioned in commit 60241bfe

mentioned in commit eefc367d

mentioned in issue #22890

mentioned in commit b959ad0b

mentioned in commit 382bd7da

mentioned in commit bea77fc4

mentioned in commit 75b87ef6

mentioned in commit a2feb420

mentioned in commit 0de5c091

mentioned in commit f9a6a7e8

mentioned in commit bc9b1c9b

closed with commit d6411d6c

mentioned in commit d6411d6c

mentioned in commit cb11178d

mentioned in commit 50184e05

mentioned in commit cfab991e

mentioned in commit 7a9a6233

mentioned in issue #24870 (closed)

This was fixed in !9863 (closed)

commit 382bd7dad9cd53254204f418190368667a127f64
Author: Andreas Klebinger <klebinger.andreas@gmx.at>
Date:   Mon Jan 30 21:59:47 2023 +0100

    Fix some correctness issues around tag inference when targeting the bytecode generator.

    * Let binders are now always assumed untagged for bytecode.
    * Imported referenced are now always assumed to be untagged for bytecode.

    Fixes #22840

Tag inference failure with -fprefer-byte-code

Child items ...

Activity