Convey ABI/calling-convention between modules (even with -O0)

In #23848 we were investigating ways in which we could prevent tag inference from generating thunks to insert field seqs on strictly used binders. The workaround to pass -fworker-wrapper-cbv could generally lead to regressions because of rules #20364 and is generally an unsatisfying mechanism.

Furthermore, "Tag inference" is not some kind of analysis run with -O1, but at this point it is an integral part of the code generation pipeline, even with -O0. Compile a client module without running tag inference and you get crashes. In light of this, it should better be named StgPrep or some such.

See also

#21792: make everything BoxedRep have enter code

The situation today

When we export a function (or data value) in an interface file, we always serialise its type. Of course! But we do not currently always serialise

Its arity
Its CAFFy-ness
Whether it is a data constructor application
The CBV (call by value) properties of its arguments

And even if we do serialise this info, if we compile an importing module with -O0, or -fignore-iface-pragmas, we may simply skip the info.

Up to now that has been acceptable, but it has a big downside: an importing module must (be able to) make pessimistic assumptions about all of these things.

For arity, make a "slow call"; so there must be a version we can slow-call.
For CAFFY-ness, assume that it may have CAFs. That bloats the CAF-ref tables in the importing module; a bit annoying when -O0 is meant to compile fast.
For data constructor applications it means that the importing module may need to add a redundant eval. e.g.
```
module X where { x = Just True }
module A where { import X; data T = MkT !a; y = MkT x }
```
In the code for y we may need to add an eval for x unless we can "see" that it is evaluated and properly tagged.
For CBV-properties, see the next section.

Collectively, let's we call these properties of an export x the STG-ABI of x. The STG-ABI of a function is a bit like the standard definition of "application binary inerface"](https://en.wikipedia.org/wiki/Application_binary_interface), but STG-ABI is platform-independent. Even knowing the numbers and representations of each argument, there is a further step to map to the actual registers and calling conventoin of a particular machine architecture.

The central idea of this ticket is: let's always export the calling-convention/ABI.

Call-by-value properties

Consider

module X where
  f x y z = case x of { True -> e1; False -> e2 }

module A where
  import X( f )
  g1 ys p q = f (h ys) q p
  g2 x  p q = Just (f x q p)
  g3 ys     = map f ys

(we suppose that f is too big to be inlined.) With -O, today:

The code for f will evaluate x and branch on its tag. More precisely, it will
- test x's pointer tag for "I am a thunk". If so, evaluate, if not fall through.
- test x's pointer tag for 1 or 2, meaning True or False. The "evaluate" step involves saving y and z on the stack, and pushing a return address (which in turn means allocating an info table) and entering x; then e1 and e2 need to reload y and z from the stack.
The code for g1 will use call-by-value. It will push a return address, call (h ys), take the returned pointer, and pass it to f. Note that in this call, f's test for evaluated-ness is unnecessary, because the argument is always evaluated.
On the other hand, the call to f in g2 will pass a possibly-unevaluated x (note that g2 is lazy), so then the evaluated-ness test in f plays a role. Similarly, in g3, our function f will be called on possibly-unevaluated elements of ys

Idea: if we could expose f as a call-by-value function, that absolutely requires its argument(s) to be evaluated, then

The code for f could omit the evaluated-ness test, the info-table, and the stores and loads for y and z. Big win!
The code for g1 would be as now.
The code for g2 would need to add an eval for x, to satisfy f`'s calling convention:
```
g2 x  p q = Just (case x of x' -> f x q p)
```

Similarly g3 via eta-expansion:

g3 ys = map (\x p q -> case x of x' -> f x' p q) ys

This idea places more obligations on the caller in exchange for giving more guarantees to the callee.

Alternative: another plausible alternative is to express CBV-ness in the type: see #23890.

Guaranteeing the CBV properties

Background: Making a fast curry, If we have an unknown call (f a b) (where a and b are pointers), we put f, a and b in registers and call the RTS function stg_ap_pp. That stg_ap_pp evaluates f, looks at its arity, and calls its fast entry point passing exactly the right number of arguments.

Idea: currently in GHC the infer-tags-and-rewrite pass does eta-expansion to guarantee the CBV entry convention. This seems a bit inconsistent. A more consistent story might be:

Plan A: Always eta-expand every call of a known function, to guarantee its entry convention.
Plan B: Generate a "CBV wrapper" for every function, and call that when the caller can't statically guarantee to follow the funcion's CBV conventions.

It amounts to whether we generate one blob of wrapper code, or a per-call blob of wrapper code. The latter is more voluminous, but can take advantage of call-site knowledge. With -O we already largely do this, I think: e.g. we eta-expand under-saturated calls, I think, to avoid PAPs.

Notice that: Plan A requires us to always-expose the calling convention in the interface file; Plan B does not.

Side note: the existing "slow entry point". When function f doesn't have one of a fixed set of standard argument patterns (GHC.StgToCmm.Layout.stdPattern returns Nothing) GHC generates a "slow entry point" f_slow for f that seems to be used ony for the return from a call to the garbage collector. In the days of push/enter these slow entry points played a larger role. End of note

Implementation notes

The calling convention for an Id is pretty much expressed in its LambdaFormInfo.
Sadly, CAFFy-ness is handled separately right now.
The number of arguments in the calling convention might not be the same as the Core Arity, because of unarisation. See GHC.Type.Id.idFunRepArity. The current sitaution is delicate because it has to predict the choices made by unarisation.
A key function is GHC.StgToCmm.Closure.importedIdLFInfo.
Serialising key calling convention information would not be expensive:
- A few bits for arity or representation-arity
- A bit for CAFFy-ness
- A few bits for CBV-info (often empty)
- A bit for data-con shape Very cheap compared to the representation of the type, which we always serialise.

More implementation notes from Sebastian

@simonpj and I came to the conclusion (see #23848 (comment 531525)) that a holistic solution would be to treat call-by-value as the calling convention it really is; as part of GHC's ABI it must be manifest in interface files through exported LambdaFormInfo (not IdDetails like the current CbV mechanism) and any client must adhere to it (-O0 or not, GHC or ... the next version of GHC). Essentially, we want to do the same for levity mismatch of arguments as we do for arity mismatches: Whenever we encounter a mismatch, generate a slow call that will do the necessary PAP'ing/eval'ing.

From my very rough understanding of everything in StgTo* and below, this is the minimum number of steps required to integrate call-by-value/unlifted arg reps:

Change GHC.StgToCmm.ArgRep.ArgRep: Split P into L and U for lifted and unlifted GCd pointers, respectively.
That directly prompts a change for slowCallPattern, which is the interface between GHC and the generic apply functions generated by utils/genapply.
Change utils/genapply/Main.hs:ArgRep in the same way as (1). The entry code for a U should simply seq the argument.
Find the right adjustment to utils/genapply/Main.hs:applyTypes and utils/genapply/Main.hs:stackApplyTypes that catches 99% of all call sites in reasonable space. To get things running, no change should be needed (but the resulting code will either be huge or slow). Fine-tuning the exact set of apply patterns informed by benchmarks is the bulk of the change, I think, and deserves its own section below.
Fill in slowCallPattern so that it matches applyTypes. Also fix other functions in GHC.StgToCmm.ArgRep, the most important being toArgRep which turns a PrimRep into an ArgRep. We must split the BoxedRep case in two: One for Just Lifted and one for Just Unlifted. Furthermore, it is no longer enough to look at
Follow references of today's P and see which need fixes. GHC.StgToCmm.Layout: argBits (U is also GCd), stdPattern. Also some in the ByteCode interpreter.
Now I'd work from GHC.StgToCmm.Expr.cgIdApp downwards. Specifically, we need a new LambdaFormInfo concept of CbV to communicate the information to GHC.StgToCmm.Closure.getCallMethod. The latter needs to emit a slow call not only on arity mismatch, but also when one of the args is not unlifted but goes into an unlifted parameter of the callee. Slow call will go through genapply funs IIRC which will do the proper seqs. We can tell whether an arg is unlifted by looking at the idDemandInfo of the StgArg. We must be able to tell whether a parameter of the callee is expected unlifted by means of its ReEntrant :: ... -> LambdaFormInfo. Just as that constructor lists the callee's arity, we must augment it with the function's "CbV marks". A good place to do so would the ArgDescr; in fact, for usual apply patterns, the info is already present in ArgSpec. For the ArgGen case, instead of a type Liveness = [Bool] bitmap, we must embed data Stuff = LiftedP | UnliftedP | Dead; type Liveness = [Stuff] (TODO: better name).

As outlined in (5) above, we need to re-run benchmarks to find a good set of generic apply patterns. What a generic apply function is and the process to find good candidates has been described in Section 6.2 and 8.2 of https://www.microsoft.com/en-us/research/publication/make-fast-curry-pushenter-vs-evalapply/. Perhaps it's a good idea to extend the benchmark suite beyond NoFib for this purpose, as it displays a relative poverty of unboxed and unlifted types.

Edited Oct 31, 2023 by Simon Peyton Jones

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information