Implement Call-by-value convention via `ArgRep`/`LambdaFormInfo`

Summary

This ticket tracks a proposal to retire tag inference in favor of a mechanism integrating levity into the calling convention.

Motivation

In #23848 we were investigating ways in which we could prevent tag inference from generating thunks to insert field seqs on strictly used binders. The workaround to pass -fworker-wrapper-cbv could generally lead to regressions because of rules #20364 and is generally an unsatisfying mechanism.

Furthermore, "Tag inference" is not some kind of analysis run with -O1, but at this point it is an integral part of the code generation pipeline, even with -O0. Compile a client module without running tag inference and you get crashes. In light of this, it should better be named StgPrep or some such.

Clarification: CbV, levity and types

Note that when a function is call-by-value, that expresses a precondition: Callers of the function are required to evaluate lifted arguments before calling the function, and the callee is permitted to assume that the argument has been evaluated.

Evaluating a lifted argument before passing it is effectively like passing an unlifted pointer as argument. We ultimately hope to make this precondition precise in STG's and Core's type system (#19530, #19767, #23890), but it is not a concern of this issue. Rather what we discuss here is necessary ground work to work around the annoyances in our current tag inference situation, because unlifted pointers are always properly tagged.

Proposal

@simonpj and I came to the conclusion (see #23848 (comment 531525)) that a holistic solution would be to treat call-by-value as the calling convention it really is; as part of GHC's ABI it must be manifest in interface files through exported LambdaFormInfo (not IdDetails like the current CbV mechanism) and any client must adhere to it (-O0 or not, GHC or ... the next version of GHC). Essentially, we want to do the same for levity mismatch of arguments as we do for arity mismatches: Whenever we encounter a mismatch, generate a slow call that will do the necessary PAP'ing/eval'ing.

From my very rough understanding of everything in StgTo* and below, this is the minimum number of steps required to integrate call-by-value/unlifted arg reps:

Change GHC.StgToCmm.ArgRep.ArgRep: Split P into L and U for lifted and unlifted GCd pointers, respectively.
That directly prompts a change for slowCallPattern, which is the interface between GHC and the generic apply functions generated by utils/genapply.
Change utils/genapply/Main.hs:ArgRep in the same way as (1). The entry code for a U should simply seq the argument.
Find the right adjustment to utils/genapply/Main.hs:applyTypes and utils/genapply/Main.hs:stackApplyTypes that catches 99% of all call sites in reasonable space. To get things running, no change should be needed (but the resulting code will either be huge or slow). Fine-tuning the exact set of apply patterns informed by benchmarks is the bulk of the change, I think, and deserves its own section below.
Fill in slowCallPattern so that it matches applyTypes. Also fix other functions in GHC.StgToCmm.ArgRep, the most important being toArgRep which turns a PrimRep into an ArgRep. We must split the BoxedRep case in two: One for Just Lifted and one for Just Unlifted. Furthermore, it is no longer enough to look at
Follow references of today's P and see which need fixes. GHC.StgToCmm.Layout: argBits (U is also GCd), stdPattern. Also some in the ByteCode interpreter.
Now I'd work from GHC.StgToCmm.Expr.cgIdApp downwards. Specifically, we need a new LambdaFormInfo concept of CbV to communicate the information to GHC.StgToCmm.Closure.getCallMethod. The latter needs to emit a slow call not only on arity mismatch, but also when one of the args is not unlifted but goes into an unlifted parameter of the callee. Slow call will go through genapply funs IIRC which will do the proper seqs. We can tell whether an arg is unlifted by looking at the idDemandInfo of the StgArg. We must be able to tell whether a parameter of the callee is expected unlifted by means of its ReEntrant :: ... -> LambdaFormInfo. Just as that constructor lists the callee's arity, we must augment it with the function's "CbV marks". A good place to do so would the ArgDescr; in fact, for usual apply patterns, the info is already present in ArgSpec. For the ArgGen case, instead of a type Liveness = [Bool] bitmap, we must embed data Stuff = LiftedP | UnliftedP | Dead; type Liveness = [Stuff] (TODO: better name).

Effect

With these changes, we will generate a slow call for a function like this:

module A where
data T a = T !a
g :: T Int -> Int
g (T n) = n -- g is strict, hence expects its arg unlifted

module B where
f :: T Int -> Bool -> Int
f a True  = g a -- a is not used strictly, hence it is lifted. This will generate a slow call to g
f _ False = 0

Compiling A with -O1, g is detected strict and expects its argument unlifted.

Compiling B with -O0, we see that the caller f is not strict in a (even with -O1) but that it passes a on to g. Since a is lifted, there is an levity mismatch with the LambdaFormInfo of g (the exact info is in the imported ArgSpec ARG_U :: ArgDescr, presumably) -- just like an arity mismatch, this forces a slow call to g.

Now, tag inference would simply insert a seq prior to calling g, thus guaranteeing compatible unliftedness and eliminating the slow call. ~~Note that there are no thunks involved whatsoever! No #23848.~~ (Edit: This is actually untrue in case g is a strict DataCon worker like T and let bound. However there are no avoidable thunks such as in #23848.) In fact this transformation is so simple that it could be carried out in CorePrep already, or perhaps as part of cgIdApp. Kind of like eta expansion, but for levity (and without introducing new closures).

Generating generic apply functions

As outlined in (5) above, we need to re-run benchmarks to find a good set of generic apply patterns. What a generic apply function is and the process to find good candidates has been described in Section 6.2 and 8.2 of https://www.microsoft.com/en-us/research/publication/make-fast-curry-pushenter-vs-evalapply/. Perhaps it's a good idea to extend the benchmark suite beyond NoFib for this purpose, as it displays a relative poverty of unboxed and unlifted types.

Edited Oct 30, 2023 by Sebastian Graf

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information