Implement Call-by-value convention via `ArgRep`/`LambdaFormInfo`
Summary
This ticket tracks a proposal to retire tag inference in favor of a mechanism integrating levity into the calling convention.
Motivation
In #23848 we were investigating ways in which we could prevent tag inference from generating thunks to insert field seqs on strictly used binders. The workaround to pass -fworker-wrapper-cbv
could generally lead to regressions because of rules #20364 and is generally an unsatisfying mechanism.
Furthermore, "Tag inference" is not some kind of analysis run with -O1, but at this point it is an integral part of the code generation pipeline, even with -O0
. Compile a client module without running tag inference and you get crashes. In light of this, it should better be named StgPrep
or some such.
Clarification: CbV, levity and types
Note that when a function is call-by-value, that expresses a precondition: Callers of the function are required to evaluate lifted arguments before calling the function, and the callee is permitted to assume that the argument has been evaluated.
Evaluating a lifted argument before passing it is effectively like passing an unlifted pointer as argument. We ultimately hope to make this precondition precise in STG's and Core's type system (#19530, #19767, #23890), but it is not a concern of this issue. Rather what we discuss here is necessary ground work to work around the annoyances in our current tag inference situation, because unlifted pointers are always properly tagged.
Proposal
@simonpj and I came to the conclusion (see #23848 (comment 531525)) that a holistic solution would be to treat call-by-value as the calling convention it really is; as part of GHC's ABI it must be manifest in interface files through exported LambdaFormInfo
(not IdDetails
like the current CbV mechanism) and any client must adhere to it (-O0 or not, GHC or ... the next version of GHC). Essentially, we want to do the same for levity mismatch of arguments as we do for arity mismatches: Whenever we encounter a mismatch, generate a slow call that will do the necessary PAP'ing/eval'ing.
From my very rough understanding of everything in StgTo*
and below, this is the minimum number of steps required to integrate call-by-value/unlifted arg reps:
- Change
GHC.StgToCmm.ArgRep.ArgRep
: SplitP
intoL
andU
for lifted and unlifted GCd pointers, respectively. - That directly prompts a change for
slowCallPattern
, which is the interface between GHC and the generic apply functions generated byutils/genapply
. - Change
utils/genapply/Main.hs:ArgRep
in the same way as (1). The entry code for aU
should simply seq the argument. - Find the right adjustment to
utils/genapply/Main.hs:applyTypes
andutils/genapply/Main.hs:stackApplyTypes
that catches 99% of all call sites in reasonable space. To get things running, no change should be needed (but the resulting code will either be huge or slow). Fine-tuning the exact set of apply patterns informed by benchmarks is the bulk of the change, I think, and deserves its own section below. - Fill in
slowCallPattern
so that it matchesapplyTypes
. Also fix other functions inGHC.StgToCmm.ArgRep
, the most important beingtoArgRep
which turns aPrimRep
into anArgRep
. We must split theBoxedRep
case in two: One forJust Lifted
and one forJust Unlifted
. Furthermore, it is no longer enough to look at - Follow references of today's
P
and see which need fixes.GHC.StgToCmm.Layout
:argBits
(U
is also GCd),stdPattern
. Also some in the ByteCode interpreter. - Now I'd work from
GHC.StgToCmm.Expr.cgIdApp
downwards. Specifically, we need a newLambdaFormInfo
concept of CbV to communicate the information toGHC.StgToCmm.Closure.getCallMethod
. The latter needs to emit a slow call not only on arity mismatch, but also when one of the args is not unlifted but goes into an unlifted parameter of the callee. Slow call will go through genapply funs IIRC which will do the proper seqs. We can tell whether an arg is unlifted by looking at theidDemandInfo
of theStgArg
. We must be able to tell whether a parameter of the callee is expected unlifted by means of itsReEntrant :: ... -> LambdaFormInfo
. Just as that constructor lists the callee's arity, we must augment it with the function's "CbV marks". A good place to do so would theArgDescr
; in fact, for usual apply patterns, the info is already present inArgSpec
. For theArgGen
case, instead of atype Liveness = [Bool]
bitmap, we must embeddata Stuff = LiftedP | UnliftedP | Dead; type Liveness = [Stuff]
(TODO: better name).
Effect
With these changes, we will generate a slow call for a function like this:
module A where
data T a = T !a
g :: T Int -> Int
g (T n) = n -- g is strict, hence expects its arg unlifted
module B where
f :: T Int -> Bool -> Int
f a True = g a -- a is not used strictly, hence it is lifted. This will generate a slow call to g
f _ False = 0
Compiling A
with -O1, g
is detected strict and expects its argument unlifted.
Compiling B
with -O0, we see that the caller f
is not strict in a
(even with -O1
) but that it passes a
on to g
.
Since a
is lifted, there is an levity mismatch with the LambdaFormInfo
of g
(the exact info is in the imported ArgSpec ARG_U :: ArgDescr
, presumably) -- just like an arity mismatch, this forces a slow call to g
.
Now, tag inference would simply insert a seq prior to calling g
, thus guaranteeing compatible unliftedness and eliminating the slow call.
Note that there are no thunks involved whatsoever! No #23848. (Edit: This is actually untrue in case g
is a strict DataCon worker like T
and let bound. However there are no avoidable thunks such as in #23848.) In fact this transformation is so simple that it could be carried out in CorePrep already, or perhaps as part of cgIdApp
. Kind of like eta expansion, but for levity (and without introducing new closures).
Generating generic apply functions
As outlined in (5) above, we need to re-run benchmarks to find a good set of generic apply patterns. What a generic apply function is and the process to find good candidates has been described in Section 6.2 and 8.2 of https://www.microsoft.com/en-us/research/publication/make-fast-curry-pushenter-vs-evalapply/. Perhaps it's a good idea to extend the benchmark suite beyond NoFib for this purpose, as it displays a relative poverty of unboxed and unlifted types.