Convey ABI/calling-convention between modules (even with -O0)
Convey ABI/calling-convention between modules (even with -O0)
In #23848 we were investigating ways in which we could prevent tag inference from generating thunks to insert field seqs on strictly used binders. The workaround to pass -fworker-wrapper-cbv
could generally lead to regressions because of rules #20364 and is generally an unsatisfying mechanism.
Furthermore, "Tag inference" is not some kind of analysis run with -O1, but at this point it is an integral part of the code generation pipeline, even with -O0
. Compile a client module without running tag inference and you get crashes. In light of this, it should better be named StgPrep
or some such.
See also
- #21792: make everything BoxedRep have enter code
The situation today
When we export a function (or data value) in an interface file, we always serialise its type. Of course! But we do not currently always serialise
- Its arity
- Its CAFFy-ness
- Whether it is a data constructor application
- The CBV (call by value) properties of its arguments
And even if we do serialise this info, if we compile an importing module with -O0,
or -fignore-iface-pragmas
, we may simply skip the info.
Up to now that has been acceptable, but it has a big downside: an importing module must (be able to) make pessimistic assumptions about all of these things.
- For arity, make a "slow call"; so there must be a version we can slow-call.
- For CAFFY-ness, assume that it may have CAFs. That bloats the CAF-ref tables in the importing module; a bit annoying when -O0 is meant to compile fast.
- For data constructor applications it means that the importing module may need to add a redundant eval. e.g.
module X where { x = Just True } module A where { import X; data T = MkT !a; y = MkT x }
y
we may need to add an eval forx
unless we can "see" that it is evaluated and properly tagged. - For CBV-properties, see the next section.
Collectively, let's we call these properties of an export x
the STG-ABI of x
.
The STG-ABI of a function is a bit like the standard definition of "application binary inerface"](https://en.wikipedia.org/wiki/Application_binary_interface), but
STG-ABI is platform-independent. Even knowing the numbers and representations of each argument, there is a further step to map to the actual registers and calling conventoin of a particular machine architecture.
The central idea of this ticket is: let's always export the calling-convention/ABI.
Call-by-value properties
Consider
module X where
f x y z = case x of { True -> e1; False -> e2 }
module A where
import X( f )
g1 ys p q = f (h ys) q p
g2 x p q = Just (f x q p)
g3 ys = map f ys
(we suppose that f
is too big to be inlined.)
With -O
, today:
-
The code for
f
will evaluatex
and branch on its tag. More precisely, it will- test
x
's pointer tag for "I am a thunk". If so, evaluate, if not fall through. - test
x
's pointer tag for 1 or 2, meaning True or False. The "evaluate" step involves savingy
andz
on the stack, and pushing a return address (which in turn means allocating an info table) and enteringx
; thene1
ande2
need to reloady
andz
from the stack.
- test
-
The code for
g1
will use call-by-value. It will push a return address, call(h ys)
, take the returned pointer, and pass it tof
. Note that in this call,f
's test for evaluated-ness is unnecessary, because the argument is always evaluated. -
On the other hand, the call to
f
ing2
will pass a possibly-unevaluatedx
(note thatg2
is lazy), so then the evaluated-ness test inf
plays a role. Similarly, ing3
, our functionf
will be called on possibly-unevaluated elements ofys
Idea: if we could expose f
as a call-by-value function, that absolutely requires its argument(s)
to be evaluated, then
- The code for
f
could omit the evaluated-ness test, the info-table, and the stores and loads fory
andz
. Big win! - The code for
g1
would be as now. - The code for
g2
would need to add an eval forx
, to satisfy f`'s calling convention:g2 x p q = Just (case x of x' -> f x q p)
- Similarly
g3
via eta-expansion:g3 ys = map (\x p q -> case x of x' -> f x' p q) ys
This idea places more obligations on the caller in exchange for giving more guarantees to the callee.
Alternative: another plausible alternative is to express CBV-ness in the type: see #23890.
Guaranteeing the CBV properties
Background: Making a fast curry,
If we have an unknown call (f a b)
(where a
and b
are pointers),
we put f
, a
and b
in registers and call the RTS function stg_ap_pp
.
That stg_ap_pp
evaluates f
, looks at its arity, and calls its fast entry point passing exactly the right number of arguments.
Idea: currently in GHC the infer-tags-and-rewrite pass does eta-expansion to guarantee the CBV entry convention. This seems a bit inconsistent. A more consistent story might be:
- Plan A: Always eta-expand every call of a known function, to guarantee its entry convention.
- Plan B: Generate a "CBV wrapper" for every function, and call that when the caller can't statically guarantee to follow the funcion's CBV conventions.
It amounts to whether we generate one blob of wrapper code, or a per-call blob of
wrapper code.
The latter is more voluminous, but can take advantage of call-site knowledge. With -O
we already
largely do this, I think: e.g. we eta-expand under-saturated calls, I think, to avoid PAPs.
Notice that: Plan A requires us to always-expose the calling convention in the interface file; Plan B does not.
Side note: the existing "slow entry point".
When function f
doesn't have one of a fixed set of standard argument patterns
(GHC.StgToCmm.Layout.stdPattern
returns Nothing
)
GHC generates a "slow entry point" f_slow
for f
that seems to be used ony for the return from a call to the garbage collector.
In the days of push/enter these slow entry points played a larger role. End of note
Implementation notes
- The calling convention for an Id is pretty much expressed in its
LambdaFormInfo
. - Sadly, CAFFy-ness is handled separately right now.
- The number of arguments in the calling convention might not be the same as the Core
Arity
, because of unarisation. SeeGHC.Type.Id.idFunRepArity
. The current sitaution is delicate because it has to predict the choices made by unarisation. - A key function is
GHC.StgToCmm.Closure.importedIdLFInfo
. - Serialising key calling convention information would not be expensive:
- A few bits for arity or representation-arity
- A bit for CAFFy-ness
- A few bits for CBV-info (often empty)
- A bit for data-con shape Very cheap compared to the representation of the type, which we always serialise.
More implementation notes from Sebastian
@simonpj and I came to the conclusion (see #23848 (comment 531525)) that a holistic solution would be to treat call-by-value as the calling convention it really is; as part of GHC's ABI it must be manifest in interface files through exported LambdaFormInfo
(not IdDetails
like the current CbV mechanism) and any client must adhere to it (-O0 or not, GHC or ... the next version of GHC). Essentially, we want to do the same for levity mismatch of arguments as we do for arity mismatches: Whenever we encounter a mismatch, generate a slow call that will do the necessary PAP'ing/eval'ing.
From my very rough understanding of everything in StgTo*
and below, this is the minimum number of steps required to integrate call-by-value/unlifted arg reps:
- Change
GHC.StgToCmm.ArgRep.ArgRep
: SplitP
intoL
andU
for lifted and unlifted GCd pointers, respectively. - That directly prompts a change for
slowCallPattern
, which is the interface between GHC and the generic apply functions generated byutils/genapply
. - Change
utils/genapply/Main.hs:ArgRep
in the same way as (1). The entry code for aU
should simply seq the argument. - Find the right adjustment to
utils/genapply/Main.hs:applyTypes
andutils/genapply/Main.hs:stackApplyTypes
that catches 99% of all call sites in reasonable space. To get things running, no change should be needed (but the resulting code will either be huge or slow). Fine-tuning the exact set of apply patterns informed by benchmarks is the bulk of the change, I think, and deserves its own section below. - Fill in
slowCallPattern
so that it matchesapplyTypes
. Also fix other functions inGHC.StgToCmm.ArgRep
, the most important beingtoArgRep
which turns aPrimRep
into anArgRep
. We must split theBoxedRep
case in two: One forJust Lifted
and one forJust Unlifted
. Furthermore, it is no longer enough to look at - Follow references of today's
P
and see which need fixes.GHC.StgToCmm.Layout
:argBits
(U
is also GCd),stdPattern
. Also some in the ByteCode interpreter. - Now I'd work from
GHC.StgToCmm.Expr.cgIdApp
downwards. Specifically, we need a newLambdaFormInfo
concept of CbV to communicate the information toGHC.StgToCmm.Closure.getCallMethod
. The latter needs to emit a slow call not only on arity mismatch, but also when one of the args is not unlifted but goes into an unlifted parameter of the callee. Slow call will go through genapply funs IIRC which will do the proper seqs. We can tell whether an arg is unlifted by looking at theidDemandInfo
of theStgArg
. We must be able to tell whether a parameter of the callee is expected unlifted by means of itsReEntrant :: ... -> LambdaFormInfo
. Just as that constructor lists the callee's arity, we must augment it with the function's "CbV marks". A good place to do so would theArgDescr
; in fact, for usual apply patterns, the info is already present inArgSpec
. For theArgGen
case, instead of atype Liveness = [Bool]
bitmap, we must embeddata Stuff = LiftedP | UnliftedP | Dead; type Liveness = [Stuff]
(TODO: better name).
As outlined in (5) above, we need to re-run benchmarks to find a good set of generic apply patterns. What a generic apply function is and the process to find good candidates has been described in Section 6.2 and 8.2 of https://www.microsoft.com/en-us/research/publication/make-fast-curry-pushenter-vs-evalapply/. Perhaps it's a good idea to extend the benchmark suite beyond NoFib for this purpose, as it displays a relative poverty of unboxed and unlifted types.