WIP: Introduce with# primop (option B)
This introduces a new primop, with#
, which addresses the problems suffered by touch#
, as described in #17760 (closed) and https://gitlab.haskell.org/ghc/ghc/wikis/proposal/with-combinator.
The approach taken here is what is labelled as Option B on the Wiki page. Specifically, the plan is to retain touch#
but heavily discourage its use by user code. We then introduce with#
which gets rewritten into touch#
during Core Prep. Specifically, when we see an application of the form:
with# x k s
We rewrite it to
case k s of (# s', y #) ->
case touch# x s' of s'' ->
(# s'', y #)
This ensures that the simplifier cannot drop the continuation containing the touch#
if k
diverges (which was the cause of #14346 (closed) and #17746). Moreover, by simply lowering with#
to touch#
we avoid incurring any runtime overhead (unlike the other options presented on the Wiki page).
TODO
-
Verify that this doesn't incur any unexpected runtime costs -
Add tests -
Finish writing Note -
Add changelog entry -
Note change in release notes -
Discourage use of touch#
in Haddocks -
Document with#
Merge request reports
Activity
added primops label
assigned to @bgamari
changed milestone to %9.0.1
added 18 commits
-
85d077c3...abc02b40 - 17 commits from branch
master
- c68e753b - Introduce with#
-
85d077c3...abc02b40 - 17 commits from branch
mentioned in merge request !2933 (closed)
For the record, this is the Core generated by
Foreign.Marshal.Alloc.allocaBytes
Foreign.Marshal.Alloc.allocaBytes1 :: forall {a} {b}. GHC.Types.Int -> (GHC.Ptr.Ptr a -> GHC.Types.IO b) -> GHC.Prim.State# GHC.Prim.RealWorld -> (# GHC.Prim.State# GHC.Prim.RealWorld, b #) [GblId, Arity=3, Str=<S,1*U(U)><L,1*C1(U)><L,U>, Unf=OtherCon []] Foreign.Marshal.Alloc.allocaBytes1 = \ (@a_a2pq) (@b_a2pr) (ds_s34m [Occ=Once!] :: GHC.Types.Int) (action_s34n [Occ=Once!] :: GHC.Ptr.Ptr a_a2pq -> GHC.Types.IO b_a2pr) (eta_s34o [Occ=Once] :: GHC.Prim.State# GHC.Prim.RealWorld) -> case ds_s34m of { GHC.Types.I# size_s34q [Occ=Once] -> case GHC.Prim.newPinnedByteArray# @GHC.Prim.RealWorld size_s34q eta_s34o of { (# ipv_s34s [Occ=Once], ipv1_s34t [Occ=Once] #) -> case GHC.Prim.unsafeFreezeByteArray# @GHC.Prim.RealWorld ipv1_s34t ipv_s34s of { (# ipv2_s34v [Occ=Once], ipv3_s34w #) -> case GHC.Prim.byteArrayContents# ipv3_s34w of sat_s34B [Occ=Once] { __DEFAULT -> let { sat_s34C [Occ=Once] :: GHC.Ptr.Ptr a_a2pq [LclId] sat_s34C = GHC.Ptr.Ptr @a_a2pq sat_s34B } in case ((action_s34n sat_s34C) `cast` (...)) ipv2_s34v of { (# sat_s34E [Occ=Once], sat_s34F [Occ=Once] #) -> case GHC.Prim.touch# @'GHC.Types.UnliftedRep @GHC.Prim.ByteArray# ipv3_s34w sat_s34E of sat_s34G [Occ=Once] { __DEFAULT -> (# sat_s34G, sat_s34F #) } } } } } }
I believe this addresses @alexbiehl's point from !2933 (closed): we don't need to beta reduce ourselves.
Edited by Ben GamariSo after a few false starts, I finally have some ticky numbers that I trust. Here is a comparison of ticky profiles from GHC compiling
T5321FD
with-O
(comparing this branch, B, against its base commit, A):| Change | alloc A | alloc B | name | |------------|----------|----------|-------------------------------------------------------------------------------| | +4525568.0 | 0 | 4525568 | bytestring-0.10.9.0:Data.ByteString.Internal.$wcompareBytes (:<no module>) | | +238784.0 | 0 | 238784 | ghc:Pretty.$wlayLeft (:<no module>) | | +217952.0 | 108976 | 326928 | ghc:TysWiredIn.$wisBuiltInOcc_maybe (:<no module>) | | +40792.0 | 897424 | 938216 | $walexGetByte (ghc:Lexer) | | +37520.0 | 13680 | 51200 | $lgo5_gyzc (ghc:TcType) | | +26240.0 | 92792 | 119032 | ghc:GHC.Iface.Recomp.addFingerprints1 (:<no module>) | | +15816.0 | 1024 | 16840 | $lgo28_gZHQ (ghc:GHC.Core.Type) | | +13680.0 | 0 | 13680 | $lgo5_gyzf (ghc:TcType) | | +9168.0 | 0 | 9168 | ghc:StringBuffer.$wcurrentChar (:<no module>) | | +9160.0 | 192 | 9352 | ghc:Demand.$w$cget4 (:<no module>) | | +9088.0 | 5680 | 14768 | $lgo1_gfik (ghc:GHC.CmmToAsm.CFG.Dominators) | | +7192.0 | 512 | 7704 | ghc:Demand.$w$cget3 (:<no module>) | | +6144.0 | 847904 | 854048 | base:GHC.List.$wbreak (:<no module>) | | +3472.0 | 0 | 3472 | sat_sHyP (ghc:GHC.CmmToAsm.X86.Ppr) | | +3280.0 | 2312 | 5592 | ghc:GHC.Iface.Syntax.$w$cget9 (:<no module>) | | +2160.0 | 0 | 2160 | base:GHC.IO.FD.$fBufferedIOFD17 (:<no module>) | | +2152.0 | 48 | 2200 | sat_sttq (ghc:GHC.Rename.Pat) | | +1856.0 | 384 | 2240 | sat_s8sc (ghc:SysTools.Settings) | | +1344.0 | 3615360 | 3616704 | base:GHC.Base.++ (:<no module>) | | +1024.0 | 0 | 1024 | $lgo28_gZHP (ghc:GHC.Core.Type) | | +960.0 | 192 | 1152 | sat_sIdm (ghc:TcBinds) |
The fact that
compareBytes
suggests thatwithForeignPtr
isn't being optimised as I would expect. Investigating.Edited by Ben GamariI believe I see the issue here; the strictness signature of
with#
(to be renamed tokeepAlive#
) is far too loose. It needs to place a call demand on the continuation at very least.However, this poses an interesting conundrum: Consider this code which one would have written before
keepAlive#
(assume thatdoSomething
is strict in its argument):f x = do doSomething x touch x
Refactored to use
keepAlive#
this would look like,f x = do keepAlive x (doSomething x)
If
keepAlive#
only places a demand on its continuation argument then the strictness signature off
will have changed: previously GHC would be able to see thatx
is demanded. However, in the case of the refactored code this fact is hidden.I suspect the demand analyser will need a special case to ensure that a
keepAlive#
application has the same demands as its continuation.Unfortunately placing a call demand on the continuation isn't quite sufficient. Specifically, consider this program:
module Test where import Data.ByteString.Internal hiding (compareBytes) import Foreign.Storable import Foreign.ForeignPtr import Foreign.Ptr test :: ForeignPtr Word -> IO () test fp = do withForeignPtr fp $ \p -> do x <- peek p if (x > 0) then putStrLn "x" else putStrLn "y"
With
touch#
we end up with the following STG (the Core isn't appreciably different):Test.$wtest [InlPrag=NOUSERINLINE[2]] :: GHC.Prim.Addr# -> GHC.ForeignPtr.ForeignPtrContents -> GHC.Prim.State# GHC.Prim.RealWorld -> (# GHC.Prim.State# GHC.Prim.RealWorld, () #) [GblId, Arity=3, Str=<L,U><L,U><L,U>, Unf=OtherCon []] = \r [ww ww1 w] case readWordOffAddr# [ww 0# w] of { (#,#) ipv [Occ=Once*] ipv1 [Occ=Once] -> case gtWord# [ipv1 0##] of { __DEFAULT -> case GHC.IO.Handle.Text.hPutStr' GHC.IO.Handle.FD.stdout Test.test4 GHC.Types.True ipv of { (#,#) ipv2 [Occ=Once] ipv3 [Occ=Once] -> case touch# [ww1 ipv2] of s' [Occ=Once] { __DEFAULT -> (#,#) [s' ipv3]; }; }; 1# -> case hPutStr' GHC.IO.Handle.FD.stdout Test.test2 GHC.Types.True ipv of { (#,#) ipv2 [Occ=Once] ipv3 [Occ=Once] -> case touch# [ww1 ipv2] of s' [Occ=Once] { __DEFAULT -> (#,#) [s' ipv3]; }; }; }; };
Very nice; no allocation, the
touch#
s are pushed into the case analysis.However, with
keepAlive#
we see this simplified Core,Test.$wtest = \ (ww :: GHC.Prim.Addr#) (ww1 :: GHC.ForeignPtr.ForeignPtrContents) (w :: GHC.Prim.State# GHC.Prim.RealWorld) -> GHC.Prim.with# @'GHC.Types.LiftedRep @GHC.ForeignPtr.ForeignPtrContents @'GHC.Types.LiftedRep @() ww1 (\ (s [OS=OneShot] :: GHC.Prim.State# GHC.Prim.RealWorld) -> case GHC.Prim.readWordOffAddr# @GHC.Prim.RealWorld ww 0# s of { (# ipv, ipv1 #) -> case GHC.Prim.gtWord# ipv1 0## of { __DEFAULT -> ((hPutStr' stdout Test.test4 GHC.Types.True) `cast` ...) ipv; 1# -> ((hPutStr' stdout Test.test2 GHC.Types.True) `cast` ...) ipv } }) w
There is nothing terribly surprising about this yet, since
with#
is only lowered during Core Prep. Specifically, we lower applications of the formwith# @a @r @b x k s0 ~> case k s of { (# y, s1 #) -> case touch# @a x s1 of s2 -> (# y, s2 #)
However, this lowering is being transformed to rather allocate
k
:Test.$wtest = \ (ww [Occ=Once] :: GHC.Prim.Addr#) (ww1 [Occ=Once] :: GHC.ForeignPtr.ForeignPtrContents) (w [Occ=Once] :: GHC.Prim.State# GHC.Prim.RealWorld) -> let { sat [Occ=Once!] :: GHC.Prim.State# GHC.Prim.RealWorld -> (# GHC.Prim.State# GHC.Prim.RealWorld, () #) [LclId] sat = \ (s [Occ=Once, OS=OneShot] :: GHC.Prim.State# GHC.Prim.RealWorld) -> case GHC.Prim.readWordOffAddr# @GHC.Prim.RealWorld ww 0# s of { (# ipv [Occ=Once*], ipv1 [Occ=Once] #) -> case GHC.Prim.gtWord# ipv1 0## of { __DEFAULT -> ((hPutStr' stdout Test.test4 GHC.Types.True) `cast` ...) ipv; 1# -> ((hPutStr' stdout Test.test2 GHC.Types.True) `cast` ...) ipv } } } in case sat w of { (# sat [Occ=Once], sat [Occ=Once] #) -> case GHC.Prim.touch# @'GHC.Types.LiftedRep @GHC.ForeignPtr.ForeignPtrContents ww1 sat of sat [Occ=Once] { __DEFAULT -> (# sat, sat #) } }
This suggests that my original conclusion of not needing manual beta reduction was incorrect. I wonder how this case differs from
allocaBytes
.We agreed on the phone that the current code in
Prep
for runRW is thiscpe_app env (Var f) [CpeApp _runtimeRep@Type{}, CpeApp _type@Type{}, CpeApp arg] 1 | f `hasKey` runRWKey -- See Note [runRW magic] -- Replace (runRW# f) by (f realWorld#), beta reducing if possible (this -- is why we return a CorePrepEnv as well) = case arg of Lam s body -> cpe_app (extendCorePrepEnv env s realWorldPrimId) body [] 0 _ -> cpe_app env arg [CpeApp (Var realWorldPrimId)] 1
and its magic beta reduction only happens for
runRW
. We need to generalise this forkeepAlive#
(let's change the name!), but that probably isn't hard.I have augmented the wiki page with pointers to some relevant
runRW#
tickets.Indeed the beta reduction fixes many of the issues. Those issues are remain appear to be due to
with#
hiding CPR opportunities. For instance,Buffer.readWord8Buf
call-sites now tend to produce a boxed result (e.g.GHC.IO.Encoding.UTF8.mkUTF8
). I'm going to look into what can be done about this tomorrow.Edit: This comment is not quite the full story. Nested CPR is in fact not necessary. See the comment that follows.
So the CPR issues here are a bit hairy. First, recall the type of
with#
,with# :: forall (a_rep :: RuntimeRep) (a :: TYPE a_rep) (r_rep :: RuntimeRep) (r :: TYPE r_rep). (a -> (# State# RealWorld, r #)) -> a -> State# RealWorld -> (# State# RealWorld, r #)
Specifically, imagine you have something like the following (see $1597 for a complete example),
case with# @'LiftedPtrRep @Int @_ ba (\s -> (# s, I# 42# #)) _ _ of (# s', I# n #) -> ...
Ideally we would like to transform this to:
case with# @'IntRep @Int# @_ ba (\s -> (# s, 42 #)) _ _ of (# s', n# #) -> ...
Speaking with @sgraf812, we believe this should be within the capabilities of his on-going nested CPR work. He's currently in the process of rebasing his branch (!1866 (closed)); once he is done I will rebase this on top and test again. I believe with nested CPR this change should be performance-neutral.
Edited by Ben GamariAs noted above, nested CPR is only half of the story. In order for worker-wrapper to actually eliminate the allocation here we would need to do a non-trivial transformation to the program (incidentally, the same transformation described in #15127 (comment 211427)). That is, starting with:
let f :: State# RealWorld -> (# State# RealWorld, Int #) f = \s -> (# s, I# 42# #) in case with# @'LiftedPtrRep @Int @_ ba f _ _ of (# s', n #) -> case n of I# n# -> ...
We somehow need to bring the case analysis on the
I#
into contact with its allocation site inf
. This might look something like this:let f :: State# RealWorld -> (# State# RealWorld, Int #) f = \s -> (# s, I# 42# #) in case with# @'LiftedPtrRep @Int @_ ba (\s -> case f s of (# s', n #) -> case n of I# n# -> (# s', n# #) ) _ _ of (# s', n# #) -> ...
And then inlining
f
(or, iff
is large and we have nested CPR, the wrapper off
) would allow case-of-known-constructor to fire, eliminating theI#
. However, making this jump is non-trivial and would almost certainly require a special case in the simplifier just forwith#
.Assuming that we are stuck with the current performance characteristics of
with#
, we can mitigate much of the impact by introducing a set of interfaces as suggested by #4096, which combine anAddr#
read/write and the "keep-alive" behavior oftouch#
. These can even be written in terms ofwith#
.Unfortunately, this doesn't fix the problem of operations like
withForeignPtr
, which requires that the user produce a boxed result. IfwithForeignPtr
is implemented in terms ofwith#
such boxing cannot be avoided as noted above. This is very unfortunate aswithForeignPtr
is a remarkably common operation in user code.Edited by Ben Gamari