Skip to content
Snippets Groups Projects

WIP: Introduce with# primop (option B)

Closed Ben Gamari requested to merge wip/with2-primop into master

This introduces a new primop, with#, which addresses the problems suffered by touch#, as described in #17760 (closed) and https://gitlab.haskell.org/ghc/ghc/wikis/proposal/with-combinator.

The approach taken here is what is labelled as Option B on the Wiki page. Specifically, the plan is to retain touch# but heavily discourage its use by user code. We then introduce with# which gets rewritten into touch# during Core Prep. Specifically, when we see an application of the form:

with# x k s

We rewrite it to

case k s of (# s', y #) ->
case touch# x s' of s'' ->
(# s'', y #)

This ensures that the simplifier cannot drop the continuation containing the touch# if k diverges (which was the cause of #14346 (closed) and #17746). Moreover, by simply lowering with# to touch# we avoid incurring any runtime overhead (unlike the other options presented on the Wiki page).

TODO

  • Verify that this doesn't incur any unexpected runtime costs
  • Add tests
  • Finish writing Note
  • Add changelog entry
  • Note change in release notes
  • Discourage use of touch# in Haddocks
  • Document with#
Edited by Ben Gamari

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Ben Gamari added 1 commit

    added 1 commit

    Compare with previous version

  • Ben Gamari changed title from WIP: Wip/with2 primop to WIP: Introduce with# primop (option B)

    changed title from WIP: Wip/with2 primop to WIP: Introduce with# primop (option B)

  • Ben Gamari changed the description

    changed the description

  • added primops label

  • assigned to @bgamari

  • Ben Gamari changed milestone to %9.0.1

    changed milestone to %9.0.1

  • Ben Gamari added 18 commits

    added 18 commits

    Compare with previous version

  • Ben Gamari mentioned in merge request !2933 (closed)

    mentioned in merge request !2933 (closed)

  • Author Maintainer

    For the record, this is the Core generated by Foreign.Marshal.Alloc.allocaBytes

    Foreign.Marshal.Alloc.allocaBytes1
      :: forall {a} {b}.
         GHC.Types.Int
         -> (GHC.Ptr.Ptr a -> GHC.Types.IO b)
         -> GHC.Prim.State# GHC.Prim.RealWorld
         -> (# GHC.Prim.State# GHC.Prim.RealWorld, b #)
    [GblId, Arity=3, Str=<S,1*U(U)><L,1*C1(U)><L,U>, Unf=OtherCon []]
    Foreign.Marshal.Alloc.allocaBytes1
      = \ (@a_a2pq)
          (@b_a2pr)
          (ds_s34m [Occ=Once!] :: GHC.Types.Int)
          (action_s34n [Occ=Once!] :: GHC.Ptr.Ptr a_a2pq -> GHC.Types.IO b_a2pr)
          (eta_s34o [Occ=Once] :: GHC.Prim.State# GHC.Prim.RealWorld) ->
          case ds_s34m of { GHC.Types.I# size_s34q [Occ=Once] ->
          case GHC.Prim.newPinnedByteArray# @GHC.Prim.RealWorld size_s34q eta_s34o of
          { (# ipv_s34s [Occ=Once], ipv1_s34t [Occ=Once] #) ->
          case GHC.Prim.unsafeFreezeByteArray# @GHC.Prim.RealWorld ipv1_s34t ipv_s34s of
          { (# ipv2_s34v [Occ=Once], ipv3_s34w #) ->
          case GHC.Prim.byteArrayContents# ipv3_s34w of sat_s34B [Occ=Once]
          { __DEFAULT ->
          let {
            sat_s34C [Occ=Once] :: GHC.Ptr.Ptr a_a2pq
            [LclId]
            sat_s34C = GHC.Ptr.Ptr @a_a2pq sat_s34B } in
          case ((action_s34n sat_s34C) `cast` (...)) ipv2_s34v of
          { (# sat_s34E [Occ=Once], sat_s34F [Occ=Once] #) ->
          case GHC.Prim.touch# @'GHC.Types.UnliftedRep @GHC.Prim.ByteArray# ipv3_s34w sat_s34E of sat_s34G [Occ=Once]
          { __DEFAULT -> (# sat_s34G, sat_s34F #)
          } } } } } }

    I believe this addresses @alexbiehl's point from !2933 (closed): we don't need to beta reduce ourselves.

    Edited by Ben Gamari
  • Author Maintainer

    So after a few false starts, I finally have some ticky numbers that I trust. Here is a comparison of ticky profiles from GHC compiling T5321FD with -O (comparing this branch, B, against its base commit, A):

    | Change     | alloc A  | alloc B  | name                                                                          |
    |------------|----------|----------|-------------------------------------------------------------------------------|
    | +4525568.0 | 0        | 4525568  | bytestring-0.10.9.0:Data.ByteString.Internal.$wcompareBytes (:<no module>)    |
    | +238784.0  | 0        | 238784   | ghc:Pretty.$wlayLeft (:<no module>)                                           |
    | +217952.0  | 108976   | 326928   | ghc:TysWiredIn.$wisBuiltInOcc_maybe (:<no module>)                            |
    | +40792.0   | 897424   | 938216   | $walexGetByte (ghc:Lexer)                                                     |
    | +37520.0   | 13680    | 51200    | $lgo5_gyzc (ghc:TcType)                                                       |
    | +26240.0   | 92792    | 119032   | ghc:GHC.Iface.Recomp.addFingerprints1 (:<no module>)                          |
    | +15816.0   | 1024     | 16840    | $lgo28_gZHQ (ghc:GHC.Core.Type)                                               |
    | +13680.0   | 0        | 13680    | $lgo5_gyzf (ghc:TcType)                                                       |
    | +9168.0    | 0        | 9168     | ghc:StringBuffer.$wcurrentChar (:<no module>)                                 |
    | +9160.0    | 192      | 9352     | ghc:Demand.$w$cget4 (:<no module>)                                            |
    | +9088.0    | 5680     | 14768    | $lgo1_gfik (ghc:GHC.CmmToAsm.CFG.Dominators)                                  |
    | +7192.0    | 512      | 7704     | ghc:Demand.$w$cget3 (:<no module>)                                            |
    | +6144.0    | 847904   | 854048   | base:GHC.List.$wbreak (:<no module>)                                          |
    | +3472.0    | 0        | 3472     | sat_sHyP (ghc:GHC.CmmToAsm.X86.Ppr)                                           |
    | +3280.0    | 2312     | 5592     | ghc:GHC.Iface.Syntax.$w$cget9 (:<no module>)                                  |
    | +2160.0    | 0        | 2160     | base:GHC.IO.FD.$fBufferedIOFD17 (:<no module>)                                |
    | +2152.0    | 48       | 2200     | sat_sttq (ghc:GHC.Rename.Pat)                                                 |
    | +1856.0    | 384      | 2240     | sat_s8sc (ghc:SysTools.Settings)                                              |
    | +1344.0    | 3615360  | 3616704  | base:GHC.Base.++ (:<no module>)                                               |
    | +1024.0    | 0        | 1024     | $lgo28_gZHP (ghc:GHC.Core.Type)                                               |
    | +960.0     | 192      | 1152     | sat_sIdm (ghc:TcBinds)                                                        |

    The fact that compareBytes suggests that withForeignPtr isn't being optimised as I would expect. Investigating.

    Edited by Ben Gamari
  • Author Maintainer

    I believe I see the issue here; the strictness signature of with# (to be renamed to keepAlive#) is far too loose. It needs to place a call demand on the continuation at very least.

    However, this poses an interesting conundrum: Consider this code which one would have written before keepAlive# (assume that doSomething is strict in its argument):

    f x = do
        doSomething x
        touch x

    Refactored to use keepAlive# this would look like,

    f x = do
        keepAlive x (doSomething x)

    If keepAlive# only places a demand on its continuation argument then the strictness signature of f will have changed: previously GHC would be able to see that x is demanded. However, in the case of the refactored code this fact is hidden.

    I suspect the demand analyser will need a special case to ensure that a keepAlive# application has the same demands as its continuation.

  • I hope not a special case. I think you may just need a suitable strictness signature for keepAlive#. (With a call demand on its argument.) cf uses of strictApplyDmd in primoops.txt.pp. @sgraf may be able to help here.

  • Author Maintainer

    Ahh, sure; I suppose just a call demand should do.

  • Author Maintainer

    Unfortunately placing a call demand on the continuation isn't quite sufficient. Specifically, consider this program:

    module Test where
    
    import Data.ByteString.Internal hiding (compareBytes)
    import Foreign.Storable
    import Foreign.ForeignPtr
    import Foreign.Ptr
    
    test :: ForeignPtr Word -> IO ()
    test fp = do
      withForeignPtr fp $ \p ->  do
        x <- peek p
        if (x > 0) then putStrLn "x" else putStrLn "y"

    With touch# we end up with the following STG (the Core isn't appreciably different):

    Test.$wtest [InlPrag=NOUSERINLINE[2]]
      :: GHC.Prim.Addr#
         -> GHC.ForeignPtr.ForeignPtrContents
         -> GHC.Prim.State# GHC.Prim.RealWorld
         -> (# GHC.Prim.State# GHC.Prim.RealWorld, () #)
    [GblId, Arity=3, Str=<L,U><L,U><L,U>, Unf=OtherCon []] =
        \r [ww ww1 w]
            case readWordOffAddr# [ww 0# w] of {
            (#,#) ipv [Occ=Once*] ipv1 [Occ=Once] ->
            case gtWord# [ipv1 0##] of {
              __DEFAULT ->
                  case GHC.IO.Handle.Text.hPutStr' GHC.IO.Handle.FD.stdout Test.test4 GHC.Types.True ipv of {
                  (#,#) ipv2 [Occ=Once] ipv3 [Occ=Once] ->
                    case touch# [ww1 ipv2] of s' [Occ=Once] { __DEFAULT -> (#,#) [s' ipv3]; };
                  };
              1# ->
                  case hPutStr' GHC.IO.Handle.FD.stdout Test.test2 GHC.Types.True ipv of {
                  (#,#) ipv2 [Occ=Once] ipv3 [Occ=Once] ->
                    case touch# [ww1 ipv2] of s' [Occ=Once] { __DEFAULT -> (#,#) [s' ipv3]; };
                  };
            }; 
            };

    Very nice; no allocation, the touch#s are pushed into the case analysis.

    However, with keepAlive# we see this simplified Core,

    Test.$wtest
      = \ (ww :: GHC.Prim.Addr#)
          (ww1 :: GHC.ForeignPtr.ForeignPtrContents)
          (w :: GHC.Prim.State# GHC.Prim.RealWorld) ->
          GHC.Prim.with#
            @'GHC.Types.LiftedRep
            @GHC.ForeignPtr.ForeignPtrContents
            @'GHC.Types.LiftedRep
            @()
            ww1
            (\ (s [OS=OneShot] :: GHC.Prim.State# GHC.Prim.RealWorld) ->
               case GHC.Prim.readWordOffAddr# @GHC.Prim.RealWorld ww 0# s of
               { (# ipv, ipv1 #) ->
               case GHC.Prim.gtWord# ipv1 0## of {
                 __DEFAULT -> ((hPutStr' stdout Test.test4 GHC.Types.True) `cast` ...) ipv;
                 1#        -> ((hPutStr' stdout Test.test2 GHC.Types.True) `cast` ...) ipv
               }
               })
            w

    There is nothing terribly surprising about this yet, since with# is only lowered during Core Prep. Specifically, we lower applications of the form

    with# @a @r @b x k s0
    
        ~>
    
    case k s of { (# y, s1 #) ->
      case touch# @a x s1 of s2 -> (# y, s2 #)

    However, this lowering is being transformed to rather allocate k:

    Test.$wtest
      = \ (ww [Occ=Once] :: GHC.Prim.Addr#)
          (ww1 [Occ=Once] :: GHC.ForeignPtr.ForeignPtrContents)
          (w [Occ=Once] :: GHC.Prim.State# GHC.Prim.RealWorld) ->
          let {
            sat [Occ=Once!]
              :: GHC.Prim.State# GHC.Prim.RealWorld
                 -> (# GHC.Prim.State# GHC.Prim.RealWorld, () #)
            [LclId]
            sat
              = \ (s [Occ=Once, OS=OneShot]
                     :: GHC.Prim.State# GHC.Prim.RealWorld) ->
                  case GHC.Prim.readWordOffAddr# @GHC.Prim.RealWorld ww 0# s of
                  { (# ipv [Occ=Once*], ipv1 [Occ=Once] #) ->
                  case GHC.Prim.gtWord# ipv1 0## of {
                    __DEFAULT -> ((hPutStr' stdout Test.test4 GHC.Types.True) `cast` ...) ipv;
                    1# -> ((hPutStr' stdout Test.test2 GHC.Types.True) `cast` ...) ipv
                  }
                  } } in
          case sat w of { (# sat [Occ=Once], sat [Occ=Once] #) ->
          case GHC.Prim.touch# @'GHC.Types.LiftedRep @GHC.ForeignPtr.ForeignPtrContents ww1 sat
          of sat [Occ=Once]
          { __DEFAULT -> (# sat, sat #)
          }
          }

    This suggests that my original conclusion of not needing manual beta reduction was incorrect. I wonder how this case differs from allocaBytes.

  • We agreed on the phone that the current code in Prep for runRW is this

        cpe_app env (Var f) [CpeApp _runtimeRep@Type{}, CpeApp _type@Type{}, CpeApp arg] 1
            | f `hasKey` runRWKey
            -- See Note [runRW magic]
            -- Replace (runRW# f) by (f realWorld#), beta reducing if possible (this
            -- is why we return a CorePrepEnv as well)
            = case arg of
                Lam s body -> cpe_app (extendCorePrepEnv env s realWorldPrimId) body [] 0
                _          -> cpe_app env arg [CpeApp (Var realWorldPrimId)] 1

    and its magic beta reduction only happens for runRW. We need to generalise this for keepAlive# (let's change the name!), but that probably isn't hard.

  • I have augmented the wiki page with pointers to some relevant runRW# tickets.

  • Author Maintainer

    Indeed the beta reduction fixes many of the issues. Those issues are remain appear to be due to with# hiding CPR opportunities. For instance, Buffer.readWord8Buf call-sites now tend to produce a boxed result (e.g. GHC.IO.Encoding.UTF8.mkUTF8). I'm going to look into what can be done about this tomorrow.

  • Awesome, Ben! Keep going!

  • Author Maintainer

    Edit: This comment is not quite the full story. Nested CPR is in fact not necessary. See the comment that follows.

    So the CPR issues here are a bit hairy. First, recall the type of with#,

    with# :: forall (a_rep :: RuntimeRep) (a :: TYPE a_rep) 
                    (r_rep :: RuntimeRep) (r :: TYPE r_rep).
             (a -> (# State# RealWorld, r #))
          -> a
          -> State# RealWorld
          -> (# State# RealWorld, r #)

    Specifically, imagine you have something like the following (see $1597 for a complete example),

      case with# @'LiftedPtrRep @Int @_ ba (\s -> (# s, I# 42# #)) _ _ of
        (# s', I# n #) -> ...

    Ideally we would like to transform this to:

      case with# @'IntRep @Int# @_ ba (\s -> (# s, 42 #)) _ _ of
        (# s', n# #) -> ...

    Speaking with @sgraf812, we believe this should be within the capabilities of his on-going nested CPR work. He's currently in the process of rebasing his branch (!1866 (closed)); once he is done I will rebase this on top and test again. I believe with nested CPR this change should be performance-neutral.

    Edited by Ben Gamari
  • Author Maintainer

    As noted above, nested CPR is only half of the story. In order for worker-wrapper to actually eliminate the allocation here we would need to do a non-trivial transformation to the program (incidentally, the same transformation described in #15127 (comment 211427)). That is, starting with:

    let 
      f :: State# RealWorld -> (# State# RealWorld, Int #)
      f = \s -> (# s, I# 42# #)
    in case with# @'LiftedPtrRep @Int @_ ba f _ _ of
         (# s', n #) -> case n of I# n# -> ...

    We somehow need to bring the case analysis on the I# into contact with its allocation site in f. This might look something like this:

    let 
      f :: State# RealWorld -> (# State# RealWorld, Int #)
      f = \s -> (# s, I# 42# #)
    in case with# @'LiftedPtrRep @Int @_ ba 
              (\s -> case f s of (# s', n #) ->
                       case n of I# n# -> (# s', n# #)
              ) _ _ of
         (# s', n# #) -> ...

    And then inlining f (or, if f is large and we have nested CPR, the wrapper of f) would allow case-of-known-constructor to fire, eliminating the I#. However, making this jump is non-trivial and would almost certainly require a special case in the simplifier just for with#.

    Assuming that we are stuck with the current performance characteristics of with#, we can mitigate much of the impact by introducing a set of interfaces as suggested by #4096, which combine an Addr# read/write and the "keep-alive" behavior of touch#. These can even be written in terms of with#.

    Unfortunately, this doesn't fix the problem of operations like withForeignPtr, which requires that the user produce a boxed result. If withForeignPtr is implemented in terms of with# such boxing cannot be avoided as noted above. This is very unfortunate as withForeignPtr is a remarkably common operation in user code.

    Edited by Ben Gamari
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
Please register or sign in to reply
Loading