Skip to content

Failure to recognise join point leads to slow down

The following (rather contrived) example has a join point j that isn't recognised as such:

{-# OPTIONS_GHC -O2 -fforce-recomp #-}

module Main (main, f) where

import System.Environment

f :: Int -> Int -> Int -> Int
f x y = let
             j f 0 = f
             j f n = j (f . (y +)) (n-1 :: Int)
         in
         case x of
           0 -> \_ -> 0
           1 -> j id 2
           x -> j id x

main = do
  (a:b:_) <- map read <$> getArgs
  flip mapM_ [0..a::Int] $ \i -> do
    f b i 0 `seq` return ()

This is the optimised Core of the worker of f (which is also called from main, as I verified:

Main.$wf
  = \ (ww_s30j :: GHC.Prim.Int#) (w_s30g :: Int) ->
      letrec {
        $wj_s30e [InlPrag=NOUSERINLINE[2], Occ=LoopBreaker]
          :: forall {c}. (Int -> c) -> GHC.Prim.Int# -> Int -> c
        [LclId, Arity=2, Str=<L,U><S,1*U>, Unf=OtherCon []]
        $wj_s30e
          = \ (@c_s307)
              (w1_s308 :: Int -> c_s307)
              (ww1_s30c :: GHC.Prim.Int#) ->
              case ww1_s30c of ds_X1 {
                __DEFAULT ->
                  $wj_s30e
                    @c_s307
                    (\ (x_a1Ps :: Int) -> w1_s308 (GHC.Num.$fNumInt_$c+ w_s30g x_a1Ps))
                    (GHC.Prim.-# ds_X1 1#);
                0# -> w1_s308
              }; } in
      case ww_s30j of ds_X1 {
        __DEFAULT -> $wj_s30e @Int (id @Int) ds_X1;
        0# -> Main.f1;
        1# -> $wj_s30e @Int (id @Int) 2#
      }

Example run:

$ ./test 99999999 0 +RTS -s
   1,600,052,512 bytes allocated in the heap
          20,160 bytes copied during GC
          36,008 bytes maximum residency (1 sample(s))
          29,528 bytes maximum slop
               2 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      1525 colls,     0 par    0.003s   0.004s     0.0000s    0.0001s
  Gen  1         1 colls,     0 par    0.000s   0.000s     0.0002s    0.0002s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    0.331s  (  0.332s elapsed)
  GC      time    0.003s  (  0.004s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    0.335s  (  0.336s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    4,828,658,145 bytes per MUT second

  Productivity  98.9% of total user, 98.7% of total elapsed

Now remove the 1 -> j id 2, so that j can float inside the case branch and turn into a join point:

Main.$wf
  = \ (ww_s319 :: GHC.Prim.Int#) (w_s316 :: Int) ->
      case ww_s319 of ds_X1 {
        __DEFAULT ->
          joinrec {
            $wj_s314 [InlPrag=NOUSERINLINE[2], Occ=LoopBreaker]
              :: (Int -> Int) -> GHC.Prim.Int# -> Int -> Int
            [LclId[JoinId(2)], Arity=2, Str=<L,U><S,1*U>, Unf=OtherCon []]
            $wj_s314 (w1_s30Y :: Int -> Int) (ww1_s312 :: GHC.Prim.Int#)
              = case ww1_s312 of ds1_X3 {
                  __DEFAULT ->
                    jump $wj_s314
                      (\ (x_a1Pm :: Int) -> w1_s30Y (GHC.Num.$fNumInt_$c+ w_s316 x_a1Pm))
                      (GHC.Prim.-# ds1_X3 1#);
                  0# -> w1_s30Y
                }; } in
          jump $wj_s314 (id @Int) ds_X1;
        0# -> Main.f1
      }

Observed performance is much faster, basically no allocations (I verified that main still calls $wf and you can see that it still scales linearly in the first command-line parameter):

$ ./test 99999999 0 +RTS -s
          52,528 bytes allocated in the heap
           3,168 bytes copied during GC
          36,008 bytes maximum residency (1 sample(s))
          21,336 bytes maximum slop
               2 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0         0 colls,     0 par    0.000s   0.000s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.000s   0.000s     0.0002s    0.0002s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    0.057s  (  0.058s elapsed)
  GC      time    0.000s  (  0.000s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    0.058s  (  0.058s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    914,157 bytes per MUT second

  Productivity  99.5% of total user, 99.5% of total elapsed
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information