Failure to recognise join point leads to slow down
The following (rather contrived) example has a join point j
that isn't recognised as such:
{-# OPTIONS_GHC -O2 -fforce-recomp #-}
module Main (main, f) where
import System.Environment
f :: Int -> Int -> Int -> Int
f x y = let
j f 0 = f
j f n = j (f . (y +)) (n-1 :: Int)
in
case x of
0 -> \_ -> 0
1 -> j id 2
x -> j id x
main = do
(a:b:_) <- map read <$> getArgs
flip mapM_ [0..a::Int] $ \i -> do
f b i 0 `seq` return ()
This is the optimised Core of the worker of f
(which is also called from main
, as I verified:
Main.$wf
= \ (ww_s30j :: GHC.Prim.Int#) (w_s30g :: Int) ->
letrec {
$wj_s30e [InlPrag=NOUSERINLINE[2], Occ=LoopBreaker]
:: forall {c}. (Int -> c) -> GHC.Prim.Int# -> Int -> c
[LclId, Arity=2, Str=<L,U><S,1*U>, Unf=OtherCon []]
$wj_s30e
= \ (@c_s307)
(w1_s308 :: Int -> c_s307)
(ww1_s30c :: GHC.Prim.Int#) ->
case ww1_s30c of ds_X1 {
__DEFAULT ->
$wj_s30e
@c_s307
(\ (x_a1Ps :: Int) -> w1_s308 (GHC.Num.$fNumInt_$c+ w_s30g x_a1Ps))
(GHC.Prim.-# ds_X1 1#);
0# -> w1_s308
}; } in
case ww_s30j of ds_X1 {
__DEFAULT -> $wj_s30e @Int (id @Int) ds_X1;
0# -> Main.f1;
1# -> $wj_s30e @Int (id @Int) 2#
}
Example run:
$ ./test 99999999 0 +RTS -s
1,600,052,512 bytes allocated in the heap
20,160 bytes copied during GC
36,008 bytes maximum residency (1 sample(s))
29,528 bytes maximum slop
2 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1525 colls, 0 par 0.003s 0.004s 0.0000s 0.0001s
Gen 1 1 colls, 0 par 0.000s 0.000s 0.0002s 0.0002s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.331s ( 0.332s elapsed)
GC time 0.003s ( 0.004s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.335s ( 0.336s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 4,828,658,145 bytes per MUT second
Productivity 98.9% of total user, 98.7% of total elapsed
Now remove the 1 -> j id 2
, so that j
can float inside the case branch and turn into a join point:
Main.$wf
= \ (ww_s319 :: GHC.Prim.Int#) (w_s316 :: Int) ->
case ww_s319 of ds_X1 {
__DEFAULT ->
joinrec {
$wj_s314 [InlPrag=NOUSERINLINE[2], Occ=LoopBreaker]
:: (Int -> Int) -> GHC.Prim.Int# -> Int -> Int
[LclId[JoinId(2)], Arity=2, Str=<L,U><S,1*U>, Unf=OtherCon []]
$wj_s314 (w1_s30Y :: Int -> Int) (ww1_s312 :: GHC.Prim.Int#)
= case ww1_s312 of ds1_X3 {
__DEFAULT ->
jump $wj_s314
(\ (x_a1Pm :: Int) -> w1_s30Y (GHC.Num.$fNumInt_$c+ w_s316 x_a1Pm))
(GHC.Prim.-# ds1_X3 1#);
0# -> w1_s30Y
}; } in
jump $wj_s314 (id @Int) ds_X1;
0# -> Main.f1
}
Observed performance is much faster, basically no allocations (I verified that main
still calls $wf
and you can see that it still scales linearly in the first command-line parameter):
$ ./test 99999999 0 +RTS -s
52,528 bytes allocated in the heap
3,168 bytes copied during GC
36,008 bytes maximum residency (1 sample(s))
21,336 bytes maximum slop
2 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 0 colls, 0 par 0.000s 0.000s 0.0000s 0.0000s
Gen 1 1 colls, 0 par 0.000s 0.000s 0.0002s 0.0002s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.057s ( 0.058s elapsed)
GC time 0.000s ( 0.000s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.058s ( 0.058s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 914,157 bytes per MUT second
Productivity 99.5% of total user, 99.5% of total elapsed