Strict enum fields + case-of-case lead to huge code bloat
I was testing this on GHC 8.10.2, but it appears there has been some improvements in the wake of bug fixes last year, see below.
I'm sure this has come up before in one form or another, but I really don't like the code bloat that results from making a field strict. Example:
data B = T | F
data S = S !B !B !B !B !B
data L = L B B B B B
(&) :: B -> B -> B
T & T = T
_ & _ = F
{-# INLINE (&) #-} -- with or without, it doesn't matter much. Funny enough, if I *omit it*, then it will be inlined!
s :: B -> B -> B -> B -> B -> B -> B -> B -> B -> B -> S
s a b c d e f g h i j = S (a & b) (c & d) (e & f) (g & h) (i & j)
l :: B -> B -> B -> B -> B -> B -> B -> B -> B -> B -> L
l a b c d e f g h i j = L (a & b) (c & d) (e & f) (g & h) (i & j)
Here's the code:
s = \ (w_sFL :: B)
(w1_sFM :: B)
(w2_sFN :: B)
(w3_sFO :: B)
(w4_sFP :: B)
(w5_sFQ :: B)
(w6_sFR :: B)
(w7_sFS :: B)
(w8_sFT :: B)
(w9_sFU :: B) ->
case w_sFL of {
T ->
case w2_sFN of {
T ->
case w4_sFP of {
T ->
case w6_sFR of {
T ->
case w8_sFT of {
T ->
case w1_sFM of dt_XhM { __DEFAULT ->
case w3_sFO of dt1_XhO { __DEFAULT ->
case w5_sFQ of dt2_XhQ { __DEFAULT ->
case w7_sFS of dt3_XhS { __DEFAULT ->
case w9_sFU of dt4_XhU { __DEFAULT ->
Lib.S dt_XhM dt1_XhO dt2_XhQ dt3_XhS dt4_XhU
}
}
}
}
};
F ->
case w1_sFM of dt_XhM { __DEFAULT ->
case w3_sFO of dt1_XhO { __DEFAULT ->
case w5_sFQ of dt2_XhQ { __DEFAULT ->
case w7_sFS of dt3_XhS { __DEFAULT ->
Lib.S dt_XhM dt1_XhO dt2_XhQ dt3_XhS Lib.F
}
}
}
}
};
F ->
... and so on, 30 more cases ...
l = \ (a_agT :: B)
(b_agU :: B)
(c_agV :: B)
(d_agW :: B)
(e_agX :: B)
(f_agY :: B)
(g_agZ :: B)
(h_ah0 :: B)
(i_ah1 :: B)
(j_ah2 :: B) ->
Lib.L
(& a_agT b_agU)
(& c_agV d_agW)
(& e_agX f_agY)
(& g_agZ h_ah0)
(& i_ah1 j_ah2)
The code for s
severely suffers from combinatorial explosion. I'm sure we have a safe guard at some constant, but it seems so pointless, since we just store the case binder in S
's field anyway. We enable no further simplification whatsoever, it's all just bloat. In the context of a bigger function, this bloat may inhibit other functions or the function itself from inlining.
Compare that to l
, which generates the (to me) optimal code. It's much smaller and thus much more likely to inline. I'd much rather have
s = \ (a_agT :: B)
(b_agU :: B)
(c_agV :: B)
(d_agW :: B)
(e_agX :: B)
(f_agY :: B)
(g_agZ :: B)
(h_ah0 :: B)
(i_ah1 :: B)
(j_ah2 :: B) ->
case (& a_agT b_agU) of dt_XhM { __DEFAULT ->
case (& c_agV d_agW) of dt1_XhO { __DEFAULT ->
case (& e_agX f_agY) of dt2_XhQ { __DEFAULT ->
case (& g_agZ h_ah0) of dt3_XhS { __DEFAULT ->
case (& i_ah1 j_ah2) of dt4_XhU { __DEFAULT ->
Lib.S dt_XhM dt1_XhO dt2_XhQ dt3_XhS dt4_XhU
}
}
}
}
};
As the final code. Maybe inline (&)
, but then don't do case-of-case if we'll also fruitlessly inline the join points afterwards. I'm sure the branch predictor makes short process of the function calls.
Maybe this is something that can improve with boxity information available? Not sure. I think an important bit here is that the code operates on enums exclusively, no nested fields.
Huh, with GHC HEAD, s
looks like
Lib.$ws
= \ (a_syt :: B)
(b_syu :: B)
(c_syv :: B)
(d_syw :: B)
(e_syx :: B)
(f_syy :: B)
(g_syz :: B)
(h_syA :: B)
(i_syB :: B)
(j_syC :: B) ->
join {
$j_sye [Dmd=1C1(L)] :: B %1 -> (# B, B, B, B, B #)
[LclId[JoinId(1)], Arity=1, Str=<1L>, Unf=OtherCon []]
$j_sye (arg_sy5 [OS=OneShot] :: B)
= join {
$j1_syd [Dmd=1C1(L)] :: B %1 -> (# B, B, B, B, B #)
[LclId[JoinId(1)], Arity=1, Str=<1L>, Unf=OtherCon []]
$j1_syd (arg1_sy6 [OS=OneShot] :: B)
= join {
$j2_syc [Dmd=1C1(L)] :: B %1 -> (# B, B, B, B, B #)
[LclId[JoinId(1)], Arity=1, Str=<1L>, Unf=OtherCon []]
$j2_syc (arg2_sy7 [OS=OneShot] :: B)
= case g_syz of {
T ->
case i_syB of {
T ->
case arg_sy5 of conrep_X0 { __DEFAULT ->
case arg1_sy6 of conrep1_X3 { __DEFAULT ->
case arg2_sy7 of conrep2_X4 { __DEFAULT ->
case h_syA of conrep3_X5 { __DEFAULT ->
case j_syC of conrep4_X6 { __DEFAULT ->
(# conrep_X0, conrep1_X3, conrep2_X4, conrep3_X5, conrep4_X6 #)
}
}
}
}
};
F ->
case arg_sy5 of conrep_X0 { __DEFAULT ->
case arg1_sy6 of conrep1_X3 { __DEFAULT ->
case arg2_sy7 of conrep2_X4 { __DEFAULT ->
case h_syA of conrep3_X5 { __DEFAULT ->
(# conrep_X0, conrep1_X3, conrep2_X4, conrep3_X5, Lib.F #)
}
}
}
}
};
...
... } in
case a_syt of {
T -> jump $j_sye b_syu;
F -> jump $j_sye Lib.F
}
So case-of-case is still happening, but we no longer inline all the join points. That is an improvement, but it is still clear pretty clear to me that we shouldn't even have inlined any of the join points.
I still don't see much of a point in applying case-of-case in the first place, but I might not completely understand the operational difference between
case (case ... of { T -> b; F -> F }) of a { __DEFAULT -> S a }
and
join j fld = case fld of fld' { __DEFAULT -> S fld' }
in case ... of { T -> jump j b; F -> jump j F }
I think the latter does a jump to a known location whereas the former does a tail-call to an unknown location on the stack. Fair enough... (Although I think that the former could be optimised to do a known jump as well and is definitely the clearer code.) Alright, then do case-of-case, but don't inline the join points, which is obviously fruitless; the case binder that ends up in a field will never be canceled away.