Draft: Improve switch lowering in cmm
Rewrites switch lowering to make use of smaller and sometimes faster patterns. Imo it also simplifies the algorithm to create switch plans as it breaks it up into different phases.
Some of the patterns improved by this mr:
Ranges:
f :: Int -> Bool
f a = case a of
1 -> True
2 -> True
3 -> True
4 -> True
5 -> True
6 -> True
_ -> False
Generates a single range check (u_> for explicitly unsigned comparisons)
if a - 1 u_> 5 then False else True
Previously ghc did:
if a >= 7
then False
else if a < 1 then False else True
BitTests
f :: Int -> Bool
f a = case a of
1 -> True
5 -> True
8 -> True
9 -> True
11 -> True
19 -> True
_ -> False
Generates:
if a u_> 19
then False
else if (1 << a) & 527138 == 0 then False else True
-- 527138 = 1 << 1 + 1 << 5 + 1 << 8 + 1 << 9 + 1 << 11 + 1 << 19
Previously generated:
if a < 8
then if a >= 6
then False
else if a >= 5
then True
else if a == 1
then True
else False
else if a < 12
then if a < 9
then True
else if a == 10
then False
else True
else if a == 19
then True
else False
And in general search trees should be smaller, due linear search at the leaf nodes, and better balanced.
New (slightly simplified for readability) switch plan:
data SwitchPlan =
LinearSearch [Range] -- See Note [Linear search vs binary search tree] on how this usually looks
| BitTest [(Label, Mask)] -- See Note [BitTest]
| JumpTable (Map Integer Label)
| BinarySearch Pivot SwitchPlan SwitchPlan
type Range = (Integer, Integer, Label)
type Mask = Integer
type Pivot = Integer
The balancing of the tree can also be changed, currently it uses the number of cases (not nodes, so ranges, bit-tests and jumptables are heavier). See Note [Balancing the search tree]
.
The algorithm first combines ranges, then searches for BitTest or JumpTable opportunities and at last assembles everything into a balanced tree. All steps are decoupled thus adding other special patterns or changing the existing ones is much easier than before, especially because the balancing is done after these patterns are found. It also removes some warts of the previous implementation as it no longer needs special cases to produce good plans.
Generating the nodes from this new plan got a little more complex because the plan is much more high level now. (Previously it was directly mapped to less-than, eq, jump tables or unconditional jumps)
I would not expect the new code to be faster in general. There should be fewer branches and less code, but the resulting performance heavily depends on how predictable these branches are, so I would not be surprised to see no significant changes. I redid the benchmark for myIsSpace
from #10124 and saw no changes. With more alternatives in the case statement, such that a bit test is generated, at least on my machine (ryzen 7950x) there is a clear win for the bit test.
The benchmark from #10124:
This is with added alternatives to myIsSpace
, just comment them out and remove the elements from spaces
to check the original benchmark.
module Main (main) where
import Test.Tasty.Bench
import Test.QuickCheck
import Test.QuickCheck.Gen (Gen(MkGen))
import Test.QuickCheck.Random (mkQCGen)
import Control.DeepSeq
import Control.Exception (evaluate)
main :: IO ()
main = defaultMain [
env (gen balanced) $ \inp -> bench "balanced" (whnf test inp)
, env (gen lessWhiteSpace) $ \inp -> bench "less whitespace" (whnf test inp)
, env (gen moreWhiteSpace) $ \inp -> bench "more whitespace" (whnf test inp)
]
seed = 1470
size = 3000
gen (MkGen g) = evaluate . force $ g (mkQCGen seed) size
balanced = listOf $ frequency [(1, elements elems), (1, chooseAny `suchThat` (not . myIsSpace))]
lessWhiteSpace = listOf $ frequency [(1, elements elems), (5, chooseAny `suchThat` (not . myIsSpace))]
moreWhiteSpace = listOf $ frequency [(5, elements elems), (1, chooseAny `suchThat` (not . myIsSpace))]
test = length . filter myIsSpace
elems = [' ', '\n', '\t', '#', '&']
myIsSpace ' ' = True
myIsSpace '\n' = True
myIsSpace '\t' = True
myIsSpace '#' = True
myIsSpace '&' = True
myIsSpace _ = False
{-# NOINLINE myIsSpace #-}
-- Yes this is necessary because otherwise ghc generates a block for each case and BitTests cannot be generated
With the bit test:
All
balanced: OK (0.45s)
3.49 μs ± 101 ns
less whitespace: OK (0.20s)
3.03 μs ± 165 ns
more whitespace: OK (0.21s)
3.24 μs ± 273 ns
With ghc-head:
All
balanced: OK (0.25s)
3.89 μs ± 164 ns
less whitespace: OK (0.11s)
3.43 μs ± 331 ns
more whitespace: OK (0.13s)
3.99 μs ± 352 ns
The code generated with this mr
==================== Output Cmm ====================
[$wmyIsSpace_r4g7_entry() { // [R2]
{ info_tbls: [(c4oL,
label: $wmyIsSpace_r4g7_info
rep: HeapRep static { Fun {arity: 1 fun_type: ArgSpec 4} }
srt: Nothing)]
stack_info: arg_space: 8
}
{offset
c4oL: // global
if (R2 > 38) goto c4oF; else goto u4oO;
u4oO: // global
if ((1 << R2) & 313532614144 == 0) goto c4oF; else goto c4oG;
c4oF: // global
R1 = GHC.Types.False_closure+1;
call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
c4oG: // global
R1 = GHC.Types.True_closure+2;
call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
}
},
section ""data" . $wmyIsSpace_r4g7_closure" {
$wmyIsSpace_r4g7_closure:
const $wmyIsSpace_r4g7_info;
}]
Current ghc
==================== Output Cmm ====================
[$wmyIsSpace_r4g7_entry() { // [R2]
{ info_tbls: [(c4oL,
label: $wmyIsSpace_r4g7_info
rep: HeapRep static { Fun {arity: 1 fun_type: ArgSpec 4} }
srt: Nothing)]
stack_info: arg_space: 8
}
{offset
c4oL: // global
if (R2 < 11) goto u4oO; else goto u4oQ;
u4oO: // global
if (R2 >= 10) goto c4oG; else goto u4oP;
u4oP: // global
if (R2 < 9) goto c4oF; else goto c4oG;
u4oQ: // global
if (R2 < 36) goto u4oR; else goto u4oT;
u4oR: // global
if (R2 >= 35) goto c4oG; else goto u4oS;
u4oS: // global
if (R2 != 32) goto c4oF; else goto c4oG;
u4oT: // global
if (R2 != 38) goto c4oF; else goto c4oG;
c4oF: // global
R1 = GHC.Types.False_closure+1;
call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
c4oG: // global
R1 = GHC.Types.True_closure+2;
call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
}
},
section ""data" . $wmyIsSpace_r4g7_closure" {
$wmyIsSpace_r4g7_closure:
const $wmyIsSpace_r4g7_info;
}]
Related issues:
- #10124 With range checks and bit tests being generated, this could be closed I think. Although it still won't be branchless but with this in place a future transformation to branchless should be easier, so maybe keep it open?
-
#19290 In the first example given, ghc generates a case statement with no alternative. The ranges here are
[1 -> False, 2..5 -> True]
, this mr generatesLinearSearch [(1,1,labelFalse),(2,5,labelTrue)]
but since we have no fallback it recognises that ifscrut != 1
any value forscrut
other than2,3,4,5
leads to undefined behavior and it drops the range check. Other problems described in this ticket are usually covered by either range checks or bit tests (for sparse versions). So this can be closed. - #20271 This still persists, although the specific example no longer compiles this way. (Mainly here as a reminder to add a new example for it if this is merged)