Skip to content

Draft: Improve switch lowering in cmm

Jannis requested to merge 1Jajen1/ghc:jo-switches into master

Rewrites switch lowering to make use of smaller and sometimes faster patterns. Imo it also simplifies the algorithm to create switch plans as it breaks it up into different phases.

Some of the patterns improved by this mr:

Ranges:
f :: Int -> Bool
f a = case a of
  1 -> True
  2 -> True
  3 -> True
  4 -> True
  5 -> True
  6 -> True
  _ -> False

Generates a single range check (u_> for explicitly unsigned comparisons)

if a - 1 u_> 5 then False else True

Previously ghc did:

if a >= 7
  then False
  else if a < 1 then False else True
BitTests
f :: Int -> Bool
f a = case a of
  1  -> True
  5  -> True
  8  -> True
  9  -> True
  11 -> True
  19 -> True
  _  -> False

Generates:

if a u_> 19
  then False
  else if (1 << a) & 527138 == 0 then False else True
-- 527138 = 1 << 1 + 1 << 5 + 1 << 8 + 1 << 9 + 1 << 11 + 1 << 19

Previously generated:

if a < 8
  then if a >= 6
    then False
    else if a >= 5
      then True
      else if a == 1
        then True
        else False
  else if a < 12
    then if a < 9
      then True
      else if a == 10
        then False
        else True
    else if a == 19
      then True
      else False

And in general search trees should be smaller, due linear search at the leaf nodes, and better balanced.

New (slightly simplified for readability) switch plan:

data SwitchPlan =
    LinearSearch [Range]    -- See Note [Linear search vs binary search tree] on how this usually looks
  | BitTest [(Label, Mask)] -- See Note [BitTest]
  | JumpTable (Map Integer Label)
  | BinarySearch Pivot SwitchPlan SwitchPlan
type Range = (Integer, Integer, Label)
type Mask = Integer
type Pivot = Integer

The balancing of the tree can also be changed, currently it uses the number of cases (not nodes, so ranges, bit-tests and jumptables are heavier). See Note [Balancing the search tree].

The algorithm first combines ranges, then searches for BitTest or JumpTable opportunities and at last assembles everything into a balanced tree. All steps are decoupled thus adding other special patterns or changing the existing ones is much easier than before, especially because the balancing is done after these patterns are found. It also removes some warts of the previous implementation as it no longer needs special cases to produce good plans.

Generating the nodes from this new plan got a little more complex because the plan is much more high level now. (Previously it was directly mapped to less-than, eq, jump tables or unconditional jumps)

I would not expect the new code to be faster in general. There should be fewer branches and less code, but the resulting performance heavily depends on how predictable these branches are, so I would not be surprised to see no significant changes. I redid the benchmark for myIsSpace from #10124 and saw no changes. With more alternatives in the case statement, such that a bit test is generated, at least on my machine (ryzen 7950x) there is a clear win for the bit test.

The benchmark from #10124:

This is with added alternatives to myIsSpace, just comment them out and remove the elements from spaces to check the original benchmark.

module Main (main) where

import Test.Tasty.Bench

import Test.QuickCheck
import Test.QuickCheck.Gen (Gen(MkGen))
import Test.QuickCheck.Random (mkQCGen)

import Control.DeepSeq
import Control.Exception (evaluate)

main :: IO ()
main = defaultMain [
    env (gen balanced)       $ \inp -> bench "balanced"        (whnf test inp)
  , env (gen lessWhiteSpace) $ \inp -> bench "less whitespace" (whnf test inp)
  , env (gen moreWhiteSpace) $ \inp -> bench "more whitespace" (whnf test inp)
  ]

seed = 1470
size = 3000

gen (MkGen g) = evaluate . force $ g (mkQCGen seed) size

balanced       = listOf $ frequency [(1, elements elems), (1, chooseAny `suchThat` (not . myIsSpace))]
lessWhiteSpace = listOf $ frequency [(1, elements elems), (5, chooseAny `suchThat` (not . myIsSpace))]
moreWhiteSpace = listOf $ frequency [(5, elements elems), (1, chooseAny `suchThat` (not . myIsSpace))]

test = length . filter myIsSpace

elems = [' ', '\n', '\t', '#', '&']

myIsSpace ' '  = True
myIsSpace '\n' = True
myIsSpace '\t' = True
myIsSpace '#'  = True
myIsSpace '&'  = True
myIsSpace _    = False
{-# NOINLINE myIsSpace #-}
-- Yes this is necessary because otherwise ghc generates a block for each case and BitTests cannot be generated 

With the bit test:

All
  balanced:        OK (0.45s)
    3.49 μs ± 101 ns
  less whitespace: OK (0.20s)
    3.03 μs ± 165 ns
  more whitespace: OK (0.21s)
    3.24 μs ± 273 ns

With ghc-head:

All
  balanced:        OK (0.25s)
    3.89 μs ± 164 ns
  less whitespace: OK (0.11s)
    3.43 μs ± 331 ns
  more whitespace: OK (0.13s)
    3.99 μs ± 352 ns
The code generated with this mr
==================== Output Cmm ====================
[$wmyIsSpace_r4g7_entry() { //  [R2]
         { info_tbls: [(c4oL,
                        label: $wmyIsSpace_r4g7_info
                        rep: HeapRep static { Fun {arity: 1 fun_type: ArgSpec 4} }
                        srt: Nothing)]
           stack_info: arg_space: 8
         }
     {offset
       c4oL: // global
           if (R2 > 38) goto c4oF; else goto u4oO;
       u4oO: // global
           if ((1 << R2) & 313532614144 == 0) goto c4oF; else goto c4oG;
       c4oF: // global
           R1 = GHC.Types.False_closure+1;
           call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
       c4oG: // global
           R1 = GHC.Types.True_closure+2;
           call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
     }
 },
 section ""data" . $wmyIsSpace_r4g7_closure" {
     $wmyIsSpace_r4g7_closure:
         const $wmyIsSpace_r4g7_info;
 }]
Current ghc
==================== Output Cmm ====================
[$wmyIsSpace_r4g7_entry() { //  [R2]
         { info_tbls: [(c4oL,
                        label: $wmyIsSpace_r4g7_info
                        rep: HeapRep static { Fun {arity: 1 fun_type: ArgSpec 4} }
                        srt: Nothing)]
           stack_info: arg_space: 8
         }
     {offset
       c4oL: // global
           if (R2 < 11) goto u4oO; else goto u4oQ;
       u4oO: // global
           if (R2 >= 10) goto c4oG; else goto u4oP;
       u4oP: // global
           if (R2 < 9) goto c4oF; else goto c4oG;
       u4oQ: // global
           if (R2 < 36) goto u4oR; else goto u4oT;
       u4oR: // global
           if (R2 >= 35) goto c4oG; else goto u4oS;
       u4oS: // global
           if (R2 != 32) goto c4oF; else goto c4oG;
       u4oT: // global
           if (R2 != 38) goto c4oF; else goto c4oG;
       c4oF: // global
           R1 = GHC.Types.False_closure+1;
           call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
       c4oG: // global
           R1 = GHC.Types.True_closure+2;
           call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
     }
 },
 section ""data" . $wmyIsSpace_r4g7_closure" {
     $wmyIsSpace_r4g7_closure:
         const $wmyIsSpace_r4g7_info;
 }]

Related issues:

  • #10124 With range checks and bit tests being generated, this could be closed I think. Although it still won't be branchless but with this in place a future transformation to branchless should be easier, so maybe keep it open?
  • #19290 In the first example given, ghc generates a case statement with no alternative. The ranges here are [1 -> False, 2..5 -> True], this mr generates LinearSearch [(1,1,labelFalse),(2,5,labelTrue)] but since we have no fallback it recognises that if scrut != 1 any value for scrut other than 2,3,4,5 leads to undefined behavior and it drops the range check. Other problems described in this ticket are usually covered by either range checks or bit tests (for sparse versions). So this can be closed.
  • #20271 This still persists, although the specific example no longer compiles this way. (Mainly here as a reminder to add a new example for it if this is merged)

Merge request reports