Changes

jstolarek · 1421b14e
--- a/prim-bool.md
+++ b/prim-bool.md
-# Implementing primitive Bool\#
+# Implementing new primitive comparisons to allow branchless algorithms


 This page gathers the notes about implementing new primitive logical operations and thus resolving ticket [\#6135](https://gitlab.haskell.org//ghc/ghc/issues/6135).
@@ -89,9 +89,16 @@ There are five possible branches to take, although four of them have the same re
 ## Solution


-The idea behind the solution is to modify comparison primops to return unboxed unlifted `Int#` instead of `Bool` (which is lifted and thus is returned as a thunk that needs to be evaluated). This will be implemented in the following way:
+This problem was solved by modifying comparison primops to return unboxed unlifted `Int#` instead of `Bool` (which is lifted and thus is returned as a thunk that needs to be evaluated). Having `Int#` returned as a result of logical comparison will allow to use branchless bitwise logical operators instead of branching logical operators defined by Haskell.

- existing comparison primops will have their return type changed to `Int#`. Also, their names will be changed. Operators will have `$` added before `#`, others will have `I` added before the `#` (this is a mnemonic denoting that this primop returns and `Int#`). Examples:
+## Implementation details
+
+
+Below is a summary of implementation details and decisions:
+
+- the new comparison primops return a value of type `Int#`: `1#` represents `True` and `0#` represents `False`. The `Int#` type was chosen because on Haskell it is more common to use signed Int type insetad of unsigned Word. By using `Int#` the users can easily convert unboxed result into a boxed value, without need to use `word2Int#` and `int2word#` primops.
+- as a small side-task, four new logical bitwise primops have been implemented: `andI#`, `orI#`, `xorI#` and `negI#` ([\#7689](https://gitlab.haskell.org//ghc/ghc/issues/7689)). These operate on values of type `Int#`. Earlier we had only bitwise logical primops operating on values of type `Word#`.
+- names of the existing comparison primops were changed. Operators will have `$` added before `#`, others will have `I` added before the `#` (this is a mnemonic denoting that this primop returns and `Int#`). Examples:

 ```wiki
 >=$#      :: Int#    -> Int#    -> Int#
@@ -102,7 +109,7 @@ ltFloatI# :: Float#  -> Float#  -> Int#
 leAddrI#  :: Addr#   -> Addr#   -> Int#
 ```

- a new module `GHC.PrimWrappers` will be added to ghc-prim library. This module will contain wrappers for comparison primops. These wrappers will have names identical to removed primops and will return a `Bool`. Examples:
+- a new module `GHC.PrimWrappers` was added to ghc-prim library. This module contains wrappers for comparison primops. These wrappers have names identical to removed primops and return a `Bool`. Examples:

 ```wiki
 gtChar# :: Char# -> Char# -> Bool
@@ -125,26 +132,28 @@ leAddr# a b = tagToEnum# (a `leAddrI#` b)
 ```


-Thanks to these wrappers the change will be almost backwards compatible. The only thing primop users will need to change in their existing code to make it work again is adding import of !GHC.PrimWrappers module.
+Thanks to these wrappers the change is almost backwards compatible. The only thing primop users need to change in their existing code to make it work again is adding import of !GHC.PrimWrappers module.

- The following boot libraries require modification in order to work with the new primops: base, ghc-prim and integer-gmp. The only required modifications are imports of the !GHC.PrimWrappers module in modules that use the primops.
+- functions for comparing `Integer` type, implemented in integer-gmp and integer-simple libraries, received a similar treatment. Technically they are not primops, because they are implemented in Haskell (in case of integer-gmp also with FFI), but they pretend to be ones. There are six primops for comparing `Integer` values:

-## Proof of concept
+  ```wiki
+  eqInteger#  :: Integer -> Integer -> Int#
+  neqInteger# :: Integer -> Integer -> Int#
+  leInteger#  :: Integer -> Integer -> Int#
+  ltInteger#  :: Integer -> Integer -> Int#
+  gtInteger#  :: Integer -> Integer -> Int#
+  geInteger#  :: Integer -> Integer -> Int#
+  ```


-The prototype patch posted on the trac on 13th of March implemented six new prototype comparison primops:
+Each of these functions has a wrapper that calls `tagToEnum#` and returns a `Bool`. These wrappers are: `eqInteger`, `neqInteger`, `leInteger`, `ltInteger`, `gtInteger` and `geInteger`.

-```wiki
-.>#  :: Int# -> Int# -> Int#
-.<#  :: Int# -> Int# -> Int#
-.>=# :: Int# -> Int# -> Int#
-.<=# :: Int# -> Int# -> Int#
-.==# :: Int# -> Int# -> Int#
-./=# :: Int# -> Int# -> Int#
-```
+- Other libraries that were modified to work with the new primops are: base, ghc-prim and primitive. The only required modifications were imports of the !GHC.PrimWrappers module in modules that use the primops.

+## Eliminating branches using new primops

-Each of these new primops takes two `Int#`s that are to be compared. The result is also an `Int#`: `0#` if the relation between the operands does not hold and `1#` when it does hold. For example `5# .># 3#` returns `1#` and `3# .># 3#` returns `0#`. With the new primops we can rewrite the original expression that motivated the problem:
+
+With the new primops we can rewrite the original expression that motivated the problem:

 ```wiki
 case (x <# 0#) || (x >=# width) || (y <# 0#) || (y >=# height) of
@@ -156,52 +165,53 @@ case (x <# 0#) || (x >=# width) || (y <# 0#) || (y >=# height) of
 as

 ```wiki
-case (x .<# 0#) `orI#` (x .>=# width) `orI#` (y .<# 0#) `orI#` (y .>=# height) of
+case (x <$# 0#) `orI#` (x >=$# width) `orI#` (y <$# 0#) `orI#` (y >=$# height) of
  True  -> E1
  False -> E2
 ```


-(Note: `orI#` is a bitwise OR operation on operands of type `Int#`. It was introduced together with `andI#`, `notI#` and `xor#` in [\#7689](https://gitlab.haskell.org//ghc/ghc/issues/7689)). Using the LLVM backend this compiles to:
+Using the LLVM backend this compiles to:

 ```wiki
-# BB#0:                                 # %c1nK
-  movq  %rsi, %rax
-  orq %r14, %rax
-  shrq  $63, %rax
-  cmpq  %rdi, %r14
-  setge %cl
-  movzbl  %cl, %ecx
-  orq %rax, %rcx
-  cmpq  %r8, %rsi
-  setge %al
-  movzbl  %al, %eax
-  orq %rcx, %rax
-  jne .LBB2_1
-# BB#3:                                 # %c1ol
-  movq  (%rbp), %rax
-  movl  $r1m6_closure+1, %ebx
-  jmpq  *%rax  # TAILCALL
-.LBB2_1:                                # %c1nK
-  cmpq  $1, %rax
-  jne .LBB2_2
-# BB#4:                                 # %c1ov
-  movq  (%rbp), %rax
-  movl  $r1m7_closure+1, %ebx
-  jmpq  *%rax  # TAILCALL
-.LBB2_2:                                # %c1ob
-  movq  r1m5_closure(%rip), %rax
-  movl  $r1m5_closure, %ebx
-  jmpq  *%rax  # TAILCALL
+# BB#0:                                 # %c1oe
+  movq   %rsi, %rax
+  orq    %r14, %rax
+  shrq   $63, %rax
+  cmpq   %rdi, %r14
+  setge  %cl
+  movzbl %cl, %ecx
+  orq    %rax, %rcx
+  cmpq   %r8, %rsi
+  setge  %al
+  movzbl %al, %eax
+  orq    %rcx, %rax
+  jne    .LBB2_1
+# BB#3:                                 # %c1oP
+  movq   (%rbp), %rax
+  movl   $r1mu_closure+1, %ebx
+  jmpq   *%rax  # TAILCALL
+.LBB2_1:                                # %c1oe
+  cmpq   $1, %rax
+  jne    .LBB2_2
+# BB#4:                                 # %c1oZ
+  movq   (%rbp), %rax
+  movl   $r1mv_closure+1, %ebx
+  jmpq   *%rax  # TAILCALL
+.LBB2_2:                                # %c1oF
+  movq   r1mt_closure(%rip), %rax
+  movl   $r1mt_closure, %ebx
+  jmpq   *%rax  # TAILCALL
+
 ```


-The assembly does not contain comparisons and jumps in the scrutinee of the case expression, but still does jumps for selecting an appropriate branch of the case expression.
+The assembly does not contain comparisons and branches in the scrutinee of the case expression, but still uses jumps to select an appropriate branch of the case expression.

-### Benchmarks for the proposed patch
+### Benchmarks


-Below is a benchmark for the proof-of-concept filter function that demonstrates performance gains possible with the new primops:
+Below is a benchmark for the proof-of-concept branchless filter function that demonstrates performance gains possible with the new primops:

 ```wiki
 {-# LANGUAGE BangPatterns, MagicHash #-}
@@ -216,7 +226,7 @@ import Criterion.Main
 import Data.Vector.Unboxed.Mutable       (unsafeNew, unsafeSlice, unsafeWrite)
 import Data.Vector.Unboxed               as U (Vector, filter, foldM',
                                               fromList, length, unsafeFreeze)
-import GHC.Exts                          (Int (I#), (.>=#))
+import GHC.Exts                          (Int (I#), (>=$#))
 import System.Random                     (RandomGen, mkStdGen, randoms)
 import Prelude                    hiding (filter, length)

@@ -227,7 +237,7 @@ filterN vec = runST $ do
  fVec <- unsafeNew size
  let put i x = do
        let !(I# v) = x
-            inc     = I# (v .>=# 0#)
+            inc     = I# (v >=$# 0#)
        unsafeWrite fVec i x
        return $ i + inc
  fSize <- foldM' put 0 vec
@@ -258,7 +268,6 @@ benchConfig :: Config
 benchConfig = defaultConfig {
             cfgPerformGC = ljust True
           }
-
 ```


@@ -270,4 +279,6 @@ ghc -O2 -fllvm -optlo-O3 Main.hs
 ```


-Benchmarking shows that `filterN` function is 60% faster than the `filter` function based on stream fusion (tested for unboxed vectors containing 10 thousand and 10 million elements).
+Benchmarking shows that `filterN` function is about 55-65% faster than the `filter` function based on stream fusion (tested for unboxed vectors containing 10 thousand and 10 million elements). Below is an example benchmarking report from criterion:
+
+[](http://ics.p.lodz.pl/~stolarek/ghc/prim-bool-criterion.png)
\ No newline at end of file