Add Aarch64 clz, ctz and brev primops

added aarch64 runtime perf labels

marked the checklist item if your MR may break existing programs (e.g. touches base or causes the as completed

marked the checklist item ensure that your commits are either individually buildable or squashed as completed

marked the checklist item ensure that your commit messages describe what they do as completed

marked the checklist item have added source comments describing your change. For larger changes you as completed

changed title from WIP: Add Aarch64 clz, ctz and brev primops to Add Aarch64 clz, ctz and brev primops

added Blocked on Review label

We have to be careful with ctz/clz and 0: we return the word size, not 0.

Could you add the following test in testsuite/tests/codeGen/should_run/:

all.T:

...
test('CtzClz0', normal, compile_and_run, [''])

CtzClz0.hs:

{-# LANGUAGE CPP #-}
{-# LANGUAGE MagicHash #-}

module Main where

import GHC.Exts
import Control.Monad

#include <MachDeps.h>

{-# OPAQUE x #-} -- needed to avoid triggering constant folding
x :: Word
x = 0

main :: IO ()
main = do
  let !(W# w) = x

  guard (W# (ctz# w) == WORD_SIZE_IN_BITS)
  guard (W# (ctz8# w) == 8)
  guard (W# (ctz16# w) == 16)
  guard (W# (ctz32# w) == 32)

  guard (W# (clz# w) == WORD_SIZE_IN_BITS)
  guard (W# (clz8# w) == 8)
  guard (W# (clz16# w) == 16)
  guard (W# (clz32# w) == 32)

I couldn't find what AArch64's CLZ/CTZ instructions do in this case. So we'd better add a test.

Thanks @hsyl20, I'll try and find some time this week to add that.

I have confirmed that the clz instruction does return 32/64 for 32/64 bit zero's, according to https://www.scs.stanford.edu/~zyedidia/arm64/clz_int.html (then here and then here - the pseudo code for HighestSetBit returns -1 when passed zero, so you end up with N (-1 + 1) = N). I've also tested in C just to sanity check.

Thanks for taking a look, and prompting me to question myself!

These tests have been added, and pass on my machine.

I also wanted to propose a change to the C implementations of clz/ctz based off this MR, which would look like the following:

diff --git a/libraries/ghc-prim/cbits/clz.c b/libraries/ghc-prim/cbits/clz.c
index b0637b5dfe0..9b20307ba03 100644
--- a/libraries/ghc-prim/cbits/clz.c
+++ b/libraries/ghc-prim/cbits/clz.c
@@ -10,13 +10,15 @@
 StgWord
 hs_clz8(StgWord x)
 {
-  return (uint8_t)x ? __builtin_clz((uint8_t)x)-24 : 8;
+  return __builtin_clz((x << 24) | (1 << 23));
 }
 
 StgWord
 hs_clz16(StgWord x)
 {
-  return (uint16_t)x ? __builtin_clz((uint16_t)x)-16 : 16;
+  return __builtin_clz((x << 16) | (1 << 15));
 }
 
 StgWord
diff --git a/libraries/ghc-prim/cbits/ctz.c b/libraries/ghc-prim/cbits/ctz.c
index 755ad6e0b35..2c0fc2acdfc 100644
--- a/libraries/ghc-prim/cbits/ctz.c
+++ b/libraries/ghc-prim/cbits/ctz.c
@@ -10,19 +10,22 @@
 StgWord
 hs_ctz8(StgWord x)
 {
-  return (uint8_t)x ? __builtin_ctz(x) : 8;
+  return __builtin_ctz(x | 0x100);
 }
 
 StgWord
 hs_ctz16(StgWord x)
 {
-  return (uint16_t)x ? __builtin_ctz(x) : 16;
+  return __builtin_ctz(x | 0x10000);
 }
 
 StgWord
 hs_ctz32(StgWord x)
 {
-  return (uint32_t)x ? __builtin_ctz(x) : 32;
+  return hs_ctz64((StgWord64)x | 0x100000000ULL);
 }
 #else
 # error no suitable __builtin_ctz() found

It looks like GCC and Clang aren't smart enough to make this non-branching, at least on aarch64. I'm wondering if this is sufficiently related to the other changes in the MR to include as well.

Pinging @hvr as the author of clz.c & ctz.c.

I would suggest creating another ticket/MR for these changes. I'm not sure about the StgWord64 cast, maybe we could use __builtin_ffs instead which is defined for a 0 argument.

And thanks for adding the test!

added 1 commit

52a03fdf - Add tests from @hsyl20

Compare with previous version

added 4084 commits

52a03fdf...98597ad5 - 4079 commits from branch ghc:master
0d8b75a2 - Add CLZ and RBIT instructions
b5929ddb - Add implementation for Clz
b31c5128 - Add implementation for Ctz
e302e08b - Add implementation for BRev
57eb5cb1 - Add tests from @hsyl20

Compare with previous version

marked the checklist item add a [testcase to the testsuite][adding test]. as completed

LGTM! Please squash the commits (I believe they can all be merged into one).

added test-primops label

Please write a commit message before merging, I applied the test-primops label so the test-primops job will run on this branch.

Thanks Matthew, I have squashed the commits and written a more explanatory commit message - let me know if it's missing anything. I wasn't aware of that tag, I feel like it might have been useful on previous MRs of mine - good to know!

I noticed that the pipeline failed on aarch64-darwin, but it doesn't fail on my Mac, and the tests look unrelated to my changes; hoping this is just an issue with the macOS running being temperamental. Hopefully this pipeline runs fine.