Add Aarch64 clz, ctz and brev primops
Adds assembly implementations of the count leading zeros, count trailing zeros and bit reverse primops for Aarch64 for W8-W64 sizes. The code produced here appears to be better than what the C compiler produces for hs_clz8
and hs_clz16
, which uses conditional moves. I'll make a PR changing the implementations in clz.c
etc. so they can be a little faster on all platforms that use them too.
I also made some of the case statements in compiler/GHC/CmmToAsm/AArch64/Instr.hs
explicitly match all constructors so editor tools can let you know when an instruction is added but the necessary cases aren't handled - it's bitten me a few times already.
Please take a few moments to address the following points:
-
if your MR may break existing programs (e.g. touches base
or causes the compiler to reject programs), please describe the expected breakage and add the user-facing label. This will run ghc/head.hackage> to characterise the effect of your change on Hackage. -
ensure that your commits are either individually buildable or squashed -
ensure that your commit messages describe what they do (referring to tickets using #NNNN
syntax when appropriate) -
have added source comments describing your change. For larger changes you likely should add a [Note][notes] and cross-reference it from the relevant places. -
add a [testcase to the testsuite][adding test]. -
updates the users guide if applicable -
mentions new features in the release notes for the next release
Merge request reports
Activity
added aarch64 runtime perf labels
added Blocked on Review label
We have to be careful with ctz/clz and 0: we return the word size, not 0.
Could you add the following test in
testsuite/tests/codeGen/should_run/
:all.T
:... test('CtzClz0', normal, compile_and_run, [''])
CtzClz0.hs
:{-# LANGUAGE CPP #-} {-# LANGUAGE MagicHash #-} module Main where import GHC.Exts import Control.Monad #include <MachDeps.h> {-# OPAQUE x #-} -- needed to avoid triggering constant folding x :: Word x = 0 main :: IO () main = do let !(W# w) = x guard (W# (ctz# w) == WORD_SIZE_IN_BITS) guard (W# (ctz8# w) == 8) guard (W# (ctz16# w) == 16) guard (W# (ctz32# w) == 32) guard (W# (clz# w) == WORD_SIZE_IN_BITS) guard (W# (clz8# w) == 8) guard (W# (clz16# w) == 16) guard (W# (clz32# w) == 32)
Edited by Sylvain HenryThanks @hsyl20, I'll try and find some time this week to add that.
I have confirmed that the
clz
instruction does return 32/64 for 32/64 bit zero's, according to https://www.scs.stanford.edu/~zyedidia/arm64/clz_int.html (then here and then here - the pseudo code forHighestSetBit
returns -1 when passed zero, so you end up withN (-1 + 1) = N
). I've also tested in C just to sanity check.Thanks for taking a look, and prompting me to question myself!
These tests have been added, and pass on my machine.
I also wanted to propose a change to the C implementations of clz/ctz based off this MR, which would look like the following:
diff --git a/libraries/ghc-prim/cbits/clz.c b/libraries/ghc-prim/cbits/clz.c index b0637b5dfe0..9b20307ba03 100644 --- a/libraries/ghc-prim/cbits/clz.c +++ b/libraries/ghc-prim/cbits/clz.c @@ -10,13 +10,15 @@ StgWord hs_clz8(StgWord x) { - return (uint8_t)x ? __builtin_clz((uint8_t)x)-24 : 8; + return __builtin_clz((x << 24) | (1 << 23)); } StgWord hs_clz16(StgWord x) { - return (uint16_t)x ? __builtin_clz((uint16_t)x)-16 : 16; + return __builtin_clz((x << 16) | (1 << 15)); } StgWord diff --git a/libraries/ghc-prim/cbits/ctz.c b/libraries/ghc-prim/cbits/ctz.c index 755ad6e0b35..2c0fc2acdfc 100644 --- a/libraries/ghc-prim/cbits/ctz.c +++ b/libraries/ghc-prim/cbits/ctz.c @@ -10,19 +10,22 @@ StgWord hs_ctz8(StgWord x) { - return (uint8_t)x ? __builtin_ctz(x) : 8; + return __builtin_ctz(x | 0x100); } StgWord hs_ctz16(StgWord x) { - return (uint16_t)x ? __builtin_ctz(x) : 16; + return __builtin_ctz(x | 0x10000); } StgWord hs_ctz32(StgWord x) { - return (uint32_t)x ? __builtin_ctz(x) : 32; + return hs_ctz64((StgWord64)x | 0x100000000ULL); } #else # error no suitable __builtin_ctz() found
It looks like GCC and Clang aren't smart enough to make this non-branching, at least on aarch64. I'm wondering if this is sufficiently related to the other changes in the MR to include as well.
Pinging @hvr as the author of clz.c & ctz.c.
I would suggest creating another ticket/MR for these changes. I'm not sure about the
StgWord64
cast, maybe we could use__builtin_ffs
instead which is defined for a0
argument.And thanks for adding the test!
Edited by Sylvain Henry
added test-primops label
Please write a commit message before merging, I applied the test-primops label so the test-primops job will run on this branch.
Edited by Matthew PickeringThanks Matthew, I have squashed the commits and written a more explanatory commit message - let me know if it's missing anything. I wasn't aware of that tag, I feel like it might have been useful on previous MRs of mine - good to know!
I noticed that the pipeline failed on aarch64-darwin, but it doesn't fail on my Mac, and the tests look unrelated to my changes; hoping this is just an issue with the macOS running being temperamental. Hopefully this pipeline runs fine.
added 1 commit
- 60c9598a - Add AArch64 CLZ, CTZ, RBIT primop implementations.
assigned to @marge-bot
removed Blocked on Review label
added 43 commits
-
60c9598a...9f614270 - 42 commits from branch
ghc:master
- d3af571a - Add AArch64 CLZ, CTZ, RBIT primop implementations.
-
60c9598a...9f614270 - 42 commits from branch