Potential optimization when quotRem divisor is a power of 2
Reported by @cocreature on Twitter: https://twitter.com/cocreature/status/1158643061121015808
- Original reproducer: https://gist.github.com/cocreature/822114257227473ecff1638a88f07788
- Actual change in a project: https://github.com/digital-asset/daml/pull/2350/commits/fbde3f500aaaccb86c12bf249730d56aefc891c7
Here's a smaller reproducer: https://gist.github.com/osa1/391be0dfdf0f291468f750a0d2859e50
Comment out the
quotRem line and uncomment the next two lines. Results:
-- quotRem: benchmarking mangling new time 918.8 ns (916.1 ns .. 922.3 ns) 1.000 R² (1.000 R² .. 1.000 R²) mean 919.6 ns (916.8 ns .. 923.1 ns) std dev 10.39 ns (7.976 ns .. 13.56 ns) -- shift + mask benchmarking mangling new time 405.3 ns (404.0 ns .. 406.6 ns) 1.000 R² (1.000 R² .. 1.000 R²) mean 405.7 ns (404.4 ns .. 407.1 ns) std dev 4.365 ns (3.488 ns .. 5.456 ns)
So the shift + mask implementation is twice as fast.
As far as I can see we don't have any optimizations when quotRem divisor is a power of 2, and the implementation on
Int uses a primop, which is lowered to a
div instruction on x86.
It may worth adding a rewrite rule for this case and generate a shift + mask.
- GHC version used: 8.6.5