Bitwise operations on unboxed array are slow when one output byte depends on two or three input bytes
Summary
This code does four operations on an unboxed array of bytes:
- Partition it into three equal parts (with 0-2 bytes left over) and mix them
- Look up every byte in an s-box, which is another unboxed array
- Rotate the block by its population count
- Add the round constant to each byte. Steps 1 and 3 take 41.2% and 38.8% of the time, as of 2023-09-01. The other two steps combined take only 16.5% of the time. (The profile on that date was of a decryption, so the steps were in reverse order; they are numbered here for an encryption.) The Haskell program takes 6 to 7 times as long as the Rust program. Geekosaur or int-e told me on IRC after reading Core that it's boxing and unboxing too much.
If you look back to 2023-08-25 in the Git log, you'll see that I precomputed the array of round constants because xorn, which does bitwise operations on the bytes of a single number, was taking too long.
Steps to reproduce
Wring.cabal Wring.hs is the boiled-down code of WringTwistree, which includes also a Rust version.
The attached, boiled-down version: time cabal new-run wring
The complete version: Create a 1 MiB file, say /tmp/1meg. Run time stack run -- -k aerate -e /tmp/1meg -o /tmp/1meg.crypt. If you have Rust, do the same with cargo instead of stack.
Expected behavior
The Haskell program should take at most twice as long as the Rust program, about five seconds on my box. It actually takes about 16 s when run with Stack and about a minute when run with Cabal.
Environment
- GHC version used: 8.6.5 with Cabal, 9.4.5 with Stack.
Optional:
- Operating System: Focal Fossa
- System Architecture: AMD Ryzen 5 3600X 6-Core Processor 3.8 or 3.9 GHz