Bitwise operations on unboxed array are slow when one output byte depends on two or three input bytes
Summary
This code does four operations on an unboxed array of bytes:
- Partition it into three equal parts (with 0-2 bytes left over) and mix them
- Look up every byte in an s-box, which is another unboxed array
- Rotate the block by its population count
- Add the round constant to each byte. Steps 1 and 3 take 41.2% and 38.8% of the time, as of 2023-09-01. The other two steps combined take only 16.5% of the time. (The profile on that date was of a decryption, so the steps were in reverse order; they are numbered here for an encryption.) The Haskell program takes 6 to 7 times as long as the Rust program. Geekosaur or int-e told me on IRC after reading Core that it's boxing and unboxing too much.
If you look back to 2023-08-25 in the Git log, you'll see that I precomputed the array of round constants because xorn
, which does bitwise operations on the bytes of a single number, was taking too long.
Steps to reproduce
Wring.cabal Wring.hs is the boiled-down code of WringTwistree, which includes also a Rust version.
The attached, boiled-down version: time cabal new-run wring
The complete version: Create a 1 MiB file, say /tmp/1meg
. Run time stack run -- -k aerate -e /tmp/1meg -o /tmp/1meg.crypt
. If you have Rust, do the same with cargo
instead of stack
.
Expected behavior
The Haskell program should take at most twice as long as the Rust program, about five seconds on my box. It actually takes about 16 s when run with Stack and about a minute when run with Cabal.
Environment
- GHC version used: 8.6.5 with Cabal, 9.4.5 with Stack.
Optional:
- Operating System: Focal Fossa
- System Architecture: AMD Ryzen 5 3600X 6-Core Processor 3.8 or 3.9 GHz