Skip to content

Bitwise operations on unboxed array are slow when one output byte depends on two or three input bytes

Summary

This code does four operations on an unboxed array of bytes:

  1. Partition it into three equal parts (with 0-2 bytes left over) and mix them
  2. Look up every byte in an s-box, which is another unboxed array
  3. Rotate the block by its population count
  4. Add the round constant to each byte. Steps 1 and 3 take 41.2% and 38.8% of the time, as of 2023-09-01. The other two steps combined take only 16.5% of the time. (The profile on that date was of a decryption, so the steps were in reverse order; they are numbered here for an encryption.) The Haskell program takes 6 to 7 times as long as the Rust program. Geekosaur or int-e told me on IRC after reading Core that it's boxing and unboxing too much.

If you look back to 2023-08-25 in the Git log, you'll see that I precomputed the array of round constants because xorn, which does bitwise operations on the bytes of a single number, was taking too long.

Steps to reproduce

Wring.cabal Wring.hs is the boiled-down code of WringTwistree, which includes also a Rust version.

The attached, boiled-down version: time cabal new-run wring

The complete version: Create a 1 MiB file, say /tmp/1meg. Run time stack run -- -k aerate -e /tmp/1meg -o /tmp/1meg.crypt. If you have Rust, do the same with cargo instead of stack.

Expected behavior

The Haskell program should take at most twice as long as the Rust program, about five seconds on my box. It actually takes about 16 s when run with Stack and about a minute when run with Cabal.

Environment

  • GHC version used: 8.6.5 with Cabal, 9.4.5 with Stack.

Optional:

  • Operating System: Focal Fossa
  • System Architecture: AMD Ryzen 5 3600X 6-Core Processor 3.8 or 3.9 GHz
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information