MutableByteArray# is slower than Addr#
I've filed this against the native code generator, even thought it still occurs in my experience with the via-C path. It is significantly more pronounced with the NCG path, and so, I imagine, is likely to be fixed by working on that. However, feel free to reclassify if appropriate.
When attempting to rewrite the fannkuch benchmark of the computer language benchmarks game to use more realistic code, I discovered that I am currently unable to do so. At least one piece of the puzzle is that the current entry uses Ptr and Addr# for arrays, while the current mutable array libraries (ST arrays and uvector) use MutableByteArray# underneath, and it happens that reads/writes on the latter are slower than the former.
After some discussion on the glasgow-haskell-users list, I believe I've worked the kinks out of my (even more micro) benchmark programs, and will attach them here. They accept two numbers, k and n, and reverse an array of size-n k times. The larger fannkuch benchmark works on small arrays (reversing, shifting and copying), so using small arrays with a very large number of iterations is more representative of it. However, my benchmark doesn't show terribly much variation when one varies the parameters by corresponding factors (although there is a drop in performance when the size of the array nears the size of the CPU cache, naturally).
I've test machine is an AMD Athlon 64 3000+ running in 64-bit mode with 2 GB of ram. My operating system is Linux, kernel 2.6.24.
I've mainly done runs where n * k = 2.5 billion, as that takes approximately as much time as the MutableByteArray# version of the fannkuch benchmark, around 11 seconds for 250 million iterations with an array size of 10. The Addr# version of the benchmark runs with the same parameters in around 7 seconds (which is faster than the actual fannkuch benchmark, but I digress).
I'll attach my two micro benchmarks. ByteArr.hs is the MutableByteArray# reversal benchmark. Ptr.hs is the Addr#/Ptr version. I may also attach full fannkuch implementations as another data point (the original is already Addr#/Ptr, and I've written a MutableByteArray# implementation; I just need to do some proofreading of the latter to make sure it doesn't have any discrepancies).
I've also not done much investigation as to the source of the problem, besides looking at the core of various higher level programs to track things down this far. I'll attempt to delve further into the issue, investigating the C-- and assembly if my abilities allow, and update here as I find things.