random benchmarks are up to 65x slower in GHC 9.0.1
Reproduction
git clone https://github.com/haskell/random
cd random
git checkout b55aaa4
cabal bench -w ghc-8.10.4 --ghc-options '-fproc-alignment=64' --benchmark-options '--csv 8.10.4.csv --hide-successes' random:bench
cabal bench -w ghc-9.0.1 --ghc-options '-fproc-alignment=64' --benchmark-options '--baseline 8.10.4.csv --csv 9.0.1.csv --hide-successes --fail-if-slower 50' random:bench
Expected results
No benchmarks become slower.
Actual results
All
pure
uniformR
full
Word8: FAIL (2.59s)
10 ms ± 499 μs, 4576% slower than baseline
Word16: FAIL (3.64s)
14 ms ± 1.3 ms, 6536% slower than baseline
Word32: FAIL (4.86s)
9.5 ms ± 476 μs, 5057% slower than baseline
Int8: FAIL (2.96s)
12 ms ± 624 μs, 1756% slower than baseline
Int16: FAIL (1.50s)
358 μs ± 33 μs, 56% slower than baseline
Char: FAIL (1.94s)
15 ms ± 1.5 ms, 2690% slower than baseline
CChar: FAIL (1.51s)
12 ms ± 1.1 ms, 1688% slower than baseline
CSChar: FAIL (3.10s)
12 ms ± 430 μs, 1822% slower than baseline
CUChar: FAIL (2.68s)
11 ms ± 464 μs, 4793% slower than baseline
CUShort: FAIL (1.81s)
14 ms ± 858 μs, 6366% slower than baseline
CUInt: FAIL (1.27s)
10 ms ± 929 μs, 5263% slower than baseline
excludeMax
Word8: FAIL (2.61s)
10 ms ± 903 μs, 4358% slower than baseline
Word16: FAIL (1.81s)
14 ms ± 1.2 ms, 5920% slower than baseline
Word32: FAIL (7.41s)
15 ms ± 233 μs, 6141% slower than baseline
Word64: FAIL (1.30s)
318 μs ± 29 μs, 58% slower than baseline
Word: FAIL (2.60s)
316 μs ± 29 μs, 63% slower than baseline
Int8: FAIL (3.08s)
12 ms ± 947 μs, 1760% slower than baseline
Int16: FAIL (1.44s)
352 μs ± 28 μs, 51% slower than baseline
Int64: FAIL (2.71s)
333 μs ± 16 μs, 52% slower than baseline
Int: FAIL (1.34s)
331 μs ± 32 μs, 51% slower than baseline
Char: FAIL (1.86s)
15 ms ± 1.1 ms, 2634% slower than baseline
CChar: FAIL (3.00s)
12 ms ± 568 μs, 1702% slower than baseline
CSChar: FAIL (3.02s)
12 ms ± 452 μs, 1677% slower than baseline
CUChar: FAIL (1.31s)
10 ms ± 875 μs, 4354% slower than baseline
CShort: FAIL (2.87s)
350 μs ± 24 μs, 51% slower than baseline
CUShort: FAIL (1.81s)
14 ms ± 1.3 ms, 2385% slower than baseline
CUInt: FAIL (1.78s)
14 ms ± 913 μs, 2441% slower than baseline
CULong: FAIL (2.56s)
315 μs ± 13 μs, 59% slower than baseline
CSize: FAIL (2.57s)
315 μs ± 22 μs, 60% slower than baseline
CSigAtomic: FAIL (3.07s)
377 μs ± 24 μs, 62% slower than baseline
CULLong: FAIL (2.55s)
313 μs ± 20 μs, 58% slower than baseline
CUIntPtr: FAIL (2.55s)
313 μs ± 22 μs, 57% slower than baseline
CUIntMax: FAIL (1.26s)
306 μs ± 27 μs, 53% slower than baseline
includeHalf
Word8: FAIL (2.63s)
10 ms ± 448 μs, 4244% slower than baseline
Word16: FAIL (1.78s)
14 ms ± 1.2 ms, 5732% slower than baseline
Word32: FAIL (3.57s)
14 ms ± 893 μs, 5551% slower than baseline
Word64: FAIL (2.69s)
329 μs ± 21 μs, 54% slower than baseline
Word: FAIL (1.34s)
333 μs ± 32 μs, 56% slower than baseline
Int8: FAIL (3.41s)
13 ms ± 659 μs, 1063% slower than baseline
Char: FAIL (7.60s)
15 ms ± 288 μs, 2455% slower than baseline
CChar: FAIL (1.73s)
14 ms ± 915 μs, 1093% slower than baseline
CSChar: FAIL (1.75s)
14 ms ± 951 μs, 1066% slower than baseline
CUChar: FAIL (1.26s)
9.9 ms ± 859 μs, 3963% slower than baseline
CUShort: FAIL (1.78s)
14 ms ± 840 μs, 5768% slower than baseline
CUInt: FAIL (1.81s)
14 ms ± 1.3 ms, 5454% slower than baseline
CULong: FAIL (2.71s)
330 μs ± 20 μs, 57% slower than baseline
CSize: FAIL (2.72s)
336 μs ± 27 μs, 57% slower than baseline
CULLong: FAIL (2.66s)
327 μs ± 18 μs, 54% slower than baseline
CUIntPtr: FAIL (2.72s)
330 μs ± 17 μs, 56% slower than baseline
CUIntMax: FAIL (2.84s)
349 μs ± 25 μs, 65% slower than baseline
It seems that inlining in GHC 9.0.1 works differently to what it used to do in GHC 8.10.4. I wonder if it's the same issue as in #19557 (closed)
Enforcing more inlining by pragmas (c9471d4
) improves the most outrageous regressions, but still does not bring it back to baseline levels:
All
pure
uniformR
full
Int8: FAIL (1.57s)
380 μs ± 34 μs, 58% slower than baseline
Int16: FAIL (1.57s)
387 μs ± 31 μs, 62% slower than baseline
CChar: FAIL (1.54s)
376 μs ± 27 μs, 59% slower than baseline
CSChar: FAIL (1.56s)
384 μs ± 27 μs, 60% slower than baseline
CWchar: FAIL (3.11s)
378 μs ± 21 μs, 54% slower than baseline
excludeMax
Word64: FAIL (1.29s)
315 μs ± 30 μs, 53% slower than baseline
Word: FAIL (1.38s)
340 μs ± 33 μs, 67% slower than baseline
Int8: FAIL (1.61s)
389 μs ± 28 μs, 57% slower than baseline
Int64: FAIL (1.45s)
353 μs ± 33 μs, 56% slower than baseline
CULong: FAIL (1.27s)
313 μs ± 26 μs, 54% slower than baseline
CSize: FAIL (2.60s)
317 μs ± 19 μs, 53% slower than baseline
CULLong: FAIL (1.28s)
314 μs ± 29 μs, 53% slower than baseline
CUIntPtr: FAIL (2.60s)
317 μs ± 28 μs, 54% slower than baseline
CIntMax: FAIL (1.42s)
350 μs ± 32 μs, 52% slower than baseline
CUIntMax: FAIL (1.28s)
316 μs ± 28 μs, 56% slower than baseline
includeHalf
Word64: FAIL (2.72s)
331 μs ± 24 μs, 54% slower than baseline
CULong: FAIL (2.66s)
324 μs ± 19 μs, 56% slower than baseline
CSize: FAIL (2.68s)
326 μs ± 21 μs, 54% slower than baseline
CULLong: FAIL (2.74s)
335 μs ± 24 μs, 59% slower than baseline
CUIntPtr: FAIL (1.36s)
329 μs ± 28 μs, 56% slower than baseline
CUIntMax: FAIL (2.88s)
352 μs ± 14 μs, 73% slower than baseline
CC @lehins