Some bytestring benchmarks got slower in 9.0.1

Reproduction

Checkout bytestring at the current master
cabal bench --benchmark-options '--csv bench-8.10.4.csv' -w ghc-8.10.4
cabal bench --benchmark-options '--baseline bench-8.10.4.csv --fail-if-slower=3 --hide-successes' -w ghc-9.0.1
Expected results

No benchmarks have become slower.
Actual results

All
  folds
    mapAccumL
      8:                                                      FAIL (0.15s)
        116 ns ±  11 ns, 13% slower than baseline
      16:                                                     FAIL (0.23s)
        193 ns ±  11 ns, 15% slower than baseline
      32:                                                     FAIL (0.23s)
        361 ns ±  21 ns, 18% slower than baseline
      64:                                                     FAIL (0.21s)
        681 ns ±  43 ns, 18% slower than baseline
      128:                                                    FAIL (0.20s)
        1.3 μs ±  87 ns, 22% slower than baseline
      256:                                                    FAIL (0.20s)
        2.6 μs ± 169 ns, 23% slower than baseline
      512:                                                    FAIL (0.19s)
        5.1 μs ± 359 ns, 23% slower than baseline
      1024:                                                   FAIL (0.20s)
         10 μs ± 820 ns, 23% slower than baseline
      2048:                                                   FAIL (0.20s)
         21 μs ± 1.4 μs, 25% slower than baseline
      4096:                                                   FAIL (0.18s)
         41 μs ± 2.7 μs, 25% slower than baseline
      8192:                                                   FAIL (0.19s)
         82 μs ± 5.4 μs, 25% slower than baseline
      16384:                                                  FAIL (0.18s)
        165 μs ±  11 μs, 25% slower than baseline
      32768:                                                  FAIL (0.19s)
        329 μs ±  21 μs, 25% slower than baseline
      65536:                                                  FAIL (0.18s)
        716 μs ±  48 μs, 22% slower than baseline
    mapAccumR
      8:                                                      FAIL (0.15s)
        112 ns ±  11 ns, 11% slower than baseline
      16:                                                     FAIL (0.24s)
        187 ns ±  10 ns, 14% slower than baseline
      32:                                                     FAIL (0.22s)
        350 ns ±  23 ns, 21% slower than baseline
      64:                                                     FAIL (0.21s)
        651 ns ±  51 ns, 18% slower than baseline
      128:                                                    FAIL (0.19s)
        1.3 μs ±  84 ns, 18% slower than baseline
      256:                                                    FAIL (0.19s)
        2.5 μs ± 174 ns, 18% slower than baseline
      512:                                                    FAIL (0.19s)
        4.9 μs ± 368 ns, 20% slower than baseline
      1024:                                                   FAIL (0.19s)
        9.7 μs ± 671 ns, 18% slower than baseline
      2048:                                                   FAIL (0.19s)
         20 μs ± 1.9 μs, 21% slower than baseline
      4096:                                                   FAIL (0.18s)
         39 μs ± 2.6 μs, 20% slower than baseline
      8192:                                                   FAIL (0.17s)
         79 μs ± 6.2 μs, 21% slower than baseline
      16384:                                                  FAIL (0.17s)
        156 μs ±  11 μs, 20% slower than baseline
      32768:                                                  FAIL (0.17s)
        312 μs ±  23 μs, 21% slower than baseline
      65536:                                                  FAIL (0.18s)
        690 μs ±  50 μs, 18% slower than baseline
    scanr
      8192:                                                   FAIL (0.29s)
         64 μs ± 3.6 μs,  8% slower than baseline
  findIndexOrLength
    takeWhile:                                                FAIL (0.21s)
       92 μs ± 5.4 μs, 23% slower than baseline
    break:                                                    FAIL (0.22s)
       97 μs ± 5.3 μs, 28% slower than baseline
    group zero-one:                                           FAIL (0.28s)
      2.2 ms ±  89 μs, 18% slower than baseline
    groupBy (>=):                                             FAIL (0.16s)
      309 μs ±  22 μs, 12% slower than baseline
    groupBy (>):                                              FAIL (0.33s)
      2.5 ms ± 125 μs, 11% slower than baseline
  findIndex_
    findIndices:                                              FAIL (0.23s)
       24 μs ± 1.4 μs, 23% slower than baseline
  findIndexEnd
    findIndexEnd:                                             FAIL (0.23s)
       21 ns ± 1.4 ns, 10% slower than baseline
  traversals
    map (+1):                                                 FAIL (0.25s)
      472 μs ±  23 μs, 402121% slower than baseline
  BoundsCheckFusion
    Data.ByteString.Builder
      foldMap (left-assoc) (10000):                           FAIL (0.19s)
        330 μs ±  28 μs, 17% slower than baseline
      foldMap (right-assoc) (10000):                          FAIL (0.16s)
        276 μs ±  27 μs, 18% slower than baseline
      foldMap [manually fused, right-assoc] (10000):          FAIL (0.27s)
        244 μs ±  11 μs,  5% slower than baseline
  Indices
    ByteString index equality inlining
      FindIndices/non-inlined:                                FAIL (0.31s)
        9.7 ms ± 358 μs, 12% slower than baseline

41 out of 263 tests failed (56.86s)
(Ignore the 402121% regression in traversals.map (+1). This seems to be a buglet in tasty-bench.)
Edited Mar 15, 2021 by Simon Jakobi
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information