Some text benchmarks appear to be slower in HEAD (vs 9.2.1) - partial register stall with x86 byte operations

Reproduction

Checkout text:

git clone https://github.com/Bodigrim/text
cd text
git checkout ff31f1451fb3ce56b82ceb7e7379647d6f70a90e

git clone https://github.com/bos/text-test-data benchmarks/text-test-data
cd benchmarks/text-test-data
make
cd ../..

Run ascii-small benchmarks with GHC 9.2.1 (it will take around five minutes):

cabal bench -w ghc-9.2.1 --ghc-options '-fproc-alignment=64' --benchmark-options '--csv 9.2.csv --stdev 2 --timeout 100 --hide-successes -p ascii-small'

Compare againt benchmarks with GHC HEAD (it will take around five minutes):

cabal run -w /path/to/head --ghc-options="-fproc-alignment=64" --allow-newer -- -p '/ascii-small/' --baseline 9.2.csv --csv head.csv --fail-if-slower 20 --hide-successes

There are two regressions:

[nix-shell:~/t19661/text]$ cabal run -w /home/matt/ghc-clean/_build-perf/stage1/bin/ghc benchmarks --allow-newer -- -p '/ascii-small/' --baseline 9.2.csv --csv head.csv --fail-if-slower 50 --hide-successes
Up to date
All
  Pure
    ascii-small
      length
        filter
          Text:         FAIL (0.25s)
            210  μs ±  16 μs, 81% slower than baseline
            Use -p '/ascii-small/&&/length.filter.Text/' to rerun this test only.
        filter.filter
          Text:         FAIL (0.23s)
            206  μs ±  12 μs, 67% slower than baseline
            Use -p '/ascii-small/&&/length.filter.filter.Text/' to rerun this test only.

Edited Sep 24, 2021 by Matthew Pickering

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information