Runtime Perf: Allocations increased by 15x in GHC-9 vs GHC-8

Summary

See #19790 (closed) and #19861 (closed) for background. @simonpj looked at those issues.

I updated the repo (https://github.com/composewell/streamly-ghc9-regression) with reproduction code for another regression. This time it is in the splitOnSeq operation, a bit more complicated repro code than the previous ones.

Steps to reproduce

You can pull the repo (splitOnSeq branch which is also the same as master branch as of now) and use the following command to build it:

$ cabal build --with-compiler <compiler>

This will build the Main executable and also produce .dump-simpl files which can be found in the dist-newstyle directory tree.

After doing this step you can also build it directly using ghc but make sure that you are providing the same optimization options as in the cabal file. For example:

$ ghc --make -O2 -ddump-to-file -ddump-simpl Main.hs

I have uploaded the core generated by ghc-8 and ghc-9 for quick review.

To run the program, first generate a 100MB input file by running this script:

$ ./mkInput.sh

Then run it:

$ time ./Main +RTS -s

The allocations with GHC-8 are 123MB while with GHC-9 around 2GB. The code processes each byte in the file in a loop to find the given substring. I have not yet been able to figure out what is it that is being allocated in ghc-9, it may be some parameter in the loop that is being passed boxed in ghc-9?

Let me know if I can help in any further investigation.

Expected behavior

GHC-9 should produce code as efficient as GHC-8.

Environment

GHC version used: ghc-9.3+!5658 (closed)

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information