Perf regression when TypeFamilies extension is enabled
Summary
Enabling the TypeFamilies extension (but not actually using it) changes how the core is optimized. It is observed that in some cases it leads to severe performance degradation, see this issue for some numbers. Surprisingly, in some other cases it also leads to performance improvements. Ideally we would like to keep those improvements (even without TypeFamilies) but not have the regressions. In this issue I am describing a minimal example and the generated core differences for the performance degradation case.
Steps to reproduce
git clone https://github.com/composewell/streamly.git
git checkout type-families-perf-regression
cabal build --write-ghc-environment-files=always
ghc -O2 -fspec-constr-recursive=16 -fmax-worker-args=16 -ddump-simpl -ddump-to-file product.hs
- ./product
This will generate the core for the degraded case i.e. when the TypeFamilies extension is enabled.
For the baseline i.e. without the TypeFamilies extension:
git checkout HEAD~2
- repeat steps 3 to 5 above
Core-2-Core pass outputs for good and bad cases
For convenient viewing I have committed the core files for the good case and the bad case in the following files:
- good case core passes with -dsuppress-all
- good case core passes full
- bad case core passes with -duppress-all
- bad case core passes full
Preliminary Observations
The function of interest in the core is $wgo
.diff product.good/product.dump-simpl product.bad/product.dump-simpl
shows the following difference (the first one is the good case and second one is the bad case):
< $s$wgo_sloM sc_sloL sc1_sloK
< = case ># sc1_sloK (+# ww2_sll9 100000#) of {
< __DEFAULT -> jump $s$wgo_sloM sc_sloL (+# sc1_sloK 1#);
< 1# -> jump exit_XU sc_sloL
---
> $s$wgo_slqH sc_slqG sc1_slqF sc2_slqE
> = case ># sc1_slqF (+# ww2_slkM 100000#) of {
> __DEFAULT ->
> jump $s$wgo_slqH sc_slqG (+# sc1_slqF 1#) (*# sc2_slqE sc1_slqF);
> 1# -> $wgo (-# ww_sll6 1#) sc_slqG
In the bad case we see a redundant (*# sc2_slqE sc1_slqF)
computation being passed to $s$wgo_slqH
. sc2_slqE
is not being used anywhere except in this argument.
What's causing this difference in the core? Let's look at core-2-core passes. The first difference comes in the second simplifier phase (use vimdiff product.good.full/product.05-* product.bad.full/product.05-*
in the repo). We observe here that in the function go
(called from main):
- in the good case the type of
step
contains aforall p
whereas in the bad case it is specialized to the actual type. - in the good case we see a
joinrec
insidelet
whereas in the bad case we see aletrec
.
This difference carries on in the next passes, we can view it by just changing the core pass numbers and diffing the committed core output e.g. vimdiff product.good/product.06-* product.bad/product.06-*
After pass 12 (vimdiff product.good/product.12-* product.bad/product.12-*
) i.e. the simplifier pass after worker wrapper binds, we can see that the redundant argument performing multiply operation is gone in the good case (also, joins 2/2) but remains in the bad case (joins 0/3). That is what we see in the final core output as well.
Expected behavior
Performance is expected to remain the same when TypeFamilies
extension is enabled and not used.
Environment
- GHC version used:
$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 8.8.3
Optional:
- Operating System: Mac OS X
- System Architecture: x86_64