GHC issueshttps://gitlab.haskell.org/ghc/ghc/-/issues2023-08-07T22:56:25Zhttps://gitlab.haskell.org/ghc/ghc/-/issues/3755Improve join point inlining2023-08-07T22:56:25ZSimon Peyton JonesImprove join point inliningThis ticket relates to the paper "Optimising generics is easy" http://www.dreixel.net/research/pdf/ogie.pdf. Jose writes (about a case where inlining doesn't work well):
I put a minimal source code for this test and resulting Core
(with...This ticket relates to the paper "Optimising generics is easy" http://www.dreixel.net/research/pdf/ogie.pdf. Jose writes (about a case where inlining doesn't work well):
I put a minimal source code for this test and resulting Core
(with GHC 6.13.20091115) in https://subversion.cs.uu.nl/repos/staff.jpm.public/Inline/.
`UpdateInt` behaves great: in `UpdateInt.core.O1.hs`, there are no traces of
generic representations in testOne, testTwo, testThree and testFour.
`UpdateString`, however, does not behave so well. In `UpdateString.core.O1.hs`,
`Test.testLogic_updateString` still has loads of generic representations.
It's easy to see what is happening. Compile `UpdateString` (which I've attached to this ticket) with `-ddump-simpl`, and look at the core. You see stuff like
```
Rec { $j1_rx32 = \x. <big nested case expression>
; f = \y. ....($j1_rx32 <big constructor expression>)
---($j1_rx32 <big constructor expression)....
}
```
So here the `$j` (which is a "join point") isn't inlined because it's big, although if it *were* inlined there would be much goodness because the case expressions would cancel with the explicit constructors.
Why did this happen despite lots of INLINE pragmas? I have not followed all the details, but I'm guessing that if we have, say
```
{-# INLINE from #-}
from = \x. case x of from_alts
{-# INLINE to #-}
to = \x. case x of to_alts
```
and we try to optimize this call:
```
from (mapT f (to x))
```
then after inlining `from`, `mapT`, and `to` we'll get
```
case (case (case x of to_alts) of map_alts) of from_alts
```
And now the case-of-case transform happens, which creates the join points to avoid duplicating map_alts, from_alts into every branch of to_alts. You may say that we've already said that it's ok to duplicate from (and hence from_alts) but we duplicated it once when we inlined it, and then we forget the origin of the code. And indeed, in the worse case you could get a quadratic blow up; and there are only two functions involved. So I'm not unhappy with that.
However, it does make me wonder whether we could not do a better job on the above Rec {..}. Two thoughts occur.
1. We could beef up `SpecConstr`. It doesn't fire at the moment for some reason, even with -O2
1. If we have
```
f = \x. case x of { C1 x11..x1n -> e1; ... Ck xk1 ... xkm -> ek }
```
maybe we should worker-wrapper it to
```
f1 x1 .. x1n = e1
...
fn xk1 .. xkm = en
f = \x of pi -> fi xi
```
and now inline f. The net effect is very similar to the way join points work right now, but it would make it multi-level. In fact, doing this might simplify and generalise the way that join points are currently done, where (rather arbitrarily) we duplicate the outer layer of a single case.
Simon
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ------------ |
| Version | 6.12.1 |
| Type | Task |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Compiler |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"Improve join point inlining","status":"New","operating_system":"","component":"Compiler","related":[],"milestone":"","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"6.12.1","keywords":[],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"Task","description":"This ticket relates to the paper \"Optimising generics is easy\" http://www.dreixel.net/research/pdf/ogie.pdf. Jose writes (about a case where inlining doesn't work well):\r\n\r\n I put a minimal source code for this test and resulting Core \r\n (with GHC 6.13.20091115) in https://subversion.cs.uu.nl/repos/staff.jpm.public/Inline/. \r\n `UpdateInt` behaves great: in `UpdateInt.core.O1.hs`, there are no traces of \r\n generic representations in testOne, testTwo, testThree and testFour.\r\n\r\n `UpdateString`, however, does not behave so well. In `UpdateString.core.O1.hs`, \r\n `Test.testLogic_updateString` still has loads of generic representations. \r\n\r\nIt's easy to see what is happening. Compile `UpdateString` (which I've attached to this ticket) with `-ddump-simpl`, and look at the core. You see stuff like\r\n{{{\r\nRec { $j1_rx32 = \\x. <big nested case expression>\r\n ; f = \\y. ....($j1_rx32 <big constructor expression>)\r\n ---($j1_rx32 <big constructor expression)....\r\n }\r\n}}}\r\nSo here the `$j` (which is a \"join point\") isn't inlined because it's big, although if it ''were'' inlined there would be much goodness because the case expressions would cancel with the explicit constructors.\r\n\r\nWhy did this happen despite lots of INLINE pragmas? I have not followed all the details, but I'm guessing that if we have, say\r\n{{{\r\n\t{-# INLINE from #-}\r\n\tfrom = \\x. case x of from_alts\r\n\t{-# INLINE to #-}\r\n\tto = \\x. case x of to_alts\r\n}}}\r\nand we try to optimize this call:\r\n{{{\r\nfrom (mapT f (to x))\r\n}}}\r\nthen after inlining `from`, `mapT`, and `to` we'll get\r\n{{{\r\ncase (case (case x of to_alts) of map_alts) of from_alts\r\n}}}\r\nAnd now the case-of-case transform happens, which creates the join points to avoid duplicating map_alts, from_alts into every branch of to_alts. You may say that we've already said that it's ok to duplicate from (and hence from_alts) but we duplicated it once when we inlined it, and then we forget the origin of the code. And indeed, in the worse case you could get a quadratic blow up; and there are only two functions involved. So I'm not unhappy with that.\r\n\r\nHowever, it does make me wonder whether we could not do a better job on the above Rec {..}. Two thoughts occur.\r\n\r\n 1. We could beef up `SpecConstr`. It doesn't fire at the moment for some reason, even with -O2\r\n\r\n 2. If we have \r\n{{{\r\nf = \\x. case x of { C1 x11..x1n -> e1; ... Ck xk1 ... xkm -> ek }\r\n}}}\r\n maybe we should worker-wrapper it to\r\n{{{\r\nf1 x1 .. x1n = e1\r\n...\r\nfn xk1 .. xkm = en\r\nf = \\x of pi -> fi xi\r\n}}}\r\n and now inline f. The net effect is very similar to the way join points work right now, but it would make it multi-level. In fact, doing this might simplify and generalise the way that join points are currently done, where (rather arbitrarily) we duplicate the outer layer of a single case.\r\n\r\nSimon\r\n","type_of_failure":"OtherFailure","blocking":[]} -->8.0.1https://gitlab.haskell.org/ghc/ghc/-/issues/15573Make bindings with multiple occurrences a join point instead of duplicating c...2019-07-07T18:04:04ZAndreas KlebingerMake bindings with multiple occurrences a join point instead of duplicating code during inlining.I have some intermediate core of the form:
```
-- RHS size: {terms: 9, types: 2, coercions: 0, joins: 0/0}
cseAlts_s1dD [Occ=Once!T[1]] :: T -> Int#
[LclId, CallArity=1, Str=<S,1*U>]
cseAlts_s1dD
= \ (lamVar_s1dw [Occ=Once!, Dmd=<S,1*...I have some intermediate core of the form:
```
-- RHS size: {terms: 9, types: 2, coercions: 0, joins: 0/0}
cseAlts_s1dD [Occ=Once!T[1]] :: T -> Int#
[LclId, CallArity=1, Str=<S,1*U>]
cseAlts_s1dD
= \ (lamVar_s1dw [Occ=Once!, Dmd=<S,1*U>, OS=OneShot] :: T) ->
case lamVar_s1dw of wild_Xc [Dmd=<L,A>] {
__DEFAULT -> 1#;
B -> 2#;
C -> 3#
}
-- RHS size: {terms: 14, types: 3, coercions: 0, joins: 0/0}
$wfmerge_s1cZ [InlPrag=NOUSERINLINE[0]] :: T -> T -> Int#
[LclId, Arity=2, CallArity=2, Str=<S,1*U><L,1*U>]
$wfmerge_s1cZ
= \ (w_s1cU [Occ=Once!, Dmd=<S,1*U>] :: T)
(w_s1cV [Occ=Once*, Dmd=<L,1*U>] :: T) ->
case w_s1cU of wild_XA [Dmd=<L,A>] {
__DEFAULT -> -1#;
A -> 2#;
B -> cseAlts_s1dD w_s1cV;
C -> cseAlts_s1dD w_s1cV
}
```
Which after the simplifier ran got inlined into the branches to give us:
```
fmerge
= \ (w_s1cU :: T) (w_s1cV :: T) ->
case w_s1cU of {
__DEFAULT -> GHC.Types.I# -1#;
A -> GHC.Types.I# 2#;
B ->
case w_s1cV of {
__DEFAULT -> GHC.Types.I# 1#;
B -> GHC.Types.I# 2#;
C -> GHC.Types.I# 3#
};
C ->
case w_s1cV of {
__DEFAULT -> GHC.Types.I# 1#;
B -> GHC.Types.I# 2#;
C -> GHC.Types.I# 3#
}
}
```
What I would really like GHC to do instead though is to make `cseAlts_s1dD` a join point when possible.
This would eliminate both the call overhead AND the call duplication.
The current behavior seems fine when we can't make it a join point. But when we can we should try to take advantage of that opportunity.
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ------------ |
| Version | 8.4.3 |
| Type | Task |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Compiler |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"Make bindings with multiple occurrences a join point instead of duplicating code during inlining.","status":"New","operating_system":"","component":"Compiler","related":[],"milestone":"8.6.1","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"8.4.3","keywords":["JoinPoints"],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"Task","description":"I have some intermediate core of the form:\r\n\r\n{{{\r\n-- RHS size: {terms: 9, types: 2, coercions: 0, joins: 0/0}\r\ncseAlts_s1dD [Occ=Once!T[1]] :: T -> Int#\r\n[LclId, CallArity=1, Str=<S,1*U>]\r\ncseAlts_s1dD\r\n = \\ (lamVar_s1dw [Occ=Once!, Dmd=<S,1*U>, OS=OneShot] :: T) ->\r\n case lamVar_s1dw of wild_Xc [Dmd=<L,A>] {\r\n __DEFAULT -> 1#;\r\n B -> 2#;\r\n C -> 3#\r\n }\r\n\r\n-- RHS size: {terms: 14, types: 3, coercions: 0, joins: 0/0}\r\n$wfmerge_s1cZ [InlPrag=NOUSERINLINE[0]] :: T -> T -> Int#\r\n[LclId, Arity=2, CallArity=2, Str=<S,1*U><L,1*U>]\r\n$wfmerge_s1cZ\r\n = \\ (w_s1cU [Occ=Once!, Dmd=<S,1*U>] :: T)\r\n (w_s1cV [Occ=Once*, Dmd=<L,1*U>] :: T) ->\r\n case w_s1cU of wild_XA [Dmd=<L,A>] {\r\n __DEFAULT -> -1#;\r\n A -> 2#;\r\n B -> cseAlts_s1dD w_s1cV;\r\n C -> cseAlts_s1dD w_s1cV\r\n }\r\n}}}\r\n\r\nWhich after the simplifier ran got inlined into the branches to give us:\r\n\r\n{{{\r\nfmerge\r\n = \\ (w_s1cU :: T) (w_s1cV :: T) ->\r\n case w_s1cU of {\r\n __DEFAULT -> GHC.Types.I# -1#;\r\n A -> GHC.Types.I# 2#;\r\n B ->\r\n case w_s1cV of {\r\n __DEFAULT -> GHC.Types.I# 1#;\r\n B -> GHC.Types.I# 2#;\r\n C -> GHC.Types.I# 3#\r\n };\r\n C ->\r\n case w_s1cV of {\r\n __DEFAULT -> GHC.Types.I# 1#;\r\n B -> GHC.Types.I# 2#;\r\n C -> GHC.Types.I# 3#\r\n }\r\n }\r\n}}}\r\n\r\nWhat I would really like GHC to do instead though is to make `cseAlts_s1dD` a join point when possible.\r\nThis would eliminate both the call overhead AND the call duplication.\r\n\r\nThe current behavior seems fine when we can't make it a join point. But when we can we should try to take advantage of that opportunity.","type_of_failure":"OtherFailure","blocking":[]} -->8.6.1https://gitlab.haskell.org/ghc/ghc/-/issues/13763Performance regression (~34%) in 8.2.1, poor register allocation(?) in an inn...2021-09-07T15:37:04ZjberrymanPerformance regression (~34%) in 8.2.1, poor register allocation(?) in an inner loop over an arrayTesting GHC 8.0.1 against the RC 8.2.0.20170507
I've distilled a smallish test-case from a much larger case in my 'hashabler' library, and validated that the same modifications also make that regression disappear in the real case. It's ...Testing GHC 8.0.1 against the RC 8.2.0.20170507
I've distilled a smallish test-case from a much larger case in my 'hashabler' library, and validated that the same modifications also make that regression disappear in the real case. It's probably possible to get this smaller but I don't know if I'll have time to work on it more for a while:
repro3.hs:
```hs
{-# LANGUAGE BangPatterns #-}
module Main(main) where
import Criterion.Main
import qualified Data.Primitive as P
import Data.Bits
import Data.Word
import Control.DeepSeq
main :: IO ()
main = do
defaultMain
[ env (newByteArr64 5) $ \ ~bs ->
bench "ByteArray 5" $ nf (hashTextSip 99) bs
, env (newByteArr64 8) $ \ ~bs ->
bench "ByteArray 8" $ nf (hashTextSip 99) bs
, env (newByteArr64 512) $ \ ~bs ->
bench "ByteArray 512" $ nf (hashTextSip 99) bs
, env (newByteArr64 1000000) $ \ ~bs ->
bench "ByteArray 1000000" $ nf (hashTextSip 99) bs
]
instance NFData P.ByteArray where rnf _ = ()
newByteArr64 n = P.newAlignedPinnedByteArray (8*n) 8 >>= P.unsafeFreezeByteArray
sipRound :: Word64 -> Word64 -> Word64 -> Word64 -> (Word64, Word64, Word64, Word64)
{-# INLINE sipRound #-}
sipRound v0 v1 v2 v3 = (v3 `xor` v0, v0 `xor` v1, v1 `xor` v2, v2 `xor` v3)
hashTextSip :: Word64 -> P.ByteArray -> Word64
{-# INLINE hashTextSip #-}
hashTextSip h = \ ba ->
let !lenWord16 = P.sizeofByteArray ba `unsafeShiftR` 1
!word16sRem = lenWord16 .&. 3
!word16sIx = lenWord16-word16sRem
!ixFinal = lenWord16-1
!word16sIxWd = word16sIx `unsafeShiftR` 2 -- `div` 4
hash4Word16sLoop hAcc@(!w0,!w1,!w2,!w3) !ix
| ix == word16sIxWd = hashRemainingWord16s hAcc word16sIx
| otherwise =
let w64Dirty = P.indexByteArray ba ix
w64 = clean4xWord16ChunkLE w64Dirty
in hash4Word16sLoop (sipRound (w0 `xor` w64) w1 w2 w3) (ix + 1)
-- NOTE: Removing this causes regression to disappear as well.
hashRemainingWord16s (!w0,!w1,!w2,!w3) !ix
| ix > ixFinal = w0
| otherwise =
let w16 = P.indexByteArray ba ix
in hashRemainingWord16s (sipRound (w0 `xor` (fromIntegral (w16 :: Word16))) w1 w2 w3) (ix+1)
in hash4Word16sLoop (h,1,2,3) 0
clean4xWord16ChunkLE :: Word64 -> Word64
{-# INLINE clean4xWord16ChunkLE #-}
clean4xWord16ChunkLE w64Dirty =
-- NOTE: no regression when just this (8.2 is faster)
-- (((byteSwap64 w64Dirty) `unsafeShiftR` 8) .&. 0x00FF00FF00FF00FF)
-- ...but this is a big regression:
(((byteSwap64 w64Dirty) `unsafeShiftR` 8) .&. 0x00FF00FF00FF00FF)
.|.
(((byteSwap64 w64Dirty) `unsafeShiftL` 8) .&. 0xFF00FF00FF00FF00)
```
Here are the results of the benchmark above on my machine:
On GHC \*\*8.0.1\*\*:
```
benchmarking ByteArray 5
time 24.70 ns (24.00 ns .. 26.25 ns)
0.987 R² (0.967 R² .. 1.000 R²)
mean 24.44 ns (24.13 ns .. 25.80 ns)
std dev 1.859 ns (318.3 ps .. 4.227 ns)
variance introduced by outliers: 86% (severely inflated)
benchmarking ByteArray 8
time 32.66 ns (32.58 ns .. 32.76 ns)
1.000 R² (1.000 R² .. 1.000 R²)
mean 32.79 ns (32.64 ns .. 33.09 ns)
std dev 683.7 ps (365.4 ps .. 1.175 ns)
variance introduced by outliers: 31% (moderately inflated)
benchmarking ByteArray 512
time 1.428 μs (1.382 μs .. 1.522 μs)
0.986 R² (0.970 R² .. 1.000 R²)
mean 1.398 μs (1.384 μs .. 1.454 μs)
std dev 91.12 ns (4.475 ns .. 193.9 ns)
variance introduced by outliers: 76% (severely inflated)
benchmarking ByteArray 1000000
time 2.658 ms (2.653 ms .. 2.663 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 2.672 ms (2.665 ms .. 2.691 ms)
std dev 35.00 μs (10.88 μs .. 59.58 μs)
```
And on \*\*GHC 8.2\*\* RC:
```
benchmarking ByteArray 5
time 23.78 ns (23.68 ns .. 23.88 ns)
1.000 R² (1.000 R² .. 1.000 R²)
mean 23.83 ns (23.76 ns .. 23.95 ns)
std dev 298.8 ps (183.2 ps .. 482.5 ps)
variance introduced by outliers: 14% (moderately inflated)
benchmarking ByteArray 8
time 35.81 ns (35.44 ns .. 36.27 ns)
0.999 R² (0.998 R² .. 1.000 R²)
mean 35.56 ns (35.45 ns .. 35.94 ns)
std dev 596.8 ps (134.5 ps .. 1.184 ns)
variance introduced by outliers: 22% (moderately inflated)
benchmarking ByteArray 512
time 1.706 μs (1.698 μs .. 1.716 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 1.701 μs (1.698 μs .. 1.707 μs)
std dev 13.27 ns (5.825 ns .. 24.41 ns)
benchmarking ByteArray 1000000
time 3.322 ms (3.284 ms .. 3.377 ms)
0.999 R² (0.998 R² .. 1.000 R²)
mean 3.296 ms (3.287 ms .. 3.332 ms)
std dev 44.62 μs (20.55 μs .. 87.29 μs)
```
Looking at the core wasn't fruitful, but I think dumping the asm shows that this is a case of bad (or worse) register allocation. I've attached two screenshots showing the instructions added (in blue), when moving from the one-line `clean4xWord16ChunkLE` to the two-line version, for both 8.0 and 8.2 (there wasn't anything in the diff besides instances of this change).
It looks in the 8.2 version like we've decided we're out of registers and need to use the stack.
In my real code I'm seeing 35% regression on very long Text, as well as 21% regression on very long ByteString; the latter implementation is similarly structured to `hashTextSip`, but doesn't call `clean4xWord16ChunkLE` but does do a byteswap.
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | -------------- |
| Version | 8.2.1-rc2 |
| Type | Bug |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Compiler (NCG) |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"Performance regression (~34%) in 8.2.1, poor register allocation(?) in an inner loop over an array","status":"New","operating_system":"","component":"Compiler (NCG)","related":[],"milestone":"","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"8.2.1-rc2","keywords":[],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"Bug","description":"Testing GHC 8.0.1 against the RC 8.2.0.20170507\r\n\r\nI've distilled a smallish test-case from a much larger case in my 'hashabler' library, and validated that the same modifications also make that regression disappear in the real case. It's probably possible to get this smaller but I don't know if I'll have time to work on it more for a while:\r\n\r\nrepro3.hs:\r\n\r\n{{{#!hs\r\n{-# LANGUAGE BangPatterns #-}\r\nmodule Main(main) where\r\n\r\nimport Criterion.Main\r\nimport qualified Data.Primitive as P\r\n\r\nimport Data.Bits\r\nimport Data.Word\r\nimport Control.DeepSeq\r\n\r\nmain :: IO ()\r\nmain = do\r\n defaultMain \r\n [ env (newByteArr64 5) $ \\ ~bs ->\r\n bench \"ByteArray 5\" $ nf (hashTextSip 99) bs\r\n , env (newByteArr64 8) $ \\ ~bs ->\r\n bench \"ByteArray 8\" $ nf (hashTextSip 99) bs\r\n , env (newByteArr64 512) $ \\ ~bs ->\r\n bench \"ByteArray 512\" $ nf (hashTextSip 99) bs\r\n , env (newByteArr64 1000000) $ \\ ~bs ->\r\n bench \"ByteArray 1000000\" $ nf (hashTextSip 99) bs\r\n ]\r\n\r\ninstance NFData P.ByteArray where rnf _ = ()\r\n\r\nnewByteArr64 n = P.newAlignedPinnedByteArray (8*n) 8 >>= P.unsafeFreezeByteArray\r\n\r\nsipRound :: Word64 -> Word64 -> Word64 -> Word64 -> (Word64, Word64, Word64, Word64)\r\n{-# INLINE sipRound #-}\r\nsipRound v0 v1 v2 v3 = (v3 `xor` v0, v0 `xor` v1, v1 `xor` v2, v2 `xor` v3)\r\n\r\nhashTextSip :: Word64 -> P.ByteArray -> Word64\r\n{-# INLINE hashTextSip #-}\r\nhashTextSip h = \\ ba -> \r\n let !lenWord16 = P.sizeofByteArray ba `unsafeShiftR` 1\r\n !word16sRem = lenWord16 .&. 3 \r\n !word16sIx = lenWord16-word16sRem\r\n !ixFinal = lenWord16-1\r\n !word16sIxWd = word16sIx `unsafeShiftR` 2 -- `div` 4\r\n\r\n hash4Word16sLoop hAcc@(!w0,!w1,!w2,!w3) !ix \r\n | ix == word16sIxWd = hashRemainingWord16s hAcc word16sIx\r\n | otherwise = \r\n let w64Dirty = P.indexByteArray ba ix\r\n w64 = clean4xWord16ChunkLE w64Dirty\r\n in hash4Word16sLoop (sipRound (w0 `xor` w64) w1 w2 w3) (ix + 1)\r\n \r\n -- NOTE: Removing this causes regression to disappear as well.\r\n hashRemainingWord16s (!w0,!w1,!w2,!w3) !ix \r\n | ix > ixFinal = w0 \r\n | otherwise = \r\n let w16 = P.indexByteArray ba ix\r\n in hashRemainingWord16s (sipRound (w0 `xor` (fromIntegral (w16 :: Word16))) w1 w2 w3) (ix+1)\r\n in hash4Word16sLoop (h,1,2,3) 0 \r\n\r\nclean4xWord16ChunkLE :: Word64 -> Word64\r\n{-# INLINE clean4xWord16ChunkLE #-}\r\nclean4xWord16ChunkLE w64Dirty = \r\n -- NOTE: no regression when just this (8.2 is faster)\r\n -- (((byteSwap64 w64Dirty) `unsafeShiftR` 8) .&. 0x00FF00FF00FF00FF) \r\n\r\n -- ...but this is a big regression:\r\n (((byteSwap64 w64Dirty) `unsafeShiftR` 8) .&. 0x00FF00FF00FF00FF) \r\n .|.\r\n (((byteSwap64 w64Dirty) `unsafeShiftL` 8) .&. 0xFF00FF00FF00FF00)\r\n}}}\r\n\r\nHere are the results of the benchmark above on my machine:\r\n\r\nOn GHC **8.0.1**:\r\n{{{\r\nbenchmarking ByteArray 5\r\ntime 24.70 ns (24.00 ns .. 26.25 ns)\r\n 0.987 R² (0.967 R² .. 1.000 R²)\r\nmean 24.44 ns (24.13 ns .. 25.80 ns)\r\nstd dev 1.859 ns (318.3 ps .. 4.227 ns)\r\nvariance introduced by outliers: 86% (severely inflated)\r\n\r\nbenchmarking ByteArray 8\r\ntime 32.66 ns (32.58 ns .. 32.76 ns)\r\n 1.000 R² (1.000 R² .. 1.000 R²)\r\nmean 32.79 ns (32.64 ns .. 33.09 ns)\r\nstd dev 683.7 ps (365.4 ps .. 1.175 ns)\r\nvariance introduced by outliers: 31% (moderately inflated)\r\n\r\nbenchmarking ByteArray 512\r\ntime 1.428 μs (1.382 μs .. 1.522 μs)\r\n 0.986 R² (0.970 R² .. 1.000 R²)\r\nmean 1.398 μs (1.384 μs .. 1.454 μs)\r\nstd dev 91.12 ns (4.475 ns .. 193.9 ns)\r\nvariance introduced by outliers: 76% (severely inflated)\r\n\r\nbenchmarking ByteArray 1000000\r\ntime 2.658 ms (2.653 ms .. 2.663 ms)\r\n 1.000 R² (1.000 R² .. 1.000 R²)\r\nmean 2.672 ms (2.665 ms .. 2.691 ms)\r\nstd dev 35.00 μs (10.88 μs .. 59.58 μs)\r\n\r\n}}}\r\n\r\nAnd on **GHC 8.2** RC:\r\n{{{\r\nbenchmarking ByteArray 5\r\ntime 23.78 ns (23.68 ns .. 23.88 ns)\r\n 1.000 R² (1.000 R² .. 1.000 R²)\r\nmean 23.83 ns (23.76 ns .. 23.95 ns)\r\nstd dev 298.8 ps (183.2 ps .. 482.5 ps)\r\nvariance introduced by outliers: 14% (moderately inflated)\r\n\r\nbenchmarking ByteArray 8\r\ntime 35.81 ns (35.44 ns .. 36.27 ns)\r\n 0.999 R² (0.998 R² .. 1.000 R²)\r\nmean 35.56 ns (35.45 ns .. 35.94 ns)\r\nstd dev 596.8 ps (134.5 ps .. 1.184 ns)\r\nvariance introduced by outliers: 22% (moderately inflated)\r\n\r\nbenchmarking ByteArray 512\r\ntime 1.706 μs (1.698 μs .. 1.716 μs)\r\n 1.000 R² (1.000 R² .. 1.000 R²)\r\nmean 1.701 μs (1.698 μs .. 1.707 μs)\r\nstd dev 13.27 ns (5.825 ns .. 24.41 ns)\r\n\r\nbenchmarking ByteArray 1000000\r\ntime 3.322 ms (3.284 ms .. 3.377 ms)\r\n 0.999 R² (0.998 R² .. 1.000 R²)\r\nmean 3.296 ms (3.287 ms .. 3.332 ms)\r\nstd dev 44.62 μs (20.55 μs .. 87.29 μs)\r\n\r\n}}}\r\n\r\nLooking at the core wasn't fruitful, but I think dumping the asm shows that this is a case of bad (or worse) register allocation. I've attached two screenshots showing the instructions added (in blue), when moving from the one-line `clean4xWord16ChunkLE` to the two-line version, for both 8.0 and 8.2 (there wasn't anything in the diff besides instances of this change). \r\n\r\nIt looks in the 8.2 version like we've decided we're out of registers and need to use the stack.\r\n\r\nIn my real code I'm seeing 35% regression on very long Text, as well as 21% regression on very long ByteString; the latter implementation is similarly structured to `hashTextSip`, but doesn't call `clean4xWord16ChunkLE` but does do a byteswap.","type_of_failure":"OtherFailure","blocking":[]} -->⊥https://gitlab.haskell.org/ghc/ghc/-/issues/22243OccurAnal is stricter on tail call contexts than CoreLint2022-10-11T14:59:05ZSebastian GrafOccurAnal is stricter on tail call contexts than CoreLintIn CoreLint, I see
```hs
markAllJoinsBadIf block_joins $ lintCoreExpr expr
where
block_joins = not (tickish `tickishScopesLike` SoftScope)
-- TODO Consider whether this is the correct rule. It is consistent with
...In CoreLint, I see
```hs
markAllJoinsBadIf block_joins $ lintCoreExpr expr
where
block_joins = not (tickish `tickishScopesLike` SoftScope)
-- TODO Consider whether this is the correct rule. It is consistent with
-- the simplifier's behaviour - cost-centre-scoped ticks become part of
-- the continuation, and thus they behave like part of an evaluation
-- context, but soft-scoped and non-scoped ticks simply wrap the result
-- (see Simplify.simplTick).
```
whereas in OccurAnal I see
```hs
occAnal env (Tick tickish body)
| SourceNote{} <- tickish
= ...
| tickish `tickishScopesLike` SoftScope
= WithUsageDetails (markAllNonTail usage) (Tick tickish body')
...
```
So CoreLint says every Tick with scope like `SoftScope` retains tail call context, whereas OccurAnal says it doesn't.
Which one is correct?https://gitlab.haskell.org/ghc/ghc/-/issues/22227Missed loop fusion optimisation2024-03-04T11:37:34ZJaro ReindersMissed loop fusion optimisation## Summary
I've written a boxed and unboxed version of the same program, but for some reason GHC is only able to properly optimize the boxed version. In particular the program has two loops which fuse in the boxed version but stay separ...## Summary
I've written a boxed and unboxed version of the same program, but for some reason GHC is only able to properly optimize the boxed version. In particular the program has two loops which fuse in the boxed version but stay separate in the unboxed version.
For more background info, see [this GitHub issue](https://github.com/haskell/vector/issues/156).
## Steps to reproduce
Boxed:
```haskell
module Boxed (test) where
import Data.Primitive.Array
import System.IO.Unsafe (unsafePerformIO)
data Step s a = Yield a s | Done
uninitialised = undefined
test :: Int -> Int -> Array Double -> (Int, Int, Array Double)
test off n oldArr = unsafePerformIO $ do
newArr <- newArray n uninitialised
let
step' i
| i >= n = Done
| otherwise =
let x = indexArray oldArr (off + i) in
if x > 10
then Yield x (i + 1)
else step' (i + 1)
loop i j = do
case step' i of
Yield x s' -> do
writeArray newArr j (x + 1)
loop s' (j + 1)
Done -> do
out <- unsafeFreezeArray newArr
return (0, j, out)
loop 0 0
```
Unboxed:
```haskell
{-# LANGUAGE MagicHash #-}
{-# LANGUAGE UnboxedTuples #-}
module Unboxed (test) where
import GHC.Exts
import GHC.IO
data Step s a = Yield a s | Done
uninitialised = undefined
test :: Int# -> Int# -> Array# Double -> (# Int#, Int#, Array# Double #)
test off n oldArr = runRW# $ \s0 ->
case newArray# n uninitialised s0
of { (# s1, newArr #) ->
let
step' i
| isTrue# (i >=# n) = Done
| otherwise =
let (# D# x #) = indexArray# oldArr (off +# i) in
if isTrue# (x >## 10.0##)
then Yield (D# x) (I# (i +# 1#))
else step' (i +# 1#)
loop i j s2 =
case step' i of
Yield x (I# s') ->
case writeArray# newArr j (x + 1) s2
of { s3 ->
loop s' (j +# 1#) s3
}
Done ->
case unsafeFreezeArray# newArr s2
of { (# s3, out #) ->
(# 0#, j, out #)
}
in
loop 0# 0# s1
}
```
## Expected behavior
I expect both programs to produce very similar core where the two loops are fused and the intermediate `Yield` and `Done` constructors are eliminated.
That does happen for the boxed version:
```haskell
$wtest :: Int -> Int# -> Array Double -> (Int, Int, Array Double)
$wtest
= \ (w :: Int) (ww :: Int#) (w1 :: Array Double) ->
runRW#
(\ (s :: State# RealWorld) ->
case noDuplicate# s of s' { __DEFAULT ->
case newArray# ww uninitialised (s' `cast` <Co:3>) of
{ (# ipv, ipv1 #) ->
joinrec {
$s$wloop
:: State# RealWorld -> Int# -> Int# -> (Int, Int, Array Double)
$s$wloop (sc :: State# RealWorld) (sc1 :: Int#) (sc2 :: Int#)
= join {
$j :: (Int, Int, Array Double)
$j
= case unsafeFreezeArray# ipv1 (sc `cast` <Co:3>) of
{ (# ipv2, ipv3 #) ->
lazy (test1, I# sc1, Array ipv3)
} } in
case >=# sc2 ww of {
__DEFAULT ->
case w of { I# x ->
case w1 of { Array ds2 ->
case indexArray# ds2 (+# x sc2) of { (# ipv2 #) ->
case ipv2 of { D# x1 ->
case >## x1 10.0## of {
__DEFAULT ->
joinrec {
$wstep' :: Int# -> (Int, Int, Array Double)
$wstep' (ww1 :: Int#)
= case >=# ww1 ww of {
__DEFAULT ->
case indexArray# ds2 (+# x ww1) of { (# ipv3 #) ->
case ipv3 of { D# x2 ->
case >## x2 10.0## of {
__DEFAULT -> jump $wstep' (+# ww1 1#);
1# ->
case writeArray#
ipv1 sc1 (D# (+## x2 1.0##)) (sc `cast` <Co:3>)
of s'#
{ __DEFAULT ->
jump $s$wloop (s'# `cast` <Co:2>) (+# sc1 1#) (+# ww1 1#)
}
} } };
1# -> jump $j
}; } in
jump $wstep' (+# sc2 1#);
1# ->
case writeArray# ipv1 sc1 (D# (+## x1 1.0##)) (sc `cast` <Co:3>)
of s'#
{ __DEFAULT ->
jump $s$wloop (s'# `cast` <Co:2>) (+# sc1 1#) (+# sc2 1#)
}
} } } } };
1# -> jump $j
}; } in
jump $s$wloop (ipv `cast` <Co:2>) 0# 0#
} })
```
But the unboxed version keeps the loops separate and doesn't eliminate `Yield` and `Done`:
```haskell
test
= \ off n oldArr ->
runRW#
(\ s0 ->
case newArray# n uninitialised s0 of { (# ipv, ipv1 #) ->
letrec {
step'
= \ i ->
case >=# i n of {
__DEFAULT ->
case indexArray# oldArr (+# off i) of { (# ipv2 #) ->
case ipv2 of wild { D# x ->
case >## x 10.0## of {
__DEFAULT -> step' (+# i 1#);
1# -> Yield wild (I# (+# i 1#))
}
}
};
1# -> Done
}; } in
join {
exit j s2
= case unsafeFreezeArray# ipv1 s2 of { (# ipv2, ipv3 #) ->
(# 0#, j, ipv3 #)
} } in
joinrec {
loop i j s2
= case step' i of {
Yield x ds1 ->
case ds1 of { I# s' ->
case writeArray#
ipv1 j (case x of { D# x1 -> D# (+## x1 1.0##) }) s2
of s3
{ __DEFAULT ->
jump loop s' (+# j 1#) s3
}
};
Done -> jump exit j s2
}; } in
jump loop 0# 0# ipv
})
```
## Environment
* GHC version used: 9.2.3 and 9.4.2https://gitlab.haskell.org/ghc/ghc/-/issues/19643Inefficient code generated from matching with pattern synonyms2023-10-25T21:22:00ZOlle FredrikssonInefficient code generated from matching with pattern synonyms## Summary
Pattern matching using pattern synonyms sometimes results in generated code that uses multiple case expressions where one would suffice. This can result in code that's linear time in the number of patterns, instead of constan...## Summary
Pattern matching using pattern synonyms sometimes results in generated code that uses multiple case expressions where one would suffice. This can result in code that's linear time in the number of patterns, instead of constant time.
See also #3755, #3781
## Steps to reproduce
Compile the following module with `ghc PatternSynonyms.hs -ddump-simpl -dsuppress-all -O2`.
```haskell
{-# language PatternSynonyms #-}
module PatternSynonyms where
pattern FirstZero :: a -> (Int, a)
pattern FirstZero a = (0, a)
pattern FirstOne :: a -> (Int, a)
pattern FirstOne a = (1, a)
pattern FirstTwo :: a -> (Int, a)
pattern FirstTwo a = (2, a)
synonym :: (Int, Either () Int) -> Bool
synonym (FirstZero (Right 0)) = True
synonym (FirstOne (Right 1)) = True
synonym (FirstTwo (Right 2)) = True
synonym _ = False
nonSynonym :: (Int, Either () Int) -> Bool
nonSynonym (0, Right 0) = True
nonSynonym (1, Right 1) = True
nonSynonym (2, Right 2) = True
nonSynonym _ = False
```
Part of the dumped core is this:
```
$wsynonym
= \ ww_sUY ww1_sV0 ->
join {
fail_sUN _
= case ww_sUY of {
__DEFAULT -> False;
1# ->
case ww1_sV0 of {
Left ipv_sUy -> False;
Right ds2_dUj ->
case ds2_dUj of { I# ds3_dUk ->
case ds3_dUk of {
__DEFAULT -> False;
1# -> True
}
}
};
2# ->
case ww1_sV0 of {
Left ipv_sUw -> False;
Right ds2_dUm ->
case ds2_dUm of { I# ds3_dUn ->
case ds3_dUn of {
__DEFAULT -> False;
2# -> True
}
}
}
} } in
case ww_sUY of {
__DEFAULT -> jump fail_sUN (##);
0# ->
case ww1_sV0 of {
Left ipv_sUA -> jump fail_sUN (##);
Right ds1_dUg ->
case ds1_dUg of { I# ds2_dUh ->
case ds2_dUh of {
__DEFAULT -> jump fail_sUN (##);
0# -> True
}
}
}
}
$wnonSynonym
= \ ww_sV9 ww1_sVb ->
case ww_sV9 of {
__DEFAULT -> False;
0# ->
case ww1_sVb of {
Left ipv_sUC -> False;
Right ds1_dTq ->
case ds1_dTq of { I# ds2_dTr ->
case ds2_dTr of {
__DEFAULT -> False;
0# -> True
}
}
};
1# ->
case ww1_sVb of {
Left ipv_sUE -> False;
Right ds1_dTs ->
case ds1_dTs of { I# ds2_dTt ->
case ds2_dTt of {
__DEFAULT -> False;
1# -> True
}
}
};
2# ->
case ww1_sVb of {
Left ipv_sUG -> False;
Right ds1_dTu ->
case ds1_dTu of { I# ds2_dTv ->
case ds2_dTv of {
__DEFAULT -> False;
2# -> True
}
}
}
}
```
## Expected behavior
Ideally, `synonym` and `nonSynonym` would perform the same.
## Environment
* GHC version used: Latest GHC HEAD (ce706faeef3964116c6e1dd0e6ae2f2e77fde57d).https://gitlab.haskell.org/ghc/ghc/-/issues/19555INLINE Pragma prevents join point from being floated out2021-03-22T17:27:45ZTarmeanINLINE Pragma prevents join point from being floated out## Summary
The [Data.Text.Internal.Search.indices](https://github.com/haskell/text/blob/d5118aa5a8301cfff4ff55b2e7325900c34ebdeb/src/Data/Text/Internal/Search.hs) function can become significantly slower in some cases when compiled with...## Summary
The [Data.Text.Internal.Search.indices](https://github.com/haskell/text/blob/d5118aa5a8301cfff4ff55b2e7325900c34ebdeb/src/Data/Text/Internal/Search.hs) function can become significantly slower in some cases when compiled with -O2. Reproducing this depends on the calling context, the INLINE pragma, and strictness, making the underlying issue fairly hard to pin down. I have tried to create a minimal test but cannot claim to understand why this happens.
[This comment](https://github.com/haskell/text/pull/219#issuecomment-800693669) mentions some (probably) relevant issues.
## Steps to reproduce
Compile the following code with -O1 or -O2:
module Foo (wrapper) where
-- INLINE or variants like INLINE[1], but not INLINABLE
{-# INLINE loop #-}
-- inlined function contains a nested joinrec loop
loop :: Int -> Int
loop x0 = go x0 0
where
go x acc
| x >= 1000000 = acc
| otherwise = go (x+1) (acc+step)
where
-- joinrec loop has a nested joinrec loop that could be floated out
step = buildTable 0 x0
-- the buildTable type signature must be missing
-- buildTable :: Int -> Int -> Int
buildTable a i
| a == i = i
| otherwise = buildTable (a+1) i
-- Nontrivial call site
wrapper :: Int -> Int
wrapper i
| i <= 10 = i
| otherwise = loop i
## Expected behavior
buildTable should be floated out of the go loop. This happens if buildTable has an explicit type signature, wrapper is simpler, or the INLINE pragma is removed:
Rec {
-- RHS size: {terms: 14, types: 3, coercions: 0, joins: 0/0}
$wbuildTable
= \ ww ww1 ->
case ==# ww ww1 of {
__DEFAULT -> $wbuildTable (+# ww 1#) ww1;
1# -> ww1
}
end Rec }
-- RHS size: {terms: 44, types: 11, coercions: 0, joins: 1/1}
wrapper
= \ x0 ->
case x0 of ww { I# ww ->
case >=# ww 1000000# of {
__DEFAULT ->
case $wbuildTable 0# ww of ww { __DEFAULT ->
joinrec {
$wgo ww ww
= case >=# ww 1000000# of {
__DEFAULT -> jump $wgo (+# ww 1#) (*# (+# ww ww) 3#);
1# -> I# ww
}; } in
jump $wgo (+# ww 1#) (*# ww 3#)
};
1# -> I# 0#
}
}
## Actual behavior
The loop remains nested
$wwrapper
= \ ww ->
case <=# ww 10# of {
__DEFAULT ->
joinrec {
$wgo ww ww
= case >=# ww 1000000# of {
__DEFAULT ->
joinrec {
$wbuildTable ww ww
= case ==# ww ww of {
__DEFAULT -> jump $wbuildTable (+# ww 1#) ww;
1# -> jump $wgo (+# ww 1#) (*# (+# ww ww) 3#)
}; } in
jump $wbuildTable 0# ww;
1# -> ww
}; } in
jump $wgo ww 0#;
1# -> ww
}
## Environment
* GHC version used: 8.10.4
Optional:
* Operating System: Windows
* System Architecture: x86-64https://gitlab.haskell.org/ghc/ghc/-/issues/18837Improve interaction of exitification with spec constr2020-11-09T14:01:11ZharendraImprove interaction of exitification with spec constr## Summary
Exitification pass happens before spec constr pass. It moves exit path code out of a recursive function. However, this impacts the efficacy of the later spec constr pass.
See https://github.com/composewell/streamly/issues/70...## Summary
Exitification pass happens before spec constr pass. It moves exit path code out of a recursive function. However, this impacts the efficacy of the later spec constr pass.
See https://github.com/composewell/streamly/issues/703 for details.
## Steps to reproduce
1. `git clone https://github.com/composewell/streamly.git`
2. `git checkout ghc-spec-constr-keen`
3. `cabal build streamly`
4. `ghc -ddump-simpl -ddump-to-file -dsuppress-all -O2 -fmax-worker-args=16 -fspec-constr-recursive=16 -funfolding-use-threshold=600 fold-bench.hs`
5. time ./fold-bench +RTS -s
6. Examine fold-bench.dump-simpl look for `W8#`.
We can see a W8# being passed to the join point $wstep3_s5F2:
```
jump $wstep3_s5F2
sc9_s5W3
sc8_s5W4
(PlainPtr sc7_s5W5)
sc6_s5W6
()
(W8# ipv8_a4F5)
sc_s5Wc
```
This join point then passes it to an exit point:
```
1# ->
jump exit1_X19
ww_s5EQ ww1_s5EV ww2_s5EW ww3_s5EX ww4_s5F0 w_s5EG w1_s5EH
```
The exit point examines this W8#:
```
exit1_X19 ww_s5EQ
ww1_s5EV
ww2_s5EW
ww3_s5EX
ww4_s5F0
w_s5EG
w1_s5EH
= case w_s5EG of { W8# x_a4tc ->
```
The W8# constructor is not removed by the spec-constr pass because the W8# constructor is not being examined by `$wstep3_s5F2`. The code examining the W8# has been moved to the exit point by the exitification pass which happens before the spec-constr pass. So it becomes opaque to the spec-constr pass and not removed.
We can use `-fspec-constr-keen` option in the compilation step above and see that the W8# is removed by it. However, using that option creates regressions in other benchmarks, it is not always good.
Another option is to do exitification pass after the spec-constr pass. See [this note](https://gitlab.haskell.org/ghc/ghc/-/blob/a9ce159ba58ca7e8946b46e19b1361588b677a26/compiler/GHC/Core/Opt/Exitify.hs#L451). We would not have this issue if we use the option D suggested in this note i.e. put exitification before the final simplifier pass. However, this is also not the best possible option because exitify before spec-constr + -fspec-constr-keen generates much better code and has 2x better performance.
We can checkout the branch `ghc-spec-constr-before-exitify` and repeat the steps 3-6 above to see that the W8# constructor is gone when exitify is done after spec-constr.
## Expected behavior
The point of this issue is to explore if we can keep exitify before spec-constr but do what -fspec-constr-keen does automatically without enabling it always but only surgically in cases like this. Can we make spec-constr keen through exit join points only?
The effect of removing the W8# in the above examples is 4x improvement because it in the fast path, called for every element of the stream. Depending on the use case we can get very significant gains if we can do spec-constr smartly to get the best results.
## Environment
* GHC version used:
8.10.2 branch
Optional:
* Operating System: macOS
* System Architecture: x86_64https://gitlab.haskell.org/ghc/ghc/-/issues/15560Full laziness destroys opportunities for join points2024-02-27T21:59:40ZAndreas KlebingerFull laziness destroys opportunities for join pointsEven if we already know a binding is a join point we STILL float it to the top and turn it into a function.
The simple example below results in a join point after the first simplifier run. Then we run the float out pass immediately undo...Even if we already know a binding is a join point we STILL float it to the top and turn it into a function.
The simple example below results in a join point after the first simplifier run. Then we run the float out pass immediately undoing this by making it a top level binding.
It then stays at the top till we are done resulting in the core I've put in the comments.
```haskell
data T = A | B | C | D | E | F | G
{-# NOINLINE n #-}
n :: T -> T
n A = B
n B = C
n _ = A
f :: Int -> T -> T -> T
f sel x y =
-- function large enough to avoid being simply inlined
let j z = n . n . n . n . n . n $ z
in case sel of
-- j is always tailcalled
0 -> j x
_ -> j y
-- j is floated to top level instead of ending up as joinpoint.
-- T.f_j
-- = \ (eta_B1 [OS=OneShot] :: T) -> n (n (n (n (n (n eta_B1)))))
-- -- RHS size: {terms: 14, types: 6, coercions: 0, joins: 0/0}
-- f :: Int -> T -> T -> T
-- f = \ (sel_aYP :: Int) (x_aYQ :: T) (y_aYR :: T) ->
-- case sel_aYP of { GHC.Types.I# ds_d2fL ->
-- case ds_d2fL of {
-- __DEFAULT -> T.f_j y_aYR;
-- 0# -> T.f_j x_aYQ
-- }
-- }
```
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ------------------ |
| Version | 8.4.3 |
| Type | Bug |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Compiler (CodeGen) |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | #14287 |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"Full laziness destroys opportunities for join points","status":"New","operating_system":"","component":"Compiler (CodeGen)","related":[14287],"milestone":"8.6.1","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"8.4.3","keywords":["JoinPoints"],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"Bug","description":"Even if we already know a binding is a join point we STILL float it to the top and turn it into a function.\r\n\r\nThe simple example below results in a join point after the first simplifier run. Then we run the float out pass immediately undoing this by making it a top level binding.\r\n\r\nIt then stays at the top till we are done resulting in the core I've put in the comments.\r\n\r\n{{{\r\n#!haskell\r\ndata T = A | B | C | D | E | F | G\r\n\r\n{-# NOINLINE n #-}\r\nn :: T -> T\r\nn A = B\r\nn B = C\r\nn _ = A\r\n\r\nf :: Int -> T -> T -> T\r\nf sel x y =\r\n -- function large enough to avoid being simply inlined\r\n let j z = n . n . n . n . n . n $ z\r\n in case sel of\r\n -- j is always tailcalled\r\n 0 -> j x\r\n _ -> j y\r\n\r\n-- j is floated to top level instead of ending up as joinpoint.\r\n-- T.f_j\r\n-- = \\ (eta_B1 [OS=OneShot] :: T) -> n (n (n (n (n (n eta_B1)))))\r\n\r\n-- -- RHS size: {terms: 14, types: 6, coercions: 0, joins: 0/0}\r\n-- f :: Int -> T -> T -> T\r\n-- f = \\ (sel_aYP :: Int) (x_aYQ :: T) (y_aYR :: T) ->\r\n-- case sel_aYP of { GHC.Types.I# ds_d2fL ->\r\n-- case ds_d2fL of {\r\n-- __DEFAULT -> T.f_j y_aYR;\r\n-- 0# -> T.f_j x_aYQ\r\n-- }\r\n-- }\r\n}}}\r\n","type_of_failure":"OtherFailure","blocking":[]} -->
## Current plan (as of Dec. 19th 2023)
### A brief summary
"destroys" is an apt description but the mechanism is this:
- some join points get lifted to the top level
- because they get lifted they no longer are join points and instead become top level functions with all the consequences of top level functions, as [this](https://gitlab.haskell.org/ghc/ghc/-/issues/15560#note_159111) comment points out.
- But these previous join point functions have some nice properties:
- They are never exported because they were floated out
- All call sites to them are known and saturated, again because they began life as a join point
### The Conceptual Plan
So the plan is to _not_ focus on join points but _instead_ optimize _all_ top level functions that that are local (i.e., not exported) and whose call sites are all known. Optimizing these functions will, in effect, also optimize the previously-join-point-now-top-level functions.
### The Optimizations
SPJ provides a nice overview in [this](https://gitlab.haskell.org/ghc/ghc/-/issues/15560#note_159253) comment:
1) Elide the info-table and closure generation for top level functions that are local and whose call sites are all known and saturated. This should reduce the code size increase that occurs from the `join point -> top level function` conversion.
2) Elide the stack overflow check for top level functions that have the properties of (1) _and_ are tail recursive, call other top level functions with the same properties, or are not recursive.
3) Eliminate Heap Checks by absorbing the checks into the caller of the function. This is an orthogonal optimization to this ticket and is currently not done for join points nor top level functions. Please see Simon's comment linked above.
### The Implementation Plan
Implement the optimizations in order:
For (1):
- Add a phase to Stg called `CgPrep` for `code gen prep`.
- Absorb the `InferTags` pass into `CgPrep`. InferTags already defines a CgPrep pass and a `CgInfo` record that is passed to the code generator for each backend, so this item is just a refactoring.
- Add a pass to `CgPrep` that detects and records all `Id`s that are functions and whose call sites are all known and saturated. Pass this set of `Id`s to `CgInfo`.
- For a backend `b`, use the new field in `CgInfo` to elide the code generation for info-tables and closures for top-level local functions whose `Id`s are also elements the new field.
The strategy here is to keep inspection of call sites at `Stg` instead of `StgToCmm` so that different backends (for example the JS backend) can implement (1) in their own code generator.
For (2):
- Todo
For (3):
- Should be tracked in another ticket (Todo)⊥doyougnujmy6342@gmail.comdoyougnujmy6342@gmail.comhttps://gitlab.haskell.org/ghc/ghc/-/issues/15091Better occurrence analysis for join points2021-03-22T13:03:56ZSimon Peyton JonesBetter occurrence analysis for join pointsConsider this
```
let x = ... in
join j y = x+y in
case v of
A -> x
B -> j 1
C -> j 2
D -> 3
```
What does `OccAnal` say about `x`'s usage? It says `ManyOccs`!
But it's plain as a pikestaff that `x` is used at most once:
- Ei...Consider this
```
let x = ... in
join j y = x+y in
case v of
A -> x
B -> j 1
C -> j 2
D -> 3
```
What does `OccAnal` say about `x`'s usage? It says `ManyOccs`!
But it's plain as a pikestaff that `x` is used at most once:
- Either in the `A` branch,
- or in the `B` or `C` branches via `j`.
If instead we had inlined `j` we'd have
```
let x = ... in
join j y = x+y in
case v of
A -> x
B -> x + 1
C -> x + 2
D -> 3
```
and now it's all more obvious: `x`'s occurrence info should be `OneOcc { occ_one_br = False }`, not `ManyOccs`.
Does this matter? Not a great deal, but there is a reason for having `OneOcc` with `occ_one_br = False`, and it seems a shame not to take advantage.
One case in point is the definition of `SimplUtils.isExitJoinId
```
isExitJoinId :: Var -> Bool
isExitJoinId id = isJoinId id && isOneOcc (idOccInfo id) && occ_in_lam (idOccInfo id)
```
Something does not cease to be an exit-join-point if it is mentioned in multiple places
as above.
Another use for this info is `postInlineUnconditionally`.
Could we improve this situation? I think it'd be quite easy. For non-recursive join points `j = rhs`
* Gather occurrence info from the RHS
* Bind `j` to its occurrence info
* Unleash that occurrence info at each jump-site for `j\`, just as if it had been inlined.
See Trac #14152, #15091.
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ------------ |
| Version | 8.2.2 |
| Type | Task |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Compiler |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"Better occurrence analysis for join points","status":"New","operating_system":"","component":"Compiler","related":[],"milestone":"8.6.1","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"8.2.2","keywords":["JoinPoints"],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"Task","description":"Consider this\r\n{{{\r\nlet x = ... in\r\njoin j y = x+y in\r\ncase v of\r\n A -> x\r\n B -> j 1\r\n C -> j 2\r\n D -> 3\r\n}}}\r\nWhat does `OccAnal` say about `x`'s usage? It says `ManyOccs`!\r\n\r\nBut it's plain as a pikestaff that `x` is used at most once:\r\n* Either in the `A` branch,\r\n* or in the `B` or `C` branches via `j`.\r\n\r\nIf instead we had inlined `j` we'd have\r\n{{{\r\nlet x = ... in\r\njoin j y = x+y in\r\ncase v of\r\n A -> x\r\n B -> x + 1\r\n C -> x + 2\r\n D -> 3\r\n}}}\r\nand now it's all more obvious: `x`'s occurrence info should be `OneOcc { occ_one_br = False }`, not `ManyOccs`.\r\n\r\nDoes this matter? Not a great deal, but there is a reason for having `OneOcc` with `occ_one_br = False`, and it seems a shame not to take advantage.\r\n\r\nOne case in point is the definition of `SimplUtils.isExitJoinId\r\n{{{\r\nisExitJoinId :: Var -> Bool\r\nisExitJoinId id = isJoinId id && isOneOcc (idOccInfo id) && occ_in_lam (idOccInfo id)\r\n}}}\r\nSomething does not cease to be an exit-join-point if it is mentioned in multiple places\r\nas above.\r\n\r\nAnother use for this info is `postInlineUnconditionally`.\r\n\r\nCould we improve this situation? I think it'd be quite easy. For non-recursive join points `j = rhs`\r\n\r\n* Gather occurrence info from the RHS\r\n* Bind `j` to its occurrence info\r\n* Unleash that occurrence info at each jump-site for `j`, just as if it had been inlined.\r\n\r\nSee Trac #14152, comment:40.","type_of_failure":"OtherFailure","blocking":[]} -->https://gitlab.haskell.org/ghc/ghc/-/issues/14827Recognize when inlining would create a join point2019-07-07T18:15:25ZersetzenRecognize when inlining would create a join point[This discussion](https://www.reddit.com/r/haskell/comments/7yh8ar/i_wrote_a_program_that_runs_about_10_times_faster/) revolved around a program that runs 10x faster under ghci.
One way to solve this is to remove a superfluous inline pr...[This discussion](https://www.reddit.com/r/haskell/comments/7yh8ar/i_wrote_a_program_that_runs_about_10_times_faster/) revolved around a program that runs 10x faster under ghci.
One way to solve this is to remove a superfluous inline pragma which allows the following transformation to happen:
```
letrec {
f a = case e of {
p1 -> f a';
p2 -> (# l, r #);
}} in case f e2 of { (# l, r #) -> e3; }
```
into
```
joinrec {
f a = case e of {
p1 -> jump f a';
p2 -> e3;
}} in jump f e2
```
More generally a recursive let binding that is called exactly once from the outside. If all recursive calls are tail calls and the outside one isn't then we could safely replace the call with the binding and end up with join points. In this case it means a 10x speedup so it might be worth doing generally.
```
letrec { fi = ei; } in ... (fj e) ... => ... (joinrec { fi = ei; } in jump fj e) ...
```
[Self contained example](https://gist.github.com/AndreasPK/8e6f0cbf253f0930f4cda81e685ac136)https://gitlab.haskell.org/ghc/ghc/-/issues/14620Polymorphic functions not easily recognized as join points2020-12-23T16:19:40ZDavid FeuerPolymorphic functions not easily recognized as join pointsThis grew out of [ticket:14610\#comment:146389](https://gitlab.haskell.org//ghc/ghc/issues/14610#note_146389). If I write
```hs
foo :: forall a. (Int -> Bool) -> Int -> a -> a
foo p = go
where
go :: Int -> a -> a
go !n a
...This grew out of [ticket:14610\#comment:146389](https://gitlab.haskell.org//ghc/ghc/issues/14610#note_146389). If I write
```hs
foo :: forall a. (Int -> Bool) -> Int -> a -> a
foo p = go
where
go :: Int -> a -> a
go !n a
| p n = a
| otherwise = go (n + 1) a
```
then I get
```
foo
= \ (@ a_aYZ)
(p_aWO :: Int -> Bool)
(eta_B2 :: Int)
(eta1_B1 :: a_aYZ) ->
case eta_B2 of { GHC.Types.I# ww1_s1bZ ->
joinrec {
$wgo_s1c1 [InlPrag=NOUSERINLINE[0], Occ=LoopBreaker]
:: GHC.Prim.Int# -> a_aYZ -> a_aYZ
[LclId[JoinId(2)], Arity=2, Str=<L,U><S,1*U>, Unf=OtherCon []]
$wgo_s1c1 (ww2_X1cu :: GHC.Prim.Int#) (w_s1bW :: a_aYZ)
= case p_aWO (GHC.Types.I# ww2_X1cu) of {
False -> jump $wgo_s1c1 (GHC.Prim.+# ww2_X1cu 1#) w_s1bW;
True -> w_s1bW
}; } in
jump $wgo_s1c1 ww1_s1bZ eta1_B1
}
```
But if I make `go` polymorphic,
```hs
foo :: (Int -> Bool) -> Int -> a -> a
foo p = go
where
go :: Int -> b -> b
go !n a
| p n = a
| otherwise = go (n + 1) a
```
I get a wrapper and this worker:
```hs
T14610.$wfoo
= \ (@ a_s1cm)
(w_s1cn :: Int -> Bool)
(ww_s1cs :: GHC.Prim.Int#)
(w1_s1cp :: a_s1cm) ->
letrec {
$wgo_s1cl [InlPrag=NOUSERINLINE[0], Occ=LoopBreaker]
:: forall b. GHC.Prim.Int# -> b -> b
[LclId, Arity=2, Str=<L,U><S,1*U>, Unf=OtherCon []]
$wgo_s1cl
= \ (@ b_s1ce) (ww1_s1cj :: GHC.Prim.Int#) (w2_s1cg :: b_s1ce) ->
case w_s1cn (GHC.Types.I# ww1_s1cj) of {
False -> $wgo_s1cl @ b_s1ce (GHC.Prim.+# ww1_s1cj 1#) w2_s1cg;
True -> w2_s1cg
}; } in
$wgo_s1cl @ a_s1cm ww_s1cs w1_s1cp
```
This distinction remains as `let` vs. `let-no-escape` in STG. As Joachim Breitner's comments seem to suggest, we could probably recognize this by applying the static argument transformation to the type argument of `go`. But we don't currently have any machinery for doing that, I don't think. Furthermore, that would fail with polymorphic recursion even if the only type changes are from `newtype`. That said, the SAT approach would presumably help when the worker has a non-essential type signature for clarity.
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ------------ |
| Version | 8.2.2 |
| Type | Bug |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Compiler |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | nomeata |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"Polymorphic functions not easily recognized as join points","status":"New","operating_system":"","component":"Compiler","related":[],"milestone":"8.6.1","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"8.2.2","keywords":[],"differentials":[],"test_case":"","architecture":"","cc":["nomeata"],"type":"Bug","description":"This grew out of ticket:14610#comment:3. If I write\r\n\r\n{{{#!hs\r\nfoo :: forall a. (Int -> Bool) -> Int -> a -> a\r\nfoo p = go\r\n where\r\n go :: Int -> a -> a\r\n go !n a\r\n | p n = a\r\n | otherwise = go (n + 1) a\r\n}}}\r\n\r\nthen I get\r\n\r\n{{{\r\nfoo\r\n = \\ (@ a_aYZ)\r\n (p_aWO :: Int -> Bool)\r\n (eta_B2 :: Int)\r\n (eta1_B1 :: a_aYZ) ->\r\n case eta_B2 of { GHC.Types.I# ww1_s1bZ ->\r\n joinrec {\r\n $wgo_s1c1 [InlPrag=NOUSERINLINE[0], Occ=LoopBreaker]\r\n :: GHC.Prim.Int# -> a_aYZ -> a_aYZ\r\n [LclId[JoinId(2)], Arity=2, Str=<L,U><S,1*U>, Unf=OtherCon []]\r\n $wgo_s1c1 (ww2_X1cu :: GHC.Prim.Int#) (w_s1bW :: a_aYZ)\r\n = case p_aWO (GHC.Types.I# ww2_X1cu) of {\r\n False -> jump $wgo_s1c1 (GHC.Prim.+# ww2_X1cu 1#) w_s1bW;\r\n True -> w_s1bW\r\n }; } in\r\n jump $wgo_s1c1 ww1_s1bZ eta1_B1\r\n }\r\n}}}\r\n\r\nBut if I make `go` polymorphic,\r\n\r\n{{{#!hs\r\nfoo :: (Int -> Bool) -> Int -> a -> a\r\nfoo p = go\r\n where\r\n go :: Int -> b -> b\r\n go !n a\r\n | p n = a\r\n | otherwise = go (n + 1) a\r\n}}}\r\n\r\nI get a wrapper and this worker:\r\n\r\n{{{#!hs\r\nT14610.$wfoo\r\n = \\ (@ a_s1cm)\r\n (w_s1cn :: Int -> Bool)\r\n (ww_s1cs :: GHC.Prim.Int#)\r\n (w1_s1cp :: a_s1cm) ->\r\n letrec {\r\n $wgo_s1cl [InlPrag=NOUSERINLINE[0], Occ=LoopBreaker]\r\n :: forall b. GHC.Prim.Int# -> b -> b\r\n [LclId, Arity=2, Str=<L,U><S,1*U>, Unf=OtherCon []]\r\n $wgo_s1cl\r\n = \\ (@ b_s1ce) (ww1_s1cj :: GHC.Prim.Int#) (w2_s1cg :: b_s1ce) ->\r\n case w_s1cn (GHC.Types.I# ww1_s1cj) of {\r\n False -> $wgo_s1cl @ b_s1ce (GHC.Prim.+# ww1_s1cj 1#) w2_s1cg;\r\n True -> w2_s1cg\r\n }; } in\r\n $wgo_s1cl @ a_s1cm ww_s1cs w1_s1cp\r\n}}}\r\n\r\nThis distinction remains as `let` vs. `let-no-escape` in STG. As Joachim Breitner's comments seem to suggest, we could probably recognize this by applying the static argument transformation to the type argument of `go`. But we don't currently have any machinery for doing that, I don't think. Furthermore, that would fail with polymorphic recursion even if the only type changes are from `newtype`. That said, the SAT approach would presumably help when the worker has a non-essential type signature for clarity.","type_of_failure":"OtherFailure","blocking":[]} -->https://gitlab.haskell.org/ghc/ghc/-/issues/14610newtype wrapping of a monadic stack kills performance2022-06-10T08:02:21Zmrkkrpnewtype wrapping of a monadic stack kills performanceI have a project where I in one module (A) I decided to build something like
a minimal framework for the other bigger module (B) to use. One of the main
points of the framework is that the monadic stacks are hidden behind
`newtype`s and ...I have a project where I in one module (A) I decided to build something like
a minimal framework for the other bigger module (B) to use. One of the main
points of the framework is that the monadic stacks are hidden behind
`newtype`s and in those monads you can only use the functions that the
module A provides.
The module A can be found here:
https://github.com/mrkkrp/mmark/blob/ghc-bug-newtypes/Text/MMark/Parser/Internal.hs
There are two monadic stack wrapped with newtypes: `BParser` and `IParser`.
The module B is this:
https://github.com/mrkkrp/mmark/blob/ghc-bug-newtypes/Text/MMark/Parser.hs
But it's not really relevant.
Now if I change `newtype`s to `type` synonyms like so:
```
type BParser a = ParsecT MMarkErr Text (State BlockState) a
type IParser a = StateT InlineState (Parsec MMarkErr Text) a
```
and do corresponding minor corrections, I get 2x less allocations and almost 2x faster code (before):
```
Case Allocated GCs Max
with file: data/bench-yaml-block.md 119,080 0 11,088
with file: data/bench-thematic-break.md 74,368 0 10,224
with file: data/bench-heading.md 901,928 0 9,432
with file: data/bench-fenced-code-block.md 145,744 0 9,368
with file: data/bench-indented-code-block.md 124,312 0 9,368
with file: data/bench-unordered-list.md 2,010,496 1 10,784
with file: data/bench-ordered-list.md 2,025,016 1 10,728
with file: data/bench-blockquote.md 1,961,672 1 42,648
with file: data/bench-paragraph.md 2,084,104 1 42,648
with file: data/comprehensive.md 25,899,496 24 44,200
benchmarking with file: data/comprehensive.md
time 3.590 ms (3.578 ms .. 3.601 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 3.555 ms (3.546 ms .. 3.565 ms)
std dev 31.07 μs (24.63 μs .. 39.90 μs)
```
After:
```
Case Allocated GCs Max
with file: data/bench-yaml-block.md 116,864 0 11,088
with file: data/bench-thematic-break.md 64,776 0 10,392
with file: data/bench-heading.md 615,672 0 9,432
with file: data/bench-fenced-code-block.md 144,736 0 9,368
with file: data/bench-indented-code-block.md 123,352 0 9,368
with file: data/bench-unordered-list.md 795,072 0 41,568
with file: data/bench-ordered-list.md 808,808 0 41,512
with file: data/bench-blockquote.md 826,392 0 41,568
with file: data/bench-paragraph.md 881,432 0 41,568
with file: data/comprehensive.md 10,945,104 10 44,440
benchmarking with file: data/comprehensive.md
time 2.451 ms (2.448 ms .. 2.456 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 2.432 ms (2.427 ms .. 2.437 ms)
std dev 15.86 μs (13.16 μs .. 19.01 μs)
```
I'm not great at reading non-trivial core, but I gave it a shot and dumped
some core. One thing I noticed that core for the module A is the same in
both cases. Then the problem is an inter-module problem, probably like when
you don't dump definitions into interface files with `INLINEABLE` and end up
without specialization, but here it perhaps has to do with the fact that B
has no information about internals of the monadic stacks from A, but I'm not
sure.
Here is the beginnings of core dumps (before):
```
==================== Tidy Core ====================
2017-12-23 04:55:17.967944505 UTC
Result size of Tidy Core
= {terms: 210,169,
types: 209,675,
coercions: 82,492,
joins: 973/3,753}
```
After:
```
==================== Tidy Core ====================
2017-12-23 04:58:46.386265108 UTC
Result size of Tidy Core
= {terms: 301,256,
types: 263,104,
coercions: 28,560,
joins: 1,726/5,393}
```
So it looks like `newtype` wrapping reduces GHC's ability to create join
points SPJ talked about recently.
I do understand that you expect a minimal reproducing example but I could
not produce it even though I spent several hours in futile attempts. I
started by creating a small project, defining a similar stack with a
`newtype` wrapper and using it in a simple parser in another module. Then I
benchmarked the parser. There is no difference between `newtype`d and code
and the code that uses just type synonyms. I tried different variations, no difference.
The core output is too big for me to analyze, it's like 25 Mb and 33 Mb, and I have no idea what's going on there.
You're welcome to experiment with the repo, there are benchmarks for memory usage and criterion benchmarks for speed of execution:
https://github.com/mrkkrp/mmark
I have created two branches I won't touch, one with newtypes and another one with type synonyms:
- https://github.com/mrkkrp/mmark/tree/ghc-bug-newtypes
- https://github.com/mrkkrp/mmark/tree/ghc-bug-type-synonyms
Just checkout one of these and run `stack bench`.
This is the commit that changes newtypes to type synonyms:
https://github.com/mrkkrp/mmark/commit/759d8d4aa52dd57a393299c63e8c9b70d0d43290
I'm submitting this because my friend convinced me that it's better to let you know (even though I could not create a minimal reproducing example on my own) than to let it go completely unnoticed.https://gitlab.haskell.org/ghc/ghc/-/issues/14287Early inlining causes potential join points to be missed2022-10-17T16:48:20ZjheekEarly inlining causes potential join points to be missedWhile trying to make stream fusion work with recursive step functions I noticed that the following filter implementation did not fuse nicely.
```haskell
data Stream s a = Stream (s -> Step s a) s
data Step s a = Done | Yield a s
sfilte...While trying to make stream fusion work with recursive step functions I noticed that the following filter implementation did not fuse nicely.
```haskell
data Stream s a = Stream (s -> Step s a) s
data Step s a = Done | Yield a s
sfilter :: (a -> Bool) -> Stream s a -> Stream s a
sfilter pred (Stream step s0) = Stream filterStep s0 where
filterStep s = case step s of
Done -> Done
Yield x ns
| pred x -> Yield x ns
| otherwise -> filterStep ns
fromTo :: Int -> Int -> Stream Int Int
{-# INLINE fromTo #-}
fromTo from to = Stream step from where
step i
| i > to = Done
| otherwise = Yield i (i + 1)
sfoldl :: (b -> a -> b) -> b -> Stream s a -> b
{-# INLINE sfoldl #-}
sfoldl acc z (Stream !step s0) = oneShot go z s0 where
go !y s = case step s of
Done -> y
Yield x ns -> go (acc y x) ns
ssum :: (Num a) => Stream s a -> a
ssum = sfoldl (+) 0
filterTest :: Int
filterTest = ssum $ sfilter even (fromTo 1 101)
```
For this code to work nicely, GHC should detect that filterStep is a join point. However, in the definition of sfilter it is not because not all references are tail-called & saturated.
After inlining of sfilter and some trivial case-of-case transformations filterStep should become a join point. But it seems like the simplifier never gets the change to do this because float-out optimization makes filterStep a top level binding. With -fno-full-laziness filterStep does become a join point at the call site, but of course this is not really a solution.
Then I found that the following also works:
```haskell
sfilter :: (a -> Bool) -> Stream s a -> Stream s a
sfilter pred (Stream step s0) = Stream filterStep s0 where
{-# INLINE [2] filterStep #-}
filterStep s = case step s of
Done -> Done
Yield x ns
| pred x -> Yield x ns
| otherwise -> filterStep ns
```
Simply adding an INLINE \[2\] pragma disables the inlining in the early run of the simplifier. Therefore, the float out pass does not get the change to float-out before the filterStep is recognized as a joint point.
Or at least that is my interpretation of what is going on.
What surprises me about this issue is that the gentle run seems to perform inlining while the wiki mentions that inlining is not performed in this stage: https://ghc.haskell.org/trac/ghc/wiki/Commentary/Compiler/Core2CorePipeline
Intuitively, I would think that floating-out is sub-optimal when the simplifier did not use all its tricks yet, because inlining typically opens up possibilities for simplification while floating-out typically reducing these possibilities.
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ------------ |
| Version | 8.2.1 |
| Type | Bug |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Compiler |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"Early inlining causes potential join points to be missed","status":"New","operating_system":"","component":"Compiler","related":[],"milestone":"","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"8.2.1","keywords":[],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"Bug","description":"While trying to make stream fusion work with recursive step functions I noticed that the following filter implementation did not fuse nicely.\r\n\r\n\r\n{{{#!haskell\r\n\r\ndata Stream s a = Stream (s -> Step s a) s\r\ndata Step s a = Done | Yield a s\r\n\r\nsfilter :: (a -> Bool) -> Stream s a -> Stream s a\r\nsfilter pred (Stream step s0) = Stream filterStep s0 where\r\n filterStep s = case step s of\r\n Done -> Done\r\n Yield x ns\r\n | pred x -> Yield x ns\r\n | otherwise -> filterStep ns\r\n\r\nfromTo :: Int -> Int -> Stream Int Int\r\n{-# INLINE fromTo #-}\r\nfromTo from to = Stream step from where\r\n step i\r\n | i > to = Done\r\n | otherwise = Yield i (i + 1)\r\n\r\nsfoldl :: (b -> a -> b) -> b -> Stream s a -> b\r\n{-# INLINE sfoldl #-}\r\nsfoldl acc z (Stream !step s0) = oneShot go z s0 where\r\n go !y s = case step s of\r\n Done -> y\r\n Yield x ns -> go (acc y x) ns\r\n\r\nssum :: (Num a) => Stream s a -> a\r\nssum = sfoldl (+) 0\r\n\r\nfilterTest :: Int\r\nfilterTest = ssum $ sfilter even (fromTo 1 101)\r\n}}}\r\n\r\nFor this code to work nicely, GHC should detect that filterStep is a join point. However, in the definition of sfilter it is not because not all references are tail-called & saturated. \r\n\r\nAfter inlining of sfilter and some trivial case-of-case transformations filterStep should become a join point. But it seems like the simplifier never gets the change to do this because float-out optimization makes filterStep a top level binding. With -fno-full-laziness filterStep does become a join point at the call site, but of course this is not really a solution.\r\n\r\nThen I found that the following also works:\r\n\r\n{{{#!haskell\r\nsfilter :: (a -> Bool) -> Stream s a -> Stream s a\r\nsfilter pred (Stream step s0) = Stream filterStep s0 where\r\n {-# INLINE [2] filterStep #-}\r\n filterStep s = case step s of\r\n Done -> Done\r\n Yield x ns\r\n | pred x -> Yield x ns\r\n | otherwise -> filterStep ns\r\n}}}\r\n\r\nSimply adding an INLINE [2] pragma disables the inlining in the early run of the simplifier. Therefore, the float out pass does not get the change to float-out before the filterStep is recognized as a joint point. \r\nOr at least that is my interpretation of what is going on.\r\n\r\nWhat surprises me about this issue is that the gentle run seems to perform inlining while the wiki mentions that inlining is not performed in this stage: https://ghc.haskell.org/trac/ghc/wiki/Commentary/Compiler/Core2CorePipeline\r\n\r\n\r\nIntuitively, I would think that floating-out is sub-optimal when the simplifier did not use all its tricks yet, because inlining typically opens up possibilities for simplification while floating-out typically reducing these possibilities.","type_of_failure":"OtherFailure","blocking":[]} -->https://gitlab.haskell.org/ghc/ghc/-/issues/14223Casts get in the way of join points2019-07-07T18:17:52ZJoachim Breitnermail@joachim-breitner.deCasts get in the way of join pointsI was checking which of HLint’s rewirte rules are actually already known to the compiler in the sense that the simplifier can do them (using \[ghc-proofs\](http://hackage.haskell.org/package/ghc-proofs). I stumbled on this interesting ca...I was checking which of HLint’s rewirte rules are actually already known to the compiler in the sense that the simplifier can do them (using \[ghc-proofs\](http://hackage.haskell.org/package/ghc-proofs). I stumbled on this interesting case.
`foldr @[] (&&) True xs` compiles down to this nice core
```
joinrec {
go [Occ=LoopBreaker] :: [Bool] -> Bool
[LclId[JoinId(1)],
Arity=1,
Unf=Unf{Src=<vanilla>, TopLvl=False, Value=True, ConLike=True,
WorkFree=True, Expandable=True, Guidance=IF_ARGS [30] 44 20}]
go (ds :: [Bool])
= case ds of {
[] -> GHC.Types.True;
: y ys ->
case y of {
False -> GHC.Types.False;
True -> jump go ys
}
}; } in
jump go xs
```
But `and @[] xs` (which HLint suggests to use instead, and we surely want people to use!) compiles down to
```
go [Occ=LoopBreaker] :: [Bool] -> All
[LclId,
Arity=1,
Unf=Unf{Src=<vanilla>, TopLvl=False, Value=True, ConLike=True,
WorkFree=True, Expandable=True, Guidance=IF_ARGS [30] 60 20}]
go
= \ (ds :: [Bool]) ->
case ds of {
[] ->
GHC.Types.True
`cast` (Sym Data.Monoid.N:All[0] :: Coercible Bool All);
: y ys ->
case y of {
False ->
GHC.Types.False
`cast` (Nth:3
(Nth:3
((Sym Data.Monoid.N:All[0]
-> Sym Data.Monoid.N:All[0] -> <Bool>_R)
; (<All>_R -> <All>_R -> Sym Data.Monoid.N:All[0])))
:: Coercible Bool All);
True -> go ys
}
}; } in
(go xs) `cast` (Data.Monoid.N:All[0] :: Coercible All Bool)
```
If you squint and ignore all the casts, these two expressions are the same. So it would be very desirable to get a `joinrec` for `and xs` as well.
Note that all “exit paths” of the return function cast from `Bool` to `All`, and the return value of the whole recursion casts back. That looks as if some smart floating of coercions could do the job. Maybe the common context transformation (https://mail.haskell.org/pipermail/ghc-devs/2013-December/003481.html)?
Is this tied to GHC’s deviation of the paper to have a fixed return type for a join point?
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ------------ |
| Version | 8.2.1 |
| Type | Task |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Compiler |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"Casts get in the way of join points","status":"New","operating_system":"","component":"Compiler","related":[],"milestone":"","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"8.2.1","keywords":["JoinPoints"],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"Task","description":"I was checking which of HLint’s rewirte rules are actually already known to the compiler in the sense that the simplifier can do them (using [ghc-proofs](http://hackage.haskell.org/package/ghc-proofs). I stumbled on this interesting case.\r\n\r\n`foldr @[] (&&) True xs` compiles down to this nice core\r\n{{{\r\n joinrec {\r\n go [Occ=LoopBreaker] :: [Bool] -> Bool\r\n [LclId[JoinId(1)],\r\n Arity=1,\r\n Unf=Unf{Src=<vanilla>, TopLvl=False, Value=True, ConLike=True,\r\n WorkFree=True, Expandable=True, Guidance=IF_ARGS [30] 44 20}]\r\n go (ds :: [Bool])\r\n = case ds of {\r\n [] -> GHC.Types.True;\r\n : y ys ->\r\n case y of {\r\n False -> GHC.Types.False;\r\n True -> jump go ys\r\n }\r\n }; } in\r\n jump go xs\r\n}}}\r\n\r\n\r\nBut `and @[] xs` (which HLint suggests to use instead, and we surely want people to use!) compiles down to\r\n{{{\r\n go [Occ=LoopBreaker] :: [Bool] -> All\r\n [LclId,\r\n Arity=1,\r\n Unf=Unf{Src=<vanilla>, TopLvl=False, Value=True, ConLike=True,\r\n WorkFree=True, Expandable=True, Guidance=IF_ARGS [30] 60 20}]\r\n go\r\n = \\ (ds :: [Bool]) ->\r\n case ds of {\r\n [] ->\r\n GHC.Types.True\r\n `cast` (Sym Data.Monoid.N:All[0] :: Coercible Bool All);\r\n : y ys ->\r\n case y of {\r\n False ->\r\n GHC.Types.False\r\n `cast` (Nth:3\r\n (Nth:3\r\n ((Sym Data.Monoid.N:All[0]\r\n -> Sym Data.Monoid.N:All[0] -> <Bool>_R)\r\n ; (<All>_R -> <All>_R -> Sym Data.Monoid.N:All[0])))\r\n :: Coercible Bool All);\r\n True -> go ys\r\n }\r\n }; } in\r\n (go xs) `cast` (Data.Monoid.N:All[0] :: Coercible All Bool)\r\n}}}\r\n\r\nIf you squint and ignore all the casts, these two expressions are the same. So it would be very desirable to get a `joinrec` for `and xs` as well.\r\n\r\nNote that all “exit paths” of the return function cast from `Bool` to `All`, and the return value of the whole recursion casts back. That looks as if some smart floating of coercions could do the job. Maybe the common context transformation (https://mail.haskell.org/pipermail/ghc-devs/2013-December/003481.html)?\r\n\r\nIs this tied to GHC’s deviation of the paper to have a fixed return type for a join point?","type_of_failure":"OtherFailure","blocking":[]} -->https://gitlab.haskell.org/ghc/ghc/-/issues/14137Do more inlining into non-recursive join points2021-10-18T13:22:55ZSimon Peyton JonesDo more inlining into non-recursive join pointsIn pursuing #14068, [Joachim says](https://phabricator.haskell.org/D3811#107745) found this code:
```
let {
ds5 :: [[Int]]
ds5 = case ==# x1 ww1 of { ...In pursuing #14068, [Joachim says](https://phabricator.haskell.org/D3811#107745) found this code:
```
let {
ds5 :: [[Int]]
ds5 = case ==# x1 ww1 of {
__DEFAULT -> go1 (+# x1 1#);
1# -> n
} } in
join {
lvl6 :: [[Int]]
lvl6 = (ds4 : y) : ds5} in
joinrec {
safe :: Int -> Int -> [Int] -> [[Int]]
safe (x2 :: Int) (d :: Int) (ds6 :: [Int])
= case ds6 of {
[] -> jump lvl6;
: q l ->
case x2 of wild6 { I# x3 ->
case q of { I# y1 ->
case /=# x3 y1 of {
__DEFAULT -> ds5;
1# ->
case d of { I# y2 ->
case /=# x3 (+# y1 y2) of {
__DEFAULT -> ds5;
1# ->
case /=# x3 (-# y1 y2) of {
__DEFAULT -> ds5;
1# ->
jump safe
wild6 (I# (+# y2 1#)) l
}}}}}}
}; } in
jump safe ds4 lvl y; } in ...
```
This is terrible! `lvl6` is a join point, but `ds5` is not. And it's easy to fix: we should simply float `ds5` into `lvl6`'s rhs.
- \*Backgound\*\*. In general, if GHC sees this
```
-- Version A
x = f x : g y
```
it'll float out the `(f x)` and `(g y)` parts thus (B):
```
-- Version B
a1 = f x
a2 = g y
x = a1 : a2
```
Reasons:
1. In the scope of this binding we might have `case x of (a:b) -> rhs`, and the new form allows us to eliminate the case.
1. Version A builds a thunk (which, when eval'd) builds thunks for `(f x)` and `(g y)` and returns a cons; whereas Version B builds the two thunks ahead of time and allocates a cons directly. If `x` gets evaluated this is a win, by avoiding an extra thunk. We measured this in our let-floating paper, and B is much better on average.
But if `x` is a join point, all this is wrong wrong wrong:
1. A join point is never scrutinised by a nested case, so there is no benefit in floating.
1. A join point is not implemented by a thunk, so there is no thunk alloc/update to avoid.
In fact, for join points we want to turn Version B into Version A!
Specifically:
- In `Simplify`, do not call `prepareRhs` on join RHSs; nor do floating (see `tick LetFloatFromLet`) from the RHS of a join. In fact this seems to be already the case.
- In `OccurAnal`, do not use `rhsCtxt` for the RHS of a non-recursive join point (see `occAnalNonRecRhs`). Setting this context makes `a1` and `a2` get "inside-lam" occurrences (see `occAnalApp`), and that in turn stops `a1` and `a2` getting inlined straight back into `x`'s RHS in Version B. But for join points we \*\*want\*\* them to be inlined, to get back to Version A.
For a recursive join point we probably do not want to inline `a1` and `a2` because then they'd be allocated each time round the loop. But in fact we can't have a recursive join point whose RHS is just a cons, so it doesn't really arise. The point is: we only need to take care with `rhsCtxt` for non-recursive join points.
Bottom line: just that one fix to `OccurAnal` should do the job.
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ------------ |
| Version | 8.2.1 |
| Type | Bug |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Compiler |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"Do more inlining into non-recursive join points","status":"New","operating_system":"","component":"Compiler","related":[],"milestone":"","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"8.2.1","keywords":[],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"Bug","description":"In pursuing #14068, [https://phabricator.haskell.org/D3811#107745 Joahim says] found this code:\r\n{{{\r\nlet { \r\n ds5 :: [[Int]] \r\n ds5 = case ==# x1 ww1 of { \r\n __DEFAULT -> go1 (+# x1 1#); \r\n 1# -> n \r\n } } in \r\njoin { \r\n lvl6 :: [[Int]] \r\n lvl6 = (ds4 : y) : ds5} in \r\njoinrec { \r\n safe :: Int -> Int -> [Int] -> [[Int]] \r\n safe (x2 :: Int) (d :: Int) (ds6 :: [Int]) \r\n = case ds6 of { \r\n [] -> jump lvl6; \r\n : q l -> \r\n case x2 of wild6 { I# x3 -> \r\n case q of { I# y1 -> \r\n case /=# x3 y1 of { \r\n __DEFAULT -> ds5; \r\n 1# -> \r\n case d of { I# y2 -> \r\n case /=# x3 (+# y1 y2) of { \r\n __DEFAULT -> ds5; \r\n 1# -> \r\n case /=# x3 (-# y1 y2) of { \r\n __DEFAULT -> ds5; \r\n 1# -> \r\n jump safe \r\n wild6 (I# (+# y2 1#)) l \r\n }}}}}}\r\n }; } in \r\njump safe ds4 lvl y; } in ...\r\n}}}\r\nThis is terrible! `lvl6` is a join point, but `ds5` is not. And it's easy to fix: we should simply float `ds5` into `lvl6`'s rhs.\r\n\r\n**Backgound**. In general, if GHC sees this\r\n{{{\r\n-- Version A\r\nx = f x : g y\r\n}}}\r\nit'll float out the `(f x)` and `(g y)` parts thus (B):\r\n{{{\r\n-- Version B\r\na1 = f x\r\na2 = g y\r\nx = a1 : a2\r\n}}}\r\nReasons:\r\n\r\n1. In the scope of this binding we might have `case x of (a:b) -> rhs`, and the new form allows us to eliminate the case.\r\n\r\n2. Version A builds a thunk (which, when eval'd) builds thunks for `(f x)` and `(g y)` and returns a cons; whereas Version B builds the two thunks ahead of time and allocates a cons directly. If `x` gets evaluated this is a win, by avoiding an extra thunk. We measured this in our let-floating paper, and B is much better on average.\r\n\r\nBut if `x` is a join point, all this is wrong wrong wrong:\r\n\r\n1. A join point is never scrutinised by a nested case, so there is no benefit in floating.\r\n2. A join point is not implemented by a thunk, so there is no thunk alloc/update to avoid.\r\n\r\nIn fact, for join points we want to turn Version B into Version A!\r\n\r\nSpecifically:\r\n\r\n* In `Simplify`, do not call `prepareRhs` on join RHSs; nor do floating (see `tick LetFloatFromLet`) from the RHS of a join. In fact this seems to be already the case.\r\n\r\n* In `OccurAnal`, do not use `rhsCtxt` for the RHS of a non-recursive join point (see `occAnalNonRecRhs`). Setting this context makes `a1` and `a2` get \"inside-lam\" occurrences (see `occAnalApp`), and that in turn stops `a1` and `a2` getting inlined straight back into `x`'s RHS in Version B. But for join points we **want** them to be inlined, to get back to Version A.\r\n\r\n For a recursive join point we probably do not want to inline `a1` and `a2` because then they'd be allocated each time round the loop. But in fact we can't have a recursive join point whose RHS is just a cons, so it doesn't really arise. The point is: we only need to take care with `rhsCtxt` for non-recursive join points.\r\n\r\nBottom line: just that one fix to `OccurAnal` should do the job.\r\n\r\n\r\n","type_of_failure":"OtherFailure","blocking":[]} -->Joachim Breitnermail@joachim-breitner.deJoachim Breitnermail@joachim-breitner.dehttps://gitlab.haskell.org/ghc/ghc/-/issues/14068Loopification using join points2024-03-04T22:34:57ZJoachim Breitnermail@joachim-breitner.deLoopification using join pointsThis is the master ticket for doing **loopification in Core**. See also:
* `Note [Self-recursive tail calls]` in GHC.StgToCmm.Expr
* See wiki page: https://gitlab.haskell.org/ghc/ghc/-/wikis/commentary/compiler/loopification#new-idea-u...This is the master ticket for doing **loopification in Core**. See also:
* `Note [Self-recursive tail calls]` in GHC.StgToCmm.Expr
* See wiki page: https://gitlab.haskell.org/ghc/ghc/-/wikis/commentary/compiler/loopification#new-idea-use-join-points
* #22227 has several examples of the value of loopification in Core
* #13966
* #14287
* See https://gitlab.haskell.org/ghc/ghc/issues/14068#note_140494 for a wrinkle.
* See https://gitlab.haskell.org/ghc/ghc/issues/22227#note_457985 for another wrinkle
## The idea
The function
```
let f x y = case y of
A -> f x' y'
B -> e2
C -> e3
in g f
```
is not turned into a recursive join point, because the call to `f` is not in tail call position. But the recursive calls are, and these matter performance-wise! Hence, it would be beneficial to turn this into
```
let f x y = joinrec $j x y = case x y of
A -> $j x' y'
B -> e2
C -> e3
in $j x y
in g f
```
This has the additional effect that now `f` is no longer recursive and may get inlined.
The idea is described under "New idea: use join points" in [Commentary/Compiler/Loopification](commentary/compiler/loopification).
Some notes:
- We should to this both at top level and for nested definitions.
- We can remove the "loopification" code from the code generator when this is done.
- It came up in #13966, and might go well with #14067.
- It might work well with #13051, which Thomas Jakway is still thinking about.
- Should fix #14287 too.
- See also [ticket:14068\#comment:140494](https://gitlab.haskell.org//ghc/ghc/issues/14068#note_140494) of #14620, for a wrinkle.Joachim Breitnermail@joachim-breitner.deJoachim Breitnermail@joachim-breitner.dehttps://gitlab.haskell.org/ghc/ghc/-/issues/14067Static Argument Transformation for tail-recursive functions2022-10-17T12:40:37ZJoachim Breitnermail@joachim-breitner.deStatic Argument Transformation for tail-recursive functionsIn #13966 it was determined that having a variant of the Static Argument Transformation (StaticArgumentTransformation) pass that would specifically work on recursive join points, would be beneficial. This ticket tracks this task.
Consid...In #13966 it was determined that having a variant of the Static Argument Transformation (StaticArgumentTransformation) pass that would specifically work on recursive join points, would be beneficial. This ticket tracks this task.
Consider
```
joinrec $j x y = case y of
A -> $j x y'
B -> e2 x
C -> e3
in $j foo bar
```
Here the first argument to `$j` is "static"; that is, the same in every call. So we can transform like this
```
joinrec $j y = case y of
A -> $j y'
B -> e2 foo
C -> e3
in $j bar
```
Note that `x` isn't passed around in every iteration any more.https://gitlab.haskell.org/ghc/ghc/-/issues/13928Providing a more specific argument prevents fusion caused by join point float...2019-07-07T18:19:23ZMatthew PickeringProviding a more specific argument prevents fusion caused by join point floating.I don't know whether this is expected or not but I am writing it down here for the record.
Defining `any` as in section 5 of the paper "compiling without continuations" produces nice fused code as promised. However, fixing the predicate...I don't know whether this is expected or not but I am writing it down here for the record.
Defining `any` as in section 5 of the paper "compiling without continuations" produces nice fused code as promised. However, fixing the predicate in `any` causes the fusion to stop happening producing potentially worse code.
```
module ListFusion where
find :: (a -> Bool) -> [a] -> Maybe a
find p xs = go xs
where
go [] = Nothing
go (x:xs) = if p x then Just x else go xs
fuses :: (Int -> Bool) -> [Int] -> Bool
fuses p xs = case find p xs of
Just x -> True
Nothing -> False
fuseNot :: (Int -> Bool) -> [Int] -> Bool
fuseNot _p xs = case find (> 4) xs of
Just x -> True
Nothing -> False
```
Core output
```
[1 of 1] Compiling ListFusion ( listfusion.hs, listfusion.o )
==================== Tidy Core ====================
Result size of Tidy Core
= {terms: 87, types: 82, coercions: 0, joins: 2/2}
-- RHS size: {terms: 21, types: 20, coercions: 0, joins: 1/1}
find :: forall a. (a -> Bool) -> [a] -> Maybe a
[GblId,
Arity=2,
Caf=NoCafRefs,
Str=<L,C(U)><S,1*U>,
Unf=Unf{Src=InlineStable, TopLvl=True, Value=True, ConLike=True,
WorkFree=True, Expandable=True,
Guidance=ALWAYS_IF(arity=2,unsat_ok=True,boring_ok=False)
Tmpl= \ (@ a_a1UH)
(p_aSB [Occ=OnceL!] :: a_a1UH -> Bool)
(xs_aSC [Occ=Once] :: [a_a1UH]) ->
joinrec {
go_s28Z [Occ=LoopBreakerT[1]] :: [a_a1UH] -> Maybe a_a1UH
[LclId[JoinId(1)], Arity=1, Unf=OtherCon []]
go_s28Z (ds_d27L [Occ=Once!] :: [a_a1UH])
= case ds_d27L of {
[] -> GHC.Base.Nothing @ a_a1UH;
: x_aSE xs1_aSF [Occ=Once] ->
case p_aSB x_aSE of {
False -> jump go_s28Z xs1_aSF;
True -> GHC.Base.Just @ a_a1UH x_aSE
}
}; } in
jump go_s28Z xs_aSC}]
find
= \ (@ a_a1UH) (p_aSB :: a_a1UH -> Bool) (xs_aSC :: [a_a1UH]) ->
joinrec {
go_s28Z [Occ=LoopBreaker] :: [a_a1UH] -> Maybe a_a1UH
[LclId[JoinId(1)], Arity=1, Str=<S,1*U>, Unf=OtherCon []]
go_s28Z (ds_d27L :: [a_a1UH])
= case ds_d27L of {
[] -> GHC.Base.Nothing @ a_a1UH;
: x_aSE xs1_aSF ->
case p_aSB x_aSE of {
False -> jump go_s28Z xs1_aSF;
True -> GHC.Base.Just @ a_a1UH x_aSE
}
}; } in
jump go_s28Z xs_aSC
-- RHS size: {terms: 19, types: 15, coercions: 0, joins: 1/1}
fuses :: (Int -> Bool) -> [Int] -> Bool
[GblId,
Arity=2,
Caf=NoCafRefs,
Str=<L,C(U)><S,1*U>,
Unf=Unf{Src=InlineStable, TopLvl=True, Value=True, ConLike=True,
WorkFree=True, Expandable=True,
Guidance=ALWAYS_IF(arity=2,unsat_ok=True,boring_ok=False)
Tmpl= \ (p_aSG [Occ=OnceL!] :: Int -> Bool)
(xs_aSH [Occ=Once] :: [Int]) ->
joinrec {
go_s28X [Occ=LoopBreakerT[1]] :: [Int] -> Bool
[LclId[JoinId(1)], Arity=1, Unf=OtherCon []]
go_s28X (ds_d27L [Occ=Once!] :: [Int])
= case ds_d27L of {
[] -> GHC.Types.False;
: x_aSE [Occ=Once] xs1_aSF [Occ=Once] ->
case p_aSG x_aSE of {
False -> jump go_s28X xs1_aSF;
True -> GHC.Types.True
}
}; } in
jump go_s28X xs_aSH}]
fuses
= \ (p_aSG :: Int -> Bool) (xs_aSH :: [Int]) ->
joinrec {
go_s28X [Occ=LoopBreaker] :: [Int] -> Bool
[LclId[JoinId(1)], Arity=1, Str=<S,1*U>, Unf=OtherCon []]
go_s28X (ds_d27L :: [Int])
= case ds_d27L of {
[] -> GHC.Types.False;
: x_aSE xs1_aSF ->
case p_aSG x_aSE of {
False -> jump go_s28X xs1_aSF;
True -> GHC.Types.True
}
}; } in
jump go_s28X xs_aSH
-- RHS size: {terms: 1, types: 0, coercions: 0, joins: 0/0}
ListFusion.$trModule4 :: GHC.Prim.Addr#
[GblId,
Caf=NoCafRefs,
Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,
WorkFree=True, Expandable=True, Guidance=IF_ARGS [] 20 0}]
ListFusion.$trModule4 = "main"#
-- RHS size: {terms: 2, types: 0, coercions: 0, joins: 0/0}
ListFusion.$trModule3 :: GHC.Types.TrName
[GblId,
Caf=NoCafRefs,
Str=m1,
Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,
WorkFree=True, Expandable=True, Guidance=IF_ARGS [] 10 20}]
ListFusion.$trModule3 = GHC.Types.TrNameS ListFusion.$trModule4
-- RHS size: {terms: 1, types: 0, coercions: 0, joins: 0/0}
ListFusion.$trModule2 :: GHC.Prim.Addr#
[GblId,
Caf=NoCafRefs,
Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,
WorkFree=True, Expandable=True, Guidance=IF_ARGS [] 40 0}]
ListFusion.$trModule2 = "ListFusion"#
-- RHS size: {terms: 2, types: 0, coercions: 0, joins: 0/0}
ListFusion.$trModule1 :: GHC.Types.TrName
[GblId,
Caf=NoCafRefs,
Str=m1,
Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,
WorkFree=True, Expandable=True, Guidance=IF_ARGS [] 10 20}]
ListFusion.$trModule1 = GHC.Types.TrNameS ListFusion.$trModule2
-- RHS size: {terms: 3, types: 0, coercions: 0, joins: 0/0}
ListFusion.$trModule :: GHC.Types.Module
[GblId,
Caf=NoCafRefs,
Str=m,
Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,
WorkFree=True, Expandable=True, Guidance=IF_ARGS [] 10 30}]
ListFusion.$trModule
= GHC.Types.Module ListFusion.$trModule3 ListFusion.$trModule1
Rec {
-- RHS size: {terms: 20, types: 13, coercions: 0, joins: 0/0}
ListFusion.fuseNot_go [Occ=LoopBreaker] :: [Int] -> Maybe Int
[GblId, Arity=1, Caf=NoCafRefs, Str=<S,1*U>]
ListFusion.fuseNot_go
= \ (ds_d27L :: [Int]) ->
case ds_d27L of {
[] -> GHC.Base.Nothing @ Int;
: x_aSE xs_aSF ->
case x_aSE of wild1_a28o { GHC.Types.I# x1_a28q ->
case GHC.Prim.tagToEnum# @ Bool (GHC.Prim.># x1_a28q 4#) of {
False -> ListFusion.fuseNot_go xs_aSF;
True -> GHC.Base.Just @ Int wild1_a28o
}
}
}
end Rec }
-- RHS size: {terms: 9, types: 7, coercions: 0, joins: 0/0}
fuseNot :: (Int -> Bool) -> [Int] -> Bool
[GblId,
Arity=2,
Caf=NoCafRefs,
Str=<L,A><S,1*U>,
Unf=Unf{Src=InlineStable, TopLvl=True, Value=True, ConLike=True,
WorkFree=True, Expandable=True,
Guidance=ALWAYS_IF(arity=2,unsat_ok=True,boring_ok=False)
Tmpl= \ _ [Occ=Dead] (xs_aSK [Occ=Once] :: [Int]) ->
case ListFusion.fuseNot_go xs_aSK of {
Nothing -> GHC.Types.False;
Just _ [Occ=Dead] -> GHC.Types.True
}}]
fuseNot
= \ _ [Occ=Dead] (xs_aSK :: [Int]) ->
case ListFusion.fuseNot_go xs_aSK of {
Nothing -> GHC.Types.False;
Just x_a1U5 -> GHC.Types.True
}
```
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ------------ |
| Version | 8.0.1 |
| Type | Bug |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Compiler |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"Providing a more specific argument prevents fusion caused by join point floating.","status":"New","operating_system":"","component":"Compiler","related":[],"milestone":"","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"8.0.1","keywords":["JoinPoints"],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"Bug","description":"I don't know whether this is expected or not but I am writing it down here for the record.\r\n\r\nDefining `any` as in section 5 of the paper \"compiling without continuations\" produces nice fused code as promised. However, fixing the predicate in `any` causes the fusion to stop happening producing potentially worse code. \r\n\r\n{{{\r\nmodule ListFusion where\r\n\r\nfind :: (a -> Bool) -> [a] -> Maybe a\r\nfind p xs = go xs\r\n where\r\n go [] = Nothing\r\n go (x:xs) = if p x then Just x else go xs\r\n\r\nfuses :: (Int -> Bool) -> [Int] -> Bool\r\nfuses p xs = case find p xs of\r\n Just x -> True\r\n Nothing -> False\r\n\r\nfuseNot :: (Int -> Bool) -> [Int] -> Bool\r\nfuseNot _p xs = case find (> 4) xs of\r\n Just x -> True\r\n Nothing -> False\r\n}}}\r\n\r\nCore output\r\n\r\n{{{\r\n[1 of 1] Compiling ListFusion ( listfusion.hs, listfusion.o )\r\n\r\n==================== Tidy Core ====================\r\nResult size of Tidy Core\r\n = {terms: 87, types: 82, coercions: 0, joins: 2/2}\r\n\r\n-- RHS size: {terms: 21, types: 20, coercions: 0, joins: 1/1}\r\nfind :: forall a. (a -> Bool) -> [a] -> Maybe a\r\n[GblId,\r\n Arity=2,\r\n Caf=NoCafRefs,\r\n Str=<L,C(U)><S,1*U>,\r\n Unf=Unf{Src=InlineStable, TopLvl=True, Value=True, ConLike=True,\r\n WorkFree=True, Expandable=True,\r\n Guidance=ALWAYS_IF(arity=2,unsat_ok=True,boring_ok=False)\r\n Tmpl= \\ (@ a_a1UH)\r\n (p_aSB [Occ=OnceL!] :: a_a1UH -> Bool)\r\n (xs_aSC [Occ=Once] :: [a_a1UH]) ->\r\n joinrec {\r\n go_s28Z [Occ=LoopBreakerT[1]] :: [a_a1UH] -> Maybe a_a1UH\r\n [LclId[JoinId(1)], Arity=1, Unf=OtherCon []]\r\n go_s28Z (ds_d27L [Occ=Once!] :: [a_a1UH])\r\n = case ds_d27L of {\r\n [] -> GHC.Base.Nothing @ a_a1UH;\r\n : x_aSE xs1_aSF [Occ=Once] ->\r\n case p_aSB x_aSE of {\r\n False -> jump go_s28Z xs1_aSF;\r\n True -> GHC.Base.Just @ a_a1UH x_aSE\r\n }\r\n }; } in\r\n jump go_s28Z xs_aSC}]\r\nfind\r\n = \\ (@ a_a1UH) (p_aSB :: a_a1UH -> Bool) (xs_aSC :: [a_a1UH]) ->\r\n joinrec {\r\n go_s28Z [Occ=LoopBreaker] :: [a_a1UH] -> Maybe a_a1UH\r\n [LclId[JoinId(1)], Arity=1, Str=<S,1*U>, Unf=OtherCon []]\r\n go_s28Z (ds_d27L :: [a_a1UH])\r\n = case ds_d27L of {\r\n [] -> GHC.Base.Nothing @ a_a1UH;\r\n : x_aSE xs1_aSF ->\r\n case p_aSB x_aSE of {\r\n False -> jump go_s28Z xs1_aSF;\r\n True -> GHC.Base.Just @ a_a1UH x_aSE\r\n }\r\n }; } in\r\n jump go_s28Z xs_aSC\r\n\r\n-- RHS size: {terms: 19, types: 15, coercions: 0, joins: 1/1}\r\nfuses :: (Int -> Bool) -> [Int] -> Bool\r\n[GblId,\r\n Arity=2,\r\n Caf=NoCafRefs,\r\n Str=<L,C(U)><S,1*U>,\r\n Unf=Unf{Src=InlineStable, TopLvl=True, Value=True, ConLike=True,\r\n WorkFree=True, Expandable=True,\r\n Guidance=ALWAYS_IF(arity=2,unsat_ok=True,boring_ok=False)\r\n Tmpl= \\ (p_aSG [Occ=OnceL!] :: Int -> Bool)\r\n (xs_aSH [Occ=Once] :: [Int]) ->\r\n joinrec {\r\n go_s28X [Occ=LoopBreakerT[1]] :: [Int] -> Bool\r\n [LclId[JoinId(1)], Arity=1, Unf=OtherCon []]\r\n go_s28X (ds_d27L [Occ=Once!] :: [Int])\r\n = case ds_d27L of {\r\n [] -> GHC.Types.False;\r\n : x_aSE [Occ=Once] xs1_aSF [Occ=Once] ->\r\n case p_aSG x_aSE of {\r\n False -> jump go_s28X xs1_aSF;\r\n True -> GHC.Types.True\r\n }\r\n }; } in\r\n jump go_s28X xs_aSH}]\r\nfuses\r\n = \\ (p_aSG :: Int -> Bool) (xs_aSH :: [Int]) ->\r\n joinrec {\r\n go_s28X [Occ=LoopBreaker] :: [Int] -> Bool\r\n [LclId[JoinId(1)], Arity=1, Str=<S,1*U>, Unf=OtherCon []]\r\n go_s28X (ds_d27L :: [Int])\r\n = case ds_d27L of {\r\n [] -> GHC.Types.False;\r\n : x_aSE xs1_aSF ->\r\n case p_aSG x_aSE of {\r\n False -> jump go_s28X xs1_aSF;\r\n True -> GHC.Types.True\r\n }\r\n }; } in\r\n jump go_s28X xs_aSH\r\n\r\n-- RHS size: {terms: 1, types: 0, coercions: 0, joins: 0/0}\r\nListFusion.$trModule4 :: GHC.Prim.Addr#\r\n[GblId,\r\n Caf=NoCafRefs,\r\n Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,\r\n WorkFree=True, Expandable=True, Guidance=IF_ARGS [] 20 0}]\r\nListFusion.$trModule4 = \"main\"#\r\n\r\n-- RHS size: {terms: 2, types: 0, coercions: 0, joins: 0/0}\r\nListFusion.$trModule3 :: GHC.Types.TrName\r\n[GblId,\r\n Caf=NoCafRefs,\r\n Str=m1,\r\n Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,\r\n WorkFree=True, Expandable=True, Guidance=IF_ARGS [] 10 20}]\r\nListFusion.$trModule3 = GHC.Types.TrNameS ListFusion.$trModule4\r\n\r\n-- RHS size: {terms: 1, types: 0, coercions: 0, joins: 0/0}\r\nListFusion.$trModule2 :: GHC.Prim.Addr#\r\n[GblId,\r\n Caf=NoCafRefs,\r\n Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,\r\n WorkFree=True, Expandable=True, Guidance=IF_ARGS [] 40 0}]\r\nListFusion.$trModule2 = \"ListFusion\"#\r\n\r\n-- RHS size: {terms: 2, types: 0, coercions: 0, joins: 0/0}\r\nListFusion.$trModule1 :: GHC.Types.TrName\r\n[GblId,\r\n Caf=NoCafRefs,\r\n Str=m1,\r\n Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,\r\n WorkFree=True, Expandable=True, Guidance=IF_ARGS [] 10 20}]\r\nListFusion.$trModule1 = GHC.Types.TrNameS ListFusion.$trModule2\r\n\r\n-- RHS size: {terms: 3, types: 0, coercions: 0, joins: 0/0}\r\nListFusion.$trModule :: GHC.Types.Module\r\n[GblId,\r\n Caf=NoCafRefs,\r\n Str=m,\r\n Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,\r\n WorkFree=True, Expandable=True, Guidance=IF_ARGS [] 10 30}]\r\nListFusion.$trModule\r\n = GHC.Types.Module ListFusion.$trModule3 ListFusion.$trModule1\r\n\r\nRec {\r\n-- RHS size: {terms: 20, types: 13, coercions: 0, joins: 0/0}\r\nListFusion.fuseNot_go [Occ=LoopBreaker] :: [Int] -> Maybe Int\r\n[GblId, Arity=1, Caf=NoCafRefs, Str=<S,1*U>]\r\nListFusion.fuseNot_go\r\n = \\ (ds_d27L :: [Int]) ->\r\n case ds_d27L of {\r\n [] -> GHC.Base.Nothing @ Int;\r\n : x_aSE xs_aSF ->\r\n case x_aSE of wild1_a28o { GHC.Types.I# x1_a28q ->\r\n case GHC.Prim.tagToEnum# @ Bool (GHC.Prim.># x1_a28q 4#) of {\r\n False -> ListFusion.fuseNot_go xs_aSF;\r\n True -> GHC.Base.Just @ Int wild1_a28o\r\n }\r\n }\r\n }\r\nend Rec }\r\n\r\n-- RHS size: {terms: 9, types: 7, coercions: 0, joins: 0/0}\r\nfuseNot :: (Int -> Bool) -> [Int] -> Bool\r\n[GblId,\r\n Arity=2,\r\n Caf=NoCafRefs,\r\n Str=<L,A><S,1*U>,\r\n Unf=Unf{Src=InlineStable, TopLvl=True, Value=True, ConLike=True,\r\n WorkFree=True, Expandable=True,\r\n Guidance=ALWAYS_IF(arity=2,unsat_ok=True,boring_ok=False)\r\n Tmpl= \\ _ [Occ=Dead] (xs_aSK [Occ=Once] :: [Int]) ->\r\n case ListFusion.fuseNot_go xs_aSK of {\r\n Nothing -> GHC.Types.False;\r\n Just _ [Occ=Dead] -> GHC.Types.True\r\n }}]\r\nfuseNot\r\n = \\ _ [Occ=Dead] (xs_aSK :: [Int]) ->\r\n case ListFusion.fuseNot_go xs_aSK of {\r\n Nothing -> GHC.Types.False;\r\n Just x_a1U5 -> GHC.Types.True\r\n }\r\n}}}","type_of_failure":"OtherFailure","blocking":[]} -->https://gitlab.haskell.org/ghc/ghc/-/issues/13236Improve floating for join points2019-09-02T13:03:36ZSimon Peyton JonesImprove floating for join pointsHaving looked at the code in `SetLevels`, am very uncomfortable. My nose tells me that there is far too much chuffing about; it all makes my head spin.
**Question 1**. Why do we ever "ruin" a join point? See
```
Note [When to ruin a jo...Having looked at the code in `SetLevels`, am very uncomfortable. My nose tells me that there is far too much chuffing about; it all makes my head spin.
**Question 1**. Why do we ever "ruin" a join point? See
```
Note [When to ruin a join point]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Generally, we protect join points zealously. However, there are two situations
in which it can pay to promote a join point to a function:
1. If the join point has no value arguments, then floating it outward will make
it a *thunk*, not a function, so we might get increased sharing.
2. If we float the join point all the way to the top level, it still won't be
allocated, so the cost is much less.
Refusing to lose a join point in either of these cases can be disastrous---for
instance, allocation in imaginary/x2n1 *triples* because $w$s^ becomes too big
to inline, which prevents Float In from making a particular binding strictly
demanded.
```
But I don't agree at all. If we have
```
let-join j = e in b
```
then we can leave `j` in place as a join point. We can float `e` (via `lvlMFE`) if doing so would
increase sharing. Indeed this applies uniformly to all join points. For example
```
f x = let g y = let-join j z1 z2 = expensive x
in case y of
A p q -> j p q
B r -> j r r
C -> True
```
Here it would make sense to float `(expensive x)` out of the `\y` abstrction:
```
f x = let lvl = expensive x
g y = let-join j z1 z2 = lvl
in case y of
A p q -> j p q
B r -> j r r
C -> True
```
But doing so does not affect the join point `j`. Nullary join points are no different.
This includes floating to the top level. Incidentally the RHS of the join point then becomes tiny, so the join point will be inlined.
In short, I think we can just delete all this special code.
**Question 2**: `Note [Join points and MFEs]`. Whe do we *ever* float out a MFE that has a free join variable? SLPJ claim: if there is a free join variable, do not float it anywhere.
**Question 3**: Do we actually need to float join points *at all*? Why?
I thik the reason is just to make them small
```
let-join j1 x = let-join j2 y = y+1
in ...
```
Here perhaps if we float `j2` out of `j1` that might make `j1` small enough to inline. But if that is the only motivation (unlike most of `FloatOut` which is about saving work) we'd better say so loud and clear. And the question is a bit more complicated; e.g. we might want to abstract over Ids to achieve this. e.g.
```
let-join j1 x = let-join j2 y = x+y
in ...
===>
let-join j2' x y = x+y
j1 x = let-join j2 y = j2' x y
in ...
```
Now we can inline `j2`. (The example would be more realistic if the `x+y` was a big expression.) It's not just join parameters; we can abstract over any free variables:
```
let-join j1 x = let p = x+y in
let-join j2 z = p+z
in ....
```
Here we could abstract `j2` over `p` in order to float it.
It is not clear how best to do this; but I worry that we are asking `FloatOut` to do too much.
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ------------ |
| Version | 8.0.1 |
| Type | Bug |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Compiler |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"Improve floating for join points","status":"New","operating_system":"","component":"Compiler","related":[],"milestone":"","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"8.0.1","keywords":["JoinPoints"],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"Bug","description":"Having looked at the code in `SetLevels`, am very uncomfortable. My nose tells me that there is far too much chuffing about; it all makes my head spin.\r\n\r\n'''Question 1'''. Why do we ever \"ruin\" a join point? See\r\n{{{\r\nNote [When to ruin a join point]\r\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\r\nGenerally, we protect join points zealously. However, there are two situations\r\nin which it can pay to promote a join point to a function:\r\n\r\n1. If the join point has no value arguments, then floating it outward will make\r\n it a *thunk*, not a function, so we might get increased sharing.\r\n2. If we float the join point all the way to the top level, it still won't be\r\n allocated, so the cost is much less.\r\n\r\nRefusing to lose a join point in either of these cases can be disastrous---for\r\ninstance, allocation in imaginary/x2n1 *triples* because $w$s^ becomes too big\r\nto inline, which prevents Float In from making a particular binding strictly\r\ndemanded.\r\n}}}\r\nBut I don't agree at all. If we have\r\n{{{\r\nlet-join j = e in b\r\n}}}\r\nthen we can leave `j` in place as a join point. We can float `e` (via `lvlMFE`) if doing so would\r\nincrease sharing. Indeed this applies uniformly to all join points. For example\r\n{{{\r\nf x = let g y = let-join j z1 z2 = expensive x\r\n in case y of\r\n A p q -> j p q\r\n B r -> j r r\r\n C -> True\r\n}}}\r\nHere it would make sense to float `(expensive x)` out of the `\\y` abstrction:\r\n{{{\r\nf x = let lvl = expensive x\r\n g y = let-join j z1 z2 = lvl\r\n in case y of\r\n A p q -> j p q\r\n B r -> j r r\r\n C -> True\r\n}}}\r\nBut doing so does not affect the join point `j`. Nullary join points are no different.\r\n\r\nThis includes floating to the top level. Incidentally the RHS of the join point then becomes tiny, so the join point will be inlined.\r\n\r\nIn short, I think we can just delete all this special code.\r\n\r\n\r\n'''Question 2''': `Note [Join points and MFEs]`. Whe do we ''ever'' float out a MFE that has a free join variable? SLPJ claim: if there is a free join variable, do not float it anywhere.\r\n\r\n\r\n'''Question 3''': Do we actually need to float join points ''at all''? Why?\r\n\r\nI thik the reason is just to make them small\r\n{{{\r\nlet-join j1 x = let-join j2 y = y+1\r\n in ...\r\n}}}\r\nHere perhaps if we float `j2` out of `j1` that might make `j1` small enough to inline. But if that is the only motivation (unlike most of `FloatOut` which is about saving work) we'd better say so loud and clear. And the question is a bit more complicated; e.g. we might want to abstract over Ids to achieve this. e.g.\r\n{{{\r\nlet-join j1 x = let-join j2 y = x+y\r\n in ...\r\n===>\r\nlet-join j2' x y = x+y\r\n j1 x = let-join j2 y = j2' x y\r\n in ...\r\n}}}\r\nNow we can inline `j2`. (The example would be more realistic if the `x+y` was a big expression.) It's not just join parameters; we can abstract over any free variables:\r\n{{{\r\nlet-join j1 x = let p = x+y in\r\n let-join j2 z = p+z\r\n in ....\r\n}}}\r\nHere we could abstract `j2` over `p` in order to float it.\r\n\r\nIt is not clear how best to do this; but I worry that we are asking `FloatOut` to do too much.","type_of_failure":"OtherFailure","blocking":[]} -->