GHC issueshttps://gitlab.haskell.org/ghc/ghc/-/issues2019-07-07T18:39:02Zhttps://gitlab.haskell.org/ghc/ghc/-/issues/9792map/coerce rule does not fire until the coercion is known2019-07-07T18:39:02ZDavid Feuermap/coerce rule does not fire until the coercion is knownIf I write a nice, simple use, like:
```hs
myPotato = map coerce
```
or even
```hs
myPotato = Data.OldList.map coerce
```
the "`map/coerce`" rule does not fire.
It looks like there are two things getting in the way:
1. The `map` ru...If I write a nice, simple use, like:
```hs
myPotato = map coerce
```
or even
```hs
myPotato = Data.OldList.map coerce
```
the "`map/coerce`" rule does not fire.
It looks like there are two things getting in the way:
1. The `map` rule fires first, turning `map` into a `build/foldr` form, which `map/coerce` does not recognize.
1. Even if `map` gets written back, `coerce` has been expanded into something the rule doesn't recognize.
So we end up with something that looks like
```hs
myPotato =
\ (@ a_are)
(@ b_arf)
($dCoercible_arz :: Coercible a_are b_arf)
(lst_aqx :: [a_are]) ->
map
@ a_are
@ b_arf
(\ (tpl_B2 [OS=ProbOneShot] :: a_are) ->
case $dCoercible_arz
of _ [Occ=Dead] { GHC.Types.MkCoercible tpl1_B3 ->
tpl_B2 `cast` (tpl1_B3 :: a_are ~R# b_arf)
})
lst_aqx
```
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ------------------------------------ |
| Version | 7.9 |
| Type | Bug |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Core Libraries |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | core-libraries-committee@haskell.org |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"map/coerce rule never seems to fire","status":"New","operating_system":"","component":"Core Libraries","related":[],"milestone":"7.10.1","resolution":"Unresolved","owner":{"tag":"OwnedBy","contents":"ekmett"},"version":"7.9","keywords":[],"differentials":[],"test_case":"","architecture":"","cc":["core-libraries-committee@haskell.org"],"type":"Bug","description":"If I write a nice, simple use, like:\r\n\r\n{{{#!hs\r\nmyPotato = map coerce\r\n}}}\r\n\r\nor even\r\n\r\n{{{#!hs\r\nmyPotato = Data.OldList.map coerce\r\n}}}\r\n\r\nthe \"`map/coerce`\" rule does not fire.\r\n\r\nIt looks like there are two things getting in the way:\r\n\r\n1. The `map` rule fires first, turning `map` into a `build/foldr` form, which `map/coerce` does not recognize.\r\n\r\n2. Even if `map` gets written back, `coerce` has been expanded into something the rule doesn't recognize.\r\n\r\nSo we end up with something that looks like\r\n\r\n{{{#!hs\r\nmyPotato =\r\n \\ (@ a_are)\r\n (@ b_arf)\r\n ($dCoercible_arz :: Coercible a_are b_arf)\r\n (lst_aqx :: [a_are]) ->\r\n map\r\n @ a_are\r\n @ b_arf\r\n (\\ (tpl_B2 [OS=ProbOneShot] :: a_are) ->\r\n case $dCoercible_arz\r\n of _ [Occ=Dead] { GHC.Types.MkCoercible tpl1_B3 ->\r\n tpl_B2 `cast` (tpl1_B3 :: a_are ~R# b_arf)\r\n })\r\n lst_aqx\r\n}}}","type_of_failure":"OtherFailure","blocking":[]} -->8.0.1Edward KmettEdward Kmetthttps://gitlab.haskell.org/ghc/ghc/-/issues/9786Make quot/rem/div/mod with known divisors fast2023-07-08T16:00:31ZDavid FeuerMake quot/rem/div/mod with known divisors fastGHC (with NCG) currently optimizes `Int` division by powers of two, but not by other known divisors. The Intel Optimization Manual section 9.2.4 describes a general technique for replacing division by known, sufficiently small, divisors ...GHC (with NCG) currently optimizes `Int` division by powers of two, but not by other known divisors. The Intel Optimization Manual section 9.2.4 describes a general technique for replacing division by known, sufficiently small, divisors with multiplication. LLVM appears to go further in some fashion. There's no reason we can't do something similar.
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ------------ |
| Version | 7.9 |
| Type | Task |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Compiler |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"Make quot/rem/div/mod with known divisors fast","status":"New","operating_system":"","component":"Compiler","related":[],"milestone":"7.12.1","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"7.9","keywords":[],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"Task","description":"GHC (with NCG) currently optimizes `Int` division by powers of two, but not by other known divisors. The Intel Optimization Manual section 9.2.4 describes a general technique for replacing division by known, sufficiently small, divisors with multiplication. LLVM appears to go further in some fashion. There's no reason we can't do something similar.","type_of_failure":"OtherFailure","blocking":[]} -->8.0.1https://gitlab.haskell.org/ghc/ghc/-/issues/9251ghc does not expose branchless max/min operations as primops2020-04-29T13:20:47ZCarter Schonwaldghc does not expose branchless max/min operations as primopsIt was pointed out in #9246 that GHC currently does not expose primops for Max and Min on various unlifted types such as Float\# and Double\# and Int\# and Word\#, but that on most modern CPU architectures there is instruction level supp...It was pointed out in #9246 that GHC currently does not expose primops for Max and Min on various unlifted types such as Float\# and Double\# and Int\# and Word\#, but that on most modern CPU architectures there is instruction level support for these operations. (Eg, Float\# and \#Double min and max can be provided on any 64bit x86 system pretty easily )
This ticket is to track doing that task. I'll probably do it one of the next two weekends to get back in the GHC hacking groove, or maybe this should be used as a "getting started" ticket for a new contributor?
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ------------ |
| Version | 7.8.2 |
| Type | Task |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Compiler |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"ghc does not expose brancheless max/min operations as primops","status":"New","operating_system":"","component":"Compiler","related":[],"milestone":"7.10.1","resolution":"Unresolved","owner":{"tag":"OwnedBy","contents":"carter"},"version":"7.8.2","keywords":[],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"Task","description":"It was pointed out in #9246 that GHC currently does not expose primops for Max and Min on various unlifted types such as Float# and Double# and Int# and Word#, but that on most modern CPU architectures there is instruction level support for these operations. (Eg, Float# and #Double min and max can be provided on any 64bit x86 system pretty easily )\r\n\r\nThis ticket is to track doing that task. I'll probably do it one of the next two weekends to get back in the GHC hacking groove, or maybe this should be used as a \"getting started\" ticket for a new contributor? \r\n\r\n","type_of_failure":"OtherFailure","blocking":[]} -->8.0.1https://gitlab.haskell.org/ghc/ghc/-/issues/8733I/O manager causes unnecessary syscalls in send/recv loops2019-07-07T18:43:44ZtibbeI/O manager causes unnecessary syscalls in send/recv loopsNetwork applications often call `send` followed by `recv`, to send a message and then read an answer. This causes syscall traces like this one:
```
recvfrom(13, ) -- Haskell thread A
sendto(13, ) -- Haske...Network applications often call `send` followed by `recv`, to send a message and then read an answer. This causes syscall traces like this one:
```
recvfrom(13, ) -- Haskell thread A
sendto(13, ) -- Haskell thread A
recvfrom(13, ) = -1 EAGAIN -- Haskell thread A
epoll_ctl(3, ) -- Haskell thread A (a job for the IO manager)
recvfrom(14, ) -- Haskell thread B
sendto(14, ) -- Haskell thread B
recvfrom(14, ) = -1 EAGAIN -- Haskell thread B
epoll_ctl(3, ) -- Haskell thread B (a job for the IO manager)
```
The `recvfrom` call always fails, as the response from the partner we're communicating with won't be available right after we send the request.
We ought to consider descheduling the thread as soon as sending is "done". The hard part is to figure out when that is.
See http://www.yesodweb.com/blog/2014/02/new-warp for a real world example.8.0.1https://gitlab.haskell.org/ghc/ghc/-/issues/8732Global big object heap allocator lock causes contention2019-07-07T18:43:44ZtibbeGlobal big object heap allocator lock causes contentionThe lock `allocate` takes when allocating big objects hurts scalability of I/O bound application. `Network.Socket.ByteString.recv` is typically called with a buffer size of 4096, which causes a `ByteString` of that size to be allocated. ...The lock `allocate` takes when allocating big objects hurts scalability of I/O bound application. `Network.Socket.ByteString.recv` is typically called with a buffer size of 4096, which causes a `ByteString` of that size to be allocated. The size of this `ByteString` causes it to be allocated from the big object space, which causes contention of the global lock that guards that space.
See http://www.yesodweb.com/blog/2014/02/new-warp for a real world example.8.0.1Simon MarlowSimon Marlowhttps://gitlab.haskell.org/ghc/ghc/-/issues/8623Strange slowness when using async library with FFI callbacks2021-03-31T03:41:02ZJohnWiegleyStrange slowness when using async library with FFI callbacksI've attached a Haskell and a C file, when compiled as such:
```
ghc -DSPEED_BUG=0 -threaded -O2 -main-is SpeedTest SpeedTest.hs SpeedTestC.c
```
You should find that with 7.4.2, 7.6.3 or a recent build of 7.8, building with `SPEED_BUG...I've attached a Haskell and a C file, when compiled as such:
```
ghc -DSPEED_BUG=0 -threaded -O2 -main-is SpeedTest SpeedTest.hs SpeedTestC.c
```
You should find that with 7.4.2, 7.6.3 or a recent build of 7.8, building with `SPEED_BUG=0` produces an executable that takes more than a second to run, while building with `SPEED_BUG=1` runs very quickly. I've also attached the Core for both scenarios.8.0.1Simon MarlowSimon Marlowhttps://gitlab.haskell.org/ghc/ghc/-/issues/8279bad alignment in code gen yields substantial perf issue2019-07-07T18:45:44ZCarter Schonwaldbad alignment in code gen yields substantial perf issueindependently, a number of folks have noticed that in various ways, GHC currently has quite a few different memory alignment related performance problems that can have \>= 10% perf impact!
Nicolas Frisby notes
```
On my laptop, a progr...independently, a number of folks have noticed that in various ways, GHC currently has quite a few different memory alignment related performance problems that can have \>= 10% perf impact!
Nicolas Frisby notes
```
On my laptop, a program showed a consistent slowdown with -fdicts-strict
I didn't find any obvious causes in the Core differences, so I turned to Intel's
Performance Counter Monitor for measurements. After trying a few counters, I eventually
saw that there are about an order of magnitude more misaligned memory loads with
-fdicts-strict than without, so I think that may be a significant part of the slowdown.
I'm not sure if these are code or data reads.
Can anyone suggest how to validate this hypothesis about misaligned reads?
A subsequent commit has changed the behavior I was seeing, so I'm not interested
in alternatives means to determine if -fdicts-strict is somehow at fault — I'm just
asking specifically about data/code memory alignment in GHC and how to
diagnose/experiment with it.
```
Reid Barton has independently noted
```
so I did a nofib run with llvm libraries, ghc quickbuild
so there's this really simple benchmark tak,
https://github.com/ghc/nofib/blob/master/imaginary/tak/Main.hs
it doesn't use any libraries at all in the main loop because the Ints all get unboxed
but it's still 8% slower with quick-llvm (vs -fasm)
weird right?
[14:36:30] <carter> could you post the asm it generates for that function?
[14:36:49] <rwbarton> well it's identical between the two versions
<rwbarton> but they get linked at different offsets because some llvm sections are different sizes
<rwbarton> if I add a 128-byte symbol to the .text section to move it to the same address... then the llvm libs version is just as fast
<rwbarton> well, apparently 404000 is good and 403f70 is bad
<rwbarton> I guess I can test other alignments easily enough
<rwbarton> I imagine it wants to start on a cache line
<rwbarton> but I don't know if it's just a coincidence that it worked with the ncg libraries
<rwbarton> that it got a good location
<rwbarton> for this program every 32-byte aligned address is 10+% faster than any merely 16-byte aligned address
<rwbarton> and by alignment I mean alignment of the start of the Haskell code section
<carter> haswell, sandybridge, ivy bridge, other?
<rwbarton> dunno
<rwbarton> I have similar results on Intel(R) Core(TM)2 Duo CPU T7300 @ 2.00GHz
<rwbarton> and on Quad-Core AMD Opteron(tm) Processor 2374 HE
<carter> ok
<rwbarton> trying a patch now that aligns all *_entry symbols to 32 bytes
```
the key point in there is that on the tak benchmark, better alignment for the code made a 10% perf differnce on TAk on Core2 and opteron cpus!
benjamin scarlet and Luite are speculating that this may be further induced by Tables next to code (TNC) accidentally creating bad alignment so theres cache line pollution / conflicts between the L1 Instruction-cache and data-caches.
So one experiment would be to have the TNC transform pad after the table so the function entry point starts on the next cacheline?8.0.1https://gitlab.haskell.org/ghc/ghc/-/issues/7647UNPACK polymorphic fields2019-07-07T18:48:51ZliyangUNPACK polymorphic fields[ticket:3990\#comment:56197](https://gitlab.haskell.org//ghc/ghc/issues/3990#note_56197) mentions the possibility of unpacking polymorphic fields. To quote:
> \[…\]
> What I mean by "polymorphic unpack" is this:
```
data Poly a = Mk...[ticket:3990\#comment:56197](https://gitlab.haskell.org//ghc/ghc/issues/3990#note_56197) mentions the possibility of unpacking polymorphic fields. To quote:
> \[…\]
> What I mean by "polymorphic unpack" is this:
```
data Poly a = MkP Bool {-# UNPACK #-} a
data Mango = MkMango {-# UNPACK #-} (Poly Int)
```
> Now a value of type Poly t would be represented using two pointer fields, as usual (ie the UNPACK would have no direct effect on Poly). But a Mango value would be represented thus:
```
data Mango = MkMangoRep Bool Int#
MkMango :: Poly Int -> MangoRep
MkMango (MkP b (I# i)) = MkMangoRep b i
-- Pattern match (MkMango p -> rhs)
-- is transformed to (MkMangoRep b i -> let p = MkP b (I# i) in rhs
```
Something like that would be rather nice. This ticket is just a reminder.
Thanks,\[\[br\]\]
/Liyang
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | -------------- |
| Version | 7.6.1 |
| Type | FeatureRequest |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Compiler |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | #3990 |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"UNPACK polymorphic fields","status":"New","operating_system":"","component":"Compiler","related":[3990],"milestone":"","resolution":"Unresolved","owner":{"tag":"OwnedBy","contents":"simonpj"},"version":"7.6.1","keywords":[],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"FeatureRequest","description":"comment:9:ticket:3990 mentions the possibility of unpacking polymorphic fields. To quote:\r\n\r\n […]\r\n\r\n What I mean by \"polymorphic unpack\" is this:\r\n{{{\r\n data Poly a = MkP Bool {-# UNPACK #-} a \r\n data Mango = MkMango {-# UNPACK #-} (Poly Int)\r\n}}}\r\n Now a value of type Poly t would be represented using two pointer fields, as usual (ie the UNPACK would have no direct effect on Poly). But a Mango value would be represented thus:\r\n{{{\r\n data Mango = MkMangoRep Bool Int#\r\n\r\n MkMango :: Poly Int -> MangoRep\r\n MkMango (MkP b (I# i)) = MkMangoRep b i\r\n\r\n -- Pattern match (MkMango p -> rhs)\r\n -- is transformed to (MkMangoRep b i -> let p = MkP b (I# i) in rhs\r\n}}}\r\n\r\nSomething like that would be rather nice. This ticket is just a reminder.\r\n\r\nThanks,[[br]]\r\n/Liyang","type_of_failure":"OtherFailure","blocking":[]} -->8.0.1https://gitlab.haskell.org/ghc/ghc/-/issues/7542GHC doesn't optimize (strict) composition with id2020-02-26T21:24:22ZshachafGHC doesn't optimize (strict) composition with idNewtype constructors and selectors have no runtime overhead, but some uses of them do. For example, given `newtype Identity a = Identity { runIdentity :: a }`, `Identity` turns into `id`, but `Identity . f` turns into `id . f`, which is ...Newtype constructors and selectors have no runtime overhead, but some uses of them do. For example, given `newtype Identity a = Identity { runIdentity :: a }`, `Identity` turns into `id`, but `Identity . f` turns into `id . f`, which is distinct from `f`, because it gets eta-expanded to `\x -> f x`.
It would be nice to be able to compose a newtype constructor with a function without any overhead. The obvious thing to try is strict composition:
```
(#) :: (b -> c) -> (a -> b) -> a -> c
(#) f g = f `seq` g `seq` \x -> f (g x)
```
In theory this should get rid of the eta-expansion. In practice, the generated Core looks like this:
```
foo :: (a -> b) -> [a] -> [b]
foo f = map (id # f)
-- becomes
foo = \f e -> map (case f of g { __DEFAULT -> \x -> g x }) e
```
Different variations of `(#)`, and turning `-fpedantic-bottoms` on, don't seem to affect this. A simpler version, `foo f = map (f `seq` \x -> f x)`, generates the same sort of Core.
In one library we resorted to defining a bunch of functions of the form `identityDot :: (a -> b) -> a -> Identity b; identityDot = unsafeCoerce`. It would be better to be able to rely on GHC to do the optimization directly, if we use strict composition anyway.
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ------------ |
| Version | 7.6.1 |
| Type | Bug |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Compiler |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"GHC doesn't optimize (strict) composition with id","status":"New","operating_system":"","component":"Compiler","related":[],"milestone":"","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"7.6.1","keywords":[],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"Bug","description":"Newtype constructors and selectors have no runtime overhead, but some uses of them do. For example, given `newtype Identity a = Identity { runIdentity :: a }`, `Identity` turns into `id`, but `Identity . f` turns into `id . f`, which is distinct from `f`, because it gets eta-expanded to `\\x -> f x`.\r\n\r\nIt would be nice to be able to compose a newtype constructor with a function without any overhead. The obvious thing to try is strict composition:\r\n\r\n{{{\r\n(#) :: (b -> c) -> (a -> b) -> a -> c\r\n(#) f g = f `seq` g `seq` \\x -> f (g x)\r\n}}}\r\n\r\nIn theory this should get rid of the eta-expansion. In practice, the generated Core looks like this:\r\n\r\n{{{\r\nfoo :: (a -> b) -> [a] -> [b]\r\nfoo f = map (id # f)\r\n-- becomes\r\nfoo = \\f e -> map (case f of g { __DEFAULT -> \\x -> g x }) e\r\n}}}\r\n\r\nDifferent variations of `(#)`, and turning `-fpedantic-bottoms` on, don't seem to affect this. A simpler version, `foo f = map (f `seq` \\x -> f x)`, generates the same sort of Core.\r\n\r\nIn one library we resorted to defining a bunch of functions of the form `identityDot :: (a -> b) -> a -> Identity b; identityDot = unsafeCoerce`. It would be better to be able to rely on GHC to do the optimization directly, if we use strict composition anyway.","type_of_failure":"OtherFailure","blocking":[]} -->8.0.1Simon Peyton JonesSimon Peyton Joneshttps://gitlab.haskell.org/ghc/ghc/-/issues/7511Room for GHC runtime improvement >~5%, inlining related2019-07-07T18:49:26ZdanielvRoom for GHC runtime improvement >~5%, inlining relatedI compare running nofib under GHC with default optimization flags vs. with somewhat extreme settings to inlining aggressiveness.
1. Many programs are unaffected, but some are improved significantly. Geometric mean: -6.5% allocs, -5.9% r...I compare running nofib under GHC with default optimization flags vs. with somewhat extreme settings to inlining aggressiveness.
1. Many programs are unaffected, but some are improved significantly. Geometric mean: -6.5% allocs, -5.9% runtime.
1. Some regress significantly, beyond what the obvious code size growth would explain.
```
circsim +0.7% -4.5% -0.7% -0.9% +2.1%
comp_lab_zift +0.5% -0.1% -3.3% -3.3% +28.6%
fulsom +13.8% -2.6% +4.3% +4.3% +10.8%
mandel2 +0.4% +4.5% 0.01 0.01 +0.0%
paraffins +0.5% +0.0% +4.0% +2.9% +0.0%
rewrite +0.4% +15.9% 0.03 0.03 +0.0%
tak +0.1% +4.0% 0.02 0.02 +0.0%
treejoin -0.1% -0.0% +2.2% +1.7% +0.0%
wave4main +1.6% +29.4% +19.9% +19.9% +30.8%
```
This presents two opportunities:
1. Find better default flags settings (maybe not quite as extreme) and make them the default.
1. Find the reasons behind the regressions, and fix them in GHC. In addition to improving the performance we perceive through GHC, hopefully performance will become more predictable to users: Simon PJ has told me he expects (paraphrasing) "more inlining should make things better, except for/through code size", which would be a very useful invariant; the data here clearly show some cases where it does not hold.
Included are a complete report, an extract of the highlights (significantly improved or any regressed benchmarks) and the script to reproduce given a ghc7.6 devel2 build.
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | --------------------- |
| Version | 7.6.1 |
| Type | Bug |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Compiler |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | simonpj@microsoft.com |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"Room for GHC runtime improvement >~5%, inlining related","status":"New","operating_system":"","component":"Compiler","related":[],"milestone":"","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"7.6.1","keywords":[],"differentials":[],"test_case":"","architecture":"","cc":["simonpj@microsoft.com"],"type":"Bug","description":"I compare running nofib under GHC with default optimization flags vs. with somewhat extreme settings to inlining aggressiveness.\r\n\r\n1. Many programs are unaffected, but some are improved significantly. Geometric mean: -6.5% allocs, -5.9% runtime.\r\n\r\n2. Some regress significantly, beyond what the obvious code size growth would explain.\r\n{{{\r\ncircsim +0.7% -4.5% -0.7% -0.9% +2.1%\r\ncomp_lab_zift +0.5% -0.1% -3.3% -3.3% +28.6%\r\nfulsom +13.8% -2.6% +4.3% +4.3% +10.8%\r\nmandel2 +0.4% +4.5% 0.01 0.01 +0.0%\r\nparaffins +0.5% +0.0% +4.0% +2.9% +0.0%\r\nrewrite +0.4% +15.9% 0.03 0.03 +0.0%\r\ntak +0.1% +4.0% 0.02 0.02 +0.0%\r\ntreejoin -0.1% -0.0% +2.2% +1.7% +0.0%\r\nwave4main +1.6% +29.4% +19.9% +19.9% +30.8%\r\n}}}\r\n\r\nThis presents two opportunities:\r\n\r\n1. Find better default flags settings (maybe not quite as extreme) and make them the default.\r\n\r\n2. Find the reasons behind the regressions, and fix them in GHC. In addition to improving the performance we perceive through GHC, hopefully performance will become more predictable to users: Simon PJ has told me he expects (paraphrasing) \"more inlining should make things better, except for/through code size\", which would be a very useful invariant; the data here clearly show some cases where it does not hold.\r\n\r\nIncluded are a complete report, an extract of the highlights (significantly improved or any regressed benchmarks) and the script to reproduce given a ghc7.6 devel2 build.","type_of_failure":"OtherFailure","blocking":[]} -->8.0.1https://gitlab.haskell.org/ghc/ghc/-/issues/7300Allow CAFs kept reachable by FFI to be forcibly made unreachable for GC2019-07-07T18:50:28ZabsenceAllow CAFs kept reachable by FFI to be forcibly made unreachable for GCCAFs used by a foreign exported function are kept reachable the entire session because GHC can't know when the function will be called from C. If such a CAF is an evolving expression, like an FRP network, a space leak occurs because (I'm...CAFs used by a foreign exported function are kept reachable the entire session because GHC can't know when the function will be called from C. If such a CAF is an evolving expression, like an FRP network, a space leak occurs because (I'm guessing) the thunks that build up during iteration go all the way back to the initial CAF, and the GC can't start collecting because it considers the CAF reachable. According to JaffaCake on the \#haskell IRC channel, the runtime is capable of sovling this problem, it just needs a function that tells it to consider the specific CAF unreachable. It is then the responsibility of the user to not call the foreign exported function after the CAF is forced unreachable.
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | -------------- |
| Version | 7.4.1 |
| Type | FeatureRequest |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Compiler (FFI) |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"Allow CAFs kept reachable by FFI to be forcibly made unreachable for GC","status":"New","operating_system":"","component":"Compiler (FFI)","related":[],"milestone":"","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"7.4.1","keywords":["caf","gc","unsafe"],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"FeatureRequest","description":"CAFs used by a foreign exported function are kept reachable the entire session because GHC can't know when the function will be called from C. If such a CAF is an evolving expression, like an FRP network, a space leak occurs because (I'm guessing) the thunks that build up during iteration go all the way back to the initial CAF, and the GC can't start collecting because it considers the CAF reachable. According to JaffaCake on the #haskell IRC channel, the runtime is capable of sovling this problem, it just needs a function that tells it to consider the specific CAF unreachable. It is then the responsibility of the user to not call the foreign exported function after the CAF is forced unreachable.","type_of_failure":"OtherFailure","blocking":[]} -->8.0.1https://gitlab.haskell.org/ghc/ghc/-/issues/7283Specialise INLINE functions2019-07-07T18:50:32Zrl@cse.unsw.edu.auSpecialise INLINE functionsQuick summary: At the moment, INLINE means inline a function if it is fully applied and '''don't use its unfolding** otherwise. I think we might want to change this to INLINE a function if it is fully applied and **treat it as to INLINAB...Quick summary: At the moment, INLINE means inline a function if it is fully applied and '''don't use its unfolding** otherwise. I think we might want to change this to INLINE a function if it is fully applied and **treat it as to INLINABLE''' otherwise.
Here is a small example:
```
module T where
f :: Num a => a -> a
{-# INLINE [1] f #-}
f = \x -> x+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1
g :: Num a => a -> a
{-# INLINE [1] g #-}
g x = x+2+2+2+2+2+2+2+2+2+2+2+2+2+2+2+2+2+2+2+2+2+2+2+2+2
h :: Num a => a -> a
{-# INLINABLE [1] h #-}
h x = x+3+3+3+3+3+3+3+3+3+3+3+3+3+3+3+3+3+3+3+3+3+3+3+3+3
apply :: (a -> b) -> a -> b
{-# NOINLINE apply #-}
apply f x = f x
```
```
module U where
import T
ff :: Int -> Int -> Int
ff x y = apply f x + apply f y
gg :: Int -> Int -> Int
gg x y = apply g x + apply g y
hh :: Int -> Int -> Int
hh x y = apply h x + apply h y
```
With -O2 -fno-cse (CSE does optimise this example but doesn't solve the problem of intermediate code bloat), GHC produces the following:
- The RHS of `f` is duplicated since it is inlined twice - **bad**.
- `g` is neither inlined nor specialised since it isn't fully applied - **bad**.
- `h` is specialised but its RHS isn't duplicated - **good**.
But of course, `h` isn't guaranteed to be inlined even if it is fully applied which, in the real-world examples I have in mind, is **bad**.
I think `INLINE [n] f` should mean that:
1. `f` will always be inlined if it is fully applied,
1. `f` will be specialised when possible,
1. specialisations of `f` will also be `INLINE [n]`.
I don't think it's possible to achieve this effect at the moment. If we treated `INLINE [n]` as `INLINABLE [n]` until the function is fully applied, we would get exactly this, though, except for 3 which probably isn't too hard to implement.
Now, if I understand correctly, INLINABLE also means that GHC is free to inline the function if it wants but the function isn't treated as cheap. I think it would be fine if we did this for unsaturated `INLINE` functions rather than not inlining them under any circumstances. The original motivation for only inlining when fully applied was code bloat in DPH. But IIRC that only happened because INLINE functions were always cheap and so GHC was very keen to inline them even when they weren't saturated which wouldn't be the case with my proposal. We would have to check how this affects DPH and related libraries, though.
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | -------------- |
| Version | 7.7 |
| Type | FeatureRequest |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Compiler |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"Specialise INLINE functions","status":"New","operating_system":"","component":"Compiler","related":[],"milestone":"","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"7.7","keywords":[],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"FeatureRequest","description":"Quick summary: At the moment, INLINE means inline a function if it is fully applied and '''don't use its unfolding''' otherwise. I think we might want to change this to INLINE a function if it is fully applied and '''treat it as to INLINABLE''' otherwise.\r\n\r\nHere is a small example:\r\n\r\n{{{\r\nmodule T where\r\n\r\nf :: Num a => a -> a\r\n{-# INLINE [1] f #-}\r\nf = \\x -> x+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1\r\n\r\ng :: Num a => a -> a\r\n{-# INLINE [1] g #-}\r\ng x = x+2+2+2+2+2+2+2+2+2+2+2+2+2+2+2+2+2+2+2+2+2+2+2+2+2\r\n\r\nh :: Num a => a -> a\r\n{-# INLINABLE [1] h #-}\r\nh x = x+3+3+3+3+3+3+3+3+3+3+3+3+3+3+3+3+3+3+3+3+3+3+3+3+3\r\n\r\napply :: (a -> b) -> a -> b\r\n{-# NOINLINE apply #-}\r\napply f x = f x\r\n}}}\r\n\r\n{{{\r\nmodule U where\r\n\r\nimport T\r\n\r\nff :: Int -> Int -> Int\r\nff x y = apply f x + apply f y\r\n\r\ngg :: Int -> Int -> Int\r\ngg x y = apply g x + apply g y\r\n\r\nhh :: Int -> Int -> Int\r\nhh x y = apply h x + apply h y\r\n}}}\r\n\r\nWith -O2 -fno-cse (CSE does optimise this example but doesn't solve the problem of intermediate code bloat), GHC produces the following:\r\n\r\n * The RHS of `f` is duplicated since it is inlined twice - '''bad'''.\r\n * `g` is neither inlined nor specialised since it isn't fully applied - '''bad'''.\r\n * `h` is specialised but its RHS isn't duplicated - '''good'''.\r\n\r\nBut of course, `h` isn't guaranteed to be inlined even if it is fully applied which, in the real-world examples I have in mind, is '''bad'''.\r\n\r\nI think `INLINE [n] f` should mean that:\r\n\r\n 1. `f` will always be inlined if it is fully applied,\r\n 2. `f` will be specialised when possible,\r\n 3. specialisations of `f` will also be `INLINE [n]`.\r\n\r\nI don't think it's possible to achieve this effect at the moment. If we treated `INLINE [n]` as `INLINABLE [n]` until the function is fully applied, we would get exactly this, though, except for 3 which probably isn't too hard to implement.\r\n\r\nNow, if I understand correctly, INLINABLE also means that GHC is free to inline the function if it wants but the function isn't treated as cheap. I think it would be fine if we did this for unsaturated `INLINE` functions rather than not inlining them under any circumstances. The original motivation for only inlining when fully applied was code bloat in DPH. But IIRC that only happened because INLINE functions were always cheap and so GHC was very keen to inline them even when they weren't saturated which wouldn't be the case with my proposal. We would have to check how this affects DPH and related libraries, though.","type_of_failure":"OtherFailure","blocking":[]} -->8.0.1https://gitlab.haskell.org/ghc/ghc/-/issues/7063Register allocators can't handle non-uniform register sets2019-07-07T18:51:34ZSimon MarlowRegister allocators can't handle non-uniform register setsNeither the linear scan register allocator nor the graph-colouring allocator can properly handle the fact that some registers on x86 have 8 and 16-bit versions and some don't. We got away with this until now because the only free registe...Neither the linear scan register allocator nor the graph-colouring allocator can properly handle the fact that some registers on x86 have 8 and 16-bit versions and some don't. We got away with this until now because the only free registers on x86 were `%eax`, `%ecx` and `%edx`, but now we can also treat `%esi` as free when it isn't being used for R1 (see f857f0741515b9ebf186beb38fe64448de355817). However, `%esi` doesn't have an 8-bit version, so we cannot treat it as allocatable because the register allocator will try to use it when an 8-bit register is needed (see 105754792adac0802a9a59b0df188b58fb53503f).
LLVM doesn't have this problem, so one workaround is to compile with `-fllvm` to get the extra register(s) on x86.
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | -------------- |
| Version | 7.4.2 |
| Type | Bug |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Compiler (NCG) |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"Register allocators can't handle non-uniform register sets","status":"New","operating_system":"","component":"Compiler (NCG)","related":[],"milestone":"7.8.3","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"7.4.2","keywords":[],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"Bug","description":"Neither the linear scan register allocator nor the graph-colouring allocator can properly handle the fact that some registers on x86 have 8 and 16-bit versions and some don't. We got away with this until now because the only free registers on x86 were `%eax`, `%ecx` and `%edx`, but now we can also treat `%esi` as free when it isn't being used for R1 (see f857f0741515b9ebf186beb38fe64448de355817). However, `%esi` doesn't have an 8-bit version, so we cannot treat it as allocatable because the register allocator will try to use it when an 8-bit register is needed (see 105754792adac0802a9a59b0df188b58fb53503f).\r\n\r\nLLVM doesn't have this problem, so one workaround is to compile with `-fllvm` to get the extra register(s) on x86.","type_of_failure":"OtherFailure","blocking":[]} -->8.0.1benlbenlhttps://gitlab.haskell.org/ghc/ghc/-/issues/5954Performance regression 7.0 -> 7.2 (still in 7.4)2019-07-07T18:52:52ZSimon MarlowPerformance regression 7.0 -> 7.2 (still in 7.4)The program in `nofib/parallel/blackscholes` regressed quite badly in performance between 7.0.x and 7.2.1. This is just sequential performance, no parallelism.
With 7.0:
```
3,084,786,008 bytes allocated in the heap
5,150,592...The program in `nofib/parallel/blackscholes` regressed quite badly in performance between 7.0.x and 7.2.1. This is just sequential performance, no parallelism.
With 7.0:
```
3,084,786,008 bytes allocated in the heap
5,150,592 bytes copied during GC
33,741,048 bytes maximum residency (7 sample(s))
1,541,904 bytes maximum slop
64 MB total memory in use (2 MB lost due to fragmentation)
Generation 0: 5760 collections, 0 parallel, 0.08s, 0.08s elapsed
Generation 1: 7 collections, 0 parallel, 0.01s, 0.01s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 17.43s ( 17.47s elapsed)
GC time 0.09s ( 0.09s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 17.53s ( 17.56s elapsed)
```
With 7.2.2:
```
3,062,127,752 bytes allocated in the heap
4,714,784 bytes copied during GC
34,370,232 bytes maximum residency (7 sample(s))
1,553,968 bytes maximum slop
64 MB total memory in use (2 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 5781 colls, 0 par 0.08s 0.08s 0.0000s 0.0006s
Gen 1 7 colls, 0 par 0.01s 0.01s 0.0014s 0.0017s
INIT time 0.00s ( 0.00s elapsed)
MUT time 23.93s ( 23.93s elapsed)
GC time 0.09s ( 0.09s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 24.02s ( 24.03s elapsed)
```
and with 7.4.1:
```
3,061,924,144 bytes allocated in the heap
4,733,760 bytes copied during GC
34,210,896 bytes maximum residency (7 sample(s))
1,552,640 bytes maximum slop
64 MB total memory in use (2 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 5781 colls, 0 par 0.08s 0.08s 0.0000s 0.0007s
Gen 1 7 colls, 0 par 0.01s 0.01s 0.0015s 0.0017s
INIT time 0.00s ( 0.00s elapsed)
MUT time 23.90s ( 23.91s elapsed)
GC time 0.09s ( 0.09s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 24.00s ( 24.00s elapsed)
```
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ------------ |
| Version | 7.4.1 |
| Type | Bug |
| TypeOfFailure | OtherFailure |
| Priority | high |
| Resolution | Unresolved |
| Component | Compiler |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"Performance regression 7.0 -> 7.2 (still in 7.4)","status":"New","operating_system":"","component":"Compiler","related":[],"milestone":"7.4.2","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"7.4.1","keywords":[],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"Bug","description":"The program in `nofib/parallel/blackscholes` regressed quite badly in performance between 7.0.x and 7.2.1. This is just sequential performance, no parallelism.\r\n\r\nWith 7.0:\r\n\r\n{{{\r\n 3,084,786,008 bytes allocated in the heap\r\n 5,150,592 bytes copied during GC\r\n 33,741,048 bytes maximum residency (7 sample(s))\r\n 1,541,904 bytes maximum slop\r\n 64 MB total memory in use (2 MB lost due to fragmentation)\r\n\r\n Generation 0: 5760 collections, 0 parallel, 0.08s, 0.08s elapsed\r\n Generation 1: 7 collections, 0 parallel, 0.01s, 0.01s elapsed\r\n\r\n INIT time 0.00s ( 0.00s elapsed)\r\n MUT time 17.43s ( 17.47s elapsed)\r\n GC time 0.09s ( 0.09s elapsed)\r\n EXIT time 0.00s ( 0.00s elapsed)\r\n Total time 17.53s ( 17.56s elapsed)\r\n}}}\r\n\r\nWith 7.2.2:\r\n\r\n{{{\r\n 3,062,127,752 bytes allocated in the heap\r\n 4,714,784 bytes copied during GC\r\n 34,370,232 bytes maximum residency (7 sample(s))\r\n 1,553,968 bytes maximum slop\r\n 64 MB total memory in use (2 MB lost due to fragmentation)\r\n\r\n Tot time (elapsed) Avg pause Max pause\r\n Gen 0 5781 colls, 0 par 0.08s 0.08s 0.0000s 0.0006s\r\n Gen 1 7 colls, 0 par 0.01s 0.01s 0.0014s 0.0017s\r\n\r\n INIT time 0.00s ( 0.00s elapsed)\r\n MUT time 23.93s ( 23.93s elapsed)\r\n GC time 0.09s ( 0.09s elapsed)\r\n EXIT time 0.00s ( 0.00s elapsed)\r\n Total time 24.02s ( 24.03s elapsed)\r\n}}}\r\n\r\nand with 7.4.1:\r\n\r\n{{{\r\n 3,061,924,144 bytes allocated in the heap\r\n 4,733,760 bytes copied during GC\r\n 34,210,896 bytes maximum residency (7 sample(s))\r\n 1,552,640 bytes maximum slop\r\n 64 MB total memory in use (2 MB lost due to fragmentation)\r\n\r\n Tot time (elapsed) Avg pause Max pause\r\n Gen 0 5781 colls, 0 par 0.08s 0.08s 0.0000s 0.0007s\r\n Gen 1 7 colls, 0 par 0.01s 0.01s 0.0015s 0.0017s\r\n\r\n INIT time 0.00s ( 0.00s elapsed)\r\n MUT time 23.90s ( 23.91s elapsed)\r\n GC time 0.09s ( 0.09s elapsed)\r\n EXIT time 0.00s ( 0.00s elapsed)\r\n Total time 24.00s ( 24.00s elapsed)\r\n}}}\r\n","type_of_failure":"OtherFailure","blocking":[]} -->8.0.1Simon Peyton JonesSimon Peyton Joneshttps://gitlab.haskell.org/ghc/ghc/-/issues/5567LLVM: Improve alias analysis / performance2022-07-13T21:11:11ZdtereiLLVM: Improve alias analysis / performance- LLVM doesn't generate as good as code as we feel it should in many situations
- Why?
- We've often felt its a alias anlysis issue.
- I'm a little more doubtful of that than others (I feel its part of the bigger problem, not the whole t...- LLVM doesn't generate as good as code as we feel it should in many situations
- Why?
- We've often felt its a alias anlysis issue.
- I'm a little more doubtful of that than others (I feel its part of the bigger problem, not the whole thing).
- I think there may be some register allocation / instruction selection / live range splitting issue going on.
- We could also do with looking at what optimisation passes we should run and in what order...
Here is some work Max did on the alias issue, his results for nofib weren't good:
http://blog.omega-prime.co.uk/?p=135
So this ticket is just a high level ticket about figuring out and improving the performance of LLVM backend.
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | --------------- |
| Version | |
| Type | Task |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Compiler (LLVM) |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"LLVM: Improve alias analysis / performance","status":"New","operating_system":"","component":"Compiler (LLVM)","related":[],"milestone":"","resolution":"Unresolved","owner":{"tag":"OwnedBy","contents":"dterei"},"version":"","keywords":[],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"Task","description":" * LLVM doesn't generate as good as code as we feel it should in many situations\r\n * Why?\r\n * We've often felt its a alias anlysis issue.\r\n * I'm a little more doubtful of that than others (I feel its part of the bigger problem, not the whole thing).\r\n * I think there may be some register allocation / instruction selection / live range splitting issue going on.\r\n * We could also do with looking at what optimisation passes we should run and in what order...\r\n\r\nHere is some work Max did on the alias issue, his results for nofib weren't good:\r\n\r\nhttp://blog.omega-prime.co.uk/?p=135\r\n\r\nSo this ticket is just a high level ticket about figuring out and improving the performance of LLVM backend.","type_of_failure":"OtherFailure","blocking":[]} -->8.0.1dtereidtereihttps://gitlab.haskell.org/ghc/ghc/-/issues/5444Slow 64-bit primops on 32 bit system2022-11-02T16:24:07ZKhudyakovSlow 64-bit primops on 32 bit systemGHC primops for 64-bit arithmetic are implemented as FFI calls. It leads to serious performance penalty for 32 bit code which heavily uses 64-bit arithmetics.
I found this while investigating poor performance of mwc-random on 32-bit sys...GHC primops for 64-bit arithmetic are implemented as FFI calls. It leads to serious performance penalty for 32 bit code which heavily uses 64-bit arithmetics.
I found this while investigating poor performance of mwc-random on 32-bit systems. 32-bit build runs 3-4 times slower than 64-bit build on the same hardware. It's difficult to estimate how faster would run optimal implementation since it doesn't exist. But it's probably at least 2x slowdown.
Here is simple program to demonstrate issue
```
sqr64 :: Int32 -> Int64
sqr64 x = y * y where y = fromIntegral x
```
Here is optimized core
```
$wsqr64 :: Int# -> Int64
$wsqr64 =
\ (ww_sGO :: Int#) ->
case {__pkg_ccall ghc-prim hs_intToInt64 Int#
-> State# RealWorld -> (# State# RealWorld, Int64# #)}_aFY
ww_sGO realWorld#
of _ { (# _, ds2_aG2 #) ->
case {__pkg_ccall ghc-prim hs_timesInt64 Int64#
-> Int64# -> State# RealWorld -> (# State# RealWorld, Int64# #)}_aGc
ds2_aG2 ds2_aG2 realWorld#
of _ { (# _, ds4_aGi #) ->
I64# ds4_aGi
}
}
sqr64 :: Int32 -> Int64
sqr64 = \ (w_sGM :: Int32) ->
case w_sGM of _ { I32# ww_sGO -> $wsqr64 ww_sGO }
```
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ------------ |
| Version | 7.2.1 |
| Type | Bug |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Compiler |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"Slow 64-bit primops on 32 bit system","status":"New","operating_system":"","component":"Compiler","related":[],"milestone":"","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"7.2.1","keywords":[],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"Bug","description":"GHC primops for 64-bit arithmetic are implemented as FFI calls. It leads to serious performance penalty for 32 bit code which heavily uses 64-bit arithmetics.\r\n\r\nI found this while investigating poor performance of mwc-random on 32-bit systems. 32-bit build runs 3-4 times slower than 64-bit build on the same hardware. It's difficult to estimate how faster would run optimal implementation since it doesn't exist. But it's probably at least 2x slowdown.\r\n\r\n\r\nHere is simple program to demonstrate issue\r\n{{{\r\nsqr64 :: Int32 -> Int64\r\nsqr64 x = y * y where y = fromIntegral x\r\n}}}\r\n\r\nHere is optimized core\r\n{{{\r\n$wsqr64 :: Int# -> Int64\r\n$wsqr64 =\r\n \\ (ww_sGO :: Int#) ->\r\n case {__pkg_ccall ghc-prim hs_intToInt64 Int#\r\n -> State# RealWorld -> (# State# RealWorld, Int64# #)}_aFY\r\n ww_sGO realWorld#\r\n of _ { (# _, ds2_aG2 #) ->\r\n case {__pkg_ccall ghc-prim hs_timesInt64 Int64#\r\n -> Int64# -> State# RealWorld -> (# State# RealWorld, Int64# #)}_aGc\r\n ds2_aG2 ds2_aG2 realWorld#\r\n of _ { (# _, ds4_aGi #) ->\r\n I64# ds4_aGi\r\n }\r\n }\r\n\r\nsqr64 :: Int32 -> Int64\r\nsqr64 = \\ (w_sGM :: Int32) ->\r\n case w_sGM of _ { I32# ww_sGO -> $wsqr64 ww_sGO }\r\n}}}","type_of_failure":"OtherFailure","blocking":[]} -->8.0.1https://gitlab.haskell.org/ghc/ghc/-/issues/4945Another SpecConstr infelicity2021-03-31T02:36:04ZbatterseapowerAnother SpecConstr infelicityI'm beginning to sound like a broken record, but SpecConstr still doesn't seem to be right! The last problem has been fixed, but I've found a new one.
Please observe the output of compiling the attached code with:
```
./ghc -fforce-rec...I'm beginning to sound like a broken record, but SpecConstr still doesn't seem to be right! The last problem has been fixed, but I've found a new one.
Please observe the output of compiling the attached code with:
```
./ghc -fforce-recomp -c -dverbose-core2core -O2 -fno-liberate-case STUArray-Rewrite2.hs
```
In the output of SpecConstr we have a local letrec:
```
(letrec {
$wa_s1G7 [Occ=LoopBreaker]
:: forall s_aJm.
Data.Array.Base.STUArray
s_aJm GHC.Types.Int GHC.Types.Int
-> GHC.Prim.Int#
-> GHC.Prim.State# s_aJm
-> (# GHC.Prim.State# s_aJm, () #)
[LclId, Arity=3, Str=DmdType LLL]
$wa_s1G7 =
\ (@ s_aJm)
(w_s1FS
:: Data.Array.Base.STUArray
s_aJm GHC.Types.Int GHC.Types.Int)
(ww_s1FV :: GHC.Prim.Int#)
(w_s1FX :: GHC.Prim.State# s_aJm) ->
case GHC.Prim.># ww_s1FV ww_s1FN
of wild_Xj [Dmd=Just A] {
GHC.Types.False ->
case w_s1FS
of wild_aTj [Dmd=Just L]
{ Data.Array.Base.STUArray ds1_aTl [Dmd=Just U]
ds2_aTm [Dmd=Just U]
n_aTn [Dmd=Just U(L)]
ds3_aTo [Dmd=Just A] ->
case n_aTn
of wild_aTs [Dmd=Just A]
{ GHC.Types.I# x_aTu [Dmd=Just L] ->
case $wa_s1G0
@ s_aJm
w_s1FS
(GHC.Types.I# ww_s1FV)
0
(GHC.Prim.-# x_aTu 1)
w_s1FX
of wild_XUw [Dmd=Just A]
{ (# new_s_XUB [Dmd=Just L], r_XUD [Dmd=Just A] #) ->
$wa_s1G7
@ s_aJm w_s1FS (GHC.Prim.+# ww_s1FV 1) new_s_XUB
}
}
};
GHC.Types.True -> (# w_s1FX, GHC.Unit.() #)
}; } in
$wa_s1G7)
```
This is a local recursive loop with an invariant first argument (w_s1FS) that is recrutinised every time! This seems deeply uncool.
This is with HEAD (7.1.20110203, incorporating the patch "Fix typo in SpecConstr that made it not work at all")
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ------------ |
| Version | 7.0.1 |
| Type | Bug |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Compiler |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"Another SpecConstr infelicity","status":"New","operating_system":"","component":"Compiler","related":[],"milestone":"","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"7.0.1","keywords":[],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"Bug","description":"I'm beginning to sound like a broken record, but SpecConstr still doesn't seem to be right! The last problem has been fixed, but I've found a new one.\r\n\r\nPlease observe the output of compiling the attached code with:\r\n\r\n{{{\r\n./ghc -fforce-recomp -c -dverbose-core2core -O2 -fno-liberate-case STUArray-Rewrite2.hs\r\n}}}\r\n\r\nIn the output of SpecConstr we have a local letrec:\r\n\r\n{{{\r\n(letrec {\r\n $wa_s1G7 [Occ=LoopBreaker]\r\n :: forall s_aJm.\r\n Data.Array.Base.STUArray\r\n s_aJm GHC.Types.Int GHC.Types.Int\r\n -> GHC.Prim.Int#\r\n -> GHC.Prim.State# s_aJm\r\n -> (# GHC.Prim.State# s_aJm, () #)\r\n [LclId, Arity=3, Str=DmdType LLL]\r\n $wa_s1G7 =\r\n \\ (@ s_aJm)\r\n (w_s1FS\r\n :: Data.Array.Base.STUArray\r\n s_aJm GHC.Types.Int GHC.Types.Int)\r\n (ww_s1FV :: GHC.Prim.Int#)\r\n (w_s1FX :: GHC.Prim.State# s_aJm) ->\r\n case GHC.Prim.># ww_s1FV ww_s1FN\r\n of wild_Xj [Dmd=Just A] {\r\n GHC.Types.False ->\r\n case w_s1FS\r\n of wild_aTj [Dmd=Just L]\r\n { Data.Array.Base.STUArray ds1_aTl [Dmd=Just U]\r\n ds2_aTm [Dmd=Just U]\r\n n_aTn [Dmd=Just U(L)]\r\n ds3_aTo [Dmd=Just A] ->\r\n case n_aTn\r\n of wild_aTs [Dmd=Just A]\r\n { GHC.Types.I# x_aTu [Dmd=Just L] ->\r\n case $wa_s1G0\r\n @ s_aJm\r\n w_s1FS\r\n (GHC.Types.I# ww_s1FV)\r\n 0\r\n (GHC.Prim.-# x_aTu 1)\r\n w_s1FX\r\n of wild_XUw [Dmd=Just A]\r\n { (# new_s_XUB [Dmd=Just L], r_XUD [Dmd=Just A] #) ->\r\n $wa_s1G7\r\n @ s_aJm w_s1FS (GHC.Prim.+# ww_s1FV 1) new_s_XUB\r\n }\r\n }\r\n };\r\n GHC.Types.True -> (# w_s1FX, GHC.Unit.() #)\r\n }; } in\r\n $wa_s1G7)\r\n}}}\r\n\r\nThis is a local recursive loop with an invariant first argument (w_s1FS) that is recrutinised every time! This seems deeply uncool.\r\n\r\nThis is with HEAD (7.1.20110203, incorporating the patch \"Fix typo in SpecConstr that made it not work at all\")","type_of_failure":"OtherFailure","blocking":[]} -->8.0.1https://gitlab.haskell.org/ghc/ghc/-/issues/5326Polymorphic instances aren't automatically specialised2019-07-07T18:55:53ZreinerpPolymorphic instances aren't automatically specialisedRelated to #255. Given (roughly) the example from that ticket:
```
f :: (Storable a, Eq a) => T a
g :: T (Ptr a)
g = f
```
we find that g is not specialised. Adding a SPECIALISE pragma fixes this, but it ought not to be necessary.
Th...Related to #255. Given (roughly) the example from that ticket:
```
f :: (Storable a, Eq a) => T a
g :: T (Ptr a)
g = f
```
we find that g is not specialised. Adding a SPECIALISE pragma fixes this, but it ought not to be necessary.
The following module is a complete example:
```
module C where
class C a where f :: a -> a
newtype Id a = Id a
instance C (Id a) where f = id
g :: C a => Int -> a -> a
g 0 a = a
g n a = g (n-1) (f a)
h :: Int -> Id a -> Id a
h = g
j :: Int -> Id Int -> Id Int
j = g
```
We find that h passes a dictionary to g:
```
C.h =
\ (@ a_alq) (w_smq :: Int) (w1_smu :: C.Id a_alq) ->
case w_smq of _ { I# ww_sms ->
C.$wg @ (C.Id a_alq) (C.$fCId @ a_alq) ww_sms w1_smu
}
```
whereas j is specialised as desired, getting its own worker and no dictionaries:
```
C.j =
\ (w_smg :: Int) (w1_smk :: C.Id Int) ->
case w_smg of _ { I# ww_smi ->
(C.$wj ww_smi w1_smk)
`cast` (sym (C.NTCo:Id Int)
:: Int ~ C.Id Int)
}
```
If we add
```
{-# SPECIALISE g :: Int -> Id a -> Id a #-}
```
then h is specialised as desired.
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ------------ |
| Version | 7.0.3 |
| Type | Bug |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Compiler |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"Polymorphic instances aren't automatically specialised","status":"New","operating_system":"","component":"Compiler","related":[],"milestone":"","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"7.0.3","keywords":["specialisation"],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"Bug","description":"Related to #255. Given (roughly) the example from that ticket:\r\n\r\n{{{\r\nf :: (Storable a, Eq a) => T a\r\n\r\ng :: T (Ptr a)\r\ng = f\r\n}}}\r\nwe find that g is not specialised. Adding a SPECIALISE pragma fixes this, but it ought not to be necessary.\r\n\r\nThe following module is a complete example:\r\n\r\n{{{\r\nmodule C where\r\n\r\nclass C a where f :: a -> a\r\n\r\nnewtype Id a = Id a\r\ninstance C (Id a) where f = id\r\n\r\ng :: C a => Int -> a -> a\r\ng 0 a = a\r\ng n a = g (n-1) (f a)\r\n\r\nh :: Int -> Id a -> Id a\r\nh = g\r\n\r\nj :: Int -> Id Int -> Id Int\r\nj = g\r\n}}}\r\n\r\nWe find that h passes a dictionary to g:\r\n\r\n{{{\r\nC.h =\r\n \\ (@ a_alq) (w_smq :: Int) (w1_smu :: C.Id a_alq) ->\r\n case w_smq of _ { I# ww_sms ->\r\n C.$wg @ (C.Id a_alq) (C.$fCId @ a_alq) ww_sms w1_smu\r\n }\r\n}}}\r\n\r\nwhereas j is specialised as desired, getting its own worker and no dictionaries:\r\n\r\n{{{\r\nC.j =\r\n \\ (w_smg :: Int) (w1_smk :: C.Id Int) ->\r\n case w_smg of _ { I# ww_smi ->\r\n (C.$wj ww_smi w1_smk)\r\n `cast` (sym (C.NTCo:Id Int)\r\n :: Int ~ C.Id Int)\r\n }\r\n}}}\r\n\r\nIf we add\r\n\r\n{{{\r\n{-# SPECIALISE g :: Int -> Id a -> Id a #-}\r\n}}}\r\n\r\nthen h is specialised as desired.","type_of_failure":"OtherFailure","blocking":[]} -->8.0.1Simon Peyton JonesSimon Peyton Joneshttps://gitlab.haskell.org/ghc/ghc/-/issues/5171Misfeature of Cmm optimiser: no way to extract a branch of expression into a ...2019-07-07T18:56:35ZrtvdMisfeature of Cmm optimiser: no way to extract a branch of expression into a separate statementAFAIK, optimisations in Cmm are performed using cmmMachOpFold.
However, this function is pure and does not allow detaching a branch of expression in order to make sure that it is executed only once.
Let's say we have (Op1 arg1 arg2) and...AFAIK, optimisations in Cmm are performed using cmmMachOpFold.
However, this function is pure and does not allow detaching a branch of expression in order to make sure that it is executed only once.
Let's say we have (Op1 arg1 arg2) and we want to transform it to (Op2 arg1 (Op3 arg2 arg1)). Doing this would mean that arg1 would be computed more than once. Instead, the following should be possible:
```
arg1_reg <- arg1
(Op2 arg1_reg (Op3 arg2 arg1_reg))
```
The lack of this feature already stops one of optimisations from happening in most cases. See:
```
CmmReg _ <- x -> -- We duplicate x below, hence require
-- it is a reg. FIXME: remove this restriction.
```
in CmmOpt.hs.
There is a number of other useful optimisations which can be implemented only with this feature available.8.0.1https://gitlab.haskell.org/ghc/ghc/-/issues/4308LLVM compiles Updates.cmm badly2019-07-07T18:59:36ZdtereiLLVM compiles Updates.cmm badlySimon M. reported that compiled rts/Updates.cmm on x86-64 with the LLVM backend produced some pretty bad code. The ncg produces this:
```
stg_upd_frame_info:
.Lco:
movq 8(%rbp),%rax
addq $16,%rbp
movq %rbx,8(%rax)
...Simon M. reported that compiled rts/Updates.cmm on x86-64 with the LLVM backend produced some pretty bad code. The ncg produces this:
```
stg_upd_frame_info:
.Lco:
movq 8(%rbp),%rax
addq $16,%rbp
movq %rbx,8(%rax)
movq $stg_BLACKHOLE_info,0(%rax)
movq %rax,%rcx
andq $-1048576,%rcx
movq %rax,%rdx
andq $1044480,%rdx
shrq $6,%rdx
orq %rcx,%rdx
cmpw $0,52(%rdx)
jne .Lcf
jmp *0(%rbp)
.Lcf:
[...]
```
The LLVM backend produces this though:
```
stg_upd_frame_info: # @stg_upd_frame_info
# BB#0: # %co
subq $104, %rsp
movq 8(%rbp), %rax
movq %rax, 24(%rsp) # 8-byte Spill
movq %rbx, 8(%rax)
mfence
movq $stg_BLACKHOLE_info, (%rax)
movq %rax, %rcx
andq $-1048576, %rcx # imm = 0xFFFFFFFFFFF00000
andq $1044480, %rax # imm = 0xFF000
shrq $6, %rax
addq %rcx, %rax
addq $16, %rbp
cmpw $0, 52(%rax)
movsd %xmm6, 88(%rsp) # 8-byte Spill
movsd %xmm5, 80(%rsp) # 8-byte Spill
movss %xmm4, 76(%rsp) # 4-byte Spill
movss %xmm3, 72(%rsp) # 4-byte Spill
movss %xmm2, 68(%rsp) # 4-byte Spill
movss %xmm1, 64(%rsp) # 4-byte Spill
movq %r9, 56(%rsp) # 8-byte Spill
movq %r8, 48(%rsp) # 8-byte Spill
movq %rdi, 40(%rsp) # 8-byte Spill
movq %rsi, 32(%rsp) # 8-byte Spill
je .LBB1_4
```
This has two main problems:
1. mfence instruction (write barrier) isn't required. (write-write barriers aren't required on x86)
1. LLVM backend is spilling a lot of stuff unnecessarily.
Both these I think are fairly easy fixes. LLVM is handling write barriers quite naively at the moment so 1. is easy. The spilling problem I think is related to a previous fix I made where I need to explicitly kill some of the stg registers if they aren't live across the call, otherwise LLVM rightly thinks they are live since I always pass the stg registers around (so live on entry and exit of every function unless I kill them).8.0.1