GHC issueshttps://gitlab.haskell.org/ghc/ghc/-/issues2019-07-07T18:28:34Zhttps://gitlab.haskell.org/ghc/ghc/-/issues/11783Very large slowdown when using parallel garbage collector2019-07-07T18:28:34ZluispedroVery large slowdown when using parallel garbage collectorAs part of debugging some performance issues on an application I am writing, I concluded that the issue is in the parallel GC implemented in the GHC RTS. I extracted the code attached to make a self-contained use-case, but in my system t...As part of debugging some performance issues on an application I am writing, I concluded that the issue is in the parallel GC implemented in the GHC RTS. I extracted the code attached to make a self-contained use-case, but in my system the code runs in 16s when using a single thread, in 18s when using 6 threads but no parallel GC and in over a minute when using 6 threads with parallel GC!
The true slowdown in the full code is actually worse and relevant for the application (some steps take \>1 hour instead of \<1 minute!). Parts of the code do take full advantage of parallel processing, this is just one simple test case.
On some machines it seems worse than others and it seems that the input file (data.txt) needs to be quite large for the problem to really show up (the attached script generates a 16 million input file, this is still smaller than some of my real use cases, but I couldn't trigger it with only 1 million). Similarly, with 4 threads, the slowdown is detectable, but not as large.
While running, CPU usage is very high (I tested with 16 threads and it uses 16 CPUs continuously, top reports 1600% CPU).
Using '+RTS -A64m' is another way around the issue, but for the full application it is still not as effective as '+RTS -qg', so there still seems to be a performance issue here.8.0.1Simon MarlowSimon Marlowhttps://gitlab.haskell.org/ghc/ghc/-/issues/8832Constant-folding regression wrt `clearBit (bit 0) 0 `2019-07-07T18:43:11ZHerbert Valerio Riedelhvr@gnu.orgConstant-folding regression wrt `clearBit (bit 0) 0 `While implementing `zeroBits` (see \[83bd2f5fc7e/base\]) I noticed that constant folding of the expression `clearBit (bit 0) 0` regressed (and improved at the same time) from GHC 7.6.3 to GHC 7.8.1, specifically, the following module
``...While implementing `zeroBits` (see \[83bd2f5fc7e/base\]) I noticed that constant folding of the expression `clearBit (bit 0) 0` regressed (and improved at the same time) from GHC 7.6.3 to GHC 7.8.1, specifically, the following module
```haskell
{-# LANGUAGE CPP #-}
module M where
import Data.Bits
import Data.Int
import Data.Word
#define T(s,T) \
s :: T ; \
s = clearBit (bit 0) 0 ; \
T(i,Int)
T(i8,Int8)
T(i16,Int16)
T(i32,Int32)
T(i64,Int64)
T(w,Word)
T(w8,Word8)
T(w16,Word16)
T(w32,Word32)
T(w64,Word64)
T(z,Integer)
```
compiled with GHC 7.8.1RC2 results in the following Core output:
```haskell
-- GHC 7.8.1RC2
i = I# (andI# 1 (notI# 1))
i8 = I8# 0
i16 = I16# 0
i32 = I32# 0
i64 = I64# 0
w = W# (__word 0)
w8 = W8# (__word 0)
w16 = W16# (__word 0)
w32 = W32# (__word 0)
w64 = W64# (__word 0)
z2 = $w$cbit 0
z1 = complementInteger z2
z = andInteger z2 z1
```
Thus, `i` and `z` are not properly constant-folded in GHC 7.8.1RC2. With GHC 7.6.3, however, `i` and `z` were properly folded to `0`:
```haskell
-- GHC 7.6.3
i = I# 0
i8 =
case $fBitsInt8_$cbit i of _ { I8# x#_aDf ->
case $fBitsInt8_$cbit i of _ { I8# x#1_aDr ->
I8#
(word2Int#
(and#
(int2Word# x#_aDf)
(xor# (int2Word# x#1_aDr) (__word 18446744073709551615))))
}
}
i16,i32,i64 -- equivalent to i8
w = W# (__word 0)
w8 =
case $fBitsWord8_$cbit i of _ { W8# x#_aEV ->
case $fBitsWord8_$cbit i of _ { W8# x#1_aF5 ->
W8# (and# x#_aEV (xor# x#1_aF5 (__word 255)))
}
}
w16,w32,w64 -- equivalent to w8
z = __integer 0
```
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ------------ |
| Version | 7.8.1-rc2 |
| Type | Bug |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Compiler |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"Constant-folding regression wrt `clearBit (bit 0) 0 `","status":"New","operating_system":"","component":"Compiler","related":[],"milestone":"7.8.1","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"7.8.1-rc2","keywords":[],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"Bug","description":"While implementing `zeroBits` (see [83bd2f5fc7e/base]) I noticed that constant folding of the expression `clearBit (bit 0) 0` regressed (and improved at the same time) from GHC 7.6.3 to GHC 7.8.1, specifically, the following module\r\n\r\n{{{#!haskell\r\n{-# LANGUAGE CPP #-}\r\n\r\nmodule M where\r\n\r\nimport Data.Bits\r\nimport Data.Int\r\nimport Data.Word\r\n\r\n#define T(s,T) \\\r\ns :: T ; \\\r\ns = clearBit (bit 0) 0 ; \\\r\n\r\nT(i,Int)\r\nT(i8,Int8)\r\nT(i16,Int16)\r\nT(i32,Int32)\r\nT(i64,Int64)\r\n\r\nT(w,Word)\r\nT(w8,Word8)\r\nT(w16,Word16)\r\nT(w32,Word32)\r\nT(w64,Word64)\r\n\r\nT(z,Integer)\r\n}}}\r\n\r\ncompiled with GHC 7.8.1RC2 results in the following Core output:\r\n\r\n{{{#!haskell\r\n-- GHC 7.8.1RC2\r\n\r\ni = I# (andI# 1 (notI# 1))\r\n\r\ni8 = I8# 0\r\ni16 = I16# 0\r\ni32 = I32# 0\r\ni64 = I64# 0\r\n\r\nw = W# (__word 0)\r\nw8 = W8# (__word 0)\r\n\r\nw16 = W16# (__word 0)\r\nw32 = W32# (__word 0)\r\nw64 = W64# (__word 0)\r\n\r\nz2 = $w$cbit 0\r\nz1 = complementInteger z2\r\nz = andInteger z2 z1\r\n}}}\r\n\r\nThus, `i` and `z` are not properly constant-folded in GHC 7.8.1RC2. With GHC 7.6.3, however, `i` and `z` were properly folded to `0`:\r\n\r\n{{{#!haskell\r\n-- GHC 7.6.3\r\n\r\ni = I# 0\r\n\r\ni8 =\r\n case $fBitsInt8_$cbit i of _ { I8# x#_aDf ->\r\n case $fBitsInt8_$cbit i of _ { I8# x#1_aDr ->\r\n I8#\r\n (word2Int#\r\n (and#\r\n (int2Word# x#_aDf)\r\n (xor# (int2Word# x#1_aDr) (__word 18446744073709551615))))\r\n }\r\n }\r\n\r\ni16,i32,i64 -- equivalent to i8\r\n\r\nw = W# (__word 0)\r\n\r\nw8 =\r\n case $fBitsWord8_$cbit i of _ { W8# x#_aEV ->\r\n case $fBitsWord8_$cbit i of _ { W8# x#1_aF5 ->\r\n W8# (and# x#_aEV (xor# x#1_aF5 (__word 255)))\r\n }\r\n }\r\n\r\nw16,w32,w64 -- equivalent to w8\r\n\r\nz = __integer 0\r\n}}}\r\n\r\n","type_of_failure":"OtherFailure","blocking":[]} -->8.0.1Ben GamariBen Gamarihttps://gitlab.haskell.org/ghc/ghc/-/issues/11795Performance issues with replicateM_2022-10-20T18:53:35ZMichael Snoymanmichael@snoyman.comPerformance issues with replicateM_When working on optimizing a program by minimizing allocations, I can into an issue with `replicateM_`. Consider the following code
```hs
import Control.Monad (replicateM_)
import Foreign.C.String (withCString)
import Foreign.Storable (...When working on optimizing a program by minimizing allocations, I can into an issue with `replicateM_`. Consider the following code
```hs
import Control.Monad (replicateM_)
import Foreign.C.String (withCString)
import Foreign.Storable (peek)
main :: IO ()
main = withCString "foo" $ replicateM_ 10000000 . peek
```
When I run this program, I get:
160,042,656 bytes allocated in the heap
The result is the same whether I compile with `-O0`, `-O`, or `-O2`. And as expected, the total allocation increases or decreases based on the numbers of times I replicate the action. On the other hand, replacing `replicateM_` with a hand-written version makes the total allocations for the program only 42KB, and does not increase with the numbers of replications.
```hs
replicateM_ :: Monad m => Int -> m a -> m ()
replicateM_ cnt0 f =
loop cnt0
where
loop cnt
| cnt <= 0 = return ()
| otherwise = f >> loop (cnt - 1)
```
By contrast, `Control.Monad.replicateM_` looks like:
```hs
replicateM_ :: (Monad m) => Int -> m a -> m ()
{-# INLINEABLE replicateM_ #-}
{-# SPECIALISE replicateM_ :: Int -> IO a -> IO () #-}
{-# SPECIALISE replicateM_ :: Int -> Maybe a -> Maybe () #-}
replicateM_ n x = sequence_ (replicate n x)
```
I can't see an advantage to this implementation over the more direct implementation I've provided above. Unless there are objections, I'll send a patch to switch the implementation. (Since master already uses `Applicative`, I'll make the relevant updates to generalize the function signature too.)
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | -------------- |
| Version | 7.10.3 |
| Type | Bug |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | libraries/base |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"Performance issues with replicateM_","status":"New","operating_system":"","component":"libraries/base","related":[],"milestone":"","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"7.10.3","keywords":[],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"Bug","description":"When working on optimizing a program by minimizing allocations, I can into an issue with `replicateM_`. Consider the following code\r\n\r\n{{{#!hs\r\nimport Control.Monad (replicateM_)\r\nimport Foreign.C.String (withCString)\r\nimport Foreign.Storable (peek)\r\n\r\nmain :: IO ()\r\nmain = withCString \"foo\" $ replicateM_ 10000000 . peek\r\n}}}\r\n\r\nWhen I run this program, I get:\r\n\r\n160,042,656 bytes allocated in the heap\r\n\r\nThe result is the same whether I compile with `-O0`, `-O`, or `-O2`. And as expected, the total allocation increases or decreases based on the numbers of times I replicate the action. On the other hand, replacing `replicateM_` with a hand-written version makes the total allocations for the program only 42KB, and does not increase with the numbers of replications.\r\n\r\n{{{#!hs\r\nreplicateM_ :: Monad m => Int -> m a -> m ()\r\nreplicateM_ cnt0 f =\r\n loop cnt0\r\n where\r\n loop cnt\r\n | cnt <= 0 = return ()\r\n | otherwise = f >> loop (cnt - 1)\r\n}}}\r\n\r\nBy contrast, `Control.Monad.replicateM_` looks like:\r\n\r\n{{{#!hs\r\nreplicateM_ :: (Monad m) => Int -> m a -> m ()\r\n{-# INLINEABLE replicateM_ #-}\r\n{-# SPECIALISE replicateM_ :: Int -> IO a -> IO () #-}\r\n{-# SPECIALISE replicateM_ :: Int -> Maybe a -> Maybe () #-}\r\nreplicateM_ n x = sequence_ (replicate n x)\r\n}}}\r\n\r\nI can't see an advantage to this implementation over the more direct implementation I've provided above. Unless there are objections, I'll send a patch to switch the implementation. (Since master already uses `Applicative`, I'll make the relevant updates to generalize the function signature too.)","type_of_failure":"OtherFailure","blocking":[]} -->8.0.1Michael Snoymanmichael@snoyman.comMichael Snoymanmichael@snoyman.comhttps://gitlab.haskell.org/ghc/ghc/-/issues/11725Performance Regression from 7.8.3 to 7.10.32019-07-07T18:28:54ZDominicPerformance Regression from 7.8.3 to 7.10.3Some time ago I wrote a little test program for random number generation. Under 7.8.3 I get
```
Total time 9.03s ( 9.15s elapsed)
```
Under 7.10.3 I get
```
Total time 24.773s ( 25.288s elapsed)
```
Now of course it could...Some time ago I wrote a little test program for random number generation. Under 7.8.3 I get
```
Total time 9.03s ( 9.15s elapsed)
```
Under 7.10.3 I get
```
Total time 24.773s ( 25.288s elapsed)
```
Now of course it could be the libraries that I am using rather than GHC itself so I have tried to make them as similar as possible. For 7.8.3 I have
```
build-depends: base ==4.7.0.1,
mtl ==2.1.3.1,
primitive == 0.6,
mwc-random == 0.13.3.2,
vector == 0.10.12.3,
random ==1.1,
random-fu == 0.2.6.2,
random-source == 0.3.0.6
```
For 7.10.3 I have
```
build-depends: base ==4.8.2.0,
mtl ==2.2,
primitive == 0.6,
mwc-random == 0.13.3.2,
vector == 0.10.12.3,
random ==1.1,
random-fu == 0.2.6.2,
random-source == 0.3.0.6
```
So the only differences are in mtl and base. I don’t seem to be able to coax cabal into using mtl-2.2 for 7.8.3 with all the other required libraries.
```
dominic@ghcPerformance:~$ cabal install 'random-source ==0.3.0.6' 'random-fu ==0.2.6.2' 'random ==1.1' 'primitive ==0.6' 'mwc-random ==0.13.3.2' 'mtl ==2.2' 'vector ==0.10.12.3' --with-ghc=ghc-7.10.3
Resolving dependencies...
All the requested packages are already installed:
mtl-2.2
mwc-random-0.13.3.2
primitive-0.6
random-1.1
random-fu-0.2.6.2
random-source-0.3.0.6
vector-0.10.12.3
Use --reinstall if you want to reinstall anyway.
dominic@ghcPerformance:~$ cabal install 'random-source ==0.3.0.6' 'random-fu ==0.2.6.2' 'random ==1.1' 'primitive ==0.6' 'mwc-random ==0.13.3.2' 'mtl ==2.2' 'vector ==0.10.12.3' --with-ghc=ghc-7.8.3
Resolving dependencies...
cabal: Could not resolve dependencies:
trying: random-source-0.3.0.6/installed-70e... (user goal)
next goal: mtl (user goal)
rejecting: mtl-2.2.1, 2.2.0.1 (global constraint requires ==2.2)
rejecting: mtl-2.2/installed-cc5..., 2.2 (conflict: random-source =>
mtl==2.1.3.1/installed-8bc...)
rejecting: mtl-2.1.3.1/installed-8bc..., 2.1.3.1, 2.1.2, 2.1.1, 2.1, 2.0.1.1,
2.0.1.0, 2.0.0.0, 1.1.1.1, 1.1.1.0, 1.1.0.2, 1.1.0.1, 1.1.0.0, 1.0 (global
constraint requires ==2.2)
Backjump limit reached (change with --max-backjumps).
```
Cabal seems to be telling me that random-source-0.3.0.6 is the problem but if I look at the constraints for that package here https://hackage.haskell.org/package/random-source then I see
```
mtl (>=1 && <3)
```
I am not sure how to proceed from here. I’d like to solve this myself but I don’t want to start building versions of ghc to see which change caused the regression without first eliminating mtl.
Any ideas would be gratefully received.
```
{-# LANGUAGE TemplateHaskell #-}
{-# LANGUAGE GADTs #-}
{-# LANGUAGE FlexibleInstances #-}
import Data.Random
import Data.Random.Source
import qualified System.Random.MWC as MWC
import Control.Monad.Reader
import Control.Monad.Primitive
$(monadRandom [d|
instance (PrimMonad m, s ~ PrimState m) => MonadRandom (ReaderT (MWC.Gen s) m) where
getRandomWord16 = ask >>= lift . MWC.uniform
getRandomWord32 = ask >>= lift . MWC.uniform
getRandomWord64 = ask >>= lift . MWC.uniform
|])
testUniform :: MonadRandom m => Int -> m [Double]
testUniform n = replicateM (fromIntegral n) (sample stdUniform)
n :: Int
n = 10^7
main :: IO ()
main = do
seed <- MWC.create
xs <- runReaderT (testUniform n) seed
print (sum xs / fromIntegral n)
```
This cabal file will build this on 7.8.3
```
name: PerfTest8
version: 0.1.0.0
homepage: TBD
license: MIT
author: Dominic Steinitz
maintainer: idontgetoutmuch@gmail.com
category: System
build-type: Simple
cabal-version: >=1.10
executable Random8
main-is: TestMwcViaRandomSource.hs
build-depends: base ==4.7.0.1,
mtl ==2.1.3.1,
primitive == 0.6,
mwc-random == 0.13.3.2,
vector == 0.10.12.3,
random ==1.1,
random-fu == 0.2.6.2,
random-source == 0.3.0.6
default-language: Haskell2010
```
This cabal file will build this on 7.10.3
```
name: PerfTest10
version: 0.1.0.0
homepage: TBD
license: MIT
author: Dominic Steinitz
maintainer: idontgetoutmuch@gmail.com
category: System
build-type: Simple
cabal-version: >=1.10
executable Random10
main-is: TestMwcViaRandomSource.hs
build-depends: base ==4.8.2.0,
mtl ==2.2,
primitive == 0.6,
mwc-random == 0.13.3.2,
vector == 0.10.12.3,
random ==1.1,
random-fu == 0.2.6.2,
random-source == 0.3.0.6
default-language: Haskell2010
```
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ------------ |
| Version | 7.10.3 |
| Type | Bug |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Compiler |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"Performance Regression from 7.8.3 to 7.10.3","status":"New","operating_system":"","component":"Compiler","related":[],"milestone":"","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"7.10.3","keywords":[],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"Bug","description":"Some time ago I wrote a little test program for random number generation. Under 7.8.3 I get\r\n\r\n{{{\r\nTotal time 9.03s ( 9.15s elapsed)\r\n}}}\r\n\r\nUnder 7.10.3 I get\r\n\r\n{{{\r\nTotal time 24.773s ( 25.288s elapsed)\r\n}}}\r\n\r\nNow of course it could be the libraries that I am using rather than GHC itself so I have tried to make them as similar as possible. For 7.8.3 I have\r\n\r\n{{{\r\n build-depends: base ==4.7.0.1,\r\n mtl ==2.1.3.1,\r\n primitive == 0.6,\r\n mwc-random == 0.13.3.2,\r\n vector == 0.10.12.3,\r\n random ==1.1,\r\n random-fu == 0.2.6.2,\r\n random-source == 0.3.0.6\r\n}}}\r\n\r\nFor 7.10.3 I have\r\n\r\n{{{\r\n build-depends: base ==4.8.2.0,\r\n mtl ==2.2,\r\n primitive == 0.6,\r\n mwc-random == 0.13.3.2,\r\n vector == 0.10.12.3,\r\n random ==1.1,\r\n random-fu == 0.2.6.2,\r\n random-source == 0.3.0.6\r\n}}}\r\n\r\nSo the only differences are in mtl and base. I don’t seem to be able to coax cabal into using mtl-2.2 for 7.8.3 with all the other required libraries.\r\n\r\n{{{\r\ndominic@ghcPerformance:~$ cabal install 'random-source ==0.3.0.6' 'random-fu ==0.2.6.2' 'random ==1.1' 'primitive ==0.6' 'mwc-random ==0.13.3.2' 'mtl ==2.2' 'vector ==0.10.12.3' --with-ghc=ghc-7.10.3\r\nResolving dependencies...\r\nAll the requested packages are already installed:\r\nmtl-2.2\r\nmwc-random-0.13.3.2\r\nprimitive-0.6\r\nrandom-1.1\r\nrandom-fu-0.2.6.2\r\nrandom-source-0.3.0.6\r\nvector-0.10.12.3\r\nUse --reinstall if you want to reinstall anyway.\r\ndominic@ghcPerformance:~$ cabal install 'random-source ==0.3.0.6' 'random-fu ==0.2.6.2' 'random ==1.1' 'primitive ==0.6' 'mwc-random ==0.13.3.2' 'mtl ==2.2' 'vector ==0.10.12.3' --with-ghc=ghc-7.8.3\r\nResolving dependencies...\r\ncabal: Could not resolve dependencies:\r\ntrying: random-source-0.3.0.6/installed-70e... (user goal)\r\nnext goal: mtl (user goal)\r\nrejecting: mtl-2.2.1, 2.2.0.1 (global constraint requires ==2.2)\r\nrejecting: mtl-2.2/installed-cc5..., 2.2 (conflict: random-source =>\r\nmtl==2.1.3.1/installed-8bc...)\r\nrejecting: mtl-2.1.3.1/installed-8bc..., 2.1.3.1, 2.1.2, 2.1.1, 2.1, 2.0.1.1,\r\n2.0.1.0, 2.0.0.0, 1.1.1.1, 1.1.1.0, 1.1.0.2, 1.1.0.1, 1.1.0.0, 1.0 (global\r\nconstraint requires ==2.2)\r\nBackjump limit reached (change with --max-backjumps).\r\n}}}\r\n\r\nCabal seems to be telling me that random-source-0.3.0.6 is the problem but if I look at the constraints for that package here https://hackage.haskell.org/package/random-source then I see\r\n\r\n{{{\r\n mtl (>=1 && <3)\r\n}}}\r\n\r\nI am not sure how to proceed from here. I’d like to solve this myself but I don’t want to start building versions of ghc to see which change caused the regression without first eliminating mtl.\r\n\r\nAny ideas would be gratefully received.\r\n\r\n{{{\r\n{-# LANGUAGE TemplateHaskell #-}\r\n{-# LANGUAGE GADTs #-}\r\n{-# LANGUAGE FlexibleInstances #-}\r\n\r\nimport Data.Random\r\nimport Data.Random.Source\r\nimport qualified System.Random.MWC as MWC\r\nimport Control.Monad.Reader\r\nimport Control.Monad.Primitive\r\n\r\n$(monadRandom [d|\r\n instance (PrimMonad m, s ~ PrimState m) => MonadRandom (ReaderT (MWC.Gen s) m) where\r\n getRandomWord16 = ask >>= lift . MWC.uniform\r\n getRandomWord32 = ask >>= lift . MWC.uniform\r\n getRandomWord64 = ask >>= lift . MWC.uniform\r\n |])\r\n\r\ntestUniform :: MonadRandom m => Int -> m [Double]\r\ntestUniform n = replicateM (fromIntegral n) (sample stdUniform)\r\n\r\nn :: Int\r\nn = 10^7\r\n\r\nmain :: IO ()\r\nmain = do\r\n seed <- MWC.create\r\n xs <- runReaderT (testUniform n) seed\r\n print (sum xs / fromIntegral n)\r\n}}}\r\n\r\nThis cabal file will build this on 7.8.3\r\n\r\n{{{\r\nname: PerfTest8\r\nversion: 0.1.0.0\r\nhomepage: TBD\r\nlicense: MIT\r\nauthor: Dominic Steinitz\r\nmaintainer: idontgetoutmuch@gmail.com\r\ncategory: System\r\nbuild-type: Simple\r\ncabal-version: >=1.10\r\n\r\nexecutable Random8\r\n main-is: TestMwcViaRandomSource.hs\r\n build-depends: base ==4.7.0.1,\r\n mtl ==2.1.3.1,\r\n primitive == 0.6,\r\n mwc-random == 0.13.3.2,\r\n vector == 0.10.12.3,\r\n random ==1.1,\r\n random-fu == 0.2.6.2,\r\n random-source == 0.3.0.6\r\n default-language: Haskell2010\r\n}}}\r\n\r\nThis cabal file will build this on 7.10.3\r\n\r\n{{{\r\nname: PerfTest10\r\nversion: 0.1.0.0\r\nhomepage: TBD\r\nlicense: MIT\r\nauthor: Dominic Steinitz\r\nmaintainer: idontgetoutmuch@gmail.com\r\ncategory: System\r\nbuild-type: Simple\r\ncabal-version: >=1.10\r\n\r\nexecutable Random10\r\n main-is: TestMwcViaRandomSource.hs\r\n build-depends: base ==4.8.2.0,\r\n mtl ==2.2,\r\n primitive == 0.6,\r\n mwc-random == 0.13.3.2,\r\n vector == 0.10.12.3,\r\n random ==1.1,\r\n random-fu == 0.2.6.2,\r\n random-source == 0.3.0.6\r\n default-language: Haskell2010\r\n}}}","type_of_failure":"OtherFailure","blocking":[]} -->8.0.1DominicDominichttps://gitlab.haskell.org/ghc/ghc/-/issues/11710Fusion of a simple listArray call is very fragile2019-07-07T18:28:59ZBen GamariFusion of a simple listArray call is very fragileConsider the following (taken from [ticket:11707\#comment:117750](https://gitlab.haskell.org//ghc/ghc/issues/11707#note_117750)),
```hs
module Test where
import Data.Array
arr, arr2 :: Array Int Int
arr = listArray (0...Consider the following (taken from [ticket:11707\#comment:117750](https://gitlab.haskell.org//ghc/ghc/issues/11707#note_117750)),
```hs
module Test where
import Data.Array
arr, arr2 :: Array Int Int
arr = listArray (0,10) [ 1,1,1,1,1,1,1,1,1,1 ]
arr2 = listArray (0,10) [ 1,1,1,1,1,1,1,1,1,-1 ]
```
Given that these are a small array, one might suspect it would be worthwhile for GHC to fuse the lists with `listArray`, giving rise to two nicely unrolled construction procedures.
However, if you look at the Core produced by `-O1` this you'll find that this only happens in the case of `arr2`. `arr` on the other handle, is mysteriously not fused. The fact that these expressions are so similar and yet produce entirely different code is quite worrying.
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ------------ |
| Version | 7.10.3 |
| Type | Bug |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Compiler |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"Fusion of a simple listArray call is very fragile","status":"New","operating_system":"","component":"Compiler","related":[],"milestone":"","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"7.10.3","keywords":[],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"Bug","description":"Consider the following (taken from ticket:11707#comment:2),\r\n{{{#!hs\r\nmodule Test where \r\nimport Data.Array\r\n\r\narr, arr2 :: Array Int Int\r\narr = listArray (0,10) [ 1,1,1,1,1,1,1,1,1,1 ]\r\narr2 = listArray (0,10) [ 1,1,1,1,1,1,1,1,1,-1 ]\r\n}}}\r\nGiven that these are a small array, one might suspect it would be worthwhile for GHC to fuse the lists with `listArray`, giving rise to two nicely unrolled construction procedures.\r\n\r\nHowever, if you look at the Core produced by `-O1` this you'll find that this only happens in the case of `arr2`. `arr` on the other handle, is mysteriously not fused. The fact that these expressions are so similar and yet produce entirely different code is quite worrying.","type_of_failure":"OtherFailure","blocking":[]} -->8.0.1https://gitlab.haskell.org/ghc/ghc/-/issues/11707Don't desugar large lists with build2019-07-07T18:29:00ZBen GamariDon't desugar large lists with buildWhen looking at `T9661` I noticed that a massive list was being inlined into `listArray`, where foldr/build was being applied, turning a bunch of massive static data into a massive block of code.
It seems likely that keeping this as sta...When looking at `T9661` I noticed that a massive list was being inlined into `listArray`, where foldr/build was being applied, turning a bunch of massive static data into a massive block of code.
It seems likely that keeping this as static data would be preferable here.
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ------------ |
| Version | 7.10.3 |
| Type | Bug |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Compiler |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"Don't desugar large lists with build","status":"New","operating_system":"","component":"Compiler","related":[],"milestone":"","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"7.10.3","keywords":[],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"Bug","description":"When looking at `T9661` I noticed that a massive list was being inlined into `listArray`, where foldr/build was being applied, turning a bunch of massive static data into a massive block of code.\r\n\r\nIt seems likely that keeping this as static data would be preferable here. ","type_of_failure":"OtherFailure","blocking":[]} -->8.0.1https://gitlab.haskell.org/ghc/ghc/-/issues/11701ghc generates significant slower code2019-07-07T18:29:02ZHuStmpHrrrghc generates significant slower codeI've already started a discussion here in SO: http://stackoverflow.com/questions/35941674/latest-ghc-generates-slower-code
So again, I realized that the latest version of ghc produces significantly slower code than older version. my def...I've already started a discussion here in SO: http://stackoverflow.com/questions/35941674/latest-ghc-generates-slower-code
So again, I realized that the latest version of ghc produces significantly slower code than older version. my default ghc is the latest version as of now:
```
$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.10.3
```
I have also two other old versions installed in my local machine.
test code as following
```hs
import Data.Word
import Data.List
import System.Environment
collatzNext :: Word32 -> Word32
collatzNext a = (if even a then a else 3*a+1) `div` 2
-- new code
collatzLen :: Word32 -> Int
collatzLen a0 = lenIterWhile collatzNext (/= 1) a0
lenIterWhile :: (a -> a) -> (a -> Bool) -> a -> Int
lenIterWhile next notDone start = len start 0 where
len n m = if notDone n
then len (next n) (m+1)
else m
-- End of new code
main = do
[a0] <- getArgs
let max_a0 = (read a0)::Word32
print $ maximum $ map (\a0 -> (collatzLen a0, a0)) [1..max_a0]
```
following are the three runs in my machine:
```
$ ~/Tools/ghc-7.6.1/bin/ghc -O2 Test.hs
[1 of 1] Compiling Main ( Test.hs, Test.o )
Linking Test ...
$ time ./Test 1000000
(329,837799)
real 0m1.901s
user 0m1.896s
sys 0m0.000s
$ ~/Tools/ghc/bin/ghc -O2 Test.hs
[1 of 1] Compiling Main ( Test.hs, Test.o )
Linking Test ...
$ time ./Test 1000000
(329,837799)
real 0m10.562s
user 0m10.528s
sys 0m0.036s
$ ~/Tools/ghc-7.4.2/bin/ghc -O2 Test.hs
[1 of 1] Compiling Main ( Test.hs, Test.o )
Linking Test ...
$ time ./Test 1000000
(329,837799)
real 0m1.879s
user 0m1.876s
sys 0m0.000s
```
Obviously we can tell latest version of ghc produces worse code than the older two versions.
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ------------ |
| Version | 7.10.3 |
| Type | Bug |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Compiler |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"ghc generates significant slower code","status":"New","operating_system":"","component":"Compiler","related":[],"milestone":"","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"7.10.3","keywords":["efficiency"],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"Bug","description":"I've already started a discussion here in SO: http://stackoverflow.com/questions/35941674/latest-ghc-generates-slower-code\r\n\r\nSo again, I realized that the latest version of ghc produces significantly slower code than older version. my default ghc is the latest version as of now:\r\n\r\n{{{\r\n$ ghc --version\r\nThe Glorious Glasgow Haskell Compilation System, version 7.10.3\r\n}}}\r\n\r\nI have also two other old versions installed in my local machine.\r\n\r\ntest code as following\r\n\r\n{{{#!hs\r\nimport Data.Word\r\nimport Data.List\r\nimport System.Environment\r\n\r\ncollatzNext :: Word32 -> Word32\r\ncollatzNext a = (if even a then a else 3*a+1) `div` 2\r\n\r\n-- new code\r\ncollatzLen :: Word32 -> Int\r\ncollatzLen a0 = lenIterWhile collatzNext (/= 1) a0\r\n\r\nlenIterWhile :: (a -> a) -> (a -> Bool) -> a -> Int\r\nlenIterWhile next notDone start = len start 0 where\r\n len n m = if notDone n\r\n then len (next n) (m+1)\r\n else m\r\n-- End of new code\r\n\r\nmain = do\r\n [a0] <- getArgs\r\n let max_a0 = (read a0)::Word32\r\n print $ maximum $ map (\\a0 -> (collatzLen a0, a0)) [1..max_a0]\r\n}}}\r\n\r\nfollowing are the three runs in my machine:\r\n\r\n{{{\r\n$ ~/Tools/ghc-7.6.1/bin/ghc -O2 Test.hs \r\n[1 of 1] Compiling Main ( Test.hs, Test.o )\r\nLinking Test ...\r\n\r\n$ time ./Test 1000000\r\n(329,837799)\r\n\r\nreal 0m1.901s\r\nuser 0m1.896s\r\nsys 0m0.000s\r\n\r\n$ ~/Tools/ghc/bin/ghc -O2 Test.hs \r\n[1 of 1] Compiling Main ( Test.hs, Test.o )\r\nLinking Test ...\r\n\r\n$ time ./Test 1000000\r\n(329,837799)\r\n\r\nreal 0m10.562s\r\nuser 0m10.528s\r\nsys 0m0.036s\r\n\r\n$ ~/Tools/ghc-7.4.2/bin/ghc -O2 Test.hs \r\n[1 of 1] Compiling Main ( Test.hs, Test.o )\r\nLinking Test ...\r\n\r\n$ time ./Test 1000000\r\n(329,837799)\r\n\r\nreal 0m1.879s\r\nuser 0m1.876s\r\nsys 0m0.000s\r\n}}}\r\n\r\nObviously we can tell latest version of ghc produces worse code than the older two versions. ","type_of_failure":"OtherFailure","blocking":[]} -->8.0.1https://gitlab.haskell.org/ghc/ghc/-/issues/11688Bytestring break failing rewrite to breakByte and failing to eliminate boxing...2019-07-07T18:29:05ZAlex BiehlBytestring break failing rewrite to breakByte and failing to eliminate boxing/unboxingTyson Whitehead gives a detailed description of the problem [here](https://github.com/haskell/bytestring/issues/70).
Key points:
> - Program fails to rewrite ` ByteString.break (== x) chunk ` to ` ByteString.breakByte x chunk `
> - Mis...Tyson Whitehead gives a detailed description of the problem [here](https://github.com/haskell/bytestring/issues/70).
Key points:
> - Program fails to rewrite ` ByteString.break (== x) chunk ` to ` ByteString.breakByte x chunk `
> - Missed unboxing opportunity (which led to 34gb additional allocations, which is why he noticed it)
I can reproduce the problem on OSX and Windows with GHC 7.10.3. Haven't tested Linux.
I repost it here to make sure it doesn't go unnoticed.8.0.1https://gitlab.haskell.org/ghc/ghc/-/issues/11533Stack check not optimized out even if it could be2019-07-07T18:29:56ZjschollStack check not optimized out even if it could beCmmLayoutStack tries to remove a stack check if a function uses no stack space. It knows that Sp \< SpLim is always false, but not that Sp \>= SpLim is always true. However, the latter can arise when GHC flips the comparison (which it do...CmmLayoutStack tries to remove a stack check if a function uses no stack space. It knows that Sp \< SpLim is always false, but not that Sp \>= SpLim is always true. However, the latter can arise when GHC flips the comparison (which it does sometimes).
An example would be the worker function generated for `go`:
```
countDown :: Int -> Int
countDown = go 0
where
go acc n | n > 0 = go (acc + 1) (n - 1)
| otherwise = acc + n
```
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ------------------ |
| Version | 7.10.3 |
| Type | Bug |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Compiler (CodeGen) |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"Stack check not optimized out even if it could be","status":"New","operating_system":"","component":"Compiler (CodeGen)","related":[],"milestone":"","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"7.10.3","keywords":["cmm"],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"Bug","description":"CmmLayoutStack tries to remove a stack check if a function uses no stack space. It knows that Sp < SpLim is always false, but not that Sp >= SpLim is always true. However, the latter can arise when GHC flips the comparison (which it does sometimes).\r\n\r\nAn example would be the worker function generated for {{{go}}}:\r\n\r\n{{{\r\ncountDown :: Int -> Int\r\ncountDown = go 0\r\n where\r\n go acc n | n > 0 = go (acc + 1) (n - 1)\r\n | otherwise = acc + n\r\n}}}","type_of_failure":"OtherFailure","blocking":[]} -->8.0.1https://gitlab.haskell.org/ghc/ghc/-/issues/11372Loopification does not trigger for IO even if it could2019-07-07T18:30:41ZjschollLoopification does not trigger for IO even if it couldThe loopification optimization, as I understand it, allows a self-recursive function to jump to a local label instead of the beginning of the function, thus skipping a potential stack check. However, I only observe it triggering for pure...The loopification optimization, as I understand it, allows a self-recursive function to jump to a local label instead of the beginning of the function, thus skipping a potential stack check. However, I only observe it triggering for pure functions while IO functions do not get that benefit, even when it would be possible.
I discovered this in #8793 after looking into an unexpected speedup by removing the IO context from an otherwise pure loop and while such a loop can simply be changed to a pure version (the `IOLoop` examples), other functions can not, but the optimization could be applied to them (see `MapM.hs`).
I tried to benchmark the differences between a naive loop in IO and some horrible `inlinePerformIO` hacks to get the loopification to fire and the "optimized" version performs 3-5% faster on my machine.
See [Commentary/Compiler/Loopification](commentary/compiler/loopification) for details8.0.1https://gitlab.haskell.org/ghc/ghc/-/issues/10788performance regression involving minimum2019-07-07T18:33:53Zrwbartonperformance regression involving minimumThis program (taken from http://stackoverflow.com/questions/32158319/difference-in-performance-for-coin-change-between-haskell-and-c) runs about 50% slower when compiled with `ghc-7.10.1 -O` compared to `ghc-7.8.4 -O`.
```
import Data.V...This program (taken from http://stackoverflow.com/questions/32158319/difference-in-performance-for-coin-change-between-haskell-and-c) runs about 50% slower when compiled with `ghc-7.10.1 -O` compared to `ghc-7.8.4 -O`.
```
import Data.Vector.Unboxed as U ((!), constructN, length)
coinchangev :: Int -> [Int] -> Int
coinchangev n cs = v ! n
where v = constructN (n+1) f
f w = case U.length w of
0 -> 0
m -> 1 + minimum [w ! x | x <- map (m-) cs, x >= 0]
main = print $ coinchangev 10000000 [1, 5, 10, 25, 100]
```
However if I change `minimum` to `sum`, while the runtime in 7.8.4 is unchanged, the runtime in 7.10.1 drops by a factor of 5! Allocations also decrease by a large factor. So my guess is that something has gone wrong with call arity analysis for `minimum` (and gone very right for `sum`).
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ------------ |
| Version | 7.10.1 |
| Type | Bug |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Compiler |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"performance regression involving minimum (and maybe Vector)","status":"New","operating_system":"","component":"Compiler","related":[],"milestone":"","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"7.10.1","keywords":[],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"Bug","description":"This program (taken from http://stackoverflow.com/questions/32158319/difference-in-performance-for-coin-change-between-haskell-and-c) runs about 50% slower when compiled with `ghc-7.10.1 -O` compared to `ghc-7.8.4 -O`.\r\n{{{\r\nimport Data.Vector.Unboxed as U ((!), constructN, length)\r\n\r\ncoinchangev :: Int -> [Int] -> Int\r\ncoinchangev n cs = v ! n\r\n where v = constructN (n+1) f\r\n f w = case U.length w of\r\n 0 -> 0\r\n m -> 1 + minimum [w ! x | x <- map (m-) cs, x >= 0]\r\n\r\nmain = print $ coinchangev 10000000 [1, 5, 10, 25, 100]\r\n}}}\r\n\r\nHowever if I change `minimum` to `sum`, while the runtime in 7.8.4 is unchanged, the runtime in 7.10.1 drops by a factor of 5! Allocations also decrease by a large factor. So my guess is that something has gone wrong with call arity analysis for `minimum` (and gone very right for `sum`).","type_of_failure":"OtherFailure","blocking":[]} -->8.0.1Edward KmettEdward Kmetthttps://gitlab.haskell.org/ghc/ghc/-/issues/10750silly assembly for comparing Doubles2019-07-07T18:34:11Zrwbartonsilly assembly for comparing DoublesA function like `f x = if x > 0.0 then ... else ...` compiles to the assembly
```
movsd 7(%rbx),%xmm0
xorpd %xmm1,%xmm1
ucomisd %xmm1,%xmm0
seta %al
movzbl %al,%eax
cmpq $1,%rax
jb...A function like `f x = if x > 0.0 then ... else ...` compiles to the assembly
```
movsd 7(%rbx),%xmm0
xorpd %xmm1,%xmm1
ucomisd %xmm1,%xmm0
seta %al
movzbl %al,%eax
cmpq $1,%rax
jb _c1tb
```
This `seta/movzbl/cmpq/jb` is bad, we can just generate `ja` (and 7.10 did).
The cause is that the code generator produces Cmm like
```
switch [0 .. 1] %MO_F_Gt_W64(_s1sv::F64, 0.0 :: W64)::I64 {
case 0 : goto c1tb;
case 1 : goto c1tc;
}
```
which turns into
```
if (%MO_F_Gt_W64(_s1sv::F64,
0.0 :: W64) < 1) goto c1tb; else goto c1tc;
```
and then GHC is stuck. It knows how to turn `condition >= 1` into `condition`, and it knows how to turn `condition < 1` into a negated version of `condition` when possible, but there is no negated version of `MO_F_Gt_W64` (it's not `MO_F_Le_W64` because of NaNs and signed zeros). It doesn't know how to turn a negated conditional into a conditional with the branches swapped because it doesn't look at that much at once.
Presumably more fallout from #10137, and maybe can be fixed simultaneously with #10677.
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ------------------ |
| Version | 7.11 |
| Type | Bug |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Compiler (CodeGen) |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"silly assembly for comparing Doubles","status":"New","operating_system":"","component":"Compiler (CodeGen)","related":[],"milestone":"","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"7.11","keywords":[],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"Bug","description":"A function like `f x = if x > 0.0 then ... else ...` compiles to the assembly\r\n{{{\r\n movsd 7(%rbx),%xmm0\r\n xorpd %xmm1,%xmm1\r\n ucomisd %xmm1,%xmm0\r\n seta %al\r\n movzbl %al,%eax\r\n cmpq $1,%rax\r\n jb _c1tb\r\n}}}\r\nThis `seta/movzbl/cmpq/jb` is bad, we can just generate `ja` (and 7.10 did).\r\n\r\nThe cause is that the code generator produces Cmm like\r\n{{{\r\n switch [0 .. 1] %MO_F_Gt_W64(_s1sv::F64, 0.0 :: W64)::I64 {\r\n case 0 : goto c1tb;\r\n case 1 : goto c1tc;\r\n }\r\n}}}\r\nwhich turns into\r\n{{{\r\n if (%MO_F_Gt_W64(_s1sv::F64,\r\n 0.0 :: W64) < 1) goto c1tb; else goto c1tc;\r\n}}}\r\nand then GHC is stuck. It knows how to turn `condition >= 1` into `condition`, and it knows how to turn `condition < 1` into a negated version of `condition` when possible, but there is no negated version of `MO_F_Gt_W64` (it's not `MO_F_Le_W64` because of NaNs and signed zeros). It doesn't know how to turn a negated conditional into a conditional with the branches swapped because it doesn't look at that much at once.\r\n\r\nPresumably more fallout from #10137, and maybe can be fixed simultaneously with #10677.","type_of_failure":"OtherFailure","blocking":[]} -->8.0.1https://gitlab.haskell.org/ghc/ghc/-/issues/10744Allow oneShot to work with unboxed types2019-07-07T18:34:13Ztakano-akioAllow oneShot to work with unboxed typesCurrently `oneShot` requires that both the argument type and the return type of the given function have the kind `*`. It would be nice if I could use unlifted types there.
The following program demonstrates this:
```hs
{-# LANGUAGE Mag...Currently `oneShot` requires that both the argument type and the return type of the given function have the kind `*`. It would be nice if I could use unlifted types there.
The following program demonstrates this:
```hs
{-# LANGUAGE MagicHash #-}
module Foo where
import GHC.Exts
import GHC.Magic
f0 :: Int -> Int
f0 = oneShot $ \n -> n -- OK
f1 :: Int# -> Int
f1 = oneShot $ \n# -> I# n# -- Error, the argument type is unlifted
f2 :: Int -> Int#
f2 = oneShot $ \(I# n#) -> n# -- Error, the result type is unlifted
```
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | -------------- |
| Version | 7.10.2 |
| Type | FeatureRequest |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Compiler |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"Allow oneShot to work with unboxed types","status":"New","operating_system":"","component":"Compiler","related":[],"milestone":"","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"7.10.2","keywords":[],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"FeatureRequest","description":"Currently `oneShot` requires that both the argument type and the return type of the given function have the kind `*`. It would be nice if I could use unlifted types there.\r\n\r\nThe following program demonstrates this:\r\n\r\n{{{#!hs\r\n{-# LANGUAGE MagicHash #-}\r\nmodule Foo where\r\n\r\nimport GHC.Exts\r\nimport GHC.Magic\r\n\r\nf0 :: Int -> Int\r\nf0 = oneShot $ \\n -> n -- OK\r\n\r\nf1 :: Int# -> Int\r\nf1 = oneShot $ \\n# -> I# n# -- Error, the argument type is unlifted\r\n\r\nf2 :: Int -> Int#\r\nf2 = oneShot $ \\(I# n#) -> n# -- Error, the result type is unlifted\r\n}}}","type_of_failure":"OtherFailure","blocking":[]} -->8.0.1https://gitlab.haskell.org/ghc/ghc/-/issues/10717fannkuch-redux allocations increase by factor of 10000 between 7.4.2 and 7.6.32019-07-07T18:34:23ZBen Gamarifannkuch-redux allocations increase by factor of 10000 between 7.4.2 and 7.6.3In a recent look at [historical](http://home.smart-cactus.org/~ben/nofib.html) nofib trends I noticed that fannkuck-redux appears to regress an almost unbelievable amount in its allocations,
== 7.6.3 ==
```
<<ghc: 870987952 bytes, 1668 ...In a recent look at [historical](http://home.smart-cactus.org/~ben/nofib.html) nofib trends I noticed that fannkuck-redux appears to regress an almost unbelievable amount in its allocations,
== 7.6.3 ==
```
<<ghc: 870987952 bytes, 1668 GCs (1666 + 2),
0/0 avg/max bytes residency (0 samples),
84640 bytes GC work, 1M in use,
0.00 INIT (0.00 elapsed), 2.43 MUT (2.43 elapsed), 0.00 GC (0.00 elapsed),
0.00 GC(0) (0.00 elapsed), 0.00 GC(1) (0.00 elapsed), 1 balance :ghc>>
```
## 7.4.2
```
<<ghc: 74944 bytes, 1 GCs (0 + 1),
0/0 avg/max bytes residency (0 samples),
3512 bytes GC work, 1M in use,
0.00 INIT (0.00 elapsed), 2.25 MUT (2.25 elapsed), 0.00 GC (0.00 elapsed),
0.00 GC(0) (0.00 elapsed), 0.00 GC(1) (0.00 elapsed), 1 balance :ghc>>
```
Given that \[FoldrBuildNotes\] suggests that this test is quite sensitive to fusion, I suspect something broke here.
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ------------ |
| Version | 7.6.3 |
| Type | Bug |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Compiler |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"fannkuch-redux allocations increase 10000% between 7.4.2 and 7.6.3","status":"New","operating_system":"","component":"Compiler","related":[],"milestone":"7.12.1","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"7.6.3","keywords":[],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"Bug","description":"In a recent look at [http://home.smart-cactus.org/~ben/nofib.html historical] nofib trends I noticed that fannkuck-redux appears to regress an almost unbelievable amount in its allocations,\r\n== 7.6.3 ==\r\n{{{\r\n<<ghc: 870987952 bytes, 1668 GCs (1666 + 2),\r\n 0/0 avg/max bytes residency (0 samples),\r\n 84640 bytes GC work, 1M in use,\r\n 0.00 INIT (0.00 elapsed), 2.43 MUT (2.43 elapsed), 0.00 GC (0.00 elapsed),\r\n 0.00 GC(0) (0.00 elapsed), 0.00 GC(1) (0.00 elapsed), 1 balance :ghc>>\r\n}}}\r\n\r\n== 7.4.2\r\n{{{\r\n<<ghc: 74944 bytes, 1 GCs (0 + 1),\r\n 0/0 avg/max bytes residency (0 samples),\r\n 3512 bytes GC work, 1M in use,\r\n 0.00 INIT (0.00 elapsed), 2.25 MUT (2.25 elapsed), 0.00 GC (0.00 elapsed),\r\n 0.00 GC(0) (0.00 elapsed), 0.00 GC(1) (0.00 elapsed), 1 balance :ghc>>\r\n}}}\r\n\r\nGiven that [FoldrBuildNotes] suggests that this test is quite sensitive to fusion, I suspect something broke here.","type_of_failure":"OtherFailure","blocking":[]} -->8.0.1https://gitlab.haskell.org/ghc/ghc/-/issues/10678integer-gmp's runS seems unnecessarily expensive2021-02-01T14:20:43Zrwbartoninteger-gmp's runS seems unnecessarily expensiveinteger-gmp uses an unsafePerformIO-like operation to work with mutable BigNats (unsafePerformIO and even the IO type are not yet available, since integer-gmp is a dependency of base):
```hs
type S s a = State# s -> (# State# s, a #)
-...integer-gmp uses an unsafePerformIO-like operation to work with mutable BigNats (unsafePerformIO and even the IO type are not yet available, since integer-gmp is a dependency of base):
```hs
type S s a = State# s -> (# State# s, a #)
-- NB: equivalent of GHC.IO.unsafeDupablePerformIO, see notes there
runS :: S RealWorld a -> a
runS m = lazy (case m realWorld# of (# _, r #) -> r)
{-# NOINLINE runS #-}
```
It's tempting to think of such an operation as "free" like an unsafeCoerce, but it is actually somewhat expensive.
Consider `plusBigNat` for instance. (Most BigNat operations have a similar structure.)
```hs
plusBigNat :: BigNat -> BigNat -> BigNat
plusBigNat x y
| isTrue# (eqBigNatWord# x 0##) = y
| isTrue# (eqBigNatWord# y 0##) = x
| isTrue# (nx# >=# ny#) = go x nx# y ny#
| True = go y ny# x nx#
where
go (BN# a#) na# (BN# b#) nb# = runS $ do
mbn@(MBN# mba#) <- newBigNat# na#
(W# c#) <- liftIO (c_mpn_add mba# a# na# b# nb#)
case c# of
0## -> unsafeFreezeBigNat# mbn
_ -> unsafeSnocFreezeBigNat# mbn c#
nx# = sizeofBigNat# x
ny# = sizeofBigNat# y
```
The assembly for `go` begins
```
00000000000001d0 <integerzmgmp_GHCziIntegerziType_zdwgo_info>:
1d0: 49 83 c4 28 add $0x28,%r12
1d4: 4d 3b a5 58 03 00 00 cmp 0x358(%r13),%r12
1db: 77 26 ja 203 <integerzmgmp_GHCziIntegerziType_zdwgo_info+0x33>
1dd: 49 c7 44 24 e0 00 00 movq $0x0,-0x20(%r12)
1e4: 00 00
1e2: R_X86_64_32S .text+0x38
1e6: 4d 89 74 24 e8 mov %r14,-0x18(%r12)
1eb: 49 89 7c 24 f0 mov %rdi,-0x10(%r12)
1f0: 49 89 74 24 f8 mov %rsi,-0x8(%r12)
1f5: 4d 89 04 24 mov %r8,(%r12)
1f9: 4d 8d 74 24 e1 lea -0x1f(%r12),%r14
1fe: e9 00 00 00 00 jmpq 203 <integerzmgmp_GHCziIntegerziType_zdwgo_info+0x33>
1ff: R_X86_64_PC32 integerzmgmp_GHCziIntegerziType_runS_info-0x4
203: ... ; heap overflow
```
This allocates a 5-word closure (containing `a#`, `na#`, `b#`, `nb#`) whose code is at `.text+0x38` and passes it to `runS`, which does some `stg_ap`-y things to call back into the closure, which reads its free variables back from the heap and finally does all the real work. Altogether it's around two dozen instructions compared to if we could call directly from `go` to the argument of `runS`.
The old integer-gmp somehow avoided this particular overhead by instead using the implicit "unsafePerformIO" of a foreign import prim which performed both the allocation and the addition. Is this overhead a necessary consequence of doing the work in multiple steps in Haskell?
I understand that we cannot allow everything to be inlined and, for example, the `newBigNat#` to be shared between a `plusBigNat` and `minusBigNat` with the same arguments. But once `runS` has done its job of keeping the `newBigNat#/c_mpn_add/unsafeFreeze*` together, it would be nice to eliminate it completely in the backend when compiling `go`, or any inlined version of `go`.
I'm not sure whether this should be fixed in the code generator or in integer-gmp itself. I'm also aware that this is a tricky subject but haven't really done my homework on the related tickets, so I might be missing something important!8.0.1https://gitlab.haskell.org/ghc/ghc/-/issues/10359Tuple constraint synonym led to asymptotic performance lossage2021-12-16T08:55:43ZaxchTuple constraint synonym led to asymptotic performance lossageI encountered a situation where using a tuple constraint synonym
produced an asymptotic runtime performance degradation, from linear to
quadratic time. The enclosed Bug.hs reproduces the potentially
relevant features of the problem. The ...I encountered a situation where using a tuple constraint synonym
produced an asymptotic runtime performance degradation, from linear to
quadratic time. The enclosed Bug.hs reproduces the potentially
relevant features of the problem. The function `test` there is
quadratic, but should be linear. The file also contains a definition
of `test2`, which differs only in eschewing the constraint synonym,
and is linear, as it should be.
In more detail: the program tries to count to 20,000 in a rather
contrived way. It maintains a Box, which holds the current count and
a function that can accept an increment and a count to produce a new
count (namely, `(+)`). The function `do_step` lifts incrementation to
Boxes, using their contained function to update their stored count.
The first tricky thing is that the incrementing function stored in the
Box is polymorphic in both arguments separately: it can accept
increments that are not the same type as the stored count (and
converts them). The second tricky thing is that I don't want to
expose the type variable for the increment type (as opposed to the
count type) to clients of the Box, so the Box's incrementing function
is stored polymorphic (with an explicit forall). Finally, I want to
impose the constraint `(Fractional num, Real num)` on allowable
increments (even though this example only needs `Real num`).
This constellation of desiderata conspires to produce quadratic
performance: doubling the parameter to `test` (but not `test2`)
multiplies its runtime by 4. Inspecting the heap profile indicates a
growing accumulation of closures when running `test` (but not
`test2`). Inspecting the generated stg (enclosed) locates a potential
culprit: apparently when `do_step` (but not `do_step2`) reconstructs
the Box, it changes the `func` stored there to be a fresh closure that
destructures the `Numerical` constraint tuple, constructs a fresh one,
and calls the function that was in that slot previously with it. I
hypothesize that this accumulates as a chain that performs a linear
number of such destructurings and restructurings at each increment,
leading to the observed quadratic runtime and linear memory growth.
Incidentally, I checked whether record wildcards were the issue (no:
`test4`) and whether updating just the `obj` field solves it (yes:
`test3`). Sadly, the latter solution is not directly usable in my
actual application because of "Record update for insufficiently
polymorphic field".
For reference: I compiled with `ghc -O2 Bug.hs -fprof-auto -rtsopts -ddump-to-file -ddump-simpl -ddump-st`8.0.1https://gitlab.haskell.org/ghc/ghc/-/issues/10289compiling huge HashSet hogs memory2019-07-07T18:36:44Zzudovcompiling huge HashSet hogs memoryCompiling a huge (\~2.5k elements) set with GHC-7.10.1 or GHC-head and unordered-containers-0.2.5.1 or unordered-containers-head takes way too much memory.
Here is a [file](https://github.com/zudov/html5-entity/blob/unordered-containers/...Compiling a huge (\~2.5k elements) set with GHC-7.10.1 or GHC-head and unordered-containers-0.2.5.1 or unordered-containers-head takes way too much memory.
Here is a [file](https://github.com/zudov/html5-entity/blob/unordered-containers/src/Text/Html/Entity/Data/EntitySet.hs) which I am trying to compile. I've also set up a [travis build](https://travis-ci.org/zudov/html5-entity/builds/58063328) which demonstrates the behaviour with different versions of GHC and unordered-containers. Further I would be referring to [this build-job](https://travis-ci.org/zudov/html5-entity/jobs/58063332) which uses GHC-7.10.1 and unordered-containers-0.2.5.1.
- [Compilation](https://travis-ci.org/zudov/html5-entity/jobs/58063332#L397) with default optimization options takes around [12GB](https://travis-ci.org/zudov/html5-entity/jobs/58063332#L408) of memory.
- [Compilation](https://travis-ci.org/zudov/html5-entity/jobs/58063332#L367) with `-O0` takes around [400MB](https://travis-ci.org/zudov/html5-entity/jobs/58063332#L378).
- [Compilation](https://travis-ci.org/zudov/html5-entity/jobs/58063332#L457) with `-O2` takes around [12GB](https://travis-ci.org/zudov/html5-entity/jobs/58063332#L468) of memory.
- [Compilation](https://travis-ci.org/zudov/html5-entity/jobs/58063332#L487) with `-O2 -fignore-interface-pragmas` takes around [500MB](https://travis-ci.org/zudov/html5-entity/jobs/58063332#L498) of memory, which solves the problem.
When the build uses GHC-7.8.4, neither of this hogging [occurs](https://travis-ci.org/zudov/html5-entity/jobs/58063330).
Another interesting observation is that [compiling](https://travis-ci.org/zudov/html5-entity/jobs/58063330#L339) [HashMap](https://github.com/zudov/html5-entity/blob/unordered-containers/src/Text/Html/Entity/Data/EntityMap.hs) of the same size, doesn't cause memory hogging even with `-O2`. This attracted my attention as `HashSet` is implemented in terms of `HashMap`.
I reported the issue to `unordered-containers` as well: [link](https://github.com/tibbe/unordered-containers/issues/100).
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ------------------ |
| Version | 7.10.1 |
| Type | Bug |
| TypeOfFailure | RuntimePerformance |
| Priority | normal |
| Resolution | Unresolved |
| Component | Compiler |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"compiling huge HashSet hogs memory","status":"New","operating_system":"","component":"Compiler","related":[],"milestone":"","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"7.10.1","keywords":[],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"Bug","description":"Compiling a huge (~2.5k elements) set with GHC-7.10.1 or GHC-head and unordered-containers-0.2.5.1 or unordered-containers-head takes way too much memory.\r\nHere is a [https://github.com/zudov/html5-entity/blob/unordered-containers/src/Text/Html/Entity/Data/EntitySet.hs file] which I am trying to compile. I've also set up a [https://travis-ci.org/zudov/html5-entity/builds/58063328 travis build] which demonstrates the behaviour with different versions of GHC and unordered-containers. Further I would be referring to [https://travis-ci.org/zudov/html5-entity/jobs/58063332 this build-job] which uses GHC-7.10.1 and unordered-containers-0.2.5.1.\r\n* [https://travis-ci.org/zudov/html5-entity/jobs/58063332#L397 Compilation] with default optimization options takes around [https://travis-ci.org/zudov/html5-entity/jobs/58063332#L408 12GB] of memory.\r\n* [https://travis-ci.org/zudov/html5-entity/jobs/58063332#L367 Compilation] with `-O0` takes around [https://travis-ci.org/zudov/html5-entity/jobs/58063332#L378 400MB].\r\n* [https://travis-ci.org/zudov/html5-entity/jobs/58063332#L457 Compilation] with `-O2` takes around [https://travis-ci.org/zudov/html5-entity/jobs/58063332#L468 12GB] of memory.\r\n* [https://travis-ci.org/zudov/html5-entity/jobs/58063332#L487 Compilation] with `-O2 -fignore-interface-pragmas` takes around [https://travis-ci.org/zudov/html5-entity/jobs/58063332#L498 500MB] of memory, which solves the problem.\r\nWhen the build uses GHC-7.8.4, neither of this hogging [https://travis-ci.org/zudov/html5-entity/jobs/58063330 occurs].\r\nAnother interesting observation is that [https://travis-ci.org/zudov/html5-entity/jobs/58063330#L339 compiling] [https://github.com/zudov/html5-entity/blob/unordered-containers/src/Text/Html/Entity/Data/EntityMap.hs HashMap] of the same size, doesn't cause memory hogging even with `-O2`. This attracted my attention as `HashSet` is implemented in terms of `HashMap`.\r\n\r\nI reported the issue to `unordered-containers` as well: [https://github.com/tibbe/unordered-containers/issues/100 link].","type_of_failure":"RuntimePerformance","blocking":[]} -->8.0.1https://gitlab.haskell.org/ghc/ghc/-/issues/10067The Read Integer instance is too slow2019-07-07T18:37:41ZrednebThe Read Integer instance is too slowThe current implementation of the Read Integer instance has quadratic complexity and thus performs badly on large inputs. On my system, it takes 65 seconds to perform the following:
```hs
read (take 1000000 $ cycle "1234567890") :: Inte...The current implementation of the Read Integer instance has quadratic complexity and thus performs badly on large inputs. On my system, it takes 65 seconds to perform the following:
```hs
read (take 1000000 $ cycle "1234567890") :: Integer
```
Note that we already provide an ad-hoc instance for Show Integer, so maybe we can do the same for Read Integer.
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | -------------- |
| Version | 7.11 |
| Type | FeatureRequest |
| TypeOfFailure | OtherFailure |
| Priority | low |
| Resolution | Unresolved |
| Component | libraries/base |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | ekmett, hvr |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"The Read Integer instance is too slow","status":"New","operating_system":"","component":"libraries/base","related":[],"milestone":"","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"7.11","keywords":[],"differentials":[],"test_case":"","architecture":"","cc":["ekmett","hvr"],"type":"FeatureRequest","description":"The current implementation of the Read Integer instance has quadratic complexity and thus performs badly on large inputs. On my system, it takes 65 seconds to perform the following: \r\n{{{#!hs\r\nread (take 1000000 $ cycle \"1234567890\") :: Integer\r\n}}}\r\n\r\nNote that we already provide an ad-hoc instance for Show Integer, so maybe we can do the same for Read Integer.\r\n","type_of_failure":"OtherFailure","blocking":[]} -->8.0.1https://gitlab.haskell.org/ghc/ghc/-/issues/10034Regression in mapM_ performance2019-07-07T18:37:51ZDavid FeuerRegression in mapM_ performanceThe current `mapM_` is sometimes worse than the version in 7.8.3. This can be fixed easily by eta-expansion, matching `mapM`.
<details><summary>Trac metadata</summary>
| Trac field | Value |
|...The current `mapM_` is sometimes worse than the version in 7.8.3. This can be fixed easily by eta-expansion, matching `mapM`.
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ------------------------------------ |
| Version | 7.10.1-rc2 |
| Type | Bug |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Core Libraries |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | core-libraries-committee@haskell.org |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"Regression in mapM_ performance","status":"New","operating_system":"","component":"Core Libraries","related":[],"milestone":"7.10.1","resolution":"Unresolved","owner":{"tag":"OwnedBy","contents":"ekmett"},"version":"7.10.1-rc2","keywords":[],"differentials":[],"test_case":"","architecture":"","cc":["core-libraries-committee@haskell.org"],"type":"Bug","description":"The current `mapM_` is sometimes worse than the version in 7.8.3. This can be fixed easily by eta-expansion, matching `mapM`.","type_of_failure":"OtherFailure","blocking":[]} -->8.0.1Ben GamariBen Gamarihttps://gitlab.haskell.org/ghc/ghc/-/issues/10014Data.Array.Base.elems needlessly calls bounds.2019-07-07T18:37:55ZEdward KmettData.Array.Base.elems needlessly calls bounds.```
elems arr = case bounds arr of
(_l, _u) -> [unsafeAt arr i | i <- [0 .. numElements arr - 1]]
```
It never uses the result.
I'd propose simplifying it to
```
elems arr = [unsafeAt arr i | i <- [0 .. numElements arr - 1]]
```
...```
elems arr = case bounds arr of
(_l, _u) -> [unsafeAt arr i | i <- [0 .. numElements arr - 1]]
```
It never uses the result.
I'd propose simplifying it to
```
elems arr = [unsafeAt arr i | i <- [0 .. numElements arr - 1]]
```
It appears at some point someone optimized it to use the `unsafeAt`, but never removed the bounds check.
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ------------------------------------ |
| Version | 7.8.4 |
| Type | FeatureRequest |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Core Libraries |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | core-libraries-committee@haskell.org |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"Data.Array.Base.elems needlessly calls bounds.","status":"New","operating_system":"","component":"Core Libraries","related":[],"milestone":"","resolution":"Unresolved","owner":{"tag":"OwnedBy","contents":"ekmett"},"version":"7.8.4","keywords":[],"differentials":[],"test_case":"","architecture":"","cc":["core-libraries-committee@haskell.org"],"type":"FeatureRequest","description":"{{{\r\nelems arr = case bounds arr of\r\n (_l, _u) -> [unsafeAt arr i | i <- [0 .. numElements arr - 1]]\r\n}}}\r\n\r\nIt never uses the result.\r\n\r\nI'd propose simplifying it to\r\n\r\n{{{\r\nelems arr = [unsafeAt arr i | i <- [0 .. numElements arr - 1]]\r\n}}}\r\n\r\nIt appears at some point someone optimized it to use the `unsafeAt`, but never removed the bounds check.","type_of_failure":"OtherFailure","blocking":[]} -->8.0.1Edward KmettEdward Kmett