GHC issueshttps://gitlab.haskell.org/ghc/ghc/-/issues2021-11-23T01:53:22Zhttps://gitlab.haskell.org/ghc/ghc/-/issues/15524Performance regression when using the GHC API to evaluate code compared to 8.42021-11-23T01:53:22ZVaibhav SagarPerformance regression when using the GHC API to evaluate code compared to 8.4I've been in the process of updating IHaskell to support GHC 8.6 and the test suite has become agonisingly slow as compared to GHC 8.4 and below. I've been able to work around this by using `--enable-executable-dynamic` as suggested by `...I've been in the process of updating IHaskell to support GHC 8.6 and the test suite has become agonisingly slow as compared to GHC 8.4 and below. I've been able to work around this by using `--enable-executable-dynamic` as suggested by `christiaanb` on `#ghc`but I thought this was worth reporting as a bug.
To reproduce this on a Linux installation with Nix:
1. `git clone https://github.com/gibiansky/IHaskell`
1. `cd IHaskell`
1. `git checkout 6058cd4fac01a2023dbd09d174f1f8d4c36e7475`
1. Comment out `--enable-executable-dynamic` in `release-8.6.nix`
1. `nix-build release-8.6.nix`⊥https://gitlab.haskell.org/ghc/ghc/-/issues/15503interpreter: sequence_ (replicate 100000000 (return ())) gobbles up memory2021-09-07T15:51:35ZBertram Felgenhauerinterpreter: sequence_ (replicate 100000000 (return ())) gobbles up memoryI consider this to be more of a curiosity than a bug, but it might still be worth investigating. The following line,
```hs
sequence_ (replicate 100000000 (return ()))
```
causes `ghci` to grow quite big (here it's 6GB on 64 bit Linux; ...I consider this to be more of a curiosity than a bug, but it might still be worth investigating. The following line,
```hs
sequence_ (replicate 100000000 (return ()))
```
causes `ghci` to grow quite big (here it's 6GB on 64 bit Linux; it's roughly proportional to the `100000000` number).
This used to be better in ancient times--before version 8.0.1, ghci ran this code in what looks like constant memory.
(This came up indirectly on \#haskell; somebody had adapted Hutton's game of life program, http://www.cs.nott.ac.uk/\~gmh/life.lhs which uses busy waiting for a delay.)
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ------------ |
| Version | 8.4.3 |
| Type | Bug |
| TypeOfFailure | OtherFailure |
| Priority | low |
| Resolution | Unresolved |
| Component | GHCi |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"interpreter: sequence_ (replicate 100000000 (return ())) gobbles up memory","status":"New","operating_system":"","component":"GHCi","related":[],"milestone":"8.6.1","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"8.4.3","keywords":[],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"Bug","description":"I consider this to be more of a curiosity than a bug, but it might still be worth investigating. The following line,\r\n{{{#!hs\r\nsequence_ (replicate 100000000 (return ()))\r\n}}}\r\ncauses `ghci` to grow quite big (here it's 6GB on 64 bit Linux; it's roughly proportional to the `100000000` number).\r\n\r\nThis used to be better in ancient times--before version 8.0.1, ghci ran this code in what looks like constant memory.\r\n\r\n(This came up indirectly on #haskell; somebody had adapted Hutton's game of life program, http://www.cs.nott.ac.uk/~gmh/life.lhs which uses busy waiting for a delay.)","type_of_failure":"OtherFailure","blocking":[]} -->⊥Ömer Sinan AğacanÖmer Sinan Ağacanhttps://gitlab.haskell.org/ghc/ghc/-/issues/15176Superclass `Monad m =>` makes program run 100 times slower2019-11-16T10:52:51Zdanilo2Superclass `Monad m =>` makes program run 100 times slowerHi! I've just encountered a very bizarre error.
### General description
The code:
```
class LayersFoldableBuilder__ t (layers :: [Type]) m where
buildLayersFold__ :: SomePtr -> m (Fold.Result t) -> m (Fold.Result t)
instance Mona...Hi! I've just encountered a very bizarre error.
### General description
The code:
```
class LayersFoldableBuilder__ t (layers :: [Type]) m where
buildLayersFold__ :: SomePtr -> m (Fold.Result t) -> m (Fold.Result t)
instance Monad m => LayersFoldableBuilder__ t '[] m where
buildLayersFold__ = \_ a -> a
{-# INLINE buildLayersFold__ #-}
instance ( MonadIO m
, Storable.Storable (Layer.Cons l ())
, Layer.StorableLayer l m
, LayerFoldableBuilder__ (EnabledLayer t l) t m l
, LayersFoldableBuilder__ t ls m )
=> LayersFoldableBuilder__ t (l ': ls) m where
buildLayersFold__ = \ptr mr -> do
let fs = buildLayersFold__ @t @ls ptr'
ptr' = Ptr.plusPtr ptr $ Layer.byteSize @l
layerBuild__ @(EnabledLayer t l) @t @m @l ptr $! fs mr
{-# INLINE buildLayersFold__ #-}
```
This is a typeclass `LayersFoldableBuilder__` and ALL of its instances. Please note, that every instance has a `Monad m` or `MonadIO m` constraint. The program which uses this code heavily runs in 40ms. If we only add constraint `Monad m =>` to the class definition:
```
class Monad m => LayersFoldableBuilder__ t (layers :: [Type]) m where
buildLayersFold__ :: SomePtr -> m (Fold.Result t) -> m (Fold.Result t)
```
The program runs in 3.5s , which is almost 100 times slower.
Unfortunatelly I do not have minimal example, but it is reproducible. It is a part of the Luna Language codebase: https://github.com/luna/luna-core/blob/60bf6130691c23e52b97b067b52becb8fdb0c72e/core/src/Data/Graph/Traversal/Scoped.hs\#L102
it was introduced in the commit 60bf6130691c23e52b97b067b52becb8fdb0c72e on branch static-layers . However, building this is simple: stack bench luna-core . After invoking it we see the described results.
### Why its important and what should we do to fix it ===
1. I am writing this because I care about Haskell community. I want GHC and Haskell to be widely used. Right now, the only thing I hear from all companies around our company is that impredicative performance, even when following rules "how to write efficient code" is the biggest pain people have. Haskell is gathering attention - pure functional programming, immutability etc - are great. But it will not become a popular choice unless we care about predictive performance.
1. Such performance changes are unacceptable when thinking about Haskell and GHC as production ready systems. We need a clear way how to write high performance Haskell without the need to benchmark every part of our programs even when refactoring things. GHC has enough information to discover that we want high performance here and there (even by looking at INLINE pragmas) and should warn us about lack of optimization. We should also have a way to force GHC to apply optimizations in particular places - for example by explicit marking code to be always specialized during compilation, so GHC would never fall back to dict-passing in such places. Such possibility would solve MANY related problems and user fears.
1. The point 2 also applies to strictness. In my opinion, having more clear strictness resolution rules / tools is important. Right now the only way to know if strictness analysis did a good job and we are not constantly boxing / unboxing things is to read core, which is tedious and 99% of Haskell users do not even know how to do it (We've got 10 really, really good Haskellers here, 2 of them are capable of reading core, but not very fluently). I would love to chat more about these topics, because they are crucial for growing Haskell community and making Haskell more popular choice, which is waht we want, right? We don't want Haskell to be just a research project with "all its users being its authors" at the same time, am I
### What happens in core ===
I inspected core and have found that indeed, after adding the constraint, GHC does not apply all optimizations to the definitions. To be honest, I completely don't understand it, because the code uses everywhere explicit `INLINE` pragma to be sure everything is optimized away in the compilation stage:
```
--------------------------------------------------------------------------------
SLOW, with Monad m =>
--------------------------------------------------------------------------------
-- RHS size: {terms: 5, types: 12, coercions: 4, joins: 0/0}
buildLayersFold__ [InlPrag=INLINE]
:: forall t (layers :: [*]) (m :: * -> *).
LayersFoldableBuilder__ t layers m =>
SomePtr -> m (Fold.Result t) -> m (Fold.Result t)
[GblId[ClassOp],
Arity=1,
Caf=NoCafRefs,
Str=<S,U>,
Unf=Unf{Src=InlineStable, TopLvl=True, Value=True, ConLike=True,
WorkFree=True, Expandable=True,
Guidance=ALWAYS_IF(arity=1,unsat_ok=False,boring_ok=True)
Tmpl= \ (@ t_ao0O)
(@ (layers_ao0P :: [*]))
(@ (m_ao0Q :: * -> *))
(v_B1 [Occ=Once]
:: LayersFoldableBuilder__ t_ao0O layers_ao0P m_ao0Q) ->
v_B1
`cast` (Data.Graph.Traversal.Scoped.N:LayersFoldableBuilder__[0]
<t_ao0O>_N <layers_ao0P>_N <m_ao0Q>_N
:: (LayersFoldableBuilder__
t_ao0O layers_ao0P m_ao0Q :: Constraint)
~R# (SomePtr
-> m_ao0Q (Fold.Result t_ao0O)
-> m_ao0Q (Fold.Result t_ao0O) :: *))}]
buildLayersFold__
= \ (@ t_ao0O)
(@ (layers_ao0P :: [*]))
(@ (m_ao0Q :: * -> *))
(v_B1 :: LayersFoldableBuilder__ t_ao0O layers_ao0P m_ao0Q) ->
v_B1
`cast` (Data.Graph.Traversal.Scoped.N:LayersFoldableBuilder__[0]
<t_ao0O>_N <layers_ao0P>_N <m_ao0Q>_N
:: (LayersFoldableBuilder__
t_ao0O layers_ao0P m_ao0Q :: Constraint)
~R# (SomePtr
-> m_ao0Q (Fold.Result t_ao0O)
-> m_ao0Q (Fold.Result t_ao0O) :: *))
--------------------------------------------------------------------------------
FAST, without Monad m =>
--------------------------------------------------------------------------------
-- RHS size: {terms: 8, types: 25, coercions: 0, joins: 0/0}
Data.Graph.Traversal.Scoped.$p1LayersFoldableBuilder__
:: forall t (layers :: [*]) (m :: * -> *).
LayersFoldableBuilder__ t layers m =>
Monad m
[GblId[ClassOp],
Arity=1,
Caf=NoCafRefs,
Str=<S(SL),U(U,A)>,
RULES: Built in rule for Data.Graph.Traversal.Scoped.$p1LayersFoldableBuilder__: "Class op $p1LayersFoldableBuilder__"]
Data.Graph.Traversal.Scoped.$p1LayersFoldableBuilder__
= \ (@ t_ao0P)
(@ (layers_ao0Q :: [*]))
(@ (m_ao0R :: * -> *))
(v_B1 :: LayersFoldableBuilder__ t_ao0P layers_ao0Q m_ao0R) ->
case v_B1 of v_B1
{ Data.Graph.Traversal.Scoped.C:LayersFoldableBuilder__ v_B2
v_B3 ->
v_B2
}
-- RHS size: {terms: 8, types: 25, coercions: 0, joins: 0/0}
buildLayersFold__
:: forall t (layers :: [*]) (m :: * -> *).
LayersFoldableBuilder__ t layers m =>
SomePtr -> m (Fold.Result t) -> m (Fold.Result t)
[GblId[ClassOp],
Arity=1,
Caf=NoCafRefs,
Str=<S(LS),U(A,U)>,
RULES: Built in rule for buildLayersFold__: "Class op buildLayersFold__"]
buildLayersFold__
= \ (@ t_ao0P)
(@ (layers_ao0Q :: [*]))
(@ (m_ao0R :: * -> *))
(v_B1 :: LayersFoldableBuilder__ t_ao0P layers_ao0Q m_ao0R) ->
case v_B1 of v_B1
{ Data.Graph.Traversal.Scoped.C:LayersFoldableBuilder__ v_B2
v_B3 ->
v_B3
}
```⊥https://gitlab.haskell.org/ghc/ghc/-/issues/14980Runtime performance regression with binary operations on vectors2021-09-07T15:48:34ZttylecRuntime performance regression with binary operations on vectorsLet me briefly explain our use case: we have a machine learning tools created in Haskell. Basically it builds association rules for a database. In plain English: predicates on data rows. We need to score them, so we need to check how man...Let me briefly explain our use case: we have a machine learning tools created in Haskell. Basically it builds association rules for a database. In plain English: predicates on data rows. We need to score them, so we need to check how many rows are "matched" by the predicate.
In order to optimize performance, our code uses two main representations for data: one is a "human readable", where a values are real values. The second one is binarized representation for categorical data. The latter has is actually a family of representation, since we pack bits into tuples of Word64 and use bitwise operation to implement logic.
Simplified but representative code is attached.
In GHC 8.0.2 binary representation is faster by one order of magnitude than the "human readable" one:
```
➜ ghc-bug stack exec performance-bug-pair-1
"Generated"
benchmarking 256 columns/raw unbox vectors
time 342.6 μs (338.3 μs .. 348.0 μs)
0.999 R² (0.998 R² .. 1.000 R²)
mean 339.3 μs (337.5 μs .. 342.0 μs)
std dev 7.203 μs (5.596 μs .. 10.07 μs)
variance introduced by outliers: 13% (moderately inflated)
benchmarking 256 columns/binary packed
time 32.07 μs (29.69 μs .. 34.14 μs)
0.982 R² (0.976 R² .. 0.997 R²)
mean 29.97 μs (29.41 μs .. 31.03 μs)
std dev 2.428 μs (1.526 μs .. 3.750 μs)
variance introduced by outliers: 78% (severely inflated)
```
In GHC 8.2.2 (and later) binary representation performance is similar to "human readable", so no performance gain:
```
➜ ghc-bug stack exec performance-bug-pair-1
"Generated"
benchmarking 256 columns/raw unbox vectors
time 442.4 μs (406.7 μs .. 474.5 μs)
0.978 R² (0.969 R² .. 0.993 R²)
mean 399.3 μs (391.3 μs .. 415.1 μs)
std dev 34.73 μs (20.36 μs .. 53.29 μs)
variance introduced by outliers: 71% (severely inflated)
benchmarking 256 columns/binary packed
time 378.6 μs (295.8 μs .. 488.0 μs)
0.637 R² (0.492 R² .. 0.780 R²)
mean 568.1 μs (437.1 μs .. 747.6 μs)
std dev 308.7 μs (233.6 μs .. 386.1 μs)
variance introduced by outliers: 98% (severely inflated)
```
However, for certain compilation paths, we still can get similar speedup as with GHC 8.0.2. In the above examples we used 4-tuple of Word64 as binary representation. In the following code we run two tests: one on just Word64 and one of 4-tuple of Word64. The difference is that we just add the Word64 case:
```
➜ ghc-bug stack exec performance-bug-pair-2
"Generated"
benchmarking 64 columns/raw unbox vectors
time 337.6 μs (336.1 μs .. 339.3 μs)
0.999 R² (0.998 R² .. 1.000 R²)
mean 349.6 μs (344.4 μs .. 359.7 μs)
std dev 23.22 μs (15.39 μs .. 38.22 μs)
variance introduced by outliers: 60% (severely inflated)
benchmarking 64 columns/binary packed
time 21.66 μs (21.53 μs .. 21.79 μs)
1.000 R² (0.999 R² .. 1.000 R²)
mean 21.68 μs (21.53 μs .. 21.89 μs)
std dev 613.2 ns (466.0 ns .. 806.0 ns)
variance introduced by outliers: 30% (moderately inflated)
benchmarking 256 columns/raw unbox vectors
time 344.5 μs (341.6 μs .. 348.2 μs)
0.999 R² (0.999 R² .. 1.000 R²)
mean 345.1 μs (342.5 μs .. 349.3 μs)
std dev 10.66 μs (7.997 μs .. 16.34 μs)
variance introduced by outliers: 25% (moderately inflated)
benchmarking 256 columns/binary packed
time 28.04 μs (27.70 μs .. 28.46 μs)
0.999 R² (0.999 R² .. 1.000 R²)
mean 28.05 μs (27.85 μs .. 28.30 μs)
std dev 758.2 ns (628.2 ns .. 907.6 ns)
variance introduced by outliers: 27% (moderately inflated)
```
I made a variant of code with simpler accumulating function (in our code we accumulate pair of integers, simplified accumulator work on one integer). GHC 8.2.2 in that case losses speed-up with 4-tuples:
```
➜ ghc-bug stack exec performance-bug-2
"Generated"
benchmarking 64 columns/raw unbox vectors
time 333.8 μs (333.0 μs .. 335.1 μs)
1.000 R² (0.999 R² .. 1.000 R²)
mean 333.6 μs (332.4 μs .. 335.8 μs)
std dev 5.651 μs (3.233 μs .. 9.582 μs)
benchmarking 64 columns/binary packed
time 39.06 μs (38.98 μs .. 39.14 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 38.94 μs (38.83 μs .. 39.14 μs)
std dev 495.0 ns (310.2 ns .. 782.1 ns)
benchmarking 256 columns/raw unbox vectors
time 336.9 μs (336.2 μs .. 337.9 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 337.5 μs (336.2 μs .. 339.8 μs)
std dev 5.757 μs (2.946 μs .. 8.979 μs)
benchmarking 256 columns/binary packed
time 239.8 μs (237.6 μs .. 243.0 μs)
0.998 R² (0.996 R² .. 0.999 R²)
mean 251.4 μs (247.9 μs .. 259.8 μs)
std dev 11.50 μs (5.662 μs .. 19.79 μs)
variance introduced by outliers: 37% (moderately inflated)
```
In GHC 8.0.2 we have speed-up in both cases.
What may be related: using random-fu to generate random numbers seems to produce broken code on GHC 8.2.2 (the `performance-bug-rfu.hs` source).
Short description of attached sources:
- performance-bug-pair-2.hs: using two binary representations
- performance-bug-pair-1.hs: using one binary representation (one case commented)
- performance-bug-1.hs: using one binary representation with simplified accumulator
- performance-bug-2.hs: using two binary representation with simplified accumulator
- performance-bug-rfu.hs: using random-fu to generate data (optional)
- stack-8.0.yaml: stack file for GHC-8.0.2
- stack-8.2.yaml: stack file for GHC-8.2.2
- stack-nightly.yaml: stack file for GHC-8.4
- performance-bug.cabal: to be able to stack build everything.
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ------------ |
| Version | 8.2.2 |
| Type | Bug |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Compiler |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"Runtime performance regression with binary operations on vectors","status":"New","operating_system":"","component":"Compiler","related":[],"milestone":"","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"8.2.2","keywords":["bitwise","operations","vector"],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"Bug","description":"Let me briefly explain our use case: we have a machine learning tools created in Haskell. Basically it builds association rules for a database. In plain English: predicates on data rows. We need to score them, so we need to check how many rows are \"matched\" by the predicate.\r\n\r\nIn order to optimize performance, our code uses two main representations for data: one is a \"human readable\", where a values are real values. The second one is binarized representation for categorical data. The latter has is actually a family of representation, since we pack bits into tuples of Word64 and use bitwise operation to implement logic.\r\n\r\nSimplified but representative code is attached.\r\n\r\nIn GHC 8.0.2 binary representation is faster by one order of magnitude than the \"human readable\" one:\r\n\r\n{{{\r\n➜ ghc-bug stack exec performance-bug-pair-1\r\n\"Generated\"\r\nbenchmarking 256 columns/raw unbox vectors\r\ntime 342.6 μs (338.3 μs .. 348.0 μs)\r\n 0.999 R² (0.998 R² .. 1.000 R²)\r\nmean 339.3 μs (337.5 μs .. 342.0 μs)\r\nstd dev 7.203 μs (5.596 μs .. 10.07 μs)\r\nvariance introduced by outliers: 13% (moderately inflated)\r\n\r\nbenchmarking 256 columns/binary packed\r\ntime 32.07 μs (29.69 μs .. 34.14 μs)\r\n 0.982 R² (0.976 R² .. 0.997 R²)\r\nmean 29.97 μs (29.41 μs .. 31.03 μs)\r\nstd dev 2.428 μs (1.526 μs .. 3.750 μs)\r\nvariance introduced by outliers: 78% (severely inflated)\r\n}}}\r\n\r\nIn GHC 8.2.2 (and later) binary representation performance is similar to \"human readable\", so no performance gain:\r\n\r\n{{{\r\n➜ ghc-bug stack exec performance-bug-pair-1\r\n\"Generated\"\r\nbenchmarking 256 columns/raw unbox vectors\r\ntime 442.4 μs (406.7 μs .. 474.5 μs)\r\n 0.978 R² (0.969 R² .. 0.993 R²)\r\nmean 399.3 μs (391.3 μs .. 415.1 μs)\r\nstd dev 34.73 μs (20.36 μs .. 53.29 μs)\r\nvariance introduced by outliers: 71% (severely inflated)\r\n\r\nbenchmarking 256 columns/binary packed\r\ntime 378.6 μs (295.8 μs .. 488.0 μs)\r\n 0.637 R² (0.492 R² .. 0.780 R²)\r\nmean 568.1 μs (437.1 μs .. 747.6 μs)\r\nstd dev 308.7 μs (233.6 μs .. 386.1 μs)\r\nvariance introduced by outliers: 98% (severely inflated)\r\n}}}\r\n\r\nHowever, for certain compilation paths, we still can get similar speedup as with GHC 8.0.2. In the above examples we used 4-tuple of Word64 as binary representation. In the following code we run two tests: one on just Word64 and one of 4-tuple of Word64. The difference is that we just add the Word64 case:\r\n\r\n{{{\r\n➜ ghc-bug stack exec performance-bug-pair-2\r\n\"Generated\"\r\nbenchmarking 64 columns/raw unbox vectors\r\ntime 337.6 μs (336.1 μs .. 339.3 μs)\r\n 0.999 R² (0.998 R² .. 1.000 R²)\r\nmean 349.6 μs (344.4 μs .. 359.7 μs)\r\nstd dev 23.22 μs (15.39 μs .. 38.22 μs)\r\nvariance introduced by outliers: 60% (severely inflated)\r\n\r\nbenchmarking 64 columns/binary packed\r\ntime 21.66 μs (21.53 μs .. 21.79 μs)\r\n 1.000 R² (0.999 R² .. 1.000 R²)\r\nmean 21.68 μs (21.53 μs .. 21.89 μs)\r\nstd dev 613.2 ns (466.0 ns .. 806.0 ns)\r\nvariance introduced by outliers: 30% (moderately inflated)\r\n\r\nbenchmarking 256 columns/raw unbox vectors\r\ntime 344.5 μs (341.6 μs .. 348.2 μs)\r\n 0.999 R² (0.999 R² .. 1.000 R²)\r\nmean 345.1 μs (342.5 μs .. 349.3 μs)\r\nstd dev 10.66 μs (7.997 μs .. 16.34 μs)\r\nvariance introduced by outliers: 25% (moderately inflated)\r\n\r\nbenchmarking 256 columns/binary packed\r\ntime 28.04 μs (27.70 μs .. 28.46 μs)\r\n 0.999 R² (0.999 R² .. 1.000 R²)\r\nmean 28.05 μs (27.85 μs .. 28.30 μs)\r\nstd dev 758.2 ns (628.2 ns .. 907.6 ns)\r\nvariance introduced by outliers: 27% (moderately inflated)\r\n}}}\r\n\r\nI made a variant of code with simpler accumulating function (in our code we accumulate pair of integers, simplified accumulator work on one integer). GHC 8.2.2 in that case losses speed-up with 4-tuples:\r\n\r\n{{{\r\n➜ ghc-bug stack exec performance-bug-2\r\n\"Generated\"\r\nbenchmarking 64 columns/raw unbox vectors\r\ntime 333.8 μs (333.0 μs .. 335.1 μs)\r\n 1.000 R² (0.999 R² .. 1.000 R²)\r\nmean 333.6 μs (332.4 μs .. 335.8 μs)\r\nstd dev 5.651 μs (3.233 μs .. 9.582 μs)\r\n\r\nbenchmarking 64 columns/binary packed\r\ntime 39.06 μs (38.98 μs .. 39.14 μs)\r\n 1.000 R² (1.000 R² .. 1.000 R²)\r\nmean 38.94 μs (38.83 μs .. 39.14 μs)\r\nstd dev 495.0 ns (310.2 ns .. 782.1 ns)\r\n\r\nbenchmarking 256 columns/raw unbox vectors\r\ntime 336.9 μs (336.2 μs .. 337.9 μs)\r\n 1.000 R² (1.000 R² .. 1.000 R²)\r\nmean 337.5 μs (336.2 μs .. 339.8 μs)\r\nstd dev 5.757 μs (2.946 μs .. 8.979 μs)\r\n\r\nbenchmarking 256 columns/binary packed\r\ntime 239.8 μs (237.6 μs .. 243.0 μs)\r\n 0.998 R² (0.996 R² .. 0.999 R²)\r\nmean 251.4 μs (247.9 μs .. 259.8 μs)\r\nstd dev 11.50 μs (5.662 μs .. 19.79 μs)\r\nvariance introduced by outliers: 37% (moderately inflated)\r\n}}}\r\n\r\nIn GHC 8.0.2 we have speed-up in both cases.\r\n\r\nWhat may be related: using random-fu to generate random numbers seems to produce broken code on GHC 8.2.2 (the `performance-bug-rfu.hs` source).\r\n\r\nShort description of attached sources:\r\n- performance-bug-pair-2.hs: using two binary representations\r\n- performance-bug-pair-1.hs: using one binary representation (one case commented)\r\n- performance-bug-1.hs: using one binary representation with simplified accumulator\r\n- performance-bug-2.hs: using two binary representation with simplified accumulator\r\n- performance-bug-rfu.hs: using random-fu to generate data (optional)\r\n- stack-8.0.yaml: stack file for GHC-8.0.2\r\n- stack-8.2.yaml: stack file for GHC-8.2.2\r\n- stack-nightly.yaml: stack file for GHC-8.4\r\n- performance-bug.cabal: to be able to stack build everything.","type_of_failure":"OtherFailure","blocking":[]} -->⊥Ben GamariBen Gamarihttps://gitlab.haskell.org/ghc/ghc/-/issues/14226Common Block Elimination pass doesn't eliminate common blocks2021-09-07T15:40:17ZBen GamariCommon Block Elimination pass doesn't eliminate common blocksIn #14222 it was noted that something appears to be broken in `CmmCommonBlockElim`. Consider the program from that ticket,
```hs
module T14221 where
import Data.Text as T
isNumeric :: Text -> Bool
isNumeric t =
T.all isNumeric' t ...In #14222 it was noted that something appears to be broken in `CmmCommonBlockElim`. Consider the program from that ticket,
```hs
module T14221 where
import Data.Text as T
isNumeric :: Text -> Bool
isNumeric t =
T.all isNumeric' t && T.any isNumber t
where
isNumber c = '0' <= c && c <= '9'
isNumeric' c = isNumber c
|| c == 'e'
|| c == 'E'
|| c == '.'
|| c == '-'
|| c == '+'
```
This program produces six copies of a block of the form,
```
c6JT:
R2 = I64[R1 + 7];
R1 = P64[Sp + 8];
Sp = Sp + 16;
call $wloop_all_s6CQ_info(R2, R1) args: 8, res: 0, upd: 8;
```
in the `-ddump-opt-cmm` output, which are manifest in the assembler as,
```asm
block_c6JT_info:
_c6JT:
movq 7(%rbx),%r14
movq 8(%rbp),%rbx
addq $16,%rbp
jmp $wloop_all_s6CQ_info
```
CBE really ought to be catching these.⊥Michal TerepetaMichal Terepetahttps://gitlab.haskell.org/ghc/ghc/-/issues/14013Bad monads performance2022-07-18T08:52:05Zdanilo2Bad monads performanceHi! We've been struggling with a very strange GHC behavior on IRC today. Let's consider the following code (needs mtl and criterion to be compiled):
```hs
module Main where
import Prelude
import Criterion.Main
import qualified Control....Hi! We've been struggling with a very strange GHC behavior on IRC today. Let's consider the following code (needs mtl and criterion to be compiled):
```hs
module Main where
import Prelude
import Criterion.Main
import qualified Control.Monad.State.Strict as Strict
import qualified Control.Monad.State.Class as State
import Control.DeepSeq (NFData, rnf, force)
import GHC.IO (evaluate)
import Data.Monoid
-----------------------------
-- === Criterion utils === --
-----------------------------
eval :: NFData a => a -> IO a
eval = evaluate . force ; {-# INLINE eval #-}
liftExp :: (Int -> a) -> (Int -> a)
liftExp f = f . (10^) ; {-# INLINE liftExp #-}
expCodeGen :: NFData a => (Int -> a) -> (Int -> IO a)
expCodeGen f i = do
putStrLn $ "generating input code (10e" <> show i <> " chars)"
out <- eval $ liftExp f i
putStrLn "code generated sucessfully"
return out
{-# INLINE expCodeGen #-}
expCodeGenBench :: (NFData a, NFData b) => (Int -> a) -> (a -> b) -> Int -> Benchmark
expCodeGenBench f p i = env (expCodeGen f i) $ bench ("10e" <> show i) . nf p ; {-# INLINE expCodeGenBench #-}
-------------------------------
-- === (a*) list parsing === --
-------------------------------
genList_a :: Int -> [Char]
genList_a i = replicate i 'a' ; {-# INLINE genList_a #-}
pureListParser_a :: [Char] -> Bool
pureListParser_a = \case
'a':s -> pureListParser_a s
[] -> True
_ -> False
{-# INLINE pureListParser_a #-}
mtlStateListParser_a :: State.MonadState [Char] m => m Bool
mtlStateListParser_a = State.get >>= \case
'a':s -> State.put s >> mtlStateListParser_a
[] -> return True
_ -> return False
{-# INLINE mtlStateListParser_a #-}
mtlStateListParser_a_typed :: Strict.State [Char] Bool
mtlStateListParser_a_typed = State.get >>= \case
'a':s -> State.put s >> mtlStateListParser_a_typed
[] -> return True
_ -> return False
{-# INLINE mtlStateListParser_a_typed #-}
mtlStateListParser_a_let :: Strict.MonadState [Char] m => m Bool
mtlStateListParser_a_let = go where
go = Strict.get >>= \case
'a':s -> Strict.put s >> go
[] -> return True
_ -> return False
{-# INLINE mtlStateListParser_a_let #-}
{-# SPECIALIZE mtlStateListParser_a :: Strict.State [Char] Bool #-}
{-# SPECIALIZE mtlStateListParser_a_typed :: Strict.State [Char] Bool #-}
main = do
defaultMain
[ bgroup "a*" $
[ bgroup "pure" $ expCodeGenBench genList_a pureListParser_a <$> [6..6]
, bgroup "mtl.State.Strict" $ expCodeGenBench genList_a (Strict.evalState mtlStateListParser_a) <$> [6..6]
, bgroup "mtl.State.Strict typed" $ expCodeGenBench genList_a (Strict.evalState mtlStateListParser_a_typed) <$> [6..6]
, bgroup "mtl.State.Strict let" $ expCodeGenBench genList_a (Strict.evalState mtlStateListParser_a_let) <$> [6..6]
]
]
```
The code was compiled with following options (and many other variations): `-threaded -funbox-strict-fields -O2 -fconstraint-solver-iterations=100 -funfolding-use-threshold=10000 -fexpose-all-unfoldings -fsimpl-tick-factor=1000 -flate-dmd-anal`
Everything in this code has `INLINE` pragma. The important part we should focus on are these two functions:
```hs
pureListParser_a :: [Char] -> Bool
pureListParser_a = \case
'a':s -> pureListParser_a s
[] -> True
_ -> False
{-# INLINE pureListParser_a #-}
mtlStateListParser_a :: State.MonadState [Char] m => m Bool
mtlStateListParser_a = State.get >>= \case
'a':s -> State.put s >> mtlStateListParser_a
[] -> return True
_ -> return False
{-# INLINE mtlStateListParser_a #-}
```
Which are just "parsers" accepting strings containing only 'a' characters. The former is pure one, while the later uses `State` to keep the remaining input. The following list contains performance related observations:
1. For the rest of the points, let's call the performance of `pureListParser_a` a "good" one and everything worse a "bad" one.
1. The performance of `mtlStateListParser_a` is bad, it runs 10 times slower than `pureListParser_a`. Inspecting CORE we can observe that GHC jumps between `(# a,b #)` and `(a,b)` representations all the time.
1. If we add a specialize pragma `{-# SPECIALIZE mtlStateListParser_a :: Strict.State [Char] Bool #-}`, the performance of `mtlStateListParser_a` is good (exactly the same as `pureListParser_a`).
1. If we do NOT use specialize pragma, but we use explicite, non-polymorphic type signature `mtlStateListParser_a_typed :: Strict.State [Char] Bool`, the performance is bad (!), identical to the polymorphic version without specialization.
1. If we use SPECIALIZE pragma together with explicite, non-polymorphic type, so we use BOTH `mtlStateListParser_a_typed :: Strict.State [Char] Bool` AND `{-# SPECIALIZE mtlStateListParser_a_typed :: Strict.State [Char] Bool #-}` we get the good performance.
1. If we transform `pureListParser_a` to
```hs
mtlStateListParser_a_let :: Strict.MonadState [Char] m => m Bool
mtlStateListParser_a_let = go where
go = Strict.get >>= \case
'a':s -> Strict.put s >> go
[] -> return True
_ -> return False
{-# INLINE mtlStateListParser_a_let #-}
```
we again get the good performance without the need to use `SPECIALIZE` pragmas.
1. The performance of all the functions that are not optimized as good as `pureListParser_a` is a lot worse in GHC 8.2.1-rc3 than in 8.0.2.
1. The not-yet documented flag `-fspecialise-aggressively` does NOT affect the results (https://ghc.haskell.org/trac/ghc/ticket/12463).
1. If you do NOT use `INLINE` pragma on functions `mtlStateListParser_a` and `mtlStateListParser_a_typed` their performance is good (so `INLINE` pragma makes it bad until we provide explicit specialization). Moreover, if we use `INLINABLE` pragma instead of `INLINE` on these functions (which logically makes more sense, because they are recursive), performance of the polymorphic one `mtlStateListParser_a` is good, while performance of the explicitly typed `mtlStateListParser_a_typed` is bad until we provide explicite specialization.
The above points raise the following questions:
1. Why GHC does not optimize `mtlStateListParser_a` the same way as `pureListParser_a` and where the jumping between `(# a,b #)` and `(a,b)` comes from?
1. Is there any way to tell GHC to automatically insert `SPECIALIZE` pragmas, especially in performance critical code?
1. Why providing very-explicite type signature `mtlStateListParser_a_typed :: Strict.State [Char] Bool` does not solve the problem unless we use `SPECIALIZE` pragma that tells the same as the signature? (GHC even warns: `SPECIALISE pragma for non-overloaded function ‘mtlStateListParser_a_typed’` but it affects performance.)
1. Why the trick to alias the body of recursive function to a local variable `go` affects the performance in any way, especially when it does NOT bring any variable to the local let scope?
We've been testing this behavior in GHC 8.0.2 and 8.2.1-rc3 and several people reported exactly the same observations.⊥Simon Peyton JonesSimon Peyton Joneshttps://gitlab.haskell.org/ghc/ghc/-/issues/13763Performance regression (~34%) in 8.2.1, poor register allocation(?) in an inn...2021-09-07T15:37:04ZjberrymanPerformance regression (~34%) in 8.2.1, poor register allocation(?) in an inner loop over an arrayTesting GHC 8.0.1 against the RC 8.2.0.20170507
I've distilled a smallish test-case from a much larger case in my 'hashabler' library, and validated that the same modifications also make that regression disappear in the real case. It's ...Testing GHC 8.0.1 against the RC 8.2.0.20170507
I've distilled a smallish test-case from a much larger case in my 'hashabler' library, and validated that the same modifications also make that regression disappear in the real case. It's probably possible to get this smaller but I don't know if I'll have time to work on it more for a while:
repro3.hs:
```hs
{-# LANGUAGE BangPatterns #-}
module Main(main) where
import Criterion.Main
import qualified Data.Primitive as P
import Data.Bits
import Data.Word
import Control.DeepSeq
main :: IO ()
main = do
defaultMain
[ env (newByteArr64 5) $ \ ~bs ->
bench "ByteArray 5" $ nf (hashTextSip 99) bs
, env (newByteArr64 8) $ \ ~bs ->
bench "ByteArray 8" $ nf (hashTextSip 99) bs
, env (newByteArr64 512) $ \ ~bs ->
bench "ByteArray 512" $ nf (hashTextSip 99) bs
, env (newByteArr64 1000000) $ \ ~bs ->
bench "ByteArray 1000000" $ nf (hashTextSip 99) bs
]
instance NFData P.ByteArray where rnf _ = ()
newByteArr64 n = P.newAlignedPinnedByteArray (8*n) 8 >>= P.unsafeFreezeByteArray
sipRound :: Word64 -> Word64 -> Word64 -> Word64 -> (Word64, Word64, Word64, Word64)
{-# INLINE sipRound #-}
sipRound v0 v1 v2 v3 = (v3 `xor` v0, v0 `xor` v1, v1 `xor` v2, v2 `xor` v3)
hashTextSip :: Word64 -> P.ByteArray -> Word64
{-# INLINE hashTextSip #-}
hashTextSip h = \ ba ->
let !lenWord16 = P.sizeofByteArray ba `unsafeShiftR` 1
!word16sRem = lenWord16 .&. 3
!word16sIx = lenWord16-word16sRem
!ixFinal = lenWord16-1
!word16sIxWd = word16sIx `unsafeShiftR` 2 -- `div` 4
hash4Word16sLoop hAcc@(!w0,!w1,!w2,!w3) !ix
| ix == word16sIxWd = hashRemainingWord16s hAcc word16sIx
| otherwise =
let w64Dirty = P.indexByteArray ba ix
w64 = clean4xWord16ChunkLE w64Dirty
in hash4Word16sLoop (sipRound (w0 `xor` w64) w1 w2 w3) (ix + 1)
-- NOTE: Removing this causes regression to disappear as well.
hashRemainingWord16s (!w0,!w1,!w2,!w3) !ix
| ix > ixFinal = w0
| otherwise =
let w16 = P.indexByteArray ba ix
in hashRemainingWord16s (sipRound (w0 `xor` (fromIntegral (w16 :: Word16))) w1 w2 w3) (ix+1)
in hash4Word16sLoop (h,1,2,3) 0
clean4xWord16ChunkLE :: Word64 -> Word64
{-# INLINE clean4xWord16ChunkLE #-}
clean4xWord16ChunkLE w64Dirty =
-- NOTE: no regression when just this (8.2 is faster)
-- (((byteSwap64 w64Dirty) `unsafeShiftR` 8) .&. 0x00FF00FF00FF00FF)
-- ...but this is a big regression:
(((byteSwap64 w64Dirty) `unsafeShiftR` 8) .&. 0x00FF00FF00FF00FF)
.|.
(((byteSwap64 w64Dirty) `unsafeShiftL` 8) .&. 0xFF00FF00FF00FF00)
```
Here are the results of the benchmark above on my machine:
On GHC \*\*8.0.1\*\*:
```
benchmarking ByteArray 5
time 24.70 ns (24.00 ns .. 26.25 ns)
0.987 R² (0.967 R² .. 1.000 R²)
mean 24.44 ns (24.13 ns .. 25.80 ns)
std dev 1.859 ns (318.3 ps .. 4.227 ns)
variance introduced by outliers: 86% (severely inflated)
benchmarking ByteArray 8
time 32.66 ns (32.58 ns .. 32.76 ns)
1.000 R² (1.000 R² .. 1.000 R²)
mean 32.79 ns (32.64 ns .. 33.09 ns)
std dev 683.7 ps (365.4 ps .. 1.175 ns)
variance introduced by outliers: 31% (moderately inflated)
benchmarking ByteArray 512
time 1.428 μs (1.382 μs .. 1.522 μs)
0.986 R² (0.970 R² .. 1.000 R²)
mean 1.398 μs (1.384 μs .. 1.454 μs)
std dev 91.12 ns (4.475 ns .. 193.9 ns)
variance introduced by outliers: 76% (severely inflated)
benchmarking ByteArray 1000000
time 2.658 ms (2.653 ms .. 2.663 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 2.672 ms (2.665 ms .. 2.691 ms)
std dev 35.00 μs (10.88 μs .. 59.58 μs)
```
And on \*\*GHC 8.2\*\* RC:
```
benchmarking ByteArray 5
time 23.78 ns (23.68 ns .. 23.88 ns)
1.000 R² (1.000 R² .. 1.000 R²)
mean 23.83 ns (23.76 ns .. 23.95 ns)
std dev 298.8 ps (183.2 ps .. 482.5 ps)
variance introduced by outliers: 14% (moderately inflated)
benchmarking ByteArray 8
time 35.81 ns (35.44 ns .. 36.27 ns)
0.999 R² (0.998 R² .. 1.000 R²)
mean 35.56 ns (35.45 ns .. 35.94 ns)
std dev 596.8 ps (134.5 ps .. 1.184 ns)
variance introduced by outliers: 22% (moderately inflated)
benchmarking ByteArray 512
time 1.706 μs (1.698 μs .. 1.716 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 1.701 μs (1.698 μs .. 1.707 μs)
std dev 13.27 ns (5.825 ns .. 24.41 ns)
benchmarking ByteArray 1000000
time 3.322 ms (3.284 ms .. 3.377 ms)
0.999 R² (0.998 R² .. 1.000 R²)
mean 3.296 ms (3.287 ms .. 3.332 ms)
std dev 44.62 μs (20.55 μs .. 87.29 μs)
```
Looking at the core wasn't fruitful, but I think dumping the asm shows that this is a case of bad (or worse) register allocation. I've attached two screenshots showing the instructions added (in blue), when moving from the one-line `clean4xWord16ChunkLE` to the two-line version, for both 8.0 and 8.2 (there wasn't anything in the diff besides instances of this change).
It looks in the 8.2 version like we've decided we're out of registers and need to use the stack.
In my real code I'm seeing 35% regression on very long Text, as well as 21% regression on very long ByteString; the latter implementation is similarly structured to `hashTextSip`, but doesn't call `clean4xWord16ChunkLE` but does do a byteswap.
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | -------------- |
| Version | 8.2.1-rc2 |
| Type | Bug |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Compiler (NCG) |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"Performance regression (~34%) in 8.2.1, poor register allocation(?) in an inner loop over an array","status":"New","operating_system":"","component":"Compiler (NCG)","related":[],"milestone":"","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"8.2.1-rc2","keywords":[],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"Bug","description":"Testing GHC 8.0.1 against the RC 8.2.0.20170507\r\n\r\nI've distilled a smallish test-case from a much larger case in my 'hashabler' library, and validated that the same modifications also make that regression disappear in the real case. It's probably possible to get this smaller but I don't know if I'll have time to work on it more for a while:\r\n\r\nrepro3.hs:\r\n\r\n{{{#!hs\r\n{-# LANGUAGE BangPatterns #-}\r\nmodule Main(main) where\r\n\r\nimport Criterion.Main\r\nimport qualified Data.Primitive as P\r\n\r\nimport Data.Bits\r\nimport Data.Word\r\nimport Control.DeepSeq\r\n\r\nmain :: IO ()\r\nmain = do\r\n defaultMain \r\n [ env (newByteArr64 5) $ \\ ~bs ->\r\n bench \"ByteArray 5\" $ nf (hashTextSip 99) bs\r\n , env (newByteArr64 8) $ \\ ~bs ->\r\n bench \"ByteArray 8\" $ nf (hashTextSip 99) bs\r\n , env (newByteArr64 512) $ \\ ~bs ->\r\n bench \"ByteArray 512\" $ nf (hashTextSip 99) bs\r\n , env (newByteArr64 1000000) $ \\ ~bs ->\r\n bench \"ByteArray 1000000\" $ nf (hashTextSip 99) bs\r\n ]\r\n\r\ninstance NFData P.ByteArray where rnf _ = ()\r\n\r\nnewByteArr64 n = P.newAlignedPinnedByteArray (8*n) 8 >>= P.unsafeFreezeByteArray\r\n\r\nsipRound :: Word64 -> Word64 -> Word64 -> Word64 -> (Word64, Word64, Word64, Word64)\r\n{-# INLINE sipRound #-}\r\nsipRound v0 v1 v2 v3 = (v3 `xor` v0, v0 `xor` v1, v1 `xor` v2, v2 `xor` v3)\r\n\r\nhashTextSip :: Word64 -> P.ByteArray -> Word64\r\n{-# INLINE hashTextSip #-}\r\nhashTextSip h = \\ ba -> \r\n let !lenWord16 = P.sizeofByteArray ba `unsafeShiftR` 1\r\n !word16sRem = lenWord16 .&. 3 \r\n !word16sIx = lenWord16-word16sRem\r\n !ixFinal = lenWord16-1\r\n !word16sIxWd = word16sIx `unsafeShiftR` 2 -- `div` 4\r\n\r\n hash4Word16sLoop hAcc@(!w0,!w1,!w2,!w3) !ix \r\n | ix == word16sIxWd = hashRemainingWord16s hAcc word16sIx\r\n | otherwise = \r\n let w64Dirty = P.indexByteArray ba ix\r\n w64 = clean4xWord16ChunkLE w64Dirty\r\n in hash4Word16sLoop (sipRound (w0 `xor` w64) w1 w2 w3) (ix + 1)\r\n \r\n -- NOTE: Removing this causes regression to disappear as well.\r\n hashRemainingWord16s (!w0,!w1,!w2,!w3) !ix \r\n | ix > ixFinal = w0 \r\n | otherwise = \r\n let w16 = P.indexByteArray ba ix\r\n in hashRemainingWord16s (sipRound (w0 `xor` (fromIntegral (w16 :: Word16))) w1 w2 w3) (ix+1)\r\n in hash4Word16sLoop (h,1,2,3) 0 \r\n\r\nclean4xWord16ChunkLE :: Word64 -> Word64\r\n{-# INLINE clean4xWord16ChunkLE #-}\r\nclean4xWord16ChunkLE w64Dirty = \r\n -- NOTE: no regression when just this (8.2 is faster)\r\n -- (((byteSwap64 w64Dirty) `unsafeShiftR` 8) .&. 0x00FF00FF00FF00FF) \r\n\r\n -- ...but this is a big regression:\r\n (((byteSwap64 w64Dirty) `unsafeShiftR` 8) .&. 0x00FF00FF00FF00FF) \r\n .|.\r\n (((byteSwap64 w64Dirty) `unsafeShiftL` 8) .&. 0xFF00FF00FF00FF00)\r\n}}}\r\n\r\nHere are the results of the benchmark above on my machine:\r\n\r\nOn GHC **8.0.1**:\r\n{{{\r\nbenchmarking ByteArray 5\r\ntime 24.70 ns (24.00 ns .. 26.25 ns)\r\n 0.987 R² (0.967 R² .. 1.000 R²)\r\nmean 24.44 ns (24.13 ns .. 25.80 ns)\r\nstd dev 1.859 ns (318.3 ps .. 4.227 ns)\r\nvariance introduced by outliers: 86% (severely inflated)\r\n\r\nbenchmarking ByteArray 8\r\ntime 32.66 ns (32.58 ns .. 32.76 ns)\r\n 1.000 R² (1.000 R² .. 1.000 R²)\r\nmean 32.79 ns (32.64 ns .. 33.09 ns)\r\nstd dev 683.7 ps (365.4 ps .. 1.175 ns)\r\nvariance introduced by outliers: 31% (moderately inflated)\r\n\r\nbenchmarking ByteArray 512\r\ntime 1.428 μs (1.382 μs .. 1.522 μs)\r\n 0.986 R² (0.970 R² .. 1.000 R²)\r\nmean 1.398 μs (1.384 μs .. 1.454 μs)\r\nstd dev 91.12 ns (4.475 ns .. 193.9 ns)\r\nvariance introduced by outliers: 76% (severely inflated)\r\n\r\nbenchmarking ByteArray 1000000\r\ntime 2.658 ms (2.653 ms .. 2.663 ms)\r\n 1.000 R² (1.000 R² .. 1.000 R²)\r\nmean 2.672 ms (2.665 ms .. 2.691 ms)\r\nstd dev 35.00 μs (10.88 μs .. 59.58 μs)\r\n\r\n}}}\r\n\r\nAnd on **GHC 8.2** RC:\r\n{{{\r\nbenchmarking ByteArray 5\r\ntime 23.78 ns (23.68 ns .. 23.88 ns)\r\n 1.000 R² (1.000 R² .. 1.000 R²)\r\nmean 23.83 ns (23.76 ns .. 23.95 ns)\r\nstd dev 298.8 ps (183.2 ps .. 482.5 ps)\r\nvariance introduced by outliers: 14% (moderately inflated)\r\n\r\nbenchmarking ByteArray 8\r\ntime 35.81 ns (35.44 ns .. 36.27 ns)\r\n 0.999 R² (0.998 R² .. 1.000 R²)\r\nmean 35.56 ns (35.45 ns .. 35.94 ns)\r\nstd dev 596.8 ps (134.5 ps .. 1.184 ns)\r\nvariance introduced by outliers: 22% (moderately inflated)\r\n\r\nbenchmarking ByteArray 512\r\ntime 1.706 μs (1.698 μs .. 1.716 μs)\r\n 1.000 R² (1.000 R² .. 1.000 R²)\r\nmean 1.701 μs (1.698 μs .. 1.707 μs)\r\nstd dev 13.27 ns (5.825 ns .. 24.41 ns)\r\n\r\nbenchmarking ByteArray 1000000\r\ntime 3.322 ms (3.284 ms .. 3.377 ms)\r\n 0.999 R² (0.998 R² .. 1.000 R²)\r\nmean 3.296 ms (3.287 ms .. 3.332 ms)\r\nstd dev 44.62 μs (20.55 μs .. 87.29 μs)\r\n\r\n}}}\r\n\r\nLooking at the core wasn't fruitful, but I think dumping the asm shows that this is a case of bad (or worse) register allocation. I've attached two screenshots showing the instructions added (in blue), when moving from the one-line `clean4xWord16ChunkLE` to the two-line version, for both 8.0 and 8.2 (there wasn't anything in the diff besides instances of this change). \r\n\r\nIt looks in the 8.2 version like we've decided we're out of registers and need to use the stack.\r\n\r\nIn my real code I'm seeing 35% regression on very long Text, as well as 21% regression on very long ByteString; the latter implementation is similarly structured to `hashTextSip`, but doesn't call `clean4xWord16ChunkLE` but does do a byteswap.","type_of_failure":"OtherFailure","blocking":[]} -->⊥https://gitlab.haskell.org/ghc/ghc/-/issues/13233typePrimRep panic while compiling GHC with profiling2021-10-06T08:42:31ZBen GamaritypePrimRep panic while compiling GHC with profilingI am seeing this panic while compiling GHC stage2,
```
ghc-stage1: panic! (the 'impossible' happened)
(GHC version 8.1.20170124 for x86_64-unknown-linux):
runtimeRepPrimRep
typePrimRep (a :: TYPE k0)
k0
Call stack:
?callS...I am seeing this panic while compiling GHC stage2,
```
ghc-stage1: panic! (the 'impossible' happened)
(GHC version 8.1.20170124 for x86_64-unknown-linux):
runtimeRepPrimRep
typePrimRep (a :: TYPE k0)
k0
Call stack:
?callStack, called at compiler/utils/Util.hs:1352:50 in ghc:Util
prettyCurrentCallStack, called at compiler/utils/Outputable.hs:1166:58 in ghc:Outputable
callStackDoc, called at compiler/utils/Outputable.hs:1170:37 in ghc:Outputable
pprPanic, called at compiler/simplStg/RepType.hs:360:5 in ghc:RepType
Please report this as a GHC bug: http://www.haskell.org/ghc/reportabug
<<ghc: 1038844096 bytes, 158 GCs, 21564561/73278928 avg/max bytes residency (8 samples), 148M in use, 0.000 INIT (0.000 elapsed), 1.113 MUT (1.125 elapsed), 0.935 GC (0.940 elapsed) :ghc>>
compiler/ghc.mk:582: recipe for target 'compiler/stage2/build/StgCmmMonad.p_o' failed
```
This appears to be due to my enabling of profiling. `build.mk` contains,
```
BuildFlavour=prof
ifneq "$(BuildFlavour)" ""
include mk/flavours/$(BuildFlavour).mk
endif
HADDOCK_DOCS=YES
compiler_HC_OPTS += -fprof-auto-exported
```
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ------------ |
| Version | 8.0.1 |
| Type | Bug |
| TypeOfFailure | OtherFailure |
| Priority | high |
| Resolution | Unresolved |
| Component | Compiler |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | goldfire |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"typePrimRep panic while compiling GHC with profiling","status":"New","operating_system":"","component":"Compiler","related":[],"milestone":"8.2.1","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"8.0.1","keywords":[],"differentials":[],"test_case":"","architecture":"","cc":["goldfire"],"type":"Bug","description":"I am seeing this panic while compiling GHC stage2,\r\n{{{\r\nghc-stage1: panic! (the 'impossible' happened)\r\n (GHC version 8.1.20170124 for x86_64-unknown-linux):\r\n\truntimeRepPrimRep\r\n typePrimRep (a :: TYPE k0)\r\n k0\r\n Call stack:\r\n ?callStack, called at compiler/utils/Util.hs:1352:50 in ghc:Util\r\n prettyCurrentCallStack, called at compiler/utils/Outputable.hs:1166:58 in ghc:Outputable\r\n callStackDoc, called at compiler/utils/Outputable.hs:1170:37 in ghc:Outputable\r\n pprPanic, called at compiler/simplStg/RepType.hs:360:5 in ghc:RepType\r\n\r\nPlease report this as a GHC bug: http://www.haskell.org/ghc/reportabug\r\n\r\n<<ghc: 1038844096 bytes, 158 GCs, 21564561/73278928 avg/max bytes residency (8 samples), 148M in use, 0.000 INIT (0.000 elapsed), 1.113 MUT (1.125 elapsed), 0.935 GC (0.940 elapsed) :ghc>>\r\ncompiler/ghc.mk:582: recipe for target 'compiler/stage2/build/StgCmmMonad.p_o' failed\r\n}}}\r\n\r\nThis appears to be due to my enabling of profiling. `build.mk` contains,\r\n{{{\r\nBuildFlavour=prof\r\nifneq \"$(BuildFlavour)\" \"\"\r\ninclude mk/flavours/$(BuildFlavour).mk\r\nendif\r\nHADDOCK_DOCS=YES\r\ncompiler_HC_OPTS += -fprof-auto-exported\r\n}}}","type_of_failure":"OtherFailure","blocking":[]} -->⊥sheafsam.derbyshire@gmail.comsheafsam.derbyshire@gmail.comhttps://gitlab.haskell.org/ghc/ghc/-/issues/8949Deprecate -msse2 and -msse flags2021-09-07T15:27:12ZerrgeDeprecate -msse2 and -msse flagsI propose msse2 to be on by default. I guess the default was chosen way back, when Pentium III support was still relevant.
Nowadays we don't really win on the CPU support, because e.g. https://github.com/tibbe/hashable/blob/master/hasha...I propose msse2 to be on by default. I guess the default was chosen way back, when Pentium III support was still relevant.
Nowadays we don't really win on the CPU support, because e.g. https://github.com/tibbe/hashable/blob/master/hashable.cabal is built by default with sse2 on the injected C code level. And hashable has a lot of reverse depends, therefore on an end user system (RedHat or Debian) the user is most probably unlucky with a Pentium III CPU anyways.
Flipping this default would also fix the excess precision problem for most users of GHC on the i686 platform.
GHC should provide a -mno-sse2 flag for the cases when this needs to be disabled.
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ------------------ |
| Version | 7.9 |
| Type | Bug |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Compiler (CodeGen) |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | simonmar |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"switch -msse2 to be on by default","status":"New","operating_system":"","component":"Compiler (CodeGen)","related":[],"milestone":"","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"7.9","keywords":[],"differentials":[],"test_case":"","architecture":"","cc":["simonmar"],"type":"Bug","description":"I propose msse2 to be on by default. I guess the default was chosen way back, when Pentium III support was still relevant.\r\n\r\nNowadays we don't really win on the CPU support, because e.g. https://github.com/tibbe/hashable/blob/master/hashable.cabal is built by default with sse2 on the injected C code level. And hashable has a lot of reverse depends, therefore on an end user system (RedHat or Debian) the user is most probably unlucky with a Pentium III CPU anyways.\r\n\r\nFlipping this default would also fix the excess precision problem for most users of GHC on the i686 platform.\r\n\r\nGHC should provide a -mno-sse2 flag for the cases when this needs to be disabled.","type_of_failure":"OtherFailure","blocking":[]} -->⊥https://gitlab.haskell.org/ghc/ghc/-/issues/8702floor, ceiling, truncate and so on… on NaN fail2019-07-07T18:43:53Zphaazonfloor, ceiling, truncate and so on… on NaN failThose functions returns a total random value when given NaN. This is not an acceptable behavior I guess, and we might inquire on what to do.
For instance:
```
let nan = 0 / 0 in floor nan
```
result:
```
-2696539702293473861593957786...Those functions returns a total random value when given NaN. This is not an acceptable behavior I guess, and we might inquire on what to do.
For instance:
```
let nan = 0 / 0 in floor nan
```
result:
```
-269653970229347386159395778618353710042696546841345985910145121736599013708251444699062715983611304031680170819807090036488184653221624933739271145959211186566651840137298227914453329401869141179179624428127508653257226023513694322210869665811240855745025766026879447359920868907719574457253034494436336205824
```
Maybe an arithmetic exception would be a better behavior?
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | -------------- |
| Version | 7.4.1 |
| Type | Bug |
| TypeOfFailure | OtherFailure |
| Priority | high |
| Resolution | Unresolved |
| Component | libraries/base |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"floor, ceiling, truncate and so on… on NaN fail","status":"New","operating_system":"","component":"libraries/base","related":[],"milestone":"⊥","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"7.4.1","keywords":[],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"Bug","description":"Those functions returns a total random value when given NaN. This is not an acceptable behavior I guess, and we might inquire on what to do.\r\n\r\nFor instance:\r\n\r\n\r\n{{{\r\nlet nan = 0 / 0 in floor nan\r\n}}}\r\n\r\nresult:\r\n\r\n{{{\r\n-269653970229347386159395778618353710042696546841345985910145121736599013708251444699062715983611304031680170819807090036488184653221624933739271145959211186566651840137298227914453329401869141179179624428127508653257226023513694322210869665811240855745025766026879447359920868907719574457253034494436336205824\r\n}}}\r\n\r\nMaybe an arithmetic exception would be a better behavior?","type_of_failure":"OtherFailure","blocking":[]} -->⊥https://gitlab.haskell.org/ghc/ghc/-/issues/5642Deriving Generic of a big type takes a long time and lots of space2024-03-28T13:22:28ZbasvandijkDeriving Generic of a big type takes a long time and lots of spaceIf I load the following module into `ghci` my system will run out of memory after about 15 minutes:
```
{-# LANGUAGE DeriveGeneric #-}
import GHC.Generics
data BigSum =
C0 | C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 ...If I load the following module into `ghci` my system will run out of memory after about 15 minutes:
```
{-# LANGUAGE DeriveGeneric #-}
import GHC.Generics
data BigSum =
C0 | C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9
| C10 | C11 | C12 | C13 | C14 | C15 | C16 | C17 | C18 | C19
| C20 | C21 | C22 | C23 | C24 | C25 | C26 | C27 | C28 | C29
| C30 | C31 | C32 | C33 | C34 | C35 | C36 | C37 | C38 | C39
| C40 | C41 | C42 | C43 | C44 | C45 | C46 | C47 | C48 | C49
| C50 | C51 | C52 | C53 | C54 | C55 | C56 | C57 | C58 | C59
| C60 | C61 | C62 | C63 | C64 | C65 | C66 | C67 | C68 | C69
| C70 | C71 | C72 | C73 | C74 | C75 | C76 | C77 | C78 | C79
| C80 | C81 | C82 | C83 | C84 | C85 | C86 | C87 | C88 | C89
| C90 | C91 | C92 | C93 | C94 | C95 | C96 | C97 | C98 | C99
| C100 | C101 | C102 | C103 | C104 | C105 | C106 | C107 | C108 | C109
| C110 | C111 | C112 | C113 | C114 | C115 | C116 | C117 | C118 | C119
| C120 | C121 | C122 | C123 | C124 | C125 | C126 | C127 | C128 | C129
| C130 | C131 | C132 | C133 | C134 | C135 | C136 | C137 | C138 | C139
| C140 | C141 | C142 | C143 | C144 | C145 | C146 | C147 | C148 | C149
| C150 | C151 | C152 | C153 | C154 | C155 | C156 | C157 | C158 | C159
| C160 | C161 | C162 | C163 | C164 | C165 | C166 | C167 | C168 | C169
| C170 | C171 | C172 | C173 | C174 | C175 | C176 | C177 | C178 | C179
| C180 | C181 | C182 | C183 | C184 | C185 | C186 | C187 | C188 | C189
| C190 | C191 | C192 | C193 | C194 | C195 | C196 | C197 | C198 | C199
| C200 | C201 | C202 | C203 | C204 | C205 | C206 | C207 | C208 | C209
| C210 | C211 | C212 | C213 | C214 | C215 | C216 | C217 | C218 | C219
| C220 | C221 | C222 | C223 | C224 | C225 | C226 | C227 | C228 | C229
| C230 | C231 | C232 | C233 | C234 | C235 | C236 | C237 | C238 | C239
| C240 | C241 | C242 | C243 | C244 | C245 | C246 | C247 | C248 | C249
| C250 | C251 | C252 | C253 | C254 | C255 | C256 | C257 | C258 | C259
| C260 | C261 | C262 | C263 | C264 | C265 | C266 | C267 | C268 | C269
| C270 | C271 | C272 | C273 | C274 | C275 | C276 | C277 | C278 | C279
| C280 | C281 | C282 | C283 | C284 | C285 | C286 | C287 | C288 | C289
| C290 | C291 | C292 | C293 | C294 | C295 | C296 | C297 | C298 | C299
deriving Generic
```
Big products have the same problem:
```
data BigProduct = C
() () () () () () () () () ()
() () () () () () () () () ()
() () () () () () () () () ()
() () () () () () () () () ()
() () () () () () () () () ()
() () () () () () () () () ()
() () () () () () () () () ()
() () () () () () () () () ()
() () () () () () () () () ()
() () () () () () () () () ()
() () () () () () () () () ()
() () () () () () () () () ()
() () () () () () () () () ()
() () () () () () () () () ()
() () () () () () () () () ()
() () () () () () () () () ()
() () () () () () () () () ()
() () () () () () () () () ()
() () () () () () () () () ()
() () () () () () () () () ()
() () () () () () () () () ()
() () () () () () () () () ()
() () () () () () () () () ()
() () () () () () () () () ()
() () () () () () () () () ()
() () () () () () () () () ()
() () () () () () () () () ()
() () () () () () () () () ()
() () () () () () () () () ()
() () () () () () () () () ()
deriving Generic
```
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ------------ |
| Version | 7.2.1 |
| Type | Bug |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Compiler |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | dreixel |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"Deriving Generic of a big type takes a long time and lots of space","status":"New","operating_system":"","component":"Compiler","related":[],"milestone":"","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"7.2.1","keywords":[],"differentials":[],"test_case":"","architecture":"","cc":["dreixel"],"type":"Bug","description":"If I load the following module into `ghci` my system will run out of memory after about 15 minutes: \r\n\r\n{{{\r\n{-# LANGUAGE DeriveGeneric #-}\r\n\r\nimport GHC.Generics\r\n\r\ndata BigSum = \r\n C0 | C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9 \r\n | C10 | C11 | C12 | C13 | C14 | C15 | C16 | C17 | C18 | C19 \r\n | C20 | C21 | C22 | C23 | C24 | C25 | C26 | C27 | C28 | C29\r\n | C30 | C31 | C32 | C33 | C34 | C35 | C36 | C37 | C38 | C39\r\n | C40 | C41 | C42 | C43 | C44 | C45 | C46 | C47 | C48 | C49\r\n | C50 | C51 | C52 | C53 | C54 | C55 | C56 | C57 | C58 | C59\r\n | C60 | C61 | C62 | C63 | C64 | C65 | C66 | C67 | C68 | C69\r\n | C70 | C71 | C72 | C73 | C74 | C75 | C76 | C77 | C78 | C79\r\n | C80 | C81 | C82 | C83 | C84 | C85 | C86 | C87 | C88 | C89\r\n | C90 | C91 | C92 | C93 | C94 | C95 | C96 | C97 | C98 | C99\r\n | C100 | C101 | C102 | C103 | C104 | C105 | C106 | C107 | C108 | C109\r\n | C110 | C111 | C112 | C113 | C114 | C115 | C116 | C117 | C118 | C119\r\n | C120 | C121 | C122 | C123 | C124 | C125 | C126 | C127 | C128 | C129\r\n | C130 | C131 | C132 | C133 | C134 | C135 | C136 | C137 | C138 | C139\r\n | C140 | C141 | C142 | C143 | C144 | C145 | C146 | C147 | C148 | C149\r\n | C150 | C151 | C152 | C153 | C154 | C155 | C156 | C157 | C158 | C159\r\n | C160 | C161 | C162 | C163 | C164 | C165 | C166 | C167 | C168 | C169\r\n | C170 | C171 | C172 | C173 | C174 | C175 | C176 | C177 | C178 | C179\r\n | C180 | C181 | C182 | C183 | C184 | C185 | C186 | C187 | C188 | C189\r\n | C190 | C191 | C192 | C193 | C194 | C195 | C196 | C197 | C198 | C199\r\n | C200 | C201 | C202 | C203 | C204 | C205 | C206 | C207 | C208 | C209\r\n | C210 | C211 | C212 | C213 | C214 | C215 | C216 | C217 | C218 | C219\r\n | C220 | C221 | C222 | C223 | C224 | C225 | C226 | C227 | C228 | C229\r\n | C230 | C231 | C232 | C233 | C234 | C235 | C236 | C237 | C238 | C239\r\n | C240 | C241 | C242 | C243 | C244 | C245 | C246 | C247 | C248 | C249\r\n | C250 | C251 | C252 | C253 | C254 | C255 | C256 | C257 | C258 | C259\r\n | C260 | C261 | C262 | C263 | C264 | C265 | C266 | C267 | C268 | C269\r\n | C270 | C271 | C272 | C273 | C274 | C275 | C276 | C277 | C278 | C279\r\n | C280 | C281 | C282 | C283 | C284 | C285 | C286 | C287 | C288 | C289\r\n | C290 | C291 | C292 | C293 | C294 | C295 | C296 | C297 | C298 | C299\r\n deriving Generic\r\n}}}\r\n\r\nBig products have the same problem:\r\n\r\n{{{\r\ndata BigProduct = C \r\n () () () () () () () () () ()\r\n () () () () () () () () () ()\r\n () () () () () () () () () ()\r\n () () () () () () () () () ()\r\n () () () () () () () () () ()\r\n () () () () () () () () () ()\r\n () () () () () () () () () ()\r\n () () () () () () () () () ()\r\n () () () () () () () () () ()\r\n () () () () () () () () () ()\r\n () () () () () () () () () ()\r\n () () () () () () () () () ()\r\n () () () () () () () () () ()\r\n () () () () () () () () () ()\r\n () () () () () () () () () ()\r\n () () () () () () () () () ()\r\n () () () () () () () () () ()\r\n () () () () () () () () () ()\r\n () () () () () () () () () ()\r\n () () () () () () () () () ()\r\n () () () () () () () () () ()\r\n () () () () () () () () () ()\r\n () () () () () () () () () ()\r\n () () () () () () () () () ()\r\n () () () () () () () () () ()\r\n () () () () () () () () () ()\r\n () () () () () () () () () ()\r\n () () () () () () () () () ()\r\n () () () () () () () () () ()\r\n () () () () () () () () () ()\r\n deriving Generic\r\n}}}\r\n","type_of_failure":"OtherFailure","blocking":[]} -->⊥https://gitlab.haskell.org/ghc/ghc/-/issues/4012Compilation results [interface files] are not deterministic2022-08-11T07:12:49ZkiliCompilation results [interface files] are not deterministic**[Editor's note: this ticket focused primarily on interface files. For deterministic object files, see #12935.]**
There are some issues with non-determinism in the output of GHC, which means that compilations are not repeatable. This a...**[Editor's note: this ticket focused primarily on interface files. For deterministic object files, see #12935.]**
There are some issues with non-determinism in the output of GHC, which means that compilations are not repeatable. This affects some users (e.g. Debian packagers) who need to be able to get repeatable hashes for the packages of a GHC build.
The cases we know about that lead to non-deterministic results are:
- The `spec_ids` (specialised Ids) attached to an Id have a non-deterministic ordering
- CSE can give different results depending on the order in which the bindings are considered, and since the ordering is non-deterministic, the result of CSE is also non-deterministic. e.g. in `x = z; y = z; z = 3`, where `y` and `x` are exported, we can end up with either `x = y; y = 3` or `y = x; x = 3`.
- There seems to be something unpredictable about the order of arguments to SpecConstr-generated specialisations, see [http://www.haskell.org/pipermail/glasgow-haskell-users/2011-April/020287.html](http://www.haskell.org/pipermail/glasgow-haskell-users/2011-April/020287.html)
- The wrappers generated by the `CApiFFI` extension have non-deterministic names. (see [ticket:4012\#comment:59594](https://gitlab.haskell.org//ghc/ghc/issues/4012#note_59594) below).
- See [ticket:4012\#comment:98753](https://gitlab.haskell.org//ghc/ghc/issues/4012#note_98753) for another, different, example
**Old ticket description follows**
Short story: if you use ghc-6.12.1.20100318 (or similar, probably
ghc-6.12.1 release will produce the same results) to bootstrap
ghc-6.12, and then use that ghc-6.12 to bootstrap another ghc-6.12,
those two instances of ghc-6.12 will have different ABI hashes and
interfaces in the ghc package. If you use ghc-6.10 for the
bootstrapping, you'll even get differences in the ghc, base and
Cabal packages.
Long story: see logfiles and descriptions at http://darcs.volkswurst.de/boot-tests/ (note that the logfiles are quite large, I really don't want to attach 150 MB of logs to this ticket).⊥niterianiteria