GHC issueshttps://gitlab.haskell.org/ghc/ghc/-/issues2022-05-07T13:22:23Zhttps://gitlab.haskell.org/ghc/ghc/-/issues/20565stimes for Endo2022-05-07T13:22:23ZDavid Feuerstimes for Endo## Summary
Currently,
```haskell
instance Semigroup (Endo a) where
...
stimes = stimesMonoid
```
`stimesMonoid` produces a balanced tree. I guess this enhances sharing within the resulting function (if it's realized as a closu...## Summary
Currently,
```haskell
instance Semigroup (Endo a) where
...
stimes = stimesMonoid
```
`stimesMonoid` produces a balanced tree. I guess this enhances sharing within the resulting function (if it's realized as a closure), but I'm wondering if it's really a good idea, especially when the argument is small.
Here's a simpler option to play with:
```haskell
stimes n _ | n < 0 = error "Cannot apply a function a negative number of times."
stimes n0 e = go (fromIntegral n0)
where
go 0 = mempty
go n = e <> go (n - 1 :: Word64)
```
Another possibility might be to make a tree that only *leans* to the right. That would get immediate access to the leftmost (most important) nodes while maintaining sharing for the large case.
## Environment
* GHC version used:
Optional:
* Operating System:
* System Architecture:⊥https://gitlab.haskell.org/ghc/ghc/-/issues/20347Missed Optimization in numeric code2022-02-17T15:22:47ZNeoMissed Optimization in numeric codeIn functions like the once below the two negation operations can be safely eliminated. The optimization is valid for all integer and floating point types. The same story is of course true for division.
```
f_int_1 :: Int -> Int -> Int...In functions like the once below the two negation operations can be safely eliminated. The optimization is valid for all integer and floating point types. The same story is of course true for division.
```
f_int_1 :: Int -> Int -> Int
f_int_1 x y = (- x) * (-y)
f_double_1 :: Double -> Double -> Double
f_double_1 x y = (- x) * (- y)
```
Right now we produce:
```
f_int_1
= \ (x_a2tn :: Int) (y_a2to :: Int) ->
case x_a2tn of { GHC.Types.I# x1_i2ve ->
case y_a2to of { GHC.Types.I# x2_X2vE ->
GHC.Types.I#
(GHC.Prim.*#
(GHC.Prim.negateInt# x1_i2ve) (GHC.Prim.negateInt# x2_X2vE))
}
}
f_double_1
= \ (x_a2tq :: Double) (y_a2tr :: Double) ->
case x_a2tq of { GHC.Types.D# x1_a2vt ->
case y_a2tr of { GHC.Types.D# x2_X2vV ->
GHC.Types.D#
(GHC.Prim.*##
(GHC.Prim.negateDouble# x1_a2vt) (GHC.Prim.negateDouble# x2_X2vV))
}
}
```
Moreover for the following two functions we could be (at compile time) removing the negation from the variable and pushing it into the constant:
```
f_int_2 :: Int -> Int
f_int_2 x = 3 * (-x)
f_double_2 :: Double -> Double
f_double_2 x = 3.0 * (-x)
```
we should be transforming these into:
```
f_int_2 :: Int -> Int
f_int_2 x = -3 * x
f_double_2 :: Double -> Double
f_double_2 x = -3.0 * x
```⊥Sylvain HenrySylvain Henryhttps://gitlab.haskell.org/ghc/ghc/-/issues/20313Should we inline constructor wrappers into boring contexts?2021-09-02T16:07:30ZAndreas KlebingerShould we inline constructor wrappers into boring contexts?While working on other things I came across this constructor wrapper:
```
-- RHS size: {terms: 7, types: 6, coercions: 0, joins: 0/0}
GHC.Unit.Types.$WRealUnit [InlPrag=INLINE[final] CONLIKE]
:: forall uid. Definite uid %1 -> GenUnit ...While working on other things I came across this constructor wrapper:
```
-- RHS size: {terms: 7, types: 6, coercions: 0, joins: 0/0}
GHC.Unit.Types.$WRealUnit [InlPrag=INLINE[final] CONLIKE]
:: forall uid. Definite uid %1 -> GenUnit uid
[GblId[DataConWrapper],
Arity=1,
Caf=NoCafRefs,
Str=<SL>,
Cpr=1,
Unf=Unf{Src=InlineStable, TopLvl=True, Value=True, ConLike=True,
WorkFree=True, Expandable=True,
Guidance=ALWAYS_IF(arity=1,unsat_ok=True,boring_ok=False)
Tmpl= \ (@uid_a2UY)
(conrep_a3pk [Occ=Once1] :: Definite uid_a2UY) ->
case conrep_a3pk of conrep_X0 [Occ=Once1] { __DEFAULT ->
GHC.Unit.Types.RealUnit @uid_a2UY conrep_X0
}}]
GHC.Unit.Types.$WRealUnit
= \ (@uid_a2UY) (conrep_a3pk [Occ=Once1] :: Definite uid_a2UY) ->
case conrep_a3pk of conrep_X0 [Occ=Once1] { __DEFAULT ->
GHC.Unit.Types.RealUnit @uid_a2UY conrep_X0
}
```
There are a few cases where this kind of wrapper does not get inlined. For example here:
```
case ds8_saQo sat_saQs GHC.Prim.void# of {
Solo# ipv12_saQv [Occ=Once1] ->
let {
sat_saQw [Occ=Once1]
:: GHC.Unit.Types.GenUnit GHC.Unit.Types.UnitId
[LclId] =
{ipv12_saQv} \u [] GHC.Unit.Types.$WRealUnit ipv12_saQv;
} in Solo# [sat_saQw];
```
This is all expected (see the boring_ok=False attribute). I do wonder if it's the right thing to do.
In the example above we will generate a CMM function for the binding `sat_saQw`. This means we pay the overhead of two function calls when evaluating `sat_saQw` but save a bit in code size/compile time.
Things are different for wrappers which evaluate multiple arguments, these can become non-trivial in code size and probably *shouldn't* be inlined into boring contexts.
It seems like there are pros and cons for either choice. We could also make it dependent on the number of cases inside the wrapper. But currently there is no reason given in any of the notes for doing it one way or another. So I'm opening this ticket in case anyone wonders about the current choice or wants to evaluate it going forward.⊥https://gitlab.haskell.org/ghc/ghc/-/issues/20145Let LLVM and Unregisterized lower greater than native sized primops2021-08-18T22:59:32ZJohn EricsonLet LLVM and Unregisterized lower greater than native sized primopsWhen using the LLVM or Unregisterized backends and compiling primops for prim types exceeding the native width, we should avoid our C stubs and let LLVM/C lower them more efficiently in-line.
NCG needs to call slow FFI functions where w...When using the LLVM or Unregisterized backends and compiling primops for prim types exceeding the native width, we should avoid our C stubs and let LLVM/C lower them more efficiently in-line.
NCG needs to call slow FFI functions where we "borrow" the C compiler's implementation, but there is no reason why we need to do that for LLVM or C.⊥John EricsonJohn Ericsonhttps://gitlab.haskell.org/ghc/ghc/-/issues/19103Evaluate inlining defaults.2021-09-07T15:55:52ZAndreas KlebingerEvaluate inlining defaults.In general inlining more produces larger but faster code.
It might be worthwhile to reevaluate the configurable settings for how much we inline.
In particular some questions are:
* How much benefit is there to gain in times of runtime...In general inlining more produces larger but faster code.
It might be worthwhile to reevaluate the configurable settings for how much we inline.
In particular some questions are:
* How much benefit is there to gain in times of runtime, and what's the cost in terms of code size and compile time?
* Should the defaults perhaps be higher for `-O2`. After all -O2 is generally an indicator for *speed at (almost) all cost*
* Should we use different settings than the default for GHC itself?⊥https://gitlab.haskell.org/ghc/ghc/-/issues/17128Investigate encoding recursive demand types2020-06-07T14:19:56ZSebastian GrafInvestigate encoding recursive demand types@jmct brought up an interesting point in https://gitlab.haskell.org/ghc/ghc/issues/12364#note_122148 that I just came to remember today: Instead of cutting off strictness/usage/nested CPR analysis information at some arbitrary cut-off in...@jmct brought up an interesting point in https://gitlab.haskell.org/ghc/ghc/issues/12364#note_122148 that I just came to remember today: Instead of cutting off strictness/usage/nested CPR analysis information at some arbitrary cut-off in fixed-point iteration, we should try to encode the recursive nature in the lattice instead. If all went well, we are able to encode the subset of *regular* recursive types (in the sense of i.e. TAPL 21.8. [Here's the gist of it](https://gitlab.haskell.org/ghc/ghc/issues/17128#note_219722)), for which I'm pretty confident we can define equality and lattice operations. According to the comment above, we'll have problems with polymorphic recursion, but I'm not ready to give in on this yet.
Implementing recursive lattices would allow faster termination of fixed-point iteration, while also providing the perfect amount of information! To make this more concrete, let's say we augment the nested CPR lattice with
```
data Cpr
= ...
| CprVar Unique
| CprFix Unique Cpr
```
Apart from introducing all kinds of well-formedness problems, here's an example that shows their usefulness (there are probably more compelling examples somewhere out in the wild):
```
repeat :: a -> [a]
repeat x = x : repeat x
```
`repeat` has an infinitely deep CPR property, captured by `μX.(-,X)`. I'm using some new syntax from my nested CPR branch here, where `-` encodes Top and `(,)` a constructed pair. The term above would correspond to `CprFix 42 (CprProd [CprTop, CprVar 42])` in our embedding. This is at the same time much more expressive than cutting off at depth, say, 3, corresponding to `(-,(-,(-,-)))`, and also more efficient, since fixed-point iteration will stabilise after the second iteration. Here's the trace of `repeat`s CprType over each fixed-point iteration, under the assumption we introduced a fresh variable X standing for `repeat`:
```
initially:
. -> . -> μX.X <-- this is probably similar/equivalent to bottom?! Not sure about the scope of X
after iteration #1:
. -> . -> μX.(-, X)
after iteration #2, reaching the fixed-point:
. -> . -> μX.(-, X)
```
Turns out I'm not really able to express the proper formalism/lattice here, so this needs some more thinking. But I think this should be the general story.⊥https://gitlab.haskell.org/ghc/ghc/-/issues/16891Implement escape analysis for stack allocation.2021-04-23T14:04:57ZAndreas KlebingerImplement escape analysis for stack allocation.# Motivation
Reduce GC pressure.
# Proposal
Figure out if a function stores (parts of) it's arguments in it's return value.
With this information we can look at let bindings. Any binding which is only used as non-escaping argument ca...# Motivation
Reduce GC pressure.
# Proposal
Figure out if a function stores (parts of) it's arguments in it's return value.
With this information we can look at let bindings. Any binding which is only used as non-escaping argument can then be allocated on the stack instead and implicitly freed on return.
This is fairly well understood and should mostly carry over to Haskell. The benefits might be different due to the lazy nature of Haskell though.⊥https://gitlab.haskell.org/ghc/ghc/-/issues/16859WorkerWrapper: Wrapper takes apart constructor which is then reconstructed in...2021-10-24T17:04:31ZAndreas KlebingerWorkerWrapper: Wrapper takes apart constructor which is then reconstructed in the worker.We have this function from GHC's Name module as our starting point:
```haskell
data Name = Name {
n_sort :: NameSort, -- What sort of name it is
n_occ :: !OccName, -- Its occurrence name
...We have this function from GHC's Name module as our starting point:
```haskell
data Name = Name {
n_sort :: NameSort, -- What sort of name it is
n_occ :: !OccName, -- Its occurrence name
n_uniq :: {-# UNPACK #-} !Unique,
n_loc :: !SrcSpan -- Definition site
}
{-# NOINLINE mkInternalName #-}
mkInternalName :: Unique -> OccName -> SrcSpan -> Name
mkInternalName uniq occ loc = Name { n_uniq = uniq
, n_sort = Internal
, n_occ = occ
, n_loc = loc }
```
This results in the following worker/wrapper
```haskell
-- RHS size: {terms: 14, types: 13, coercions: 0, joins: 0/0}
$wmkInternalName
= \ ww_sl5G ww1_sl5K ww2_sl5L w_sl5D ->
case w_sl5D of dt_XaRo { __DEFAULT ->
(# Internal, OccName ww1_sl5K ww2_sl5L, ww_sl5G, dt_XaRo #)
}
-- RHS size: {terms: 21, types: 21, coercions: 1, joins: 0/0}
mkInternalName
= \ w_sl5B w1_sl5C w2_sl5D ->
case w_sl5B `cast` <Co:1> of { I# ww1_sl5G ->
case w1_sl5C of { OccName ww3_sl5K ww4_sl5L ->
case $wmkInternalName ww1_sl5G ww3_sl5K ww4_sl5L w2_sl5D of
{ (# ww6_slkS, ww7_slkT, ww8_slkU, ww9_slkV #) ->
Name ww6_slkS ww7_slkT ww8_slkU ww9_slkV
}
}
}
```
This code is quite odd. It:
* Evaluates `w1_sl5C` to an OccName.
* Takes the OccName apart and passes it's fields to the worker.
* The worker constructs a **new** OccName from the passed fields.
* The constructed OccName (and other things) are returned.
* The returned OccName is ultimately returned by the wrapper.
This causes the "old" OccName to become garbage and a new one to be allocated in the heap unless I missed something.⊥https://gitlab.haskell.org/ghc/ghc/-/issues/16284Abortion of fixed-point iteration in Demand Analyser discards sound results2019-07-24T13:42:46ZSebastian GrafAbortion of fixed-point iteration in Demand Analyser discards sound resultsConsider the following program:
```hs
-- Extracted from T4903
module Foo where
data Tree = Bin Tree Tree
tree :: Tree
tree = Bin tree tree
eq :: Tree -> Bool
eq (Bin l r) = eq l && eq r
```
`eq` amounts to the following Core:
```hs...Consider the following program:
```hs
-- Extracted from T4903
module Foo where
data Tree = Bin Tree Tree
tree :: Tree
tree = Bin tree tree
eq :: Tree -> Bool
eq (Bin l r) = eq l && eq r
```
`eq` amounts to the following Core:
```hs
eq
= \ (ds_d20O :: Tree) ->
case ds_d20O of { Bin l_aY8 r_aY9 ->
case eq l_aY8 of {
False -> False;
True -> eq r_aY9
}
}
```
It clearly diverges. That's also what pure strictness/termination/CPR analysis would find out. But because usage analysis can't find out in finite time (and that's fine) that `eq` uses its argument completely, we abort fixed-point iteration \[1\] and return `nopSig`, discarding useful and sound strictness information we found out along the way, like the fact that it diverges.
This isn't hard to fix: Just track which parts of the signature were still unsound during abortion and only zap them. I'm only recording this for posterity and as an argument in a discussion I'll maybe start in a few months of time...
\[1\] This is the ascending chain of demands on `ds_d20O` during fixed-point iteration:
```
H
U(1*H,1*H)
U(1*U(1*H,1*H),1*U(1*H,1*H))
...
```
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ------------ |
| Version | 8.6.3 |
| Type | Task |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Compiler |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"Abortion of fixed-point iteration in Demand Analyser discards sound results","status":"New","operating_system":"","component":"Compiler","related":[],"milestone":"⊥","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"8.6.3","keywords":["DemandAnalysis"],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"Task","description":"Consider the following program:\r\n\r\n{{{#!hs\r\n-- Extracted from T4903\r\nmodule Foo where\r\n\r\ndata Tree = Bin Tree Tree\r\n\r\ntree :: Tree\r\ntree = Bin tree tree\r\n\r\neq :: Tree -> Bool\r\neq (Bin l r) = eq l && eq r\r\n}}}\r\n\r\n`eq` amounts to the following Core:\r\n\r\n{{{#!hs\r\neq\r\n = \\ (ds_d20O :: Tree) ->\r\n case ds_d20O of { Bin l_aY8 r_aY9 ->\r\n case eq l_aY8 of {\r\n False -> False;\r\n True -> eq r_aY9\r\n }\r\n }\r\n}}}\r\n\r\nIt clearly diverges. That's also what pure strictness/termination/CPR analysis would find out. But because usage analysis can't find out in finite time (and that's fine) that `eq` uses its argument completely, we abort fixed-point iteration [1] and return `nopSig`, discarding useful and sound strictness information we found out along the way, like the fact that it diverges.\r\n\r\nThis isn't hard to fix: Just track which parts of the signature were still unsound during abortion and only zap them. I'm only recording this for posterity and as an argument in a discussion I'll maybe start in a few months of time...\r\n\r\n[1] This is the ascending chain of demands on `ds_d20O` during fixed-point iteration:\r\n{{{\r\nH\r\nU(1*H,1*H)\r\nU(1*U(1*H,1*H),1*U(1*H,1*H))\r\n...\r\n}}}","type_of_failure":"OtherFailure","blocking":[]} -->⊥https://gitlab.haskell.org/ghc/ghc/-/issues/15642Improve the worst case performance of weak pointers2021-09-07T15:55:31ZDavid FeuerImprove the worst case performance of weak pointersGarbage collecting weak pointers involves repeatedly traversing the weak pointer list, checking which keys are still alive, and in response marking the relevant values and finalizers live. This works well in most cases, but if there are ...Garbage collecting weak pointers involves repeatedly traversing the weak pointer list, checking which keys are still alive, and in response marking the relevant values and finalizers live. This works well in most cases, but if there are long chains of weak pointers, it's terrible. I wonder if we can do something about that without significantly hurting performance elsewhere. Here's the (probably naive) idea:
1. Maintain a hash table mapping weak pointer keys to lists of weak pointers. This replaces the current weak pointer list.
1. Use a bit in the info table pointer of every heap object to indicate whether that object is referenced from any `Weak#`. This is the weakest link: if we don't have a spare bit there, or don't want to use one, then I think this whole idea is sunk.
When creating a weak pointer, set the appropriate bit in the key and insert the key and pointer into the hash table. When performing early finalization, clear the bit in the key if no other `Weak#` points to it.
When performing garbage collection: check the bit in each object as it is marked. If the bit is set, move the key and its attached `Weak#`s from the hash table into a new hash table (or an association list that gets turned into a hash table after evacuation or whatever), and mark all the linked weak pointer targets and finalizers live. In the end, the old hash table contains only unreachable objects. Now mark those objects live (for finalization and in case of resurrection), and queue up the finalizers.
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | -------------- |
| Version | 8.6.1-beta1 |
| Type | FeatureRequest |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Runtime System |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | simonmar |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"Improve the worst case performance of weak pointers","status":"New","operating_system":"","component":"Runtime System","related":[],"milestone":"8.8.1","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"8.6.1-beta1","keywords":[],"differentials":[],"test_case":"","architecture":"","cc":["simonmar"],"type":"FeatureRequest","description":"Garbage collecting weak pointers involves repeatedly traversing the weak pointer list, checking which keys are still alive, and in response marking the relevant values and finalizers live. This works well in most cases, but if there are long chains of weak pointers, it's terrible. I wonder if we can do something about that without significantly hurting performance elsewhere. Here's the (probably naive) idea:\r\n\r\n1. Maintain a hash table mapping weak pointer keys to lists of weak pointers. This replaces the current weak pointer list.\r\n\r\n2. Use a bit in the info table pointer of every heap object to indicate whether that object is referenced from any `Weak#`. This is the weakest link: if we don't have a spare bit there, or don't want to use one, then I think this whole idea is sunk.\r\n\r\nWhen creating a weak pointer, set the appropriate bit in the key and insert the key and pointer into the hash table. When performing early finalization, clear the bit in the key if no other `Weak#` points to it.\r\n\r\nWhen performing garbage collection: check the bit in each object as it is marked. If the bit is set, move the key and its attached `Weak#`s from the hash table into a new hash table (or an association list that gets turned into a hash table after evacuation or whatever), and mark all the linked weak pointer targets and finalizers live. In the end, the old hash table contains only unreachable objects. Now mark those objects live (for finalization and in case of resurrection), and queue up the finalizers.","type_of_failure":"OtherFailure","blocking":[]} -->⊥https://gitlab.haskell.org/ghc/ghc/-/issues/15524Performance regression when using the GHC API to evaluate code compared to 8.42021-11-23T01:53:22ZVaibhav SagarPerformance regression when using the GHC API to evaluate code compared to 8.4I've been in the process of updating IHaskell to support GHC 8.6 and the test suite has become agonisingly slow as compared to GHC 8.4 and below. I've been able to work around this by using `--enable-executable-dynamic` as suggested by `...I've been in the process of updating IHaskell to support GHC 8.6 and the test suite has become agonisingly slow as compared to GHC 8.4 and below. I've been able to work around this by using `--enable-executable-dynamic` as suggested by `christiaanb` on `#ghc`but I thought this was worth reporting as a bug.
To reproduce this on a Linux installation with Nix:
1. `git clone https://github.com/gibiansky/IHaskell`
1. `cd IHaskell`
1. `git checkout 6058cd4fac01a2023dbd09d174f1f8d4c36e7475`
1. Comment out `--enable-executable-dynamic` in `release-8.6.nix`
1. `nix-build release-8.6.nix`⊥https://gitlab.haskell.org/ghc/ghc/-/issues/15503interpreter: sequence_ (replicate 100000000 (return ())) gobbles up memory2021-09-07T15:51:35ZBertram Felgenhauerinterpreter: sequence_ (replicate 100000000 (return ())) gobbles up memoryI consider this to be more of a curiosity than a bug, but it might still be worth investigating. The following line,
```hs
sequence_ (replicate 100000000 (return ()))
```
causes `ghci` to grow quite big (here it's 6GB on 64 bit Linux; ...I consider this to be more of a curiosity than a bug, but it might still be worth investigating. The following line,
```hs
sequence_ (replicate 100000000 (return ()))
```
causes `ghci` to grow quite big (here it's 6GB on 64 bit Linux; it's roughly proportional to the `100000000` number).
This used to be better in ancient times--before version 8.0.1, ghci ran this code in what looks like constant memory.
(This came up indirectly on \#haskell; somebody had adapted Hutton's game of life program, http://www.cs.nott.ac.uk/\~gmh/life.lhs which uses busy waiting for a delay.)
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ------------ |
| Version | 8.4.3 |
| Type | Bug |
| TypeOfFailure | OtherFailure |
| Priority | low |
| Resolution | Unresolved |
| Component | GHCi |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"interpreter: sequence_ (replicate 100000000 (return ())) gobbles up memory","status":"New","operating_system":"","component":"GHCi","related":[],"milestone":"8.6.1","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"8.4.3","keywords":[],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"Bug","description":"I consider this to be more of a curiosity than a bug, but it might still be worth investigating. The following line,\r\n{{{#!hs\r\nsequence_ (replicate 100000000 (return ()))\r\n}}}\r\ncauses `ghci` to grow quite big (here it's 6GB on 64 bit Linux; it's roughly proportional to the `100000000` number).\r\n\r\nThis used to be better in ancient times--before version 8.0.1, ghci ran this code in what looks like constant memory.\r\n\r\n(This came up indirectly on #haskell; somebody had adapted Hutton's game of life program, http://www.cs.nott.ac.uk/~gmh/life.lhs which uses busy waiting for a delay.)","type_of_failure":"OtherFailure","blocking":[]} -->⊥Ömer Sinan AğacanÖmer Sinan Ağacanhttps://gitlab.haskell.org/ghc/ghc/-/issues/14980Runtime performance regression with binary operations on vectors2021-09-07T15:48:34ZttylecRuntime performance regression with binary operations on vectorsLet me briefly explain our use case: we have a machine learning tools created in Haskell. Basically it builds association rules for a database. In plain English: predicates on data rows. We need to score them, so we need to check how man...Let me briefly explain our use case: we have a machine learning tools created in Haskell. Basically it builds association rules for a database. In plain English: predicates on data rows. We need to score them, so we need to check how many rows are "matched" by the predicate.
In order to optimize performance, our code uses two main representations for data: one is a "human readable", where a values are real values. The second one is binarized representation for categorical data. The latter has is actually a family of representation, since we pack bits into tuples of Word64 and use bitwise operation to implement logic.
Simplified but representative code is attached.
In GHC 8.0.2 binary representation is faster by one order of magnitude than the "human readable" one:
```
➜ ghc-bug stack exec performance-bug-pair-1
"Generated"
benchmarking 256 columns/raw unbox vectors
time 342.6 μs (338.3 μs .. 348.0 μs)
0.999 R² (0.998 R² .. 1.000 R²)
mean 339.3 μs (337.5 μs .. 342.0 μs)
std dev 7.203 μs (5.596 μs .. 10.07 μs)
variance introduced by outliers: 13% (moderately inflated)
benchmarking 256 columns/binary packed
time 32.07 μs (29.69 μs .. 34.14 μs)
0.982 R² (0.976 R² .. 0.997 R²)
mean 29.97 μs (29.41 μs .. 31.03 μs)
std dev 2.428 μs (1.526 μs .. 3.750 μs)
variance introduced by outliers: 78% (severely inflated)
```
In GHC 8.2.2 (and later) binary representation performance is similar to "human readable", so no performance gain:
```
➜ ghc-bug stack exec performance-bug-pair-1
"Generated"
benchmarking 256 columns/raw unbox vectors
time 442.4 μs (406.7 μs .. 474.5 μs)
0.978 R² (0.969 R² .. 0.993 R²)
mean 399.3 μs (391.3 μs .. 415.1 μs)
std dev 34.73 μs (20.36 μs .. 53.29 μs)
variance introduced by outliers: 71% (severely inflated)
benchmarking 256 columns/binary packed
time 378.6 μs (295.8 μs .. 488.0 μs)
0.637 R² (0.492 R² .. 0.780 R²)
mean 568.1 μs (437.1 μs .. 747.6 μs)
std dev 308.7 μs (233.6 μs .. 386.1 μs)
variance introduced by outliers: 98% (severely inflated)
```
However, for certain compilation paths, we still can get similar speedup as with GHC 8.0.2. In the above examples we used 4-tuple of Word64 as binary representation. In the following code we run two tests: one on just Word64 and one of 4-tuple of Word64. The difference is that we just add the Word64 case:
```
➜ ghc-bug stack exec performance-bug-pair-2
"Generated"
benchmarking 64 columns/raw unbox vectors
time 337.6 μs (336.1 μs .. 339.3 μs)
0.999 R² (0.998 R² .. 1.000 R²)
mean 349.6 μs (344.4 μs .. 359.7 μs)
std dev 23.22 μs (15.39 μs .. 38.22 μs)
variance introduced by outliers: 60% (severely inflated)
benchmarking 64 columns/binary packed
time 21.66 μs (21.53 μs .. 21.79 μs)
1.000 R² (0.999 R² .. 1.000 R²)
mean 21.68 μs (21.53 μs .. 21.89 μs)
std dev 613.2 ns (466.0 ns .. 806.0 ns)
variance introduced by outliers: 30% (moderately inflated)
benchmarking 256 columns/raw unbox vectors
time 344.5 μs (341.6 μs .. 348.2 μs)
0.999 R² (0.999 R² .. 1.000 R²)
mean 345.1 μs (342.5 μs .. 349.3 μs)
std dev 10.66 μs (7.997 μs .. 16.34 μs)
variance introduced by outliers: 25% (moderately inflated)
benchmarking 256 columns/binary packed
time 28.04 μs (27.70 μs .. 28.46 μs)
0.999 R² (0.999 R² .. 1.000 R²)
mean 28.05 μs (27.85 μs .. 28.30 μs)
std dev 758.2 ns (628.2 ns .. 907.6 ns)
variance introduced by outliers: 27% (moderately inflated)
```
I made a variant of code with simpler accumulating function (in our code we accumulate pair of integers, simplified accumulator work on one integer). GHC 8.2.2 in that case losses speed-up with 4-tuples:
```
➜ ghc-bug stack exec performance-bug-2
"Generated"
benchmarking 64 columns/raw unbox vectors
time 333.8 μs (333.0 μs .. 335.1 μs)
1.000 R² (0.999 R² .. 1.000 R²)
mean 333.6 μs (332.4 μs .. 335.8 μs)
std dev 5.651 μs (3.233 μs .. 9.582 μs)
benchmarking 64 columns/binary packed
time 39.06 μs (38.98 μs .. 39.14 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 38.94 μs (38.83 μs .. 39.14 μs)
std dev 495.0 ns (310.2 ns .. 782.1 ns)
benchmarking 256 columns/raw unbox vectors
time 336.9 μs (336.2 μs .. 337.9 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 337.5 μs (336.2 μs .. 339.8 μs)
std dev 5.757 μs (2.946 μs .. 8.979 μs)
benchmarking 256 columns/binary packed
time 239.8 μs (237.6 μs .. 243.0 μs)
0.998 R² (0.996 R² .. 0.999 R²)
mean 251.4 μs (247.9 μs .. 259.8 μs)
std dev 11.50 μs (5.662 μs .. 19.79 μs)
variance introduced by outliers: 37% (moderately inflated)
```
In GHC 8.0.2 we have speed-up in both cases.
What may be related: using random-fu to generate random numbers seems to produce broken code on GHC 8.2.2 (the `performance-bug-rfu.hs` source).
Short description of attached sources:
- performance-bug-pair-2.hs: using two binary representations
- performance-bug-pair-1.hs: using one binary representation (one case commented)
- performance-bug-1.hs: using one binary representation with simplified accumulator
- performance-bug-2.hs: using two binary representation with simplified accumulator
- performance-bug-rfu.hs: using random-fu to generate data (optional)
- stack-8.0.yaml: stack file for GHC-8.0.2
- stack-8.2.yaml: stack file for GHC-8.2.2
- stack-nightly.yaml: stack file for GHC-8.4
- performance-bug.cabal: to be able to stack build everything.
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ------------ |
| Version | 8.2.2 |
| Type | Bug |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Compiler |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"Runtime performance regression with binary operations on vectors","status":"New","operating_system":"","component":"Compiler","related":[],"milestone":"","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"8.2.2","keywords":["bitwise","operations","vector"],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"Bug","description":"Let me briefly explain our use case: we have a machine learning tools created in Haskell. Basically it builds association rules for a database. In plain English: predicates on data rows. We need to score them, so we need to check how many rows are \"matched\" by the predicate.\r\n\r\nIn order to optimize performance, our code uses two main representations for data: one is a \"human readable\", where a values are real values. The second one is binarized representation for categorical data. The latter has is actually a family of representation, since we pack bits into tuples of Word64 and use bitwise operation to implement logic.\r\n\r\nSimplified but representative code is attached.\r\n\r\nIn GHC 8.0.2 binary representation is faster by one order of magnitude than the \"human readable\" one:\r\n\r\n{{{\r\n➜ ghc-bug stack exec performance-bug-pair-1\r\n\"Generated\"\r\nbenchmarking 256 columns/raw unbox vectors\r\ntime 342.6 μs (338.3 μs .. 348.0 μs)\r\n 0.999 R² (0.998 R² .. 1.000 R²)\r\nmean 339.3 μs (337.5 μs .. 342.0 μs)\r\nstd dev 7.203 μs (5.596 μs .. 10.07 μs)\r\nvariance introduced by outliers: 13% (moderately inflated)\r\n\r\nbenchmarking 256 columns/binary packed\r\ntime 32.07 μs (29.69 μs .. 34.14 μs)\r\n 0.982 R² (0.976 R² .. 0.997 R²)\r\nmean 29.97 μs (29.41 μs .. 31.03 μs)\r\nstd dev 2.428 μs (1.526 μs .. 3.750 μs)\r\nvariance introduced by outliers: 78% (severely inflated)\r\n}}}\r\n\r\nIn GHC 8.2.2 (and later) binary representation performance is similar to \"human readable\", so no performance gain:\r\n\r\n{{{\r\n➜ ghc-bug stack exec performance-bug-pair-1\r\n\"Generated\"\r\nbenchmarking 256 columns/raw unbox vectors\r\ntime 442.4 μs (406.7 μs .. 474.5 μs)\r\n 0.978 R² (0.969 R² .. 0.993 R²)\r\nmean 399.3 μs (391.3 μs .. 415.1 μs)\r\nstd dev 34.73 μs (20.36 μs .. 53.29 μs)\r\nvariance introduced by outliers: 71% (severely inflated)\r\n\r\nbenchmarking 256 columns/binary packed\r\ntime 378.6 μs (295.8 μs .. 488.0 μs)\r\n 0.637 R² (0.492 R² .. 0.780 R²)\r\nmean 568.1 μs (437.1 μs .. 747.6 μs)\r\nstd dev 308.7 μs (233.6 μs .. 386.1 μs)\r\nvariance introduced by outliers: 98% (severely inflated)\r\n}}}\r\n\r\nHowever, for certain compilation paths, we still can get similar speedup as with GHC 8.0.2. In the above examples we used 4-tuple of Word64 as binary representation. In the following code we run two tests: one on just Word64 and one of 4-tuple of Word64. The difference is that we just add the Word64 case:\r\n\r\n{{{\r\n➜ ghc-bug stack exec performance-bug-pair-2\r\n\"Generated\"\r\nbenchmarking 64 columns/raw unbox vectors\r\ntime 337.6 μs (336.1 μs .. 339.3 μs)\r\n 0.999 R² (0.998 R² .. 1.000 R²)\r\nmean 349.6 μs (344.4 μs .. 359.7 μs)\r\nstd dev 23.22 μs (15.39 μs .. 38.22 μs)\r\nvariance introduced by outliers: 60% (severely inflated)\r\n\r\nbenchmarking 64 columns/binary packed\r\ntime 21.66 μs (21.53 μs .. 21.79 μs)\r\n 1.000 R² (0.999 R² .. 1.000 R²)\r\nmean 21.68 μs (21.53 μs .. 21.89 μs)\r\nstd dev 613.2 ns (466.0 ns .. 806.0 ns)\r\nvariance introduced by outliers: 30% (moderately inflated)\r\n\r\nbenchmarking 256 columns/raw unbox vectors\r\ntime 344.5 μs (341.6 μs .. 348.2 μs)\r\n 0.999 R² (0.999 R² .. 1.000 R²)\r\nmean 345.1 μs (342.5 μs .. 349.3 μs)\r\nstd dev 10.66 μs (7.997 μs .. 16.34 μs)\r\nvariance introduced by outliers: 25% (moderately inflated)\r\n\r\nbenchmarking 256 columns/binary packed\r\ntime 28.04 μs (27.70 μs .. 28.46 μs)\r\n 0.999 R² (0.999 R² .. 1.000 R²)\r\nmean 28.05 μs (27.85 μs .. 28.30 μs)\r\nstd dev 758.2 ns (628.2 ns .. 907.6 ns)\r\nvariance introduced by outliers: 27% (moderately inflated)\r\n}}}\r\n\r\nI made a variant of code with simpler accumulating function (in our code we accumulate pair of integers, simplified accumulator work on one integer). GHC 8.2.2 in that case losses speed-up with 4-tuples:\r\n\r\n{{{\r\n➜ ghc-bug stack exec performance-bug-2\r\n\"Generated\"\r\nbenchmarking 64 columns/raw unbox vectors\r\ntime 333.8 μs (333.0 μs .. 335.1 μs)\r\n 1.000 R² (0.999 R² .. 1.000 R²)\r\nmean 333.6 μs (332.4 μs .. 335.8 μs)\r\nstd dev 5.651 μs (3.233 μs .. 9.582 μs)\r\n\r\nbenchmarking 64 columns/binary packed\r\ntime 39.06 μs (38.98 μs .. 39.14 μs)\r\n 1.000 R² (1.000 R² .. 1.000 R²)\r\nmean 38.94 μs (38.83 μs .. 39.14 μs)\r\nstd dev 495.0 ns (310.2 ns .. 782.1 ns)\r\n\r\nbenchmarking 256 columns/raw unbox vectors\r\ntime 336.9 μs (336.2 μs .. 337.9 μs)\r\n 1.000 R² (1.000 R² .. 1.000 R²)\r\nmean 337.5 μs (336.2 μs .. 339.8 μs)\r\nstd dev 5.757 μs (2.946 μs .. 8.979 μs)\r\n\r\nbenchmarking 256 columns/binary packed\r\ntime 239.8 μs (237.6 μs .. 243.0 μs)\r\n 0.998 R² (0.996 R² .. 0.999 R²)\r\nmean 251.4 μs (247.9 μs .. 259.8 μs)\r\nstd dev 11.50 μs (5.662 μs .. 19.79 μs)\r\nvariance introduced by outliers: 37% (moderately inflated)\r\n}}}\r\n\r\nIn GHC 8.0.2 we have speed-up in both cases.\r\n\r\nWhat may be related: using random-fu to generate random numbers seems to produce broken code on GHC 8.2.2 (the `performance-bug-rfu.hs` source).\r\n\r\nShort description of attached sources:\r\n- performance-bug-pair-2.hs: using two binary representations\r\n- performance-bug-pair-1.hs: using one binary representation (one case commented)\r\n- performance-bug-1.hs: using one binary representation with simplified accumulator\r\n- performance-bug-2.hs: using two binary representation with simplified accumulator\r\n- performance-bug-rfu.hs: using random-fu to generate data (optional)\r\n- stack-8.0.yaml: stack file for GHC-8.0.2\r\n- stack-8.2.yaml: stack file for GHC-8.2.2\r\n- stack-nightly.yaml: stack file for GHC-8.4\r\n- performance-bug.cabal: to be able to stack build everything.","type_of_failure":"OtherFailure","blocking":[]} -->⊥Ben GamariBen Gamarihttps://gitlab.haskell.org/ghc/ghc/-/issues/14295tagToEnum# leads to some silly closures2019-09-05T14:07:16ZDavid FeuertagToEnum# leads to some silly closuresI don't know how important this is in practice, but it looks unfortunate.
Suppose I write
```hs
foo :: (Bool -> a) -> Int# -> a
foo f x = f (tagToEnum# x)
```
Since `tagToEnum#` can fail, GHC compiles this to
```hs
foo
= \ (@ a_a10...I don't know how important this is in practice, but it looks unfortunate.
Suppose I write
```hs
foo :: (Bool -> a) -> Int# -> a
foo f x = f (tagToEnum# x)
```
Since `tagToEnum#` can fail, GHC compiles this to
```hs
foo
= \ (@ a_a10v)
(f_s1by [Occ=Once!] :: GHC.Types.Bool -> a_a10v)
(x_s1bz [Occ=Once] :: GHC.Prim.Int#) ->
let {
sat_s1bA [Occ=Once] :: GHC.Types.Bool
[LclId]
sat_s1bA = GHC.Prim.tagToEnum# @ GHC.Types.Bool x_s1bz } in
f_s1by sat_s1bA
```
That seems pretty bad! We know that `tagToEnum#` is applied to `Bool`, so we can transform this to something like
```hs
foo f x = case leWord# (intToWord# x) 1## of
1# -> f $! tagToEnum# x
_ -> f (error "tagToEnum# was used at Bool with tag ...")
```
which avoids an extra closure at the cost of a single `Word#` comparison. The same goes for arbitrary known enumeration types. I suspect the right place to fix this up is in CorePrep.⊥https://gitlab.haskell.org/ghc/ghc/-/issues/14226Common Block Elimination pass doesn't eliminate common blocks2021-09-07T15:40:17ZBen GamariCommon Block Elimination pass doesn't eliminate common blocksIn #14222 it was noted that something appears to be broken in `CmmCommonBlockElim`. Consider the program from that ticket,
```hs
module T14221 where
import Data.Text as T
isNumeric :: Text -> Bool
isNumeric t =
T.all isNumeric' t ...In #14222 it was noted that something appears to be broken in `CmmCommonBlockElim`. Consider the program from that ticket,
```hs
module T14221 where
import Data.Text as T
isNumeric :: Text -> Bool
isNumeric t =
T.all isNumeric' t && T.any isNumber t
where
isNumber c = '0' <= c && c <= '9'
isNumeric' c = isNumber c
|| c == 'e'
|| c == 'E'
|| c == '.'
|| c == '-'
|| c == '+'
```
This program produces six copies of a block of the form,
```
c6JT:
R2 = I64[R1 + 7];
R1 = P64[Sp + 8];
Sp = Sp + 16;
call $wloop_all_s6CQ_info(R2, R1) args: 8, res: 0, upd: 8;
```
in the `-ddump-opt-cmm` output, which are manifest in the assembler as,
```asm
block_c6JT_info:
_c6JT:
movq 7(%rbx),%r14
movq 8(%rbp),%rbx
addq $16,%rbp
jmp $wloop_all_s6CQ_info
```
CBE really ought to be catching these.⊥Michal TerepetaMichal Terepetahttps://gitlab.haskell.org/ghc/ghc/-/issues/13763Performance regression (~34%) in 8.2.1, poor register allocation(?) in an inn...2021-09-07T15:37:04ZjberrymanPerformance regression (~34%) in 8.2.1, poor register allocation(?) in an inner loop over an arrayTesting GHC 8.0.1 against the RC 8.2.0.20170507
I've distilled a smallish test-case from a much larger case in my 'hashabler' library, and validated that the same modifications also make that regression disappear in the real case. It's ...Testing GHC 8.0.1 against the RC 8.2.0.20170507
I've distilled a smallish test-case from a much larger case in my 'hashabler' library, and validated that the same modifications also make that regression disappear in the real case. It's probably possible to get this smaller but I don't know if I'll have time to work on it more for a while:
repro3.hs:
```hs
{-# LANGUAGE BangPatterns #-}
module Main(main) where
import Criterion.Main
import qualified Data.Primitive as P
import Data.Bits
import Data.Word
import Control.DeepSeq
main :: IO ()
main = do
defaultMain
[ env (newByteArr64 5) $ \ ~bs ->
bench "ByteArray 5" $ nf (hashTextSip 99) bs
, env (newByteArr64 8) $ \ ~bs ->
bench "ByteArray 8" $ nf (hashTextSip 99) bs
, env (newByteArr64 512) $ \ ~bs ->
bench "ByteArray 512" $ nf (hashTextSip 99) bs
, env (newByteArr64 1000000) $ \ ~bs ->
bench "ByteArray 1000000" $ nf (hashTextSip 99) bs
]
instance NFData P.ByteArray where rnf _ = ()
newByteArr64 n = P.newAlignedPinnedByteArray (8*n) 8 >>= P.unsafeFreezeByteArray
sipRound :: Word64 -> Word64 -> Word64 -> Word64 -> (Word64, Word64, Word64, Word64)
{-# INLINE sipRound #-}
sipRound v0 v1 v2 v3 = (v3 `xor` v0, v0 `xor` v1, v1 `xor` v2, v2 `xor` v3)
hashTextSip :: Word64 -> P.ByteArray -> Word64
{-# INLINE hashTextSip #-}
hashTextSip h = \ ba ->
let !lenWord16 = P.sizeofByteArray ba `unsafeShiftR` 1
!word16sRem = lenWord16 .&. 3
!word16sIx = lenWord16-word16sRem
!ixFinal = lenWord16-1
!word16sIxWd = word16sIx `unsafeShiftR` 2 -- `div` 4
hash4Word16sLoop hAcc@(!w0,!w1,!w2,!w3) !ix
| ix == word16sIxWd = hashRemainingWord16s hAcc word16sIx
| otherwise =
let w64Dirty = P.indexByteArray ba ix
w64 = clean4xWord16ChunkLE w64Dirty
in hash4Word16sLoop (sipRound (w0 `xor` w64) w1 w2 w3) (ix + 1)
-- NOTE: Removing this causes regression to disappear as well.
hashRemainingWord16s (!w0,!w1,!w2,!w3) !ix
| ix > ixFinal = w0
| otherwise =
let w16 = P.indexByteArray ba ix
in hashRemainingWord16s (sipRound (w0 `xor` (fromIntegral (w16 :: Word16))) w1 w2 w3) (ix+1)
in hash4Word16sLoop (h,1,2,3) 0
clean4xWord16ChunkLE :: Word64 -> Word64
{-# INLINE clean4xWord16ChunkLE #-}
clean4xWord16ChunkLE w64Dirty =
-- NOTE: no regression when just this (8.2 is faster)
-- (((byteSwap64 w64Dirty) `unsafeShiftR` 8) .&. 0x00FF00FF00FF00FF)
-- ...but this is a big regression:
(((byteSwap64 w64Dirty) `unsafeShiftR` 8) .&. 0x00FF00FF00FF00FF)
.|.
(((byteSwap64 w64Dirty) `unsafeShiftL` 8) .&. 0xFF00FF00FF00FF00)
```
Here are the results of the benchmark above on my machine:
On GHC \*\*8.0.1\*\*:
```
benchmarking ByteArray 5
time 24.70 ns (24.00 ns .. 26.25 ns)
0.987 R² (0.967 R² .. 1.000 R²)
mean 24.44 ns (24.13 ns .. 25.80 ns)
std dev 1.859 ns (318.3 ps .. 4.227 ns)
variance introduced by outliers: 86% (severely inflated)
benchmarking ByteArray 8
time 32.66 ns (32.58 ns .. 32.76 ns)
1.000 R² (1.000 R² .. 1.000 R²)
mean 32.79 ns (32.64 ns .. 33.09 ns)
std dev 683.7 ps (365.4 ps .. 1.175 ns)
variance introduced by outliers: 31% (moderately inflated)
benchmarking ByteArray 512
time 1.428 μs (1.382 μs .. 1.522 μs)
0.986 R² (0.970 R² .. 1.000 R²)
mean 1.398 μs (1.384 μs .. 1.454 μs)
std dev 91.12 ns (4.475 ns .. 193.9 ns)
variance introduced by outliers: 76% (severely inflated)
benchmarking ByteArray 1000000
time 2.658 ms (2.653 ms .. 2.663 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 2.672 ms (2.665 ms .. 2.691 ms)
std dev 35.00 μs (10.88 μs .. 59.58 μs)
```
And on \*\*GHC 8.2\*\* RC:
```
benchmarking ByteArray 5
time 23.78 ns (23.68 ns .. 23.88 ns)
1.000 R² (1.000 R² .. 1.000 R²)
mean 23.83 ns (23.76 ns .. 23.95 ns)
std dev 298.8 ps (183.2 ps .. 482.5 ps)
variance introduced by outliers: 14% (moderately inflated)
benchmarking ByteArray 8
time 35.81 ns (35.44 ns .. 36.27 ns)
0.999 R² (0.998 R² .. 1.000 R²)
mean 35.56 ns (35.45 ns .. 35.94 ns)
std dev 596.8 ps (134.5 ps .. 1.184 ns)
variance introduced by outliers: 22% (moderately inflated)
benchmarking ByteArray 512
time 1.706 μs (1.698 μs .. 1.716 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 1.701 μs (1.698 μs .. 1.707 μs)
std dev 13.27 ns (5.825 ns .. 24.41 ns)
benchmarking ByteArray 1000000
time 3.322 ms (3.284 ms .. 3.377 ms)
0.999 R² (0.998 R² .. 1.000 R²)
mean 3.296 ms (3.287 ms .. 3.332 ms)
std dev 44.62 μs (20.55 μs .. 87.29 μs)
```
Looking at the core wasn't fruitful, but I think dumping the asm shows that this is a case of bad (or worse) register allocation. I've attached two screenshots showing the instructions added (in blue), when moving from the one-line `clean4xWord16ChunkLE` to the two-line version, for both 8.0 and 8.2 (there wasn't anything in the diff besides instances of this change).
It looks in the 8.2 version like we've decided we're out of registers and need to use the stack.
In my real code I'm seeing 35% regression on very long Text, as well as 21% regression on very long ByteString; the latter implementation is similarly structured to `hashTextSip`, but doesn't call `clean4xWord16ChunkLE` but does do a byteswap.
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | -------------- |
| Version | 8.2.1-rc2 |
| Type | Bug |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Compiler (NCG) |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"Performance regression (~34%) in 8.2.1, poor register allocation(?) in an inner loop over an array","status":"New","operating_system":"","component":"Compiler (NCG)","related":[],"milestone":"","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"8.2.1-rc2","keywords":[],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"Bug","description":"Testing GHC 8.0.1 against the RC 8.2.0.20170507\r\n\r\nI've distilled a smallish test-case from a much larger case in my 'hashabler' library, and validated that the same modifications also make that regression disappear in the real case. It's probably possible to get this smaller but I don't know if I'll have time to work on it more for a while:\r\n\r\nrepro3.hs:\r\n\r\n{{{#!hs\r\n{-# LANGUAGE BangPatterns #-}\r\nmodule Main(main) where\r\n\r\nimport Criterion.Main\r\nimport qualified Data.Primitive as P\r\n\r\nimport Data.Bits\r\nimport Data.Word\r\nimport Control.DeepSeq\r\n\r\nmain :: IO ()\r\nmain = do\r\n defaultMain \r\n [ env (newByteArr64 5) $ \\ ~bs ->\r\n bench \"ByteArray 5\" $ nf (hashTextSip 99) bs\r\n , env (newByteArr64 8) $ \\ ~bs ->\r\n bench \"ByteArray 8\" $ nf (hashTextSip 99) bs\r\n , env (newByteArr64 512) $ \\ ~bs ->\r\n bench \"ByteArray 512\" $ nf (hashTextSip 99) bs\r\n , env (newByteArr64 1000000) $ \\ ~bs ->\r\n bench \"ByteArray 1000000\" $ nf (hashTextSip 99) bs\r\n ]\r\n\r\ninstance NFData P.ByteArray where rnf _ = ()\r\n\r\nnewByteArr64 n = P.newAlignedPinnedByteArray (8*n) 8 >>= P.unsafeFreezeByteArray\r\n\r\nsipRound :: Word64 -> Word64 -> Word64 -> Word64 -> (Word64, Word64, Word64, Word64)\r\n{-# INLINE sipRound #-}\r\nsipRound v0 v1 v2 v3 = (v3 `xor` v0, v0 `xor` v1, v1 `xor` v2, v2 `xor` v3)\r\n\r\nhashTextSip :: Word64 -> P.ByteArray -> Word64\r\n{-# INLINE hashTextSip #-}\r\nhashTextSip h = \\ ba -> \r\n let !lenWord16 = P.sizeofByteArray ba `unsafeShiftR` 1\r\n !word16sRem = lenWord16 .&. 3 \r\n !word16sIx = lenWord16-word16sRem\r\n !ixFinal = lenWord16-1\r\n !word16sIxWd = word16sIx `unsafeShiftR` 2 -- `div` 4\r\n\r\n hash4Word16sLoop hAcc@(!w0,!w1,!w2,!w3) !ix \r\n | ix == word16sIxWd = hashRemainingWord16s hAcc word16sIx\r\n | otherwise = \r\n let w64Dirty = P.indexByteArray ba ix\r\n w64 = clean4xWord16ChunkLE w64Dirty\r\n in hash4Word16sLoop (sipRound (w0 `xor` w64) w1 w2 w3) (ix + 1)\r\n \r\n -- NOTE: Removing this causes regression to disappear as well.\r\n hashRemainingWord16s (!w0,!w1,!w2,!w3) !ix \r\n | ix > ixFinal = w0 \r\n | otherwise = \r\n let w16 = P.indexByteArray ba ix\r\n in hashRemainingWord16s (sipRound (w0 `xor` (fromIntegral (w16 :: Word16))) w1 w2 w3) (ix+1)\r\n in hash4Word16sLoop (h,1,2,3) 0 \r\n\r\nclean4xWord16ChunkLE :: Word64 -> Word64\r\n{-# INLINE clean4xWord16ChunkLE #-}\r\nclean4xWord16ChunkLE w64Dirty = \r\n -- NOTE: no regression when just this (8.2 is faster)\r\n -- (((byteSwap64 w64Dirty) `unsafeShiftR` 8) .&. 0x00FF00FF00FF00FF) \r\n\r\n -- ...but this is a big regression:\r\n (((byteSwap64 w64Dirty) `unsafeShiftR` 8) .&. 0x00FF00FF00FF00FF) \r\n .|.\r\n (((byteSwap64 w64Dirty) `unsafeShiftL` 8) .&. 0xFF00FF00FF00FF00)\r\n}}}\r\n\r\nHere are the results of the benchmark above on my machine:\r\n\r\nOn GHC **8.0.1**:\r\n{{{\r\nbenchmarking ByteArray 5\r\ntime 24.70 ns (24.00 ns .. 26.25 ns)\r\n 0.987 R² (0.967 R² .. 1.000 R²)\r\nmean 24.44 ns (24.13 ns .. 25.80 ns)\r\nstd dev 1.859 ns (318.3 ps .. 4.227 ns)\r\nvariance introduced by outliers: 86% (severely inflated)\r\n\r\nbenchmarking ByteArray 8\r\ntime 32.66 ns (32.58 ns .. 32.76 ns)\r\n 1.000 R² (1.000 R² .. 1.000 R²)\r\nmean 32.79 ns (32.64 ns .. 33.09 ns)\r\nstd dev 683.7 ps (365.4 ps .. 1.175 ns)\r\nvariance introduced by outliers: 31% (moderately inflated)\r\n\r\nbenchmarking ByteArray 512\r\ntime 1.428 μs (1.382 μs .. 1.522 μs)\r\n 0.986 R² (0.970 R² .. 1.000 R²)\r\nmean 1.398 μs (1.384 μs .. 1.454 μs)\r\nstd dev 91.12 ns (4.475 ns .. 193.9 ns)\r\nvariance introduced by outliers: 76% (severely inflated)\r\n\r\nbenchmarking ByteArray 1000000\r\ntime 2.658 ms (2.653 ms .. 2.663 ms)\r\n 1.000 R² (1.000 R² .. 1.000 R²)\r\nmean 2.672 ms (2.665 ms .. 2.691 ms)\r\nstd dev 35.00 μs (10.88 μs .. 59.58 μs)\r\n\r\n}}}\r\n\r\nAnd on **GHC 8.2** RC:\r\n{{{\r\nbenchmarking ByteArray 5\r\ntime 23.78 ns (23.68 ns .. 23.88 ns)\r\n 1.000 R² (1.000 R² .. 1.000 R²)\r\nmean 23.83 ns (23.76 ns .. 23.95 ns)\r\nstd dev 298.8 ps (183.2 ps .. 482.5 ps)\r\nvariance introduced by outliers: 14% (moderately inflated)\r\n\r\nbenchmarking ByteArray 8\r\ntime 35.81 ns (35.44 ns .. 36.27 ns)\r\n 0.999 R² (0.998 R² .. 1.000 R²)\r\nmean 35.56 ns (35.45 ns .. 35.94 ns)\r\nstd dev 596.8 ps (134.5 ps .. 1.184 ns)\r\nvariance introduced by outliers: 22% (moderately inflated)\r\n\r\nbenchmarking ByteArray 512\r\ntime 1.706 μs (1.698 μs .. 1.716 μs)\r\n 1.000 R² (1.000 R² .. 1.000 R²)\r\nmean 1.701 μs (1.698 μs .. 1.707 μs)\r\nstd dev 13.27 ns (5.825 ns .. 24.41 ns)\r\n\r\nbenchmarking ByteArray 1000000\r\ntime 3.322 ms (3.284 ms .. 3.377 ms)\r\n 0.999 R² (0.998 R² .. 1.000 R²)\r\nmean 3.296 ms (3.287 ms .. 3.332 ms)\r\nstd dev 44.62 μs (20.55 μs .. 87.29 μs)\r\n\r\n}}}\r\n\r\nLooking at the core wasn't fruitful, but I think dumping the asm shows that this is a case of bad (or worse) register allocation. I've attached two screenshots showing the instructions added (in blue), when moving from the one-line `clean4xWord16ChunkLE` to the two-line version, for both 8.0 and 8.2 (there wasn't anything in the diff besides instances of this change). \r\n\r\nIt looks in the 8.2 version like we've decided we're out of registers and need to use the stack.\r\n\r\nIn my real code I'm seeing 35% regression on very long Text, as well as 21% regression on very long ByteString; the latter implementation is similarly structured to `hashTextSip`, but doesn't call `clean4xWord16ChunkLE` but does do a byteswap.","type_of_failure":"OtherFailure","blocking":[]} -->⊥https://gitlab.haskell.org/ghc/ghc/-/issues/8313Poor performance of higher-order functions with unboxing2019-07-07T18:45:36ZdolioPoor performance of higher-order functions with unboxingI was testing out some code to see how GHC handled some unboxed higher-order functions, and was suprised by the results I found. Here is some sample code:
```
{-# LANGUAGE BangPatterns #-}
{-# LANGUAGE LambdaCase #-}
{-# LANGUAGE MagicH...I was testing out some code to see how GHC handled some unboxed higher-order functions, and was suprised by the results I found. Here is some sample code:
```
{-# LANGUAGE BangPatterns #-}
{-# LANGUAGE LambdaCase #-}
{-# LANGUAGE MagicHash #-}
import GHC.Exts
import System.Environment
rel# :: Int# -> Int# -> Int# -> Bool
rel# i# j# k# = (i# +# j# +# k#) ># 100000000#
rel :: Int -> Int -> Int -> Bool
rel (I# i#) (I# j#) (I# k#) = rel# i# j# k#
manual :: (Int# -> Int# -> Int# -> Bool) -> (Int, Int, Int)
manual r# = go 0# 0# 0#
where
go i# j# k# | r# i# j# k# = (I# i#, I# j#, I# k#)
| otherwise = go j# k# (i# +# 1#)
{-# NOINLINE manual #-}
auto :: (Int -> Int -> Int -> Bool) -> (Int, Int, Int)
auto r = go 0 0 0
where
go !i !j !k | r i j k = (i, j, k)
| otherwise = go j k (i+1)
{-# NOINLINE auto #-}
main = getArgs >>= \case
"manual" : _ -> print $ manual rel# -- This case is significantly slower.
"auto" : _ -> print $ auto rel -- Why?
```
A loop that has to box its loop parameters to call a predicate turns out to be significantly faster than one that uses a predicate that takes unboxed values directly.
The answer turns out to be (I believe) in ghc/utils/genapply/GenApply.hs. applyTypes has an entry \[P,P,P\], but only \[N\]. This means that the manual loop has to use a slower calling convention than the boxed loop.
I'm not sure whether this should be 'fixed,' as my encounter with it was experimental in nature, and I don't have a real use case. The comment on applyTypes says its cases cover 99% of uses, and mine is not a real use. This ticket may serve as documentation at least, though.
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ------------ |
| Version | 7.6.3 |
| Type | Task |
| TypeOfFailure | OtherFailure |
| Priority | low |
| Resolution | Unresolved |
| Component | Compiler |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"Poor performance of higher-order functions with unboxing","status":"New","operating_system":"","component":"Compiler","related":[],"milestone":"⊥","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"7.6.3","keywords":["higher","order","slow","unboxed"],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"Task","description":"I was testing out some code to see how GHC handled some unboxed higher-order functions, and was suprised by the results I found. Here is some sample code:\r\n\r\n\r\n{{{\r\n{-# LANGUAGE BangPatterns #-}\r\n{-# LANGUAGE LambdaCase #-}\r\n{-# LANGUAGE MagicHash #-}\r\n\r\nimport GHC.Exts\r\nimport System.Environment\r\n\r\nrel# :: Int# -> Int# -> Int# -> Bool\r\nrel# i# j# k# = (i# +# j# +# k#) ># 100000000#\r\n\r\nrel :: Int -> Int -> Int -> Bool\r\nrel (I# i#) (I# j#) (I# k#) = rel# i# j# k#\r\n\r\nmanual :: (Int# -> Int# -> Int# -> Bool) -> (Int, Int, Int)\r\nmanual r# = go 0# 0# 0#\r\n where\r\n go i# j# k# | r# i# j# k# = (I# i#, I# j#, I# k#)\r\n | otherwise = go j# k# (i# +# 1#)\r\n{-# NOINLINE manual #-}\r\n\r\nauto :: (Int -> Int -> Int -> Bool) -> (Int, Int, Int)\r\nauto r = go 0 0 0\r\n where\r\n go !i !j !k | r i j k = (i, j, k)\r\n | otherwise = go j k (i+1)\r\n{-# NOINLINE auto #-}\r\n\r\nmain = getArgs >>= \\case\r\n \"manual\" : _ -> print $ manual rel# -- This case is significantly slower.\r\n \"auto\" : _ -> print $ auto rel -- Why?\r\n}}}\r\n\r\nA loop that has to box its loop parameters to call a predicate turns out to be significantly faster than one that uses a predicate that takes unboxed values directly.\r\n\r\nThe answer turns out to be (I believe) in ghc/utils/genapply/GenApply.hs. applyTypes has an entry [P,P,P], but only [N]. This means that the manual loop has to use a slower calling convention than the boxed loop.\r\n\r\nI'm not sure whether this should be 'fixed,' as my encounter with it was experimental in nature, and I don't have a real use case. The comment on applyTypes says its cases cover 99% of uses, and mine is not a real use. This ticket may serve as documentation at least, though.","type_of_failure":"OtherFailure","blocking":[]} -->⊥https://gitlab.haskell.org/ghc/ghc/-/issues/8048Register spilling produces ineffecient/highly contending code2022-02-22T11:08:04ZschylerRegister spilling produces ineffecient/highly contending codeThe native codegen and llvm both produce ineffecient code for functions using structures with many strict fields or unboxed values.
Consider the following program:
```
{-# LANGUAGE BangPatterns #-}
module Spill where
import GHC.Exts
...The native codegen and llvm both produce ineffecient code for functions using structures with many strict fields or unboxed values.
Consider the following program:
```
{-# LANGUAGE BangPatterns #-}
module Spill where
import GHC.Exts
data S = S !Int !Int !Int !Int !Int !Int !Int !Int !Int
spill :: S -> S -> S -> S
spill (S !a !b !c !d !e !f !g !h !i) (S !j !k !l !m !n !o !p !q !r) (S !s !t !u !v !w !x !y !z _)
= S (a + j + s) (b + c) (k + r) (a + b + c + d + e + f + g + h + i) (j + k + l + m + n + o + p + q + r) (s + t + u + v + w + x + y + z) (a + b + c) (j + k + l) (s + t + u)
```
Parts of the code produced for this (which is identical regardless of -funbox-strict-fields) looks like:
```
_cnc:
addq $80,%r12
cmpq 144(%r13),%r12
ja _cni
movq $Spill.S_con_info,-72(%r12)
movq 32(%rbp),%rax
movq %rax,-64(%r12)
movq 24(%rbp),%rax
movq %rax,-56(%r12)
movq 16(%rbp),%rax
movq %rax,-48(%r12)
movq 8(%rbp),%rax
movq %rax,-40(%r12)
movq 40(%rbp),%rax
movq %rax,-32(%r12)
movq 48(%rbp),%rax
movq %rax,-24(%r12)
movq 56(%rbp),%rax
movq %rax,-16(%r12)
movq 64(%rbp),%rax
movq %rax,-8(%r12)
movq 7(%rbx),%rax
movq %rax,0(%r12)
leaq -71(%r12),%rbx
addq $72,%rbp
jmp *0(%rbp)
```
```
_csv:
movq 63(%rbx),%rax
movq %rax,-56(%rbp)
movq 55(%rbx),%rax
movq %rax,-48(%rbp)
movq 47(%rbx),%rax
movq %rax,-40(%rbp)
movq 39(%rbx),%rax
movq %rax,-32(%rbp)
movq 31(%rbx),%rax
movq %rax,-24(%rbp)
movq 23(%rbx),%rax
movq %rax,-16(%rbp)
movq 71(%rbx),%rax
movq %rax,-8(%rbp)
movq 15(%rbx),%rax
movq %rax,0(%rbp)
```
And likewise for LLVM:
```
.LBB10_1: # %coZ
movq 7(%rbx), %rcx
movq $Spill_S_con_info, 8(%rax)
movq 8(%rbp), %rdx
movq %rdx, 16(%rax)
movq 16(%rbp), %rdx
movq %rdx, 24(%rax)
movq 24(%rbp), %rdx
movq %rdx, 32(%rax)
movq 32(%rbp), %rdx
movq %rdx, 40(%rax)
movq 40(%rbp), %rdx
movq %rdx, 48(%rax)
movq 48(%rbp), %rdx
movq %rdx, 56(%rax)
movq 56(%rbp), %rdx
movq %rdx, 64(%rax)
movq 64(%rbp), %rdx
movq %rdx, 72(%rax)
movq %rcx, (%r12)
movq 72(%rbp), %rax
leaq 72(%rbp), %rbp
leaq -71(%r12), %rbx
jmpq *%rax # TAILCALL
```
Quoting from \#ghc "the \[register allocator\] core algo is '96 vintage". Improvements are needed;
- Take into consideration pipelining and handle spills less dramatically, attempting to reduce register contention
- Sink memory reads in order to reduce register pressure
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ------------ |
| Version | 7.6.3 |
| Type | Bug |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Compiler |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"Register spilling produces ineffecient/highly contending code","status":"New","operating_system":"","component":"Compiler","related":[],"milestone":"⊥","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"7.6.3","keywords":["allocator","register","spill"],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"Bug","description":"The native codegen and llvm both produce ineffecient code for functions using structures with many strict fields or unboxed values.\r\n\r\nConsider the following program:\r\n{{{\r\n{-# LANGUAGE BangPatterns #-}\r\nmodule Spill where\r\n\r\nimport GHC.Exts\r\n\r\ndata S = S !Int !Int !Int !Int !Int !Int !Int !Int !Int\r\n\r\nspill :: S -> S -> S -> S\r\nspill (S !a !b !c !d !e !f !g !h !i) (S !j !k !l !m !n !o !p !q !r) (S !s !t !u !v !w !x !y !z _)\r\n = S (a + j + s) (b + c) (k + r) (a + b + c + d + e + f + g + h + i) (j + k + l + m + n + o + p + q + r) (s + t + u + v + w + x + y + z) (a + b + c) (j + k + l) (s + t + u)\r\n}}}\r\n\r\nParts of the code produced for this (which is identical regardless of -funbox-strict-fields) looks like:\r\n{{{\r\n_cnc:\r\n addq $80,%r12\r\n cmpq 144(%r13),%r12\r\n ja _cni\r\n movq $Spill.S_con_info,-72(%r12)\r\n movq 32(%rbp),%rax\r\n movq %rax,-64(%r12)\r\n movq 24(%rbp),%rax\r\n movq %rax,-56(%r12)\r\n movq 16(%rbp),%rax\r\n movq %rax,-48(%r12)\r\n movq 8(%rbp),%rax\r\n movq %rax,-40(%r12)\r\n movq 40(%rbp),%rax\r\n movq %rax,-32(%r12)\r\n movq 48(%rbp),%rax\r\n movq %rax,-24(%r12)\r\n movq 56(%rbp),%rax\r\n movq %rax,-16(%r12)\r\n movq 64(%rbp),%rax\r\n movq %rax,-8(%r12)\r\n movq 7(%rbx),%rax\r\n movq %rax,0(%r12)\r\n leaq -71(%r12),%rbx\r\n addq $72,%rbp\r\n jmp *0(%rbp)\r\n}}}\r\n{{{\r\n_csv:\r\n movq 63(%rbx),%rax\r\n movq %rax,-56(%rbp)\r\n movq 55(%rbx),%rax\r\n movq %rax,-48(%rbp)\r\n movq 47(%rbx),%rax\r\n movq %rax,-40(%rbp)\r\n movq 39(%rbx),%rax\r\n movq %rax,-32(%rbp)\r\n movq 31(%rbx),%rax\r\n movq %rax,-24(%rbp)\r\n movq 23(%rbx),%rax\r\n movq %rax,-16(%rbp)\r\n movq 71(%rbx),%rax\r\n movq %rax,-8(%rbp)\r\n movq 15(%rbx),%rax\r\n movq %rax,0(%rbp)\r\n}}}\r\n\r\nAnd likewise for LLVM:\r\n{{{\r\n.LBB10_1: # %coZ\r\n movq 7(%rbx), %rcx\r\n movq $Spill_S_con_info, 8(%rax)\r\n movq 8(%rbp), %rdx\r\n movq %rdx, 16(%rax)\r\n movq 16(%rbp), %rdx\r\n movq %rdx, 24(%rax)\r\n movq 24(%rbp), %rdx\r\n movq %rdx, 32(%rax)\r\n movq 32(%rbp), %rdx\r\n movq %rdx, 40(%rax)\r\n movq 40(%rbp), %rdx\r\n movq %rdx, 48(%rax)\r\n movq 48(%rbp), %rdx\r\n movq %rdx, 56(%rax)\r\n movq 56(%rbp), %rdx\r\n movq %rdx, 64(%rax)\r\n movq 64(%rbp), %rdx\r\n movq %rdx, 72(%rax)\r\n movq %rcx, (%r12)\r\n movq 72(%rbp), %rax\r\n leaq 72(%rbp), %rbp\r\n leaq -71(%r12), %rbx\r\n jmpq *%rax # TAILCALL\r\n}}}\r\n\r\nQuoting from #ghc \"the [register allocator] core algo is '96 vintage\". Improvements are needed;\r\n\r\n* Take into consideration pipelining and handle spills less dramatically, attempting to reduce register contention\r\n* Sink memory reads in order to reduce register pressure\r\n","type_of_failure":"OtherFailure","blocking":[]} -->⊥