GHC issueshttps://gitlab.haskell.org/ghc/ghc/-/issues2024-03-26T20:12:32Zhttps://gitlab.haskell.org/ghc/ghc/-/issues/24591Arity analysis fails badly2024-03-26T20:12:32ZSimon Peyton JonesArity analysis fails badlyHere's an interesting program, derived from the implementation of
`coVarsOfType` in GHC.Core.TyCo.FVs. It came up @Mikolaj's MR !12037, where the bug
described below made GHC allocate **six times as much** to compile the same program. ...Here's an interesting program, derived from the implementation of
`coVarsOfType` in GHC.Core.TyCo.FVs. It came up @Mikolaj's MR !12037, where the bug
described below made GHC allocate **six times as much** to compile the same program. Six times!
Here's the issue.
```haskell
module Foo where
import Data.List
import Data.Monoid
data Type = Tv Int | Tc [Type]
foldTyCo :: Monoid a => (Int -> a) -> Type -> a
{-# INLINE foldTyCo #-}
foldTyCo do_tv = go
where
go (Tv x) = do_tv x
----------------- foldl version --------------
go (Tc tys) = foldl (\ acc ty -> acc `mappend` go ty) mempty tys
----------------- foldr version --------------
-- go (Tc tys) = foldr (\t acc -> go t `mappend` acc) mempty tys
----------------- Explicit version --------------
-- go (Tc tys) = go_tys tys
-- go_tys [] = mempty
-- go_tys (t:ts) = go t `mappend` go_tys ts
free_tvs :: Type -> [Int]
free_tvs ty = appEndo (foldTyCo f ty) []
where
f :: Int -> Endo [Int]
f x = Endo (\xs -> x:xs)
```
Reminder:
* `foldTyCo` is specialised (by inlining) at every call site.
* In this case `free_tvs` specialises it at `Endo [Int]`.
* `newtype Endo = Endo (a->a)`; so `free_tvs` has an accumulating parameter.
We expect to get nice code like this:
```
Foo.free_tvs1 [Occ=LoopBreaker] :: [Type] -> [Int] -> [Int]
[GblId, Arity=2, Str=<1L><ML>, Unf=OtherCon []]
Foo.free_tvs1
= \ (ds_a1ez :: [Type]) (eta_X2 [OS=OneShot] :: [Int]) ->
case ds_a1ez of {
[] -> eta_X2;
: y_a1eC ys_a1eD ->
case y_a1eC of {
Tv x_aBb -> GHC.Types.: @Int x_aBb (Foo.free_tvs1 ys_a1eD eta_X2);
Tc tys_aBc -> Foo.free_tvs1 tys_aBc (Foo.free_tvs1 ys_a1eD eta_X2)
}
}
```
And we do get this code with
* The `Explicit version` which uses explicit recursion.
* The `foldr version` which uses foldr.
But with the "foldl version" we get
```
go1_r1f2 :: [Type] -> Endo [Int] -> Endo [Int]
[GblId[StrictWorker([!])], Arity=2, Str=<1L><L>, Unf=OtherCon []]
go1_r1f2
= \ (ds_a1eC :: [Type]) (eta_B0 [OS=OneShot] :: Endo [Int]) ->
case ds_a1eC of {
[] -> eta_B0;
: y_a1eF ys_a1eG ->
go1_r1f2
ys_a1eG
(let {
g_s1eq [Dmd=LC(S,L)] :: Endo [Int]
[LclId]
g_s1eq = Foo.free_tvs_go y_a1eF } in
(\ (x_a1ej :: [Int]) ->
(eta_B0
`cast` (base:Data.Semigroup.Internal.N:Endo[0] <[Int]>_R
:: Endo [Int] ~R# ([Int] -> [Int])))
((g_s1eq
`cast` (base:Data.Semigroup.Internal.N:Endo[0] <[Int]>_R
:: Endo [Int] ~R# ([Int] -> [Int])))
x_a1ej))
`cast` (Sym (base:Data.Semigroup.Internal.N:Endo[0] <[Int]>_R)
:: ([Int] -> [Int]) ~R# Endo [Int])) }
Foo.free_tvs_go [Occ=LoopBreaker] :: Type -> Endo [Int]
[GblId, Arity=1, Str=<1L>, Unf=OtherCon []]
Foo.free_tvs_go
= \ (ds_dVv :: Type) ->
case ds_dVv of {
Tv x_aB5 ->
(\ (xs_aRa :: [Int]) -> GHC.Types.: @Int x_aB5 xs_aRa)
`cast` (Sym (base:Data.Semigroup.Internal.N:Endo[0] <[Int]>_R)
:: ([Int] -> [Int]) ~R# Endo [Int]);
Tc tys_aB6 ->
go1_r1f2
tys_aB6
((id @[Int])
`cast` (Sym (base:Data.Semigroup.Internal.N:Endo[0] <[Int]>_R)
:: ([Int] -> [Int]) ~R# Endo [Int])) }
```
This is very bad. Two mutually-recursive functions, with higher order arguments. No no no.
A heavy cost for switching from `foldr` to `foldl`.
Reminder: not only are we using higher order stuff in the form of `Endo`, but also `foldl` is implemented in terms of `foldr`. See
this in `GHC.Internal.List`:
```
foldl k z0 xs =
foldr (\(v::a) (fn::b->b) -> oneShot (\(z::b) -> fn (k z v))) (id :: b -> b) xs z0
-- See Note [Left folds via right fold]
{-
Note [Left folds via right fold]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
...
```
## Diagnosis
At an earlier stage in the pipeline we have
```
go_s1em [Occ=LoopBreaker] :: Type -> Endo [Int]
go_s1em
= \ (ds_dVv :: Type) ->
case ds_dVv of {
Tv x_aB6 ->
(\ (xs_aRa :: [Int]) -> GHC.Types.: @Int x_aB6 xs_aRa)
`cast` <Co:4> :: ([Int] -> [Int]) ~R# Endo [Int];
Tc tys_aB7 ->
letrec {
go1_a1et [Occ=LoopBreaker] :: [Type] -> Endo [Int]
go1_a1et
= \ (ds_a1eu :: [Type]) ->
case ds_a1eu of {
[] -> (id @[Int]) `cast` <Co:4> :: ([Int] -> [Int]) ~R# Endo [Int];
: y_a1ex ys_a1ey ->
(let {
f_s1ei :: Endo [Int]
f_s1ei = go_s1em y_a1ex } in
let {
acc_aLK [OS=OneShot] :: Endo [Int]
acc_aLK = go1_a1et ys_a1ey } in
\ (x_a1e9 :: [Int]) ->
(f_s1ei `cast` <Co:3> :: Endo [Int] ~R# ([Int] -> [Int]))
((acc_aLK `cast` <Co:3> :: Endo [Int] ~R# ([Int] -> [Int]))
x_a1e9))
`cast` <Co:4> :: ([Int] -> [Int]) ~R# Endo [Int]
}; } in
go1_a1et tys_aB7
}
```
We want to eta-expand `go_s1em` and `go1_a1et`; then all would be well. And indeed, if we
hypothesis that both have arity 2, we can indeed eta-expand. But `GHC.Core.Opt.Arity.findRhsArity`
only works on **self-recursive** functions; it is defeated by nested mutual recursion like
the example above. And apparently Call Arity analysis is defeated too.
Mabye @sgraf812's new plan will catch this.
Meanwhile, this ticket just serves to immortalise the missed opportunity.Sebastian GrafSimon Peyton JonesSebastian Grafhttps://gitlab.haskell.org/ghc/ghc/-/issues/24589`inline` may only inline the wrapper and not the worker2024-03-26T15:46:20ZMichael Peyton Jones`inline` may only inline the wrapper and not the worker## Summary
Using the `inline` magic function on a reference to a function that has been worker-wrapper-transformed may result in only the _worker_ being inlined, which is unlikely to be what the user intended.
## Steps to reproduce
Ob...## Summary
Using the `inline` magic function on a reference to a function that has been worker-wrapper-transformed may result in only the _worker_ being inlined, which is unlikely to be what the user intended.
## Steps to reproduce
Observed when updating https://gitlab.haskell.org/ghc/ghc/-/merge_requests/5316 : `getTraceFlags` needs to be inlined to get the best code, but I am observing that the worker is still present.
## Expected behavior
The worker should be inlined.
## Environment
* GHC version used: 9.6.2
Optional:
* Operating System: NixOS
* System Architecture: x86_64https://gitlab.haskell.org/ghc/ghc/-/issues/24550NCG: Use 32bit-comparisons when operating on pointer tags.2024-03-20T17:45:22ZAndreas KlebingerNCG: Use 32bit-comparisons when operating on pointer tags.## Summary
I came across this pointer check in Cmm:
```
cGv: // global
call (I64[R1])(R1) returns to cGu, args: 8, res: 8, upd: 8;
cGu: // global
if (R1 & 7 != 1) goto cGz; else goto cGy;
```
Which ...## Summary
I came across this pointer check in Cmm:
```
cGv: // global
call (I64[R1])(R1) returns to cGu, args: 8, res: 8, upd: 8;
cGu: // global
if (R1 & 7 != 1) goto cGz; else goto cGy;
```
Which turns into this assembly:
```
movq %rbx,%rax
andl $7,%eax
cmpq $1,%rax
je .LcGy
.LcGz:
```
Using cmpq wastes a bit here for the 64bit encoding when the 32bit encoding would do.
I think the issue is that we generate the literal 1 as 64bit wide number resulting in a 64bit operation being used.
To reproduce:
```
module M where
{-# NOINLINE incMaybe #-}
incMaybe :: Maybe Int -> Int
incMaybe (Just x) = x + 1
incMaybe Nothing = 0
```
Compile with `ghc A.hs -O2 -ddump-cmm -ddump-asm -dno-typeable-binds -fforce-recomp -ddump-to-file -fno-worker-wrapper -dcmm-lint`
## Environment
* GHC version used:
Optional:
* Operating System:
* System Architecture:https://gitlab.haskell.org/ghc/ghc/-/issues/24519CmmToAsm/x86: Poor unrolling for unaligned memcpy2024-03-20T14:58:37ZMatthew Cravenclyring@gmail.comCmmToAsm/x86: Poor unrolling for unaligned memcpyCompile the following module, and look at the generated assembly:
```haskell
module Memcpy where
import Foreign
import Data.Word
fun :: Ptr Word8 -> Ptr Word8 -> IO ()
fun p q = copyBytes p q 16
```
When I compile with `ghc-9.8.2 -O ...Compile the following module, and look at the generated assembly:
```haskell
module Memcpy where
import Foreign
import Data.Word
fun :: Ptr Word8 -> Ptr Word8 -> IO ()
fun p q = copyBytes p q 16
```
When I compile with `ghc-9.8.2 -O -dno-typeable-binds Memcpy.hs -S` on my x86-64 machine, this is what I see:
<details>
```asm
.section .text
.align 8
.align 8
.quad 12884901903
.quad 0
.long 14
.long 0
.globl Memcpy_fun1_info
.type Memcpy_fun1_info, @function
Memcpy_fun1_info:
.LcRu:
leaq -16(%rbp),%rax
cmpq %r15,%rax
jb .LcRy
.LcRz:
movq $.LcRr_info,-16(%rbp)
movq %r14,%rbx
movq %rsi,-8(%rbp)
addq $-16,%rbp
testb $7,%bl
jne .LcRr
.LcRs:
jmp *(%rbx)
.align 8
.quad 1
.long 30
.long 0
.LcRr_info:
.LcRr:
movq $.LcRx_info,(%rbp)
movq 7(%rbx),%rax
movq 8(%rbp),%rbx
movq %rax,8(%rbp)
testb $7,%bl
jne .LcRx
.LcRB:
jmp *(%rbx)
.align 8
.quad 65
.long 30
.long 0
.LcRx_info:
.LcRx:
movq 8(%rbp),%rax
movq 7(%rbx),%rbx
;; !!!
movb 0(%rbx),%cl
movb %cl,0(%rax)
movb 1(%rbx),%cl
movb %cl,1(%rax)
movb 2(%rbx),%cl
movb %cl,2(%rax)
movb 3(%rbx),%cl
movb %cl,3(%rax)
movb 4(%rbx),%cl
movb %cl,4(%rax)
movb 5(%rbx),%cl
movb %cl,5(%rax)
movb 6(%rbx),%cl
movb %cl,6(%rax)
movb 7(%rbx),%cl
movb %cl,7(%rax)
movb 8(%rbx),%cl
movb %cl,8(%rax)
movb 9(%rbx),%cl
movb %cl,9(%rax)
movb 10(%rbx),%cl
movb %cl,10(%rax)
movb 11(%rbx),%cl
movb %cl,11(%rax)
movb 12(%rbx),%cl
movb %cl,12(%rax)
movb 13(%rbx),%cl
movb %cl,13(%rax)
movb 14(%rbx),%cl
movb %cl,14(%rax)
movb 15(%rbx),%bl
movb %bl,15(%rax)
leaq ghczmprim_GHCziTupleziPrim_Z0T_closure+1(%rip),%rbx
addq $16,%rbp
jmp *(%rbp)
.LcRy:
leaq Memcpy_fun1_closure(%rip),%rbx
jmp *-8(%r13)
.size Memcpy_fun1_info, .-Memcpy_fun1_info
.section .data
.align 8
.align 1
.globl Memcpy_fun1_closure
.type Memcpy_fun1_closure, @object
Memcpy_fun1_closure:
.quad Memcpy_fun1_info
.section .text
.align 8
.align 8
.quad 12884901903
.quad 0
.long 14
.long 0
.globl Memcpy_fun_info
.type Memcpy_fun_info, @function
Memcpy_fun_info:
.LcRU:
jmp Memcpy_fun1_info
.size Memcpy_fun_info, .-Memcpy_fun_info
.section .data
.align 8
.align 1
.globl Memcpy_fun_closure
.type Memcpy_fun_closure, @object
Memcpy_fun_closure:
.quad Memcpy_fun_info
.section .note.GNU-stack,"",@progbits
.ident "GHC 9.8.2"
```
</details>
Notice the 32 single-byte `movb` instructions used to implement the copy. (I have marked this with `!!!`.)
We are probably using single-byte move instructions because the given `Ptr`s can have arbitrary alignment. But on x86-64, the penalty for an unaligned `movq` is typically at most a few cycles. It would be much better for both speed and code size to emit four eight-byte move instructions or even two unaligned 16-byte move instructions. (There are a few of the latter in SSE and SSE2, which we assume are available on every x86-64 machine.) And indeed I see the latter if I also add `-fllvm`.
I haven't checked, but would not be surprised if there were similar opportunities for improvement available on other platforms.https://gitlab.haskell.org/ghc/ghc/-/issues/24487regression with 9.6/9.8, observed as heap usage but could be due to behavior ...2024-03-13T19:35:39Zquarkregression with 9.6/9.8, observed as heap usage but could be due to behavior change## Summary
The [Bluespec Compiler (bsc)](github.com/b-lang-org/bsc) is compiled with GHC and we have observed a regression in our testsuite when using GHC 9.6 and 9.8 (including recent 9.8.2), that was not a failure with GHC 9.4 and ear...## Summary
The [Bluespec Compiler (bsc)](github.com/b-lang-org/bsc) is compiled with GHC and we have observed a regression in our testsuite when using GHC 9.6 and 9.8 (including recent 9.8.2), that was not a failure with GHC 9.4 and earlier.
Specifically, in the BSC repo, in directory `/testsuite/bsc.bugs/bluespec_inc/b1490/`, there are two tests which exhaust the heap when set with `-M256` but execute fine when it is increased with `+RTS -M260` and `+RTS -M265`. However, with earlier versions of GHC (such as 9.4.8), the heap size can be reduced dramatically and it still executes without exhausting the heap, even down to `-M4K`!
Something clearly changed in GHC, but I don't think it's necessarily directly related to heap usage; I would guess that maybe a code optimization has changed the behavior, and it is resulted in BSC using more heap. I say that because BSC is itself a compiler, and the test cases which are now failing are tests that involve a for-loop in the Bluespec design being compiled, and the tests are checking that BSC is able to unroll the loops and properly prune the growing structure so that it doesn't explode. The heap usage could be because BSC is now failing to properly prune, as a result of a behavior change resulting from a change in GHC.
However, I don't know how to diagnose this. Are there RTS flags that can dump information about heap usage, that I can compare between GHC versions? I can use `+RTS -s`, but it doesn't show max heap usage, and the info that it does show isn't significantly different between GHC 9.8 and 9.4.
I assume that a binary search of the GHC commits might find the point where the issue was introduced, but each iteration would take a while.
Any advice you can provide on how to investigate this would be appreciated.
## Steps to reproduce
BSC is a large program, but if you want to reproduce, you can do this:
```
git clone --recursive https://github.com/B-Lang-org/bsc
cd bsc
make install-src
cd testsuite/bsc.bugs/bluespec_inc/b1490/
../../../../inst/bin/bsc -verilog Bug1490MyUnion.bsv +RTS -M256M
```
The result of running BSC in the last line is (when compiled with 9.8 or 9.6):
```
bsc: Heap exhausted;
bsc: Current maximum heap size is 268435456 bytes (256 MB).
bsc: Use `+RTS -M<size>' to increase it.
```
## Expected behavior
I expect BSC to run to completion and output a message like this (which is what you get with GHC 9.4.8 and earlier version):
```
Verilog file created: module_arbitrate_myunion.v
```
In fact, with GHC 9.4.8 (and presumably earlier), the heap can be reduced significantly and BSC still executes fine:
```
../../../../inst/bin/bsc -verilog Bug1490MyUnion.bsv +RTS -M4K
```
With GHC 9.6 and 9.8, the heap needs to be increased to around 260 or 265, for it to compile. This difference between essentially nothing (4K) and 260M seems significant, and so I felt it was worth reporting (particularly if the heap is just a sympton and that it's due to a bug in behavior).
## Environment
* GHC versions that fail: 9.8.2, 9.8.1, 9.6.4, 9.6.2
* GHC versions that work: 9.4.8 (and many earlier versions)
* Operating System: My results reported here are on macOS 11, but the same failures were seen on GitHub VMs for Ubuntu 22.04 and macOS 12
* System Architecture: x86_64
We have been using various versions of GHC in our CI over the last few years and had not noticed an issue up through 9.4.x (on various Ubuntu and macOS VMs). We then tested with 9.8.1 (on Ubuntu 22.04 and macOS 12) and saw the error. Today, on my macOS 11 system, I confirmed that 9.8.1 fails and that the recently released 9.8.2 also fails, and I checked 9.6.2 and 9.6.4 and saw that they also failed. So I would guess that the issue was introduced in 9.6. I confirm that we currently do not observe the failure when using 9.4.8 (on Ubuntu 20.04 and 22.04 and macOS 11, 12, and 13).https://gitlab.haskell.org/ghc/ghc/-/issues/24466Improve FloatIn for case expressions2024-02-29T09:30:13ZSimon Peyton JonesImprove FloatIn for case expressionsRelated tickets:
* #2988
Consider this function:
```
f :: Int -> Int -> (Int,Int)
f z x = let y = x+1 in
case z of
0 -> (y,y)
1 -> (y,y+1)
_ -> (99,1)
```
It would be better if we floated that `y` t...Related tickets:
* #2988
Consider this function:
```
f :: Int -> Int -> (Int,Int)
f z x = let y = x+1 in
case z of
0 -> (y,y)
1 -> (y,y+1)
_ -> (99,1)
```
It would be better if we floated that `y` thunk into the two case branches, thus
```
f z x = let y = x+1 in
case z of
0 -> let y = x+1 in (y,y)
1 -> let y = x+1 in (y,y+1)
_ -> (99,1)
```
We get a bit of code duplication (just as we do with inlining) but in exchange we may never
allocate that thunk, which may matter if the default branch is the hot path.
Also `y` might be used strictly. Suppose one RHS was simply `g y y` where `g` is strict.
Then if we float the thunk into that RHS we can use call-by-value for it.
Currently we sort-of handle this case in `postInlineUnconditionally`; see
this Note in GHC.Core.Opt.Simplify.Utils:
```
{- Note [Inline small things to avoid creating a thunk]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The point of examining occ_info here is that for *non-values* that
occur outside a lambda, the call-site inliner won't have a chance
(because it doesn't know that the thing only occurs once). The
pre-inliner won't have gotten it either, if the thing occurs in more
than one branch So the main target is things like
let x = f y in
case v of
True -> case x of ...
False -> case x of ...
This is very important in practice; e.g. wheel-seive1 doubles
in allocation if you miss this out. And bits of GHC itself start
to allocate more. An egregious example is test perf/compiler/T14697,
where GHC.Driver.CmdLine.$wprocessArgs allocated hugely more.
```
But it's very ad-hoc. It only applies if `x` is used exactly once in each branch!
## A better plan
A better plan would be for FloatIn to float `y` inwards, perhaps entirely replacing this
stuff in `postInlineUnconditionally`. This ticket records the idea.
## A problem
Consider
```
foo1 z = let x = <thunk> in
join j y = .x..x... in
case z of
A -> j 1
B -> j 2
C -> x+1
foo2 z = let x = <thunk> in
join j y = ...x... in
case h x z of
A -> j 1
B -> j 2
C -> x+1
```
* In the case of `foo1` we can float `x` inwards, both into the RHS of the join point `j`, and into the C branch.
* But we **must not** do that for `foo2` because it would lose sharing. We would evaluate `<thunk>` twice,
once in the call `h <thunk z`, and once in the RHS of `j`
Annoyingly, if we inlined the join point we would do the Right Thing in both cases.
How can we do the Right Thing even when the join point is there?
Note that no occurrence analysis (which pins a single OccInfo on each binder) can get this right. Consider this combination of `foo1` and `foo2`:
```
foo1 z = let x = <thunk> in
join j1 y = .x..x... in
case z of
A -> j1 1
B -> j1 2
C -> join j2 p = ...x...
case h x z of
A -> j2 1
B -> j2 2
C -> x+1
```
Here we can push `x` into the RHS of `j1` and into the `C` branch; but **not** into the RHS of `j2` for the same reason as `foo2`.Simon Peyton JonesSimon Peyton Joneshttps://gitlab.haskell.org/ghc/ghc/-/issues/24438STM: Remove batch-commit logic on 64bit platforms.2024-02-13T15:22:29ZAndreas KlebingerSTM: Remove batch-commit logic on 64bit platforms.## Summary
We jump through some hoops to avoid overflows of a TVars num_updates field. However on 64bit platforms we will never manage to overflow these so we could just remove the checks on these.
This would primiarly have minor perfo...## Summary
We jump through some hoops to avoid overflows of a TVars num_updates field. However on 64bit platforms we will never manage to overflow these so we could just remove the checks on these.
This would primiarly have minor performance benefits.
Alternatively we could make these fields 64-bit unconditionally simplifying the code in general.
## Environment
* GHC version used:
Optional:
* Operating System:
* System Architecture:https://gitlab.haskell.org/ghc/ghc/-/issues/24416Difference in FPS between GHCi and Executable2024-02-15T19:12:16ZSiyuan ChenDifference in FPS between GHCi and Executable## Summary
I am trying the [gloss](http://gloss.ouroborus.net/) package in Haskell and have been able to build and run some examples successfully on windows.
The problem is animations jerky.
From the gloss' official page:
> Q: Animat...## Summary
I am trying the [gloss](http://gloss.ouroborus.net/) package in Haskell and have been able to build and run some examples successfully on windows.
The problem is animations jerky.
From the gloss' official page:
> Q: Animations seem jerky.
>
> A: Make sure you're compiling with -O2 -threaded. Without the threaded runtime, the code that manages the frame rate will behave badly. This is because GHC takes too long to reschedule the gloss process after the sleep timer has expired. With the threaded runtime, most simple animations should run at about 100fps, which is our internal frame-rate cap.
Yeah. After adding the `ghc-options: -O2 -threaded` in my cabal file, the jerky was eliminated (when running the `.exe` file).
But it still is jerky in REPL.
I have tried the following command in REPL.
```
cabal repl
ghci> :set -threaded
ghci> :set -O2
when making flags consistent: warning:
Optimization flags are incompatible with the byte-code interpreter; optimization flags ignored.
ghci> main
```
It doesn't work.
As you can see, the "warning: Optimization flags are incompatible with the byte-code interpreter" and the animation was still jerk.
I asked [the question in StackOverflow](https://stackoverflow.com/questions/77894178/gloss-animations-jerky-and-hope-to-add-o2-to-ghci) a few days ago and @Noughtmare suggested two ways to improve GHCi performance.
- Enabling -O2 with the bytecode interpreter (Since GHC 9.8)
See https://downloads.haskell.org/ghc/latest/docs/users_guide/debugging.html#ghc-flag--funoptimized-core-for-interpreter
```
cabal repl yampa-examples-gloss-rotatingcolor --repl-options="-fno-unoptimized-core-for-interpreter -O2 -threaded"
```
Yeah. There is no "warning" for `:set -O2` in REPL now, but the animation is still jerky.
- Using -fobject-code to avoid bytecode altogether
See https://downloads.haskell.org/ghc/9.8.1/docs/users_guide/ghci.html#compiling-to-object-code-inside-ghci
```
cabal repl yampa-examples-gloss-rotatingcolor --repl-options="-fno-ghci-sandbox -O -fobject-code -threaded"
```
It seems faster than before (?), but the animation is still jerky.
After some discussion, @Noughtmare suggested me to open an issue with GHC.
### The question is that:
Why there's clearly a difference in FPS between repl and run?
Is there a way to make the repl as fast as the run? (It is useful because repl very convenient for debugging)
Related link: https://stackoverflow.com/questions/77894178/gloss-animations-jerky-and-hope-to-add-o2-to-ghci
Thanks.
## Steps to reproduce
Checkout this repo https://github.com/ivanperez-keera/yampa-gloss/tree/v0.2.1 and build with ghc-9.8.1 and cabal-install 3.10.2.1.
Note that the yampa-examples-gloss-rotatingcolor is a very small. I can't believe that such a small app can cause performance issues (in REPL only though)...
## Expected behavior
No obvious difference in FPS between repl and run.
## Environment
* GHC version used: ghc-9.8.1 and cabal-install 3.10.2.1https://gitlab.haskell.org/ghc/ghc/-/issues/24410Execution time of reads/writes for TVars during transactions slows down with ...2024-02-28T14:21:45ZAndreas KlebingerExecution time of reads/writes for TVars during transactions slows down with the number of TVars used during the Transaction.## Summary
Reading a list of TVars takes O(n²) as for each read we traverse the linked list of all transaction entries.
## Steps to reproduce
This programm showcases the issue:
```haskell
{-# LANGUAGE NumericUnderscores #-}
{-# LANGU...## Summary
Reading a list of TVars takes O(n²) as for each read we traverse the linked list of all transaction entries.
## Steps to reproduce
This programm showcases the issue:
```haskell
{-# LANGUAGE NumericUnderscores #-}
{-# LANGUAGE ScopedTypeVariables #-}
module Main where
import Control.Concurrent (ThreadId, myThreadId, forkIO)
import Control.Concurrent.STM (STM, atomically, newTVar, readTVar, TVar)
import Data.IORef (IORef, modifyIORef', newIORef, readIORef)
import GHC.IO (unsafePerformIO)
import System.IO (hFlush, stdout)
import Data.Time.Clock
import Control.Concurrent.MVar
import Control.Monad
main = do
let nums = [
10_000 :: Int
,50_000
,100_000
,200_000
,300_000
,400_000
]
forM_ nums $ \tvar_count -> do
print $ "Measuring: " ++ show tvar_count
(tvars :: [TVar Int]) <- atomically $ mapM (newTVar) [1 .. tvar_count]
start_time <- getCurrentTime
s <- atomically $ do
values <- mapM readTVar tvars :: STM [Int]
!s <- return $! sum values
return $ s
print $ "done, " ++ show s
done_time <- getCurrentTime
print $ done_time `diffUTCTime` start_time
```
## Expected behavior
While some overhead might be expected, I would have expected the cost of using additional TVars in an transaction to be roughly linear.
The issue is that the STM transaction log for each transaction log is a linked list of all used TVars and their expected results.
When reading from a TVar we traverse the log to check if the value originates from an earlier write during the same transaction
When writing we traverse the log to either update the cached value or to add a entry with the new value.
This isn't quite ideal and causes STM transactions to be far slower than neccessary when used with large numbers of TVars.
One potential solution would be to use a HashMap rather than a linked list. I believe this would be sound since we only need to know:
* The last value written to a TVar during a transaction.
* The expected value for each TVar at the end of the transaction.
## Potential Improvements
### Replace the linked list with a HashMap alltogether.
This would mean very fast lookups, but we would have to rebuild the HashMap during GCs.
### Use a hashmap merely as a cache in addition to the linked list
This could be done by rebuilding it once it's first needed after GC, or adding TVars to it as they are used.
Either way it would make for slightly less predictable performance, slightly more memory use, but improved performance.
### Further we could allocate TVars into non-moving parts of the heap.
But I'm sure that has it's own disadvantages.https://gitlab.haskell.org/ghc/ghc/-/issues/24371Wanted: better arity analysis2024-01-30T15:00:31ZSimon Peyton JonesWanted: better arity analysis@tomjaguarpaw1, in [this email](https://mail.haskell.org/pipermail/ghc-devs/2024-January/021510.html) showed an example where arity analysis fails badly.
<details><title>His program</title>
```
{-# LANGUAGE ConstraintKinds #-}
{-# LANG...@tomjaguarpaw1, in [this email](https://mail.haskell.org/pipermail/ghc-devs/2024-January/021510.html) showed an example where arity analysis fails badly.
<details><title>His program</title>
```
{-# LANGUAGE ConstraintKinds #-}
{-# LANGUAGE DataKinds #-}
{-# LANGUAGE DerivingStrategies #-}
{-# LANGUAGE GADTs #-}
{-# LANGUAGE GeneralizedNewtypeDeriving #-}
{-# LANGUAGE PolyKinds #-}
{-# LANGUAGE RankNTypes #-}
{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE TypeApplications #-}
{-# LANGUAGE TypeFamilies #-}
module Main where
import Control.Monad.Trans.Class (MonadTrans)
import Control.Monad.Trans.State.Strict (StateT)
import qualified Control.Monad.Trans.State.Strict as State
import Data.Coerce (coerce)
import Data.Functor.Identity (Identity (runIdentity))
import Data.Kind (Type)
import Data.Foldable (for_)
-- These "myModify" functions all compile to the same code --
-- *exactly* the same code, i.e. they share the implementation (which
-- is just a constant)! Good.
myModifyMTL :: Int
myModifyMTL =
runIdentity $
flip State.evalStateT 0 $ do
s <- State.get
State.put $! s + 1
State.get
myModifyNewtype :: Int
myModifyNewtype =
runIdentity $
unE $
flip State.evalStateT 0 $
unL $
unB $ do
s <- Bm $ Lm State.get
Bm $ Lm $ State.put $! s + 1
Bm $ Lm State.get
myModifyEff :: Int
myModifyEff =
runIdentity $
unMkEff $
flip State.evalStateT 0 $
unMkEff $
unMkEff @(Branch (Leaf (StateT Int)) Empty) $ do
s <- MkEff $ MkEff State.get
MkEff $ MkEff $ State.put $! s + 1
MkEff $ MkEff State.get
-- These "mySum" functions don't all compile to the same code. Bad!
-- "mySumMTL" and "mySumNewtype" compile to the same code but it's not
-- shared. That's fine. I don't care. I really care that "mySumEff"
-- compiles to the same code, but it doesn't. It compiles to an
-- inefficient loop with allocation.
mySumMTL :: Int
mySumMTL =
runIdentity $
flip State.evalStateT 0 $ do
for_ [1 .. 10] $ \i -> do
s <- State.get
State.put $! s + i
State.get
mySumNewtype :: Int
mySumNewtype =
runIdentity $
unE $
flip State.evalStateT 0 $
unL $
unB $ do
for_ [1.. 10] $ \i -> do
s <- Bm $ Lm State.get
Bm $ Lm $ State.put $! s + i
Bm $ Lm State.get
mySumEff :: Int
mySumEff =
runIdentity $
unMkEff $
flip State.evalStateT 0 $
unMkEff $
unMkEff @(Branch (Leaf (StateT Int)) Empty) $
do
for_ [1..10] $ \i -> do
s <- MkEff $ MkEff State.get
MkEff $ MkEff $ State.put $! s + i
MkEff $ MkEff State.get
-- Definitions for the "Newtype" versions
newtype Lt t m a = Lm {unL :: t m a}
deriving newtype (Functor, Applicative, Monad)
newtype Et m a = Em {unE :: m a}
deriving newtype (Functor, Applicative, Monad)
newtype Bt t1 t2 m a = Bm {unB :: t1 (t2 m) a}
deriving newtype (Functor, Applicative, Monad)
-- Definitions for the "Eff" versions, that use Eff, an "effect
-- system"
-- A type that encodes the structure of a composed series of monad
-- transformers
data Effects
= Branch Effects Effects
| Leaf ((Type -> Type) -> Type -> Type)
| Empty
-- A singleton type for Effects
data SEffects i where
SBranch :: (SingI i1, SingI i2) => SEffects (Branch i1 i2)
SLeaf :: (MonadTrans t) => SEffects (Leaf t)
SEmpty :: SEffects Empty
-- For passing the singleton implicitly, like in the singletons
-- library
class SingI i where
sing :: SEffects i
instance (MonadTrans t) => SingI (Leaf t) where
sing = SLeaf
instance (SingI t1, SingI t2) => SingI (Branch t1 t2) where
sing = SBranch
instance SingI Empty where
sing = SEmpty
newtype Eff es m a = MkEff {unMkEff :: EffF es m a}
type family EffF (es :: Effects) m where
EffF Empty m = m
EffF (Leaf t) m = t m
EffF (Branch s1 s2) m = Eff s1 (Eff s2 m)
coerceFmapEmpty ::
forall a b m. (Functor m) => (a -> b) -> Eff Empty m a -> Eff Empty m b
coerceFmapEmpty = coerce (fmap @m @a @b)
coerceFmapLeaf ::
forall a b t m.
(MonadTrans t, Monad m) =>
(a -> b) ->
Eff (Leaf t) m a ->
Eff (Leaf t) m b
coerceFmapLeaf = coerce (fmap @(t m) @a @b)
coerceFmapBranch ::
forall a b s1 s2 m.
(SingI s1, SingI s2, Monad m) =>
(a -> b) ->
Eff (Branch s1 s2) m a ->
Eff (Branch s1 s2) m b
coerceFmapBranch = coerce (fmap @(Eff s1 (Eff s2 m)) @a @b)
coercePureEmpty :: forall a m. (Applicative m) => a -> Eff Empty m a
coercePureEmpty = coerce (pure :: a -> m a)
coercePureLeaf ::
forall a t m.
(MonadTrans t, Monad m) =>
a ->
Eff (Leaf t) m a
coercePureLeaf = coerce (pure :: a -> t m a)
coercePureBranch ::
forall a s1 s2 m.
(Monad m, SingI s1, SingI s2) =>
a ->
Eff (Branch s1 s2) m a
coercePureBranch = coerce (pure :: a -> Eff s1 (Eff s2 m) a)
coerceAndThenEmpty ::
forall m a b. (Applicative m) => Eff Empty m a -> Eff Empty m b -> Eff Empty m b
coerceAndThenEmpty = coerce ((*>) :: m a -> m b -> m b)
coerceAndThenLeaf ::
forall t m a b.
(MonadTrans t, Monad m) =>
Eff (Leaf t) m a ->
Eff (Leaf t) m b ->
Eff (Leaf t) m b
coerceAndThenLeaf = coerce ((*>) :: t m a -> t m b -> t m b)
coerceAndThenBranch ::
forall s1 s2 m a b.
(SingI s1, SingI s2, Monad m) =>
Eff (Branch s1 s2) m a ->
Eff (Branch s1 s2) m b ->
Eff (Branch s1 s2) m b
coerceAndThenBranch =
coerce
( (*>) ::
Eff s1 (Eff s2 m) a ->
Eff s1 (Eff s2 m) b ->
Eff s1 (Eff s2 m) b
)
coerceAppEmpty ::
forall m a b. (Applicative m) => Eff Empty m (a -> b) -> Eff Empty m a -> Eff Empty m b
coerceAppEmpty = coerce ((<*>) :: m (a -> b) -> m a -> m b)
coerceAppLeaf ::
forall t m a b.
(MonadTrans t, Monad m) =>
Eff (Leaf t) m (a -> b) ->
Eff (Leaf t) m a ->
Eff (Leaf t) m b
coerceAppLeaf = coerce ((<*>) :: t m (a -> b) -> t m a -> t m b)
coerceAppBranch ::
forall s1 s2 m a b.
(SingI s1, SingI s2, Monad m) =>
Eff (Branch s1 s2) m (a -> b) ->
Eff (Branch s1 s2) m a ->
Eff (Branch s1 s2) m b
coerceAppBranch =
coerce
( (<*>) ::
Eff s1 (Eff s2 m) (a -> b) ->
Eff s1 (Eff s2 m) a ->
Eff s1 (Eff s2 m) b
)
coerceBindEmpty ::
forall m a b. (Monad m) => Eff Empty m a -> (a -> Eff Empty m b) -> Eff Empty m b
coerceBindEmpty = coerce ((>>=) @m @a @b)
coerceBindLeaf ::
forall t m a b.
(MonadTrans t, Monad m) =>
Eff (Leaf t) m a ->
(a -> Eff (Leaf t) m b) ->
Eff (Leaf t) m b
coerceBindLeaf = coerce ((>>=) @(t m) @a @b)
coerceBindBranch ::
forall s1 s2 m a b.
(SingI s1, SingI s2, Monad m) =>
Eff (Branch s1 s2) m a ->
(a -> Eff (Branch s1 s2) m b) ->
Eff (Branch s1 s2) m b
coerceBindBranch =
coerce ((>>=) @(Eff s1 (Eff s2 m)) @a @b)
instance (SingI es, Monad m) => Functor (Eff es m) where
fmap = case sing @es of
SEmpty -> coerceFmapEmpty
SBranch -> coerceFmapBranch
SLeaf -> coerceFmapLeaf
{-# INLINE fmap #-}
instance (SingI es, Monad m) => Applicative (Eff es m) where
pure = case sing @es of
SEmpty -> coercePureEmpty
SLeaf -> coercePureLeaf
SBranch -> coercePureBranch
{-# INLINE pure #-}
(<*>) = case sing @es of
SEmpty -> coerceAppEmpty
SLeaf -> coerceAppLeaf
SBranch -> coerceAppBranch
{-# INLINE (<*>) #-}
(*>) = case sing @es of
SEmpty -> coerceAndThenEmpty
SLeaf -> coerceAndThenLeaf
SBranch -> coerceAndThenBranch
{-# INLINE (*>) #-}
instance (SingI es, Monad m) => Monad (Eff es m) where
(>>=) = case sing @es of
SEmpty -> coerceBindEmpty
SLeaf -> coerceBindLeaf
SBranch -> coerceBindBranch
{-# INLINE (>>=) #-}
main :: IO ()
main = pure ()
```
</details>
Focus on the code for `mySumEff` and you get this monstrosity
```
Main.mySumEff_go3 :: GHC.Prim.Int# -> Eff (Branch (Leaf (StateT Int)) Empty) Identity ()
Main.mySumEff_go3
= \ (x_a3In :: GHC.Prim.Int#) ->
let {
k_a2Fj :: Eff (Branch (Leaf (StateT Int)) Empty) Identity ()
k_a2Fj
= case x_a3In of wild_X1H {
__DEFAULT -> Main.mySumEff_go3 (GHC.Prim.+# wild_X1H 1#);
10# ->
n_r3O8
`cast` (co1 :: (Int -> ((), Int))
~R# Eff (Branch (Leaf (StateT Int)) Empty) Identity ())
} } in
(\ (eta2_a2GN :: Int) ->
case eta2_a2GN of { GHC.Types.I# x1_a2Yy ->
(k_a2Fj
`cast` (co2 :: Eff (Branch (Leaf (StateT Int)) Empty) Identity ()
~R# (Int -> Eff Empty Identity ((), Int))))
(GHC.Types.I# (GHC.Prim.+# x1_a2Yy x_a3In))
})
`cast` (co3 :: (Int -> Eff Empty Identity ((), Int))
~R# Eff (Branch (Leaf (StateT Int)) Empty) Identity ())```
```
Not efficient at all
## Diagnosis
GHC is reluctant to inline `k_a2Fj` inside the `\eta` in case that would duplicate work: inlining
`k` would push the call `mySumEff_go3 (wild_X1H + 1)` inside the lambda. What we *want* is to recognise that `k` has arity 1 -- it could have a lambda at the top. Absent the casts, GHC will work that out -- look at GHC.Core.Opt.Arity.findRhsArity.
But the casts get in the way. What we'd like is
```
k_a2Fj = (\eta. case x of wild
DEFAULT -> ((go3 (wild + 1)) `cast` sym co1) eta
10 -> n eta)
`cast` co1
where
co1 :: (Int -> ((), Int))
~R# Eff (Branch (Leaf (StateT Int)) Empty) Identity ())
```
Now, GHC's arity analyser is clever enough to do this when `co1` is just newtypes.
See `Note [The EtaInfo mechanism]` in GHC.Core.Opt.Arity.
But here it isn't. We have to execute the recursive `EffF` type-level function
```
Eff (Branch (Leaf (StateT Int)) Empty) Identity ()
--> EffF (Branch (Leaf (StateT Int)) Empty) Identity ()
--> Eff (Leaf (StateT Int)) (Eff Empty Identy) ()
--> EffF (Leaf (StateT Int)) (EffF Empty Identy) ()
--> StateT Int Identity ()
```
Running type functions is (currently) only done in the type checker, not the Simplifier.
Tantalisingly, though, the casts we need are lying around in the tree, so it ought
be be possible to do this. But it would take some careful work, for examplehttps://gitlab.haskell.org/ghc/ghc/-/issues/24280Improve performance of `Data.List.sort` by adding a (3,4,5)-way merge2024-02-19T18:30:39ZJadeImprove performance of `Data.List.sort` by adding a (3,4,5)-way mergeHiya, I believe I found a way to decently (up to ~20%) improve performance of the `sort` in `base`.
The simple optimization is to build on the current implementation, which is a simple merge sort which pre-scans the list for ascending a...Hiya, I believe I found a way to decently (up to ~20%) improve performance of the `sort` in `base`.
The simple optimization is to build on the current implementation, which is a simple merge sort which pre-scans the list for ascending and descending sequences, by adding a static 3-way merge which reduces internal function calls as well as calls to the comparison function passed by the user.
I also attempted 4-way and 5-way merges which improved performance further but not as significantly.
The implemented sort as well as very rudimentary benchmarks are in my [github repository](https://github.com/Celestial-Gemstone/sort-perf).
I would like to open a pull request as well, but first gather some feedback. Specifically
- did I screw up in some way / is the sort not actually equivalent
- how could I more intricately benchmark the sort? -> Currently it only tests for one random list per size
- What n-way merge is "most worth it" -> You could take this ad absurdum and add a 100-way merge, which I'm sure we don't
Thanks in advance :)https://gitlab.haskell.org/ghc/ghc/-/issues/24271CprAnal: Record evaluatedness of result fields2023-12-22T18:06:32ZMatthew Cravenclyring@gmail.comCprAnal: Record evaluatedness of result fieldsCPR analysis already gathers this information, since it is needed for nested CPR. Recording this information would allow the simplifier to use it to drop redundant evals around call sites. (It would also make the special case for `seq#` ...CPR analysis already gathers this information, since it is needed for nested CPR. Recording this information would allow the simplifier to use it to drop redundant evals around call sites. (It would also make the special case for `seq#` added for the same reasons in #15226 not special anymore.)
With !9874 this evaluatedness information will help case-of-known-constructor and the like _create_ fewer redundant evals.
We probably eventually want to use these field-evaluatedness-signatures to improve the evaluatedness detection in CPR analysis, to catch recursive functions like this:
```haskell
f :: Integer -> Bool -> (Integer, Integer)
f !x True = (expensive x, x)
f x False = case f (x `quot` 2) (even x) of (p, q) -> (p + 1, q)
-- $wf :: Integer -> Bool -> (# Integer, Integer #)
```
The second field of `f`'s result is always evaluated, but an induction argument is required to prove it. Tag inference can already detect that the second component of `$wf`'s result will be evaluated and properly tagged, so in order to replace the tag inference analysis with worker/wrapper as suggested in #23890 our result-component-evaluatedness analysis in Core must catch this, too.https://gitlab.haskell.org/ghc/ghc/-/issues/24247Poor test for empty list2023-12-12T15:12:08ZSimon Peyton JonesPoor test for empty listIf you say
```haskell
g:: Eq a => [a] -> Bool
g xs = xs /= []
```
you get bad code:
```
g = \ (@a_aIv) ($dEq_aIw :: Eq a_aIv) (x_aI3 :: [a_aIv]) ->
case GHC.Classes.$fEqList_$c==
@a_aIv $dEq_aIw x_aI3 (GHC.Types.[] @a...If you say
```haskell
g:: Eq a => [a] -> Bool
g xs = xs /= []
```
you get bad code:
```
g = \ (@a_aIv) ($dEq_aIw :: Eq a_aIv) (x_aI3 :: [a_aIv]) ->
case GHC.Classes.$fEqList_$c==
@a_aIv $dEq_aIw x_aI3 (GHC.Types.[] @a_aIv)
of {
False -> GHC.Types.True;
True -> GHC.Types.False
}
```
This is a bit silly -- `Data.List.null` will do the job. Perhaps that is a silly example, but `xs == ""` is not so unusual.
We should do better.Simon Peyton JonesSimon Peyton Joneshttps://gitlab.haskell.org/ghc/ghc/-/issues/24246A pathological large object allocation pattern that leads to fragmentation2023-12-12T15:11:53ZTeo CamarasuA pathological large object allocation pattern that leads to fragmentation## Summary
This is about a pathological allocation pattern of large objects that leads to some quite high fragmentation, discovered while investigating #24150.
I think it's somewhat similar to #7831
## The issue
Let's first recall ho...## Summary
This is about a pathological allocation pattern of large objects that leads to some quite high fragmentation, discovered while investigating #24150.
I think it's somewhat similar to #7831
## The issue
Let's first recall how the block allocator works.
The block allocator has free lists that are power of 2 sized.
They span `2^x - 2^(x+1)-1`.
So, if I wish to allocate X blocks, then I will look for free groups in the free lists of index log2_ceil(X), ie, we round up to the next power of 2.
This means that when allocating quite large objects, we will be able to fit fewer into a megablock than optimally possible.
For instance, a megablock can hold at most 252 blocks (256 - 4 used for block descriptors).
Two 126 block sized objects could fill it perfectly. But in practice, we can only hold one. This is because after allocating one, we have 126 blocks left, which would go onto the 64 - 127 block free list, but we would need to allocate from the next free list, namely the 128 - 255 one. In this instance we are wasting half a block.
This also applies to a lesser extent to smaller allocation requests. For instance, only two 65 block groups can be serviced by the block allocator using one megablock, but technically 3 could fit.
On the face of it this isn't too concerning. These gaps can be used to service other types of allocations and large objects normally account for a small percentage of total memory usage.
This behaviour only leads to high fragmentation when another factor comes into play. For me, this was the nonmoving heap. Because the nonmoving heap is made up of block groups that do not get copied by GC, they tend to hem in these large objects.
The pattern I saw was a 65 block large object, some blocks used by the non-moving GC, then another 65 block large object. The issue arises when the 2 large objects are GC'd. Now we have two 65 block wide gaps. Yet, these gaps cannot be used to service new allocations of the same size as before, since recall that we round up to the next power of 2. So, when we next go to allocate we are quite likely to need to allocate a new megablock to service the allocation. And because the non-moving heap is quite likely to hold onto its blocks for a long time, we are quite unlikely to see this fragmented megablock be de-allocated.
So, what I was seeing is a pattern where an application would occasionally allocate 65 block large objects, these would get promoted to the oldest generation, and the space around them would be filled with non-moving heap blocks. Then at the next major GC they would get freed, but this space could not be used to service future requests for 65 block allocations because we would require a 128 block gap due to the rounding up done by the free list.
## Solution
In my case, the solution was just to avoid allocating such large objects. This made the fragmentation disappear.
FWIW I think this is a fine solution to something like this. It's unavoidable that any allocator will have some pathological edge-cases.
Perhaps we could do better by being less extreme with rounding up. Maybe we could have a free list for each block size. That would give us 252, which isn't that many, but maybe the performance cost would be too high. Or maybe a compromise between the two would work, eg, having free lists for multiples of 8 blocks, etc.
Now that I have a workaround, I'm not too motivated to investigate this further, but I thought I'd make a ticket to document this behaviour and maybe help someone who runs into it in the future.Ben GamariBen Gamarihttps://gitlab.haskell.org/ghc/ghc/-/issues/24236`+RTS --numa` can significantly degrade mutator performance2023-12-12T01:59:06ZBen Gamari`+RTS --numa` can significantly degrade mutator performanceWhile validating some training materials I noticed that `+RTS --numa` can often hurt performance significantly more than it helps. For instance, using my https://gitlab.haskell.org/bgamari/raytrace workload on GHC 9.8.1 running on an AMD...While validating some training materials I noticed that `+RTS --numa` can often hurt performance significantly more than it helps. For instance, using my https://gitlab.haskell.org/bgamari/raytrace workload on GHC 9.8.1 running on an AMD 2990WX (32 cores, 64 threads, ), I find the following:
```
$ ray +RTS -s -N16 --numa
Rays: 16777216
Figures: 10559
Sampling...
1,179,032,039,144 bytes allocated in the heap
438,388,616 bytes copied during GC
21,102,328 bytes maximum residency (11 sample(s))
1,413,384 bytes maximum slop
141 MiB total memory in use (24 MiB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 21674 colls, 21674 par 17.358s 2.010s 0.0001s 0.0022s
Gen 1 11 colls, 10 par 0.096s 0.019s 0.0018s 0.0037s
Parallel GC work balance: 63.91% (serial 0%, perfect 100%)
TASKS: 34 (1 bound, 33 peak workers (33 total), using -N16)
SPARKS: 256 (256 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.007s ( 0.003s elapsed)
MUT time 642.288s ( 41.830s elapsed)
GC time 17.454s ( 2.029s elapsed)
EXIT time 0.012s ( 0.002s elapsed)
Total time 659.761s ( 43.864s elapsed)
Alloc rate 1,835,673,759 bytes per MUT second
Productivity 97.4% of total user, 95.4% of total elapsed
$ ray +RTS -s -N16
Rays: 16777216
Figures: 10559
Sampling...
1,179,032,005,208 bytes allocated in the heap
450,982,840 bytes copied during GC
20,707,696 bytes maximum residency (11 sample(s))
1,382,032 bytes maximum slop
140 MiB total memory in use (21 MiB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 22545 colls, 22545 par 18.835s 2.214s 0.0001s 0.0015s
Gen 1 11 colls, 10 par 0.109s 0.020s 0.0018s 0.0036s
Parallel GC work balance: 63.71% (serial 0%, perfect 100%)
TASKS: 34 (1 bound, 33 peak workers (33 total), using -N16)
SPARKS: 256 (256 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.007s ( 0.003s elapsed)
MUT time 302.602s ( 20.498s elapsed)
GC time 18.944s ( 2.234s elapsed)
EXIT time 0.015s ( 0.002s elapsed)
Total time 321.567s ( 22.737s elapsed)
Alloc rate 3,896,317,999 bytes per MUT second
Productivity 94.1% of total user, 90.2% of total elapsed
```
So, while GC is helped slightly, mutator time is regresses by nearly half.
This machine's 64 GBytes of physical memory is spread across four DIMMs and evenly split across the machine's two memory controllers:
```
$ sudo numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39
node 0 size: 32081 MB
node 0 free: 19003 MB
node 1 cpus: 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55
node 1 size: 0 MB
node 1 free: 0 MB
node 2 cpus: 8 9 10 11 12 13 14 15 40 41 42 43 44 45 46 47
node 2 size: 32190 MB
node 2 free: 11339 MB
node 3 cpus: 24 25 26 27 28 29 30 31 56 57 58 59 60 61 62 63
node 3 size: 0 MB
node 3 free: 0 MB
node distances:
node 0 1 2 3
0: 10 16 16 16
1: 16 10 16 16
2: 16 16 10 16
3: 16 16 16 10
```https://gitlab.haskell.org/ghc/ghc/-/issues/24232maximum and minimum are not good consumers in ghc 9.8.12023-12-11T13:19:50Zgeorge.colpittsmaximum and minimum are not good consumers in ghc 9.8.1
## Motivation
maximum and minimum are not good consumers in ghc 9.8.1 which results in poor performance. I think this is true in earlier ghc versions also.
Program that demonstrates this (first do: cabal install --lib list-fusion-pro...
## Motivation
maximum and minimum are not good consumers in ghc 9.8.1 which results in poor performance. I think this is true in earlier ghc versions also.
Program that demonstrates this (first do: cabal install --lib list-fusion-probe) :
```haskell
{-# OPTIONS_GHC -Wall #-}
import Data.List.Fusion.Probe (fuseThis)
-- must compile with -O for fuseThis to work
main :: IO ()
main = print $ maximum $ fuseThis ['a'..'z']
```
Put the preceding in a file called enh.hs
```
% ghc -O enh.hs
Loaded package environment from /Users/avie/.ghc/aarch64-darwin-9.8.1/environments/default
[1 of 2] Compiling Main ( enh.hs, enh.o ) [Source file changed]
[2 of 2] Linking enh [Objects changed]
ld: warning: ignoring duplicate libraries: '-lm'
% ./enh
enh: fuseThis: List did not fuse
CallStack (from HasCallStack):
error, called at ./Data/List/Fusion/Probe.hs:52:16 in lst-fsn-prb-0.1.0.8-6ea610f9:Data.List.Fusion.Probe
% ghc --version
The Glorious Glasgow Haskell Compilation System, version 9.8.1
```
Change "maximum" to "minimum" to see this.
## Proposal
Change maximum and minimum to be good consumers which will improve performance.https://gitlab.haskell.org/ghc/ghc/-/issues/24220hSeek and hTell do unnecessary stat2023-12-16T03:10:01ZNiklas Hambüchenmail@nh2.mehSeek and hTell do unnecessary statIn `base`, [`hTell`](https://hackage.haskell.org/package/base-4.19.0.0/docs/GHC-IO-Handle.html#v:hTell) and [`hSeek`](https://hackage.haskell.org/package/base-4.19.0.0/docs/GHC-IO-Handle.html#v:hSeek) are unnecessarily inefficient.
On L...In `base`, [`hTell`](https://hackage.haskell.org/package/base-4.19.0.0/docs/GHC-IO-Handle.html#v:hTell) and [`hSeek`](https://hackage.haskell.org/package/base-4.19.0.0/docs/GHC-IO-Handle.html#v:hSeek) are unnecessarily inefficient.
On Linux, they _should_ translate to just 1 `lseek()` syscall, but `base` inserts additional `newfstatat()` (type of `stat()`) syscalls in front of each.
This can be observed in `strace` e.g. against `ghci`.
## Repro
```haskell
import System.IO
writeFile "myfile" "" -- just to create the file
f <- openFile "myfile" ReadMode
hTell f
```
Example `strace` output (e.g. `strace -fyp "$(pidof ghc)" -P "myfile"`) for the `hTell` call:
```
newfstatat(12</home/niklas/myfile>, "", {st_mode=S_IFREG|0644, st_size=919, ...}, AT_EMPTY_PATH) = 0
lseek(12</home/niklas/myfile>, 0, SEEK_CUR) = 0
```
(Note in the above that `lseek(..., SEEK_CUR)` is the returns the current seek position instead of setting it, thus implementing `hTell`.)
## Problems
These additionl stat syscalls have various problems:
* Syscall spam: Costs context switches and makes debugging harder in strace.
* Greatly increased latency on networked file systems.
Thus they make IO-based code much slower than in e.g. C or Python.
## Cause
The reason for the additional stats seems to be the "Handle must be seekable" requirement.
Instead of obtaining the information whether a Handle is seekable once, and storing it, `hSeek` and `hTell` obtain this information repeatedly on each call, even though it cannot change in between:
* [`hSeek`](https://hackage.haskell.org/package/base-4.19.0.0/docs/src/GHC.IO.Handle.html#hSeek)/[`hTell`](https://hackage.haskell.org/package/base-4.19.0.0/docs/src/GHC.IO.Handle.html#hTell) call
* [`wantSeekableHandle`](https://hackage.haskell.org/package/base-4.19.0.0/docs/src/GHC.IO.Handle.Internals.html#wantSeekableHandle) which on `FileHandle`s calls
* [`checkSeekableHandle`](https://hackage.haskell.org/package/base-4.19.0.0/docs/src/GHC.IO.Handle.Internals.html#checkSeekableHandle) which on non-closed files calls
* `IODevice.isSeekable dev` which for the `instance IODevice FD` [calls](https://hackage.haskell.org/package/base-4.19.0.0/docs/src/GHC.IO.FD.html#line-121)
* [`GHC.IO.FD.isSeekable`](https://hackage.haskell.org/package/base-4.19.0.0/docs/src/GHC.IO.FD.html#isSeekable) which calls
* [`devType`](https://hackage.haskell.org/package/base-4.19.0.0/docs/src/GHC.IO.FD.html#devType) which calls
* [`fdStat`](https://hackage.haskell.org/package/base-4.19.0.0/docs/src/System.Posix.Internals.html#fdStat)
## Solutions
I believe that `devType`, and seekableness, cannot change during the lifetime of an open `FD` or `Handle`.
So the easiest solution would be to store an `isSeekable :: Bool` field in either [`data FD`](https://hackage.haskell.org/package/base-4.19.0.0/docs/src/GHC.IO.FD.html#fdFD) next to similar fields
```haskell
data FD = FD {
fdFD :: {-# UNPACK #-} !CInt,
fdIsSocket_ :: {-# UNPACK #-} !Int
fdIsNonBlocking :: {-# UNPACK #-} !Int
}
```
or in [`data Handle__`](https://hackage.haskell.org/package/base-4.19.0.0/docs/src/GHC.IO.Handle.Types.html#Handle__).https://gitlab.haskell.org/ghc/ghc/-/issues/24210Massive performance issue between interpreted and compiled code. (Interpreted...2023-11-28T15:25:57ZLevent ErkökMassive performance issue between interpreted and compiled code. (Interpreted is faster!)I was unable to reduce this to a minimal example, but it's not hard to recreate. I've observed this all the way back to GHC 8.8, and perhaps earlier. Both Mac and Linux. (However, I think this is a regression, i.e., older GHCs--from back...I was unable to reduce this to a minimal example, but it's not hard to recreate. I've observed this all the way back to GHC 8.8, and perhaps earlier. Both Mac and Linux. (However, I think this is a regression, i.e., older GHCs--from back about 3 years ago--didn't have this issue. Though I don't have any specific info on versions where it was not an issue.)
The numbers reported below are with GHC 9.8.1.
To replicate, you'll need to have SBV installed. (https://hackage.haskell.org/package/sbv)
For this program:
```haskell
import Data.SBV
import Data.SBV.Tools.CodeGen
import Documentation.SBV.Examples.Crypto.SHA
main :: IO ()
main = compileToC (Just "testOutput_delete") "sha512" $ do
cgOverwriteFiles True
let algorithm = sha512P
hInBytes <- cgInputArr 64 "hIn"
blockBytes <- cgInputArr 128 "block"
let hIn = chunkBy 8 fromBytes hInBytes
block = chunkBy 8 fromBytes blockBytes
result = hashBlock algorithm hIn (Block block)
cgOutputArr "hash" $ concatMap toBytes result
```
I'm seeing:
```
$ time ghc -e main a
ghc -e main a 62.22s user 9.47s system 85% cpu 1:24.18 total
```
That's kind of slow, but not too bad. But when I compile:
```
$ ghc a.hs
Loaded package environment from /Users/leventerkok/.ghc/x86_64-darwin-9.8.1/environments/default
[1 of 2] Compiling Main ( a.hs, a.o )
[2 of 2] Linking a
time ./ald: warning: ignoring duplicate libraries: '-lm'
$ time ./a
./a 820.27s user 15.34s system 99% cpu 14:04.02 total
```
That is significantly slower; in fact 10 times slower than the interpreted code. Compiling with optimizations (-O2) doesn't change the run-time.
I'd appreciate any insights into what might be going wrong.Matthew PickeringMatthew Pickeringhttps://gitlab.haskell.org/ghc/ghc/-/issues/24203GHC is reluctant to unbox Int with -O when returned in a sum field2023-11-22T08:17:59ZmeooowGHC is reluctant to unbox Int with -O when returned in a sum field## Summary
### Example 1
I have some code like
```hs
{-# LANGUAGE BangPatterns #-}
module M where
foo :: [Int] -> Maybe Int
foo = go 1 0
where
go :: Int -> Int -> [Int] -> Maybe Int
go !i !n [] = Just n
go i n (x:xs)
...## Summary
### Example 1
I have some code like
```hs
{-# LANGUAGE BangPatterns #-}
module M where
foo :: [Int] -> Maybe Int
foo = go 1 0
where
go :: Int -> Int -> [Int] -> Maybe Int
go !i !n [] = Just n
go i n (x:xs)
| i < 10 = go (i+1) (n+x) xs
| otherwise = Nothing
```
If I compile it with -O on GHC 9.4.8 or 9.6.3, `n` is not unboxed even with the bang.
<details>
<summary>Core:</summary>
```hs
-- RHS size: {terms: 31, types: 16, coercions: 0, joins: 0/0}
M.$wgo [InlPrag=[2], Occ=LoopBreaker]
:: GHC.Prim.Int# -> Int -> [Int] -> Maybe Int
[GblId[StrictWorker([~, !, !])],
Arity=3,
Str=<L><1L><1L>,
Unf=OtherCon []]
M.$wgo
= \ (ww_sNC :: GHC.Prim.Int#) (n_sNE :: Int) (ds_sNF :: [Int]) ->
case n_sNE of n1_X1 { GHC.Types.I# ipv_sN7 ->
case ds_sNF of {
[] -> GHC.Maybe.Just @Int n1_X1;
: ipv1_sN9 ipv2_sNa ->
case GHC.Prim.<# ww_sNC 10# of {
__DEFAULT -> GHC.Maybe.Nothing @Int;
1# ->
case ipv1_sN9 of { GHC.Types.I# y_aNl ->
M.$wgo
(GHC.Prim.+# ww_sNC 1#)
(GHC.Types.I# (GHC.Prim.+# ipv_sN7 y_aNl))
ipv2_sNa
}
}
}
}
```
</details>
[Haskell playground link](https://play.haskell.org/saved/lOdYT1ZQ)
### Example 2
```hs
module M where
foo :: [Int] -> Maybe Int
foo [] = Nothing
foo xs = Just $! sum xs
```
The accumulator in `sum` remains boxed.
[Haskell playground link](https://play.haskell.org/saved/eM7gcscU)
This is not the case with GHC 9.2.8. Using -O2 with 9.4.8 and 9.6.3 also unboxes it.
Is this intended behavior?
## Steps to reproduce
Compile the snippet above.
## Expected behavior
`n` is unboxed.
## Environment
* GHC version used: 9.4.8, 9.6.3https://gitlab.haskell.org/ghc/ghc/-/issues/24150Non-moving segment allocation strategy might be leading to fragmentation2024-03-19T22:44:50ZTeo CamarasuNon-moving segment allocation strategy might be leading to fragmentation## Summary
I think the current allocation strategy for the non-moving heap might be to unnecessary fragmentation
## Issue
The non-moving heap consists of segments that are aligned and 32K (that's 8 4K blocks).
In order to allocate alig...## Summary
I think the current allocation strategy for the non-moving heap might be to unnecessary fragmentation
## Issue
The non-moving heap consists of segments that are aligned and 32K (that's 8 4K blocks).
In order to allocate aligned blocks we currently allocate `(2*b - 1)` blocks and find the required aligned part and then immediately de-allocate the rest. See: https://gitlab.haskell.org/ghc/ghc/-/blob/ef3d20f83499cf129b1cacac07906b8d6188fc17/rts/sm/BlockAlloc.c#L570
My worry is that this strategy can lead to a situation where the heap ends up with lots of small free block groups (less than 15 blocks) that cannot be used by the allocator to service non-moving segment requests.
For instance, we could have a situation where every second 8 block group is free on a megablock, but we can't allocate into them because we need 15 to guarantee alignment.
In the past, I've found that adding an extra copying generation to a heap that uses the non-moving GC really reduces fragmentation. I think the reason is that it can use these blocks that are otherwise unusable.
I don't have concrete evidence for this yet. I'll try to gather some more data and report back.
My plan is to add some instrumentation that records the sizes of free block groups throughout the lifetime of an application.
## Proposed solution
I think the best thing to do here is to claim entire megablocks for the use of the non-moving heap.
So it would have a list of megablocks with free space. If a megablock ends up completely free then we de-allocate it, but otherwise it would be held on a list by the non-moving gc.
It would probably be good to make this configurable since it might lead to some overhead, since now no other allocation could go into these blocks.
This also requires a bit more book keeping than the current approach.
The advantage is that we could basically eliminate this sort of fragmentation.
## Implementation
If this sounds like a good idea, I'd be happy to implement.Teo CamarasuTeo Camarasu