Stabilise benchmarks wrt. GC
Summary: This is due to #15999, a follow-up on #5793 and #15357.
It changes some benchmarks (i.e. wheel-sieve1
, awards
) rather drastically.
The general plan is outlined in #15999: Identify GC-sensitive benchmarks by looking at how productivity rates change over different nursery sizes and iterate main
of these benchmarks often enough for the non-monotony and discontinuities to go away.
I was paying attention that the benchmarked logic is actually run $n times more often, rather than just benchmarking IO operations printing the result of CAFs.
When I found benchmarks with insignificant runtime (#15357), I made sure that parameters/input files were adjusted so that runtime of the different modes fall within the ranges proposed in https://ghc.haskell.org/trac/ghc/ticket/15357#comment:4:
- fast: 0.1-0.2s
- norm: 1-2s
- slow: 5-10s
This is what I did:
- Stabilise bernoulli
- Stabilise digits-of-e1
- Stabilise digits-of-e2
- Stabilise gen_regexp
- Adjust running time of integrate
- Adjust running time of kahan
- Stabilise paraffins
- Stabilise primes
- Adjust running time of rfib
- Adjust running time of tak
- Stabilise wheel-sieve1
- Stabilise wheel-sieve2
- Adjust running time of x2n1
- Adjust running time of ansi
- Adjust running time of atom
- Make awards benchmark something other than IO
- Adjust running time of banner
- Stabilise boyer
- Adjust running time of boyer2
- Adjust running time of queens
- Adjust running time of calendar
- Adjust runtime of cichelli
- Stabilise circsim
- Stabilise clausify
- Stabilise constraints with moderate success
- Adjust running time of cryptarithm1
- Adjust running time of cryptarythm2
- Adjust running time of cse
- Adjust running time of eliza
- Adjust running time of exact-reals
- Adjust running time of expert
- Stabilise fft2
- Stabilise fibheaps
- Stabilise fish
- Adjust running time for gcd
- Stabilise comp_lab_zift
- Stabilise event
- Stabilise fft
- Stabilise genfft
- Stabilise ida
- Adjust running time for listcompr
- Adjust running time for listcopy
- Adjust running time of nucleic2
- Attempt to stabilise parstof
- Stabilise sched
- Stabilise solid
- Adjust running time of transform
- Adjust running time of typecheck
- Stabilise wang
- Stabilise wave4main
- Adjust running time of integer
- Adjust running time of knights
- Stabilise lambda
- Stabilise lcss
- Stabilise life
- Stabilise mandel
- Stabilise mandel2
- Adjust running time of mate
- Stabilise minimax
- Adjust running time of multiplier
- Adjust running time of para
- Stabilise power
- Adjust running time of primetest
- Stabilise puzzle with mild success
- Adjust running time for rewrite
- Stabilise simple with mild success
- Stabilise sorting
- Stabilise sphere
- Stabilise treejoin
- Stabilise anna
- Stabilise bspt
- Stabilise cacheprof
- Stablise compress
- Stablise compress2
- Stabilise fem
- Adjust running time of fluid
- Stabilise fulsom
- Stabilise gamteb
- Stabilise gg
- Stabilise grep
- Adjust running time of hidden
- Stabilise hpg
- Stabilise infer
- Stabilise lift
- Stabilise linear
- Attempt to stabilise maillist
- Stabilise mkhprog
- Stabilise parser
- Stabilise pic
- Stabilise prolog
- Attempt to stabilise reptile
- Adjust running time of rsa
- Adjust running time of scs
- Stabilise symalg
- Stabilise veritas
- Stabilise binary-trees
- Adjust running time of fasta
- Adjust running time of k-nucleotide
- Adjust running time of pidigits
- Adjust running time of reverse-complement
- Adjust running time of spectral-norm
- Adjust running time of fannkuch-redux
- Adjust running time for n-body
Problematic benchmarks:
-
last-piece
: Unclear how to stabilise. Runs for 300ms and I can't make up smaller inputs because I don't understand what it does. -
pretty
: It's just much too small to be relevant at all. Maybe we want to get rid of this one? -
scc
: Same aspretty
. The input graph for which SCC analysis is done is much too small and I can't find good directed example graphs on the internet. -
secretary
: Apparently this needs-package random
and consequently hasn't been run for a long time. -
simple
: Same aslast-piece
. Decent runtime (70ms), but it's unstable and I see no way to iterate it ~100 times in fast mode. -
eff
: Every benchmark is problematic here. Not from the point of view of allocations, but because the actual logic is vacuous. IMO, these should be performance tests, not actual benchmarks. Alternatively, write an actual application that makes use of algebraic effects. -
maillist
: Too trivial. It's just String/list manipulation, not representative of any Haskell code we would write today (no use of base library functions which could be fused, uses String instead of Text). It's only 75 loc according tocloc
, that's not areal
application.
Reviewers: simonpj, simonmar, bgamari, AndreasK, osa1, alpmestan, O26 nofib
GHC Trac Issues: #15999
Differential Revision: https://phabricator.haskell.org/D5438
Merge request reports
Activity
added 4 commits
-
651416b0...95c1dccb - 3 commits from branch
master
- 8c6e40f5 - Stabilise benchmarks wrt. GC
-
651416b0...95c1dccb - 3 commits from branch
added 3 commits
-
8c6e40f5...e2d614e4 - 2 commits from branch
master
- 63d5128e - Stabilise benchmarks wrt. GC
-
8c6e40f5...e2d614e4 - 2 commits from branch
This is ready for review now.
Note again that this changes the runtimes of virtually all benchmarks. Some of them are changed rather drastically by weaving runtime arguments into input data, so that meaningful runtimes can even be measured.
All benchmarks have now runtimes approximately in the following brackets:
- fast: 0.1-0.2s
- norm: 1-2s
- slow: 5-10s
And where I found that benchmarks were sensitive to GC parameters (Trac #15999), I iterated the benchmarked logic often enough to make them stable.
I don't expect that anybody will actually look at all the changes I did, but hope that you agree that this makes Nofib vastly more useful and thus is a 'breaking change' (from the perspective of historic comparibility) we should be willing to do.
I don't expect that anybody will actually look at all the changes I did, but hope that you agree that this makes Nofib vastly more useful and thus is a 'breaking change' (from the perspective of historic comparibility) we should be willing to do.
I strongly agree with this.
Thanks for doing all this work Sebastian.
It would be helpful to have a ReadMe or wiki page or something describing the setup and the steps you have taken, so that someone in 5 years time will know what to be careful about.
I added a section in the wiki and edited the README.md, also adding a section about GC impact.
WRT to the
eff
benchmarks (which I added), they have caused some problems for various people so I'm happy to move them to performance tests if you think that's better. However, I placed them innofib
as they are actual real world programs written in a style which you might find in an application today rather than most the programs which are extremely dated.I agree that we need more up to date
real
benchmarks and that test-driving effect handlers is a perfect fit. But only in conjunction with an actual application they are applied to rather than a counting loop. It's hard to come up with such an example (especially with a self-contained one these days). Maybe we need to ask around in the community...The examples in his repo are all quite self-contained but still too simple probably for what you are looking for.
https://github.com/AndrasKovacs/misc-stuff/tree/master/haskell/Eff
I don't think asking the community is going to be that helpful as now there are package managers adding dependencies isn't a problem for most people.
The examples in his repo are all quite self-contained but still too simple probably for what you are looking for.
Interesting! Indeed, these suffer from the same symptoms.
I don't think asking the community is going to be that helpful as now there are package managers adding dependencies isn't a problem for most people.
We really need some kind of vendoring tool that can just pull in the source of arbitrary packages. Then we could pipe the result through a mature version of https://github.com/nomeata/hs-all-in-one and strip dead code.
mentioned in issue ghc#9374
mentioned in issue ghc#16499 (closed)