WIP: parallel gc: use mutex + condvar instead of schedYield in main gc loop
Thank you for your contribution to GHC!
Please take a few moments to verify that your commits fulfill the following:
-
are either individually buildable or squashed -
have commit messages which describe what they do (referring to Notes and tickets using #NNNN
syntax when appropriate) -
have added source comments describing your change. For larger changes you likely should add a Note and cross-reference it from the relevant places. -
add a testcase to the testsuite. -
replace this message with a description motivating your change
If you have any questions don't hesitate to open your merge request and inquire
in a comment. If your patch isn't quite done yet please do add prefix your MR
title with WIP:
.
Merge request reports
Activity
Thanks @duog; great work!
added 37 commits
-
e9c0f671...b58cb63a - 35 commits from branch
ghc:master
- ee39de62 - [rts] introduce refill-wsdeques
- 3a94fcbb - [rts] [gc] kill any_work, add mutex + condition variable
-
e9c0f671...b58cb63a - 35 commits from branch
@bgamari I've posted a nofib comparison here: https://gitlab.haskell.org/-/snippets/1865
Headline is
- bytes allocated geom mean: -0.6%
- mutator time geom mean: -14.66
- GC (cpu) time geom mean: +Infinity%
- GC (wall) time geom mean: 51%
- Elapsed (wall) time: -2.66%
- perf instructions: -5.30%
- perf cycles: -7.52
Which I think is pretty good. Haven't investigated the infinity there. GC wall time is nuts, I suspect duog/ghc!1 will improve it. But Elapsed(wall) time supersedes it I think? Perhaps some time is being accounted for in GC where previously it was mutator?
I'd appreciate any further guidance in what benchmarks are most important to look at
I have further iteration on this, removing gc_spin and mut_spin, here: duog/ghc!2 it's now passing CI
My plan for benchmarking is:
- assess whether any programs are running too short to be useful and either remove or make them run longer
- benchmark across the following dimensions:
- 1: +RTS -N, N = {1,7,8,12} (on 8 cores)
- 2: with and without stress-ng (see below)
- 3: master, this MR, and the MR linked above
running stress-ng in parallel, to generate a small amount of scheduling load (using --msg):
- I expect this to show very poor performance for sched_yields
Edited by Douglas WilsonSounds like a good plan.
One easy benchmark that I find useful is to simply use GHC to build the
Cabal
library (perhaps in this case with-j
). I typically use this script to do this.
@bgamari I finally got round to trying this, incredible results.
with gc_spin and mut_spin replaced: duog/ghc!2
'/nix/store/353ifnmj08abfbzf2krpz6f9y475j2zs-ghc-9.1.0.20201222/lib/ghc-9.1.0.20201222/bin/ghc' '-B/nix/store/353ifnmj08abfbzf2krpz6f9y475j2zs-ghc-9.1.0.20201222/lib/ghc-9.1.0.20201222' '-hidir' '_cabal_out' '-odir' '_cabal_out' '-ilibraries/Cabal/Cabal' '-ilibraries/Cabal/Cabal/src' 'libraries/Cabal/Cabal/Setup.hs' '-j' +RTS '-s9f51594.out' 18,752,381,656 bytes allocated in the heap 2,745,971,744 bytes copied during GC 192,634,720 bytes maximum residency (17 sample(s)) 5,066,912 bytes maximum slop 529 MiB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 1567 colls, 1349 par 4.751s 1.874s 0.0012s 0.0156s Gen 1 17 colls, 10 par 2.524s 0.339s 0.0199s 0.0439s Parallel GC work balance: 49.39% (serial 0%, perfect 100%) TASKS: 39 (1 bound, 38 peak workers (38 total), using -N12) SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.001s ( 0.001s elapsed) MUT time 17.566s ( 5.155s elapsed) GC time 7.276s ( 2.213s elapsed) EXIT time 0.031s ( 0.002s elapsed) Total time 24.873s ( 7.370s elapsed) Alloc rate 1,067,554,510 bytes per MUT second Productivity 70.6% of total user, 69.9% of total elapsed
With any_work killed: duog/ghc!1 (a little forward of this MR)
'/nix/store/h3cvya9jhqj5wlzcrvqfszd8z4svqlxw-ghc-9.1.0.20201222/lib/ghc-9.1.0.20201222/bin/ghc' '-B/nix/store/h3cvya9jhqj5wlzcrvqfszd8z4svqlxw-ghc-9.1.0.20201222/lib/ghc-9.1.0.20201222' '-hidir' '_cabal_out' '-odir' '_cabal_out' '-ilibraries/Cabal/Cabal' '-ilibraries/Cabal/Cabal/src' 'libraries/Cabal/Cabal/Setup.hs' '-j' +RTS '-s289869.out' 18,755,805,856 bytes allocated in the heap 2,750,120,408 bytes copied during GC 188,575,256 bytes maximum residency (17 sample(s)) 5,095,992 bytes maximum slop 515 MiB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 1773 colls, 1555 par 13.560s 3.473s 0.0020s 0.0219s Gen 1 17 colls, 10 par 2.627s 0.324s 0.0191s 0.0411s Parallel GC work balance: 49.59% (serial 0%, perfect 100%) TASKS: 38 (1 bound, 37 peak workers (37 total), using -N12) SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.001s ( 0.001s elapsed) MUT time 49.690s ( 5.625s elapsed) GC time 16.188s ( 3.798s elapsed) EXIT time 0.023s ( 0.008s elapsed) Total time 65.901s ( 9.430s elapsed) Alloc rate 377,457,728 bytes per MUT second Productivity 75.4% of total user, 59.6% of total elapsed
baseline: commit b58cb63a
'/nix/store/gjhdwyi2v3hcxhpgz07dykq8q6cy1ss6-ghc-9.1.0.20201216/lib/ghc-9.1.0.20201216/bin/ghc' '-B/nix/store/gjhdwyi2v3hcxhpgz07dykq8q6cy1ss6-ghc-9.1.0.20201216/lib/ghc-9.1.0.20201216' '-hidir' '_cabal_out' '-odir' '_cabal_out' '-ilibraries/Cabal/Cabal' '-ilibraries/Cabal/Cabal/src' 'libraries/Cabal/Cabal/Setup.hs' '-j' +RTS '-sb58cb63.out' 48,311,762,376 bytes allocated in the heap 8,022,500,008 bytes copied during GC 502,816,400 bytes maximum residency (23 sample(s)) 7,774,576 bytes maximum slop 1148 MiB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 3258 colls, 3033 par 111.914s 9.619s 0.0030s 0.0365s Gen 1 23 colls, 16 par 7.786s 0.803s 0.0349s 0.1028s Parallel GC work balance: 47.50% (serial 0%, perfect 100%) TASKS: 38 (1 bound, 37 peak workers (37 total), using -N12) SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.001s ( 0.001s elapsed) MUT time 50.768s ( 16.275s elapsed) GC time 119.700s ( 10.422s elapsed) EXIT time 0.037s ( 0.002s elapsed) Total time 170.505s ( 26.700s elapsed) Alloc rate 951,611,067 bytes per MUT second Productivity 29.8% of total user, 61.0% of total elapsed
Edited by Douglas Wilsonbuilt 2 profiling compilers (--flavour=default+profiled_ghc+no_dynamic_ghc), the difference in allocation goes away, but still much improved times:
'./_build-nospingc-prof/stage1/bin/ghc' '-hidir' '_cabal_out' '-odir' '_cabal_out' '-ilibraries/Cabal/Cabal/src' 'libraries/Cabal/Cabal/Setup.hs' '-j' +RTS '-snospingc.s' 76,870,365,632 bytes allocated in the heap 13,611,660,688 bytes copied during GC 716,962,280 bytes maximum residency (28 sample(s)) 7,480,856 bytes maximum slop 2032 MiB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 5514 colls, 5193 par 16.252s 6.885s 0.0012s 0.0229s Gen 1 28 colls, 20 par 12.298s 1.380s 0.0493s 0.1579s Parallel GC work balance: 46.99% (serial 0%, perfect 100%) TASKS: 39 (1 bound, 38 peak workers (38 total), using -N12) SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.001s ( 0.001s elapsed) MUT time 67.413s ( 21.641s elapsed) GC time 28.550s ( 8.264s elapsed) RP time 0.000s ( 0.000s elapsed) PROF time 0.000s ( 0.000s elapsed) EXIT time 0.039s ( 0.001s elapsed) Total time 96.003s ( 29.907s elapsed) Alloc rate 1,140,288,238 bytes per MUT second Productivity 70.2% of total user, 72.4% of total elapsed
''./_build-baseline/stage1/bin/ghc' '-hidir' '_cabal_out' '-odir' '_cabal_out' '-ilibraries/Cabal/Cabal/src' 'libraries/Cabal/Cabal/Setup.hs' '-j' +RTS '-sbaseline.s' 76,883,250,696 bytes allocated in the heap 13,115,276,168 bytes copied during GC 803,563,760 bytes maximum residency (26 sample(s)) 7,722,768 bytes maximum slop 1829 MiB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 4499 colls, 4172 par 139.861s 12.064s 0.0027s 0.0460s Gen 1 26 colls, 18 par 13.203s 1.358s 0.0522s 0.1686s Parallel GC work balance: 47.94% (serial 0%, perfect 100%) TASKS: 39 (1 bound, 38 peak workers (38 total), using -N12) SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.001s ( 0.001s elapsed) MUT time 77.108s ( 22.081s elapsed) GC time 153.064s ( 13.422s elapsed) RP time 0.000s ( 0.000s elapsed) PROF time 0.000s ( 0.000s elapsed) EXIT time 0.044s ( 0.002s elapsed) Total time 230.217s ( 35.506s elapsed) Alloc rate 997,080,661 bytes per MUT second Productivity 33.5% of total user, 62.2% of total elapsed
Yes it matches up with
time
. Obviously we have to be skeptical, but there is reason to think that this performance gain is up for grabs, see #9221I don't understand how the bytes allocated are so different in the reports above. My best guess is an error in the stats accounting.
The mutator time improvements could be explained by:
- the time between stat_endGC and the release of gc worker threads in gcWorkerThread is accounted for as mutator time
- #19147 (closed) will gc cpu time accounting problems
If anyone would like to try to reproduce, ci artifacts are available:
Edited by Douglas Wilson
Some more numbers. The following graph is for running
$GHC Cabal/Setup.sh -j$N
through https://gitlab.haskell.org/bgamari/ghc-utils/-/blob/master/ghc_perf.py inside an exclusive cpuset (cset shield) with 10 cores
I ran this 5 times, so the box+whiskers are all over 5 samples
stat_name comes from either perf events or ghc machine-readable stats.
x axis is then $N above, i.e. number of capabilities. the commits and tarballs are linked in previous comments.
These numbers look very good, with a few things to look into:
- nospingc has poor wall times when N in {1,2}, and when N >= 10, which is the number of cores in the cpuset.
- For the N > 1 case, perhaps duog/ghc!2 (4687cfbf) will help.
- For the N == 1 case, maybe the new global variables affect cache locality?
Edited by Douglas WilsonThat fixed it, see https://gitlab.haskell.org/-/snippets/1882 it's beautiful
Good results running synth.bash from #9221
1168 1170 stats.no_work); 1169 1171 MR_STAT("scav_find_work", FMT_Word64, 1170 1172 stats.scav_find_work); 1173 MR_STAT("max_n_todo_overflow", FMT_Word64, I've not seen this be nonzero.
I intend to remove the refill-deques option, but I would prefer to leave this metric in. gc threads can get in a state where they have an empty todo_q, but loads of blocks in todo_overflow.
I'll at least document it in comments next to other internal counters.
mentioned in merge request !4729 (closed)
Closing, superseded by !4729 (closed)
mentioned in issue #9221