WIP: parallel gc: use mutex + condvar instead of schedYield in main gc loop

Yes, I think this is a good direction. However, this will need to be thoroughly benchmarked to ensure that we aren't regressing appreciably. Given how unpredictable sched_yield is, I doubt that we are. That being said, the proof is in the pudding.

For sure. I'll get this validated in CI, then post some numbers

Thanks @duog; great work!

added 37 commits

e9c0f671...b58cb63a - 35 commits from branch ghc:master
ee39de62 - [rts] introduce refill-wsdeques
3a94fcbb - [rts] [gc] kill any_work, add mutex + condition variable

Compare with previous version

@bgamari I've posted a nofib comparison here: https://gitlab.haskell.org/-/snippets/1865

Headline is

bytes allocated geom mean: -0.6%
mutator time geom mean: -14.66
GC (cpu) time geom mean: +Infinity%
GC (wall) time geom mean: 51%
Elapsed (wall) time: -2.66%
perf instructions: -5.30%
perf cycles: -7.52

Which I think is pretty good. Haven't investigated the infinity there. GC wall time is nuts, I suspect duog/ghc!1 will improve it. But Elapsed(wall) time supersedes it I think? Perhaps some time is being accounted for in GC where previously it was mutator?

I'd appreciate any further guidance in what benchmarks are most important to look at

As I noted on the snipped, I am similarly perplexed by the mutator time change. However, before reading too much into it I would try to setup better controls of your benchmark environment. That is:

stop as many background tasks as possible
disable CPU frequency scaling

Thanks, I'll do this.

I have further iteration on this, removing gc_spin and mut_spin, here: duog/ghc!2 it's now passing CI

My plan for benchmarking is:

assess whether any programs are running too short to be useful and either remove or make them run longer
benchmark across the following dimensions:
1: +RTS -N, N = {1,7,8,12} (on 8 cores)
2: with and without stress-ng (see below)
3: master, this MR, and the MR linked above

running stress-ng in parallel, to generate a small amount of scheduling load (using --msg):

I expect this to show very poor performance for sched_yields

Sounds like a good plan.

One easy benchmark that I find useful is to simply use GHC to build the Cabal library (perhaps in this case with -j). I typically use this script to do this.

@bgamari I finally got round to trying this, incredible results.

with gc_spin and mut_spin replaced: duog/ghc!2

'/nix/store/353ifnmj08abfbzf2krpz6f9y475j2zs-ghc-9.1.0.20201222/lib/ghc-9.1.0.20201222/bin/ghc' '-B/nix/store/353ifnmj08abfbzf2krpz6f9y475j2zs-ghc-9.1.0.20201222/lib/ghc-9.1.0.20201222' '-hidir' '_cabal_out' '-odir' '_cabal_out' '-ilibraries/Cabal/Cabal' '-ilibraries/Cabal/Cabal/src' 'libraries/Cabal/Cabal/Setup.hs' '-j' +RTS '-s9f51594.out' 
  18,752,381,656 bytes allocated in the heap
   2,745,971,744 bytes copied during GC
     192,634,720 bytes maximum residency (17 sample(s))
       5,066,912 bytes maximum slop
             529 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      1567 colls,  1349 par    4.751s   1.874s     0.0012s    0.0156s
  Gen  1        17 colls,    10 par    2.524s   0.339s     0.0199s    0.0439s

  Parallel GC work balance: 49.39% (serial 0%, perfect 100%)

  TASKS: 39 (1 bound, 38 peak workers (38 total), using -N12)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.001s  (  0.001s elapsed)
  MUT     time   17.566s  (  5.155s elapsed)
  GC      time    7.276s  (  2.213s elapsed)
  EXIT    time    0.031s  (  0.002s elapsed)
  Total   time   24.873s  (  7.370s elapsed)

  Alloc rate    1,067,554,510 bytes per MUT second

  Productivity  70.6% of total user, 69.9% of total elapsed

With any_work killed: duog/ghc!1 (a little forward of this MR)

'/nix/store/h3cvya9jhqj5wlzcrvqfszd8z4svqlxw-ghc-9.1.0.20201222/lib/ghc-9.1.0.20201222/bin/ghc' '-B/nix/store/h3cvya9jhqj5wlzcrvqfszd8z4svqlxw-ghc-9.1.0.20201222/lib/ghc-9.1.0.20201222' '-hidir' '_cabal_out' '-odir' '_cabal_out' '-ilibraries/Cabal/Cabal' '-ilibraries/Cabal/Cabal/src' 'libraries/Cabal/Cabal/Setup.hs' '-j' +RTS '-s289869.out' 
  18,755,805,856 bytes allocated in the heap
   2,750,120,408 bytes copied during GC
     188,575,256 bytes maximum residency (17 sample(s))
       5,095,992 bytes maximum slop
             515 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      1773 colls,  1555 par   13.560s   3.473s     0.0020s    0.0219s
  Gen  1        17 colls,    10 par    2.627s   0.324s     0.0191s    0.0411s

  Parallel GC work balance: 49.59% (serial 0%, perfect 100%)

  TASKS: 38 (1 bound, 37 peak workers (37 total), using -N12)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.001s  (  0.001s elapsed)
  MUT     time   49.690s  (  5.625s elapsed)
  GC      time   16.188s  (  3.798s elapsed)
  EXIT    time    0.023s  (  0.008s elapsed)
  Total   time   65.901s  (  9.430s elapsed)

  Alloc rate    377,457,728 bytes per MUT second

  Productivity  75.4% of total user, 59.6% of total elapsed

baseline: commit b58cb63a

'/nix/store/gjhdwyi2v3hcxhpgz07dykq8q6cy1ss6-ghc-9.1.0.20201216/lib/ghc-9.1.0.20201216/bin/ghc' '-B/nix/store/gjhdwyi2v3hcxhpgz07dykq8q6cy1ss6-ghc-9.1.0.20201216/lib/ghc-9.1.0.20201216' '-hidir' '_cabal_out' '-odir' '_cabal_out' '-ilibraries/Cabal/Cabal' '-ilibraries/Cabal/Cabal/src' 'libraries/Cabal/Cabal/Setup.hs' '-j' +RTS '-sb58cb63.out' 
  48,311,762,376 bytes allocated in the heap
   8,022,500,008 bytes copied during GC
     502,816,400 bytes maximum residency (23 sample(s))
       7,774,576 bytes maximum slop
            1148 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      3258 colls,  3033 par   111.914s   9.619s     0.0030s    0.0365s
  Gen  1        23 colls,    16 par    7.786s   0.803s     0.0349s    0.1028s

  Parallel GC work balance: 47.50% (serial 0%, perfect 100%)

  TASKS: 38 (1 bound, 37 peak workers (37 total), using -N12)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.001s  (  0.001s elapsed)
  MUT     time   50.768s  ( 16.275s elapsed)
  GC      time  119.700s  ( 10.422s elapsed)
  EXIT    time    0.037s  (  0.002s elapsed)
  Total   time  170.505s  ( 26.700s elapsed)

  Alloc rate    951,611,067 bytes per MUT second

  Productivity  29.8% of total user, 61.0% of total elapsed

I don't understand how allocations is so much higher in the baseline case.

built 2 profiling compilers (--flavour=default+profiled_ghc+no_dynamic_ghc), the difference in allocation goes away, but still much improved times:

'./_build-nospingc-prof/stage1/bin/ghc' '-hidir' '_cabal_out' '-odir' '_cabal_out' '-ilibraries/Cabal/Cabal/src' 'libraries/Cabal/Cabal/Setup.hs' '-j' +RTS '-snospingc.s' 
  76,870,365,632 bytes allocated in the heap
  13,611,660,688 bytes copied during GC
     716,962,280 bytes maximum residency (28 sample(s))
       7,480,856 bytes maximum slop
            2032 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      5514 colls,  5193 par   16.252s   6.885s     0.0012s    0.0229s
  Gen  1        28 colls,    20 par   12.298s   1.380s     0.0493s    0.1579s

  Parallel GC work balance: 46.99% (serial 0%, perfect 100%)

  TASKS: 39 (1 bound, 38 peak workers (38 total), using -N12)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.001s  (  0.001s elapsed)
  MUT     time   67.413s  ( 21.641s elapsed)
  GC      time   28.550s  (  8.264s elapsed)
  RP      time    0.000s  (  0.000s elapsed)
  PROF    time    0.000s  (  0.000s elapsed)
  EXIT    time    0.039s  (  0.001s elapsed)
  Total   time   96.003s  ( 29.907s elapsed)

  Alloc rate    1,140,288,238 bytes per MUT second

  Productivity  70.2% of total user, 72.4% of total elapsed

''./_build-baseline/stage1/bin/ghc' '-hidir' '_cabal_out' '-odir' '_cabal_out' '-ilibraries/Cabal/Cabal/src' 'libraries/Cabal/Cabal/Setup.hs' '-j' +RTS '-sbaseline.s' 
  76,883,250,696 bytes allocated in the heap
  13,115,276,168 bytes copied during GC
     803,563,760 bytes maximum residency (26 sample(s))
       7,722,768 bytes maximum slop
            1829 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      4499 colls,  4172 par   139.861s  12.064s     0.0027s    0.0460s
  Gen  1        26 colls,    18 par   13.203s   1.358s     0.0522s    0.1686s

  Parallel GC work balance: 47.94% (serial 0%, perfect 100%)

  TASKS: 39 (1 bound, 38 peak workers (38 total), using -N12)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.001s  (  0.001s elapsed)
  MUT     time   77.108s  ( 22.081s elapsed)
  GC      time  153.064s  ( 13.422s elapsed)
  RP      time    0.000s  (  0.000s elapsed)
  PROF    time    0.000s  (  0.000s elapsed)
  EXIT    time    0.044s  (  0.002s elapsed)
  Total   time  230.217s  ( 35.506s elapsed)

  Alloc rate    997,080,661 bytes per MUT second

  Productivity  33.5% of total user, 62.2% of total elapsed

Errr wow, is this in the "too good to be true" range or do you think the results are accurate?

Does it match up with time as reported by time?

Yes it matches up with time. Obviously we have to be skeptical, but there is reason to think that this performance gain is up for grabs, see #9221

I don't understand how the bytes allocated are so different in the reports above. My best guess is an error in the stats accounting.

The mutator time improvements could be explained by:

the time between stat_endGC and the release of gc worker threads in gcWorkerThread is accounted for as mutator time
#19147 (closed) will gc cpu time accounting problems

If anyone would like to try to reproduce, ci artifacts are available:

baseline: https://gitlab.haskell.org/ghc/ghc/-/jobs/526789
no-anywork: https://gitlab.haskell.org/duog/ghc/-/jobs/531913
nospingc: https://gitlab.haskell.org/duog/ghc/-/jobs/532009

The difference in allocations went away when I tried to reproduce, and so did the grotesque improvement. It's still better, but more realistically

Some more numbers. The following graph is for running

$GHC Cabal/Setup.sh -j$N

through https://gitlab.haskell.org/bgamari/ghc-utils/-/blob/master/ghc_perf.py inside an exclusive cpuset (cset shield) with 10 cores

I ran this 5 times, so the box+whiskers are all over 5 samples

stat_name comes from either perf events or ghc machine-readable stats.

x axis is then $N above, i.e. number of capabilities. the commits and tarballs are linked in previous comments.

These numbers look very good, with a few things to look into:

nospingc has poor wall times when N in {1,2}, and when N >= 10, which is the number of cores in the cpuset.
For the N > 1 case, perhaps duog/ghc!2 (4687cfbf) will help.
For the N == 1 case, maybe the new global variables affect cache locality?

That fixed it, see https://gitlab.haskell.org/-/snippets/1882 it's beautiful

Good results running synth.bash from #9221

https://gitlab.haskell.org/-/snippets/1883

We really should document the meanings of these statistics in the users guide but we can leave this for future work.

I've not seen this be nonzero.

I intend to remove the refill-deques option, but I would prefer to leave this metric in. gc threads can get in a state where they have an empty todo_q, but loads of blocks in todo_overflow.

I'll at least document it in comments next to other internal counters.

mentioned in merge request !4729 (closed)

Closing, superseded by !4729 (closed)

closed

mentioned in issue #9221

             stats.no_work);
     MR_STAT("scav_find_work", FMT_Word64,
             stats.scav_find_work);
     MR_STAT("max_n_todo_overflow", FMT_Word64,

WIP: parallel gc: use mutex + condvar instead of schedYield in main gc loop

Merge request reports

Activity