Skip to content
Snippets Groups Projects

WIP: parallel gc: use mutex + condvar instead of schedYield in main gc loop

Closed Douglas Wilson requested to merge duog/ghc:wip/no-anywork into master
6 unresolved threads

Thank you for your contribution to GHC!

Please take a few moments to verify that your commits fulfill the following:

  • are either individually buildable or squashed
  • have commit messages which describe what they do (referring to Notes and tickets using #NNNN syntax when appropriate)
  • have added source comments describing your change. For larger changes you likely should add a Note and cross-reference it from the relevant places.
  • add a testcase to the testsuite.
  • replace this message with a description motivating your change

If you have any questions don't hesitate to open your merge request and inquire in a comment. If your patch isn't quite done yet please do add prefix your MR title with WIP:.

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
    • Yes, I think this is a good direction. However, this will need to be thoroughly benchmarked to ensure that we aren't regressing appreciably. Given how unpredictable sched_yield is, I doubt that we are. That being said, the proof is in the pudding.

    • Author Developer

      For sure. I'll get this validated in CI, then post some numbers

    • Please register or sign in to reply
  • Thanks @duog; great work!

  • Douglas Wilson added 37 commits

    added 37 commits

    Compare with previous version

    • Author Developer

      @bgamari I've posted a nofib comparison here: https://gitlab.haskell.org/-/snippets/1865

      Headline is

      • bytes allocated geom mean: -0.6%
      • mutator time geom mean: -14.66
      • GC (cpu) time geom mean: +Infinity%
      • GC (wall) time geom mean: 51%
      • Elapsed (wall) time: -2.66%
      • perf instructions: -5.30%
      • perf cycles: -7.52

      Which I think is pretty good. Haven't investigated the infinity there. GC wall time is nuts, I suspect duog/ghc!1 will improve it. But Elapsed(wall) time supersedes it I think? Perhaps some time is being accounted for in GC where previously it was mutator?

      I'd appreciate any further guidance in what benchmarks are most important to look at

    • As I noted on the snipped, I am similarly perplexed by the mutator time change. However, before reading too much into it I would try to setup better controls of your benchmark environment. That is:

      • stop as many background tasks as possible
      • disable CPU frequency scaling
    • Author Developer

      Thanks, I'll do this.

    • Please register or sign in to reply
    • Author Developer

      I have further iteration on this, removing gc_spin and mut_spin, here: duog/ghc!2 it's now passing CI

      My plan for benchmarking is:

      • assess whether any programs are running too short to be useful and either remove or make them run longer
      • benchmark across the following dimensions:
      • 1: +RTS -N, N = {1,7,8,12} (on 8 cores)
      • 2: with and without stress-ng (see below)
      • 3: master, this MR, and the MR linked above

      running stress-ng in parallel, to generate a small amount of scheduling load (using --msg):

      • I expect this to show very poor performance for sched_yields
      Edited by Douglas Wilson
    • Sounds like a good plan.

      One easy benchmark that I find useful is to simply use GHC to build the Cabal library (perhaps in this case with -j). I typically use this script to do this.

    • Please register or sign in to reply
    • Author Developer

      @bgamari I finally got round to trying this, incredible results.

      with gc_spin and mut_spin replaced: duog/ghc!2

      '/nix/store/353ifnmj08abfbzf2krpz6f9y475j2zs-ghc-9.1.0.20201222/lib/ghc-9.1.0.20201222/bin/ghc' '-B/nix/store/353ifnmj08abfbzf2krpz6f9y475j2zs-ghc-9.1.0.20201222/lib/ghc-9.1.0.20201222' '-hidir' '_cabal_out' '-odir' '_cabal_out' '-ilibraries/Cabal/Cabal' '-ilibraries/Cabal/Cabal/src' 'libraries/Cabal/Cabal/Setup.hs' '-j' +RTS '-s9f51594.out' 
        18,752,381,656 bytes allocated in the heap
         2,745,971,744 bytes copied during GC
           192,634,720 bytes maximum residency (17 sample(s))
             5,066,912 bytes maximum slop
                   529 MiB total memory in use (0 MB lost due to fragmentation)
      
                                           Tot time (elapsed)  Avg pause  Max pause
        Gen  0      1567 colls,  1349 par    4.751s   1.874s     0.0012s    0.0156s
        Gen  1        17 colls,    10 par    2.524s   0.339s     0.0199s    0.0439s
      
        Parallel GC work balance: 49.39% (serial 0%, perfect 100%)
      
        TASKS: 39 (1 bound, 38 peak workers (38 total), using -N12)
      
        SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
      
        INIT    time    0.001s  (  0.001s elapsed)
        MUT     time   17.566s  (  5.155s elapsed)
        GC      time    7.276s  (  2.213s elapsed)
        EXIT    time    0.031s  (  0.002s elapsed)
        Total   time   24.873s  (  7.370s elapsed)
      
        Alloc rate    1,067,554,510 bytes per MUT second
      
        Productivity  70.6% of total user, 69.9% of total elapsed

      With any_work killed: duog/ghc!1 (a little forward of this MR)

      '/nix/store/h3cvya9jhqj5wlzcrvqfszd8z4svqlxw-ghc-9.1.0.20201222/lib/ghc-9.1.0.20201222/bin/ghc' '-B/nix/store/h3cvya9jhqj5wlzcrvqfszd8z4svqlxw-ghc-9.1.0.20201222/lib/ghc-9.1.0.20201222' '-hidir' '_cabal_out' '-odir' '_cabal_out' '-ilibraries/Cabal/Cabal' '-ilibraries/Cabal/Cabal/src' 'libraries/Cabal/Cabal/Setup.hs' '-j' +RTS '-s289869.out' 
        18,755,805,856 bytes allocated in the heap
         2,750,120,408 bytes copied during GC
           188,575,256 bytes maximum residency (17 sample(s))
             5,095,992 bytes maximum slop
                   515 MiB total memory in use (0 MB lost due to fragmentation)
      
                                           Tot time (elapsed)  Avg pause  Max pause
        Gen  0      1773 colls,  1555 par   13.560s   3.473s     0.0020s    0.0219s
        Gen  1        17 colls,    10 par    2.627s   0.324s     0.0191s    0.0411s
      
        Parallel GC work balance: 49.59% (serial 0%, perfect 100%)
      
        TASKS: 38 (1 bound, 37 peak workers (37 total), using -N12)
      
        SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
      
        INIT    time    0.001s  (  0.001s elapsed)
        MUT     time   49.690s  (  5.625s elapsed)
        GC      time   16.188s  (  3.798s elapsed)
        EXIT    time    0.023s  (  0.008s elapsed)
        Total   time   65.901s  (  9.430s elapsed)
      
        Alloc rate    377,457,728 bytes per MUT second
      
        Productivity  75.4% of total user, 59.6% of total elapsed

      baseline: commit b58cb63a

      '/nix/store/gjhdwyi2v3hcxhpgz07dykq8q6cy1ss6-ghc-9.1.0.20201216/lib/ghc-9.1.0.20201216/bin/ghc' '-B/nix/store/gjhdwyi2v3hcxhpgz07dykq8q6cy1ss6-ghc-9.1.0.20201216/lib/ghc-9.1.0.20201216' '-hidir' '_cabal_out' '-odir' '_cabal_out' '-ilibraries/Cabal/Cabal' '-ilibraries/Cabal/Cabal/src' 'libraries/Cabal/Cabal/Setup.hs' '-j' +RTS '-sb58cb63.out' 
        48,311,762,376 bytes allocated in the heap
         8,022,500,008 bytes copied during GC
           502,816,400 bytes maximum residency (23 sample(s))
             7,774,576 bytes maximum slop
                  1148 MiB total memory in use (0 MB lost due to fragmentation)
      
                                           Tot time (elapsed)  Avg pause  Max pause
        Gen  0      3258 colls,  3033 par   111.914s   9.619s     0.0030s    0.0365s
        Gen  1        23 colls,    16 par    7.786s   0.803s     0.0349s    0.1028s
      
        Parallel GC work balance: 47.50% (serial 0%, perfect 100%)
      
        TASKS: 38 (1 bound, 37 peak workers (37 total), using -N12)
      
        SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
      
        INIT    time    0.001s  (  0.001s elapsed)
        MUT     time   50.768s  ( 16.275s elapsed)
        GC      time  119.700s  ( 10.422s elapsed)
        EXIT    time    0.037s  (  0.002s elapsed)
        Total   time  170.505s  ( 26.700s elapsed)
      
        Alloc rate    951,611,067 bytes per MUT second
      
        Productivity  29.8% of total user, 61.0% of total elapsed
      
      Edited by Douglas Wilson
    • Author Developer

      I don't understand how allocations is so much higher in the baseline case.

    • Author Developer

      built 2 profiling compilers (--flavour=default+profiled_ghc+no_dynamic_ghc), the difference in allocation goes away, but still much improved times:

      './_build-nospingc-prof/stage1/bin/ghc' '-hidir' '_cabal_out' '-odir' '_cabal_out' '-ilibraries/Cabal/Cabal/src' 'libraries/Cabal/Cabal/Setup.hs' '-j' +RTS '-snospingc.s' 
        76,870,365,632 bytes allocated in the heap
        13,611,660,688 bytes copied during GC
           716,962,280 bytes maximum residency (28 sample(s))
             7,480,856 bytes maximum slop
                  2032 MiB total memory in use (0 MB lost due to fragmentation)
      
                                           Tot time (elapsed)  Avg pause  Max pause
        Gen  0      5514 colls,  5193 par   16.252s   6.885s     0.0012s    0.0229s
        Gen  1        28 colls,    20 par   12.298s   1.380s     0.0493s    0.1579s
      
        Parallel GC work balance: 46.99% (serial 0%, perfect 100%)
      
        TASKS: 39 (1 bound, 38 peak workers (38 total), using -N12)
      
        SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
      
        INIT    time    0.001s  (  0.001s elapsed)
        MUT     time   67.413s  ( 21.641s elapsed)
        GC      time   28.550s  (  8.264s elapsed)
        RP      time    0.000s  (  0.000s elapsed)
        PROF    time    0.000s  (  0.000s elapsed)
        EXIT    time    0.039s  (  0.001s elapsed)
        Total   time   96.003s  ( 29.907s elapsed)
      
        Alloc rate    1,140,288,238 bytes per MUT second
      
        Productivity  70.2% of total user, 72.4% of total elapsed
      ''./_build-baseline/stage1/bin/ghc' '-hidir' '_cabal_out' '-odir' '_cabal_out' '-ilibraries/Cabal/Cabal/src' 'libraries/Cabal/Cabal/Setup.hs' '-j' +RTS '-sbaseline.s' 
        76,883,250,696 bytes allocated in the heap
        13,115,276,168 bytes copied during GC
           803,563,760 bytes maximum residency (26 sample(s))
             7,722,768 bytes maximum slop
                  1829 MiB total memory in use (0 MB lost due to fragmentation)
      
                                           Tot time (elapsed)  Avg pause  Max pause
        Gen  0      4499 colls,  4172 par   139.861s  12.064s     0.0027s    0.0460s
        Gen  1        26 colls,    18 par   13.203s   1.358s     0.0522s    0.1686s
      
        Parallel GC work balance: 47.94% (serial 0%, perfect 100%)
      
        TASKS: 39 (1 bound, 38 peak workers (38 total), using -N12)
      
        SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
      
        INIT    time    0.001s  (  0.001s elapsed)
        MUT     time   77.108s  ( 22.081s elapsed)
        GC      time  153.064s  ( 13.422s elapsed)
        RP      time    0.000s  (  0.000s elapsed)
        PROF    time    0.000s  (  0.000s elapsed)
        EXIT    time    0.044s  (  0.002s elapsed)
        Total   time  230.217s  ( 35.506s elapsed)
      
        Alloc rate    997,080,661 bytes per MUT second
      
        Productivity  33.5% of total user, 62.2% of total elapsed
    • Errr wow, is this in the "too good to be true" range or do you think the results are accurate?

      Does it match up with time as reported by time?

    • Author Developer

      Yes it matches up with time. Obviously we have to be skeptical, but there is reason to think that this performance gain is up for grabs, see #9221

      I don't understand how the bytes allocated are so different in the reports above. My best guess is an error in the stats accounting.

      The mutator time improvements could be explained by:

      • the time between stat_endGC and the release of gc worker threads in gcWorkerThread is accounted for as mutator time
      • #19147 (closed) will gc cpu time accounting problems
    • Author Developer

      If anyone would like to try to reproduce, ci artifacts are available:

      Edited by Douglas Wilson
    • Author Developer

      The difference in allocations went away when I tried to reproduce, and so did the grotesque improvement. It's still better, but more realistically

    • Please register or sign in to reply
  • Author Developer

    Good results running synth.bash from #9221

    https://gitlab.haskell.org/-/snippets/1883

  • Ben Gamari
1168 1170 stats.no_work);
1169 1171 MR_STAT("scav_find_work", FMT_Word64,
1170 1172 stats.scav_find_work);
1173 MR_STAT("max_n_todo_overflow", FMT_Word64,
  • We really should document the meanings of these statistics in the users guide but we can leave this for future work.

  • Author Developer

    I've not seen this be nonzero.

    I intend to remove the refill-deques option, but I would prefer to leave this metric in. gc threads can get in a state where they have an empty todo_q, but loads of blocks in todo_overflow.

    I'll at least document it in comments next to other internal counters.

  • Please register or sign in to reply
  • Douglas Wilson mentioned in merge request !4729 (closed)

    mentioned in merge request !4729 (closed)

  • Author Developer

    Closing, superseded by !4729 (closed)

  • Niklas Hambüchen mentioned in issue #9221

    mentioned in issue #9221

  • Please register or sign in to reply
    Loading