1. 10 Aug, 2018 1 commit
  2. 06 Aug, 2018 1 commit
  3. 02 Jun, 2018 1 commit
  4. 21 Apr, 2018 1 commit
  5. 16 Apr, 2018 2 commits
  6. 30 Mar, 2018 1 commit
  7. 25 Mar, 2018 1 commit
    • Simon Marlow's avatar
      Run C finalizers incrementally during mutation · f7bbc343
      Simon Marlow authored
      With a large heap it's possible to build up a lot of finalizers
      between GCs.  We've observed GC spending up to 50% of its time running
      finalizers.  But there's no reason we have to run finalizers during
      GC, and especially no reason we have to block *all* the mutator
      threads while *one* GC thread runs finalizers one by one.
      
      I thought about a bunch of alternative ways to handle this, which are
      documented along with runSomeFinalizers() in Weak.c.  The approach I
      settled on is to have a capability run finalizers if it is idle.  So
      running finalizers is like a low-priority background thread. This
      requires some minor scheduler changes, but not much.  In the future we
      might be able to move more GC work into here (I have my eye on freeing
      large blocks, for example).
      
      Test Plan:
      * validate
      * tested on our system and saw reductions in GC pauses of 40-50%.
      
      Reviewers: bgamari, niteria, osa1, erikd
      
      Reviewed By: bgamari, osa1
      
      Subscribers: rwbarton, thomie, carter
      
      Differential Revision: https://phabricator.haskell.org/D4521
      f7bbc343
  8. 19 Mar, 2018 1 commit
    • Douglas Wilson's avatar
      rts: Add --internal-counters RTS flag and several counters · 2918abf7
      Douglas Wilson authored
      The existing internal counters:
      * gc_alloc_block_sync
      * whitehole_spin
      * gen[g].sync
      * gen[1].sync
      
      are now not shown in the -s report unless --internal-counters is also passed.
      
      If --internal-counters is passed we now show the counters above, reformatted, as
      well as several other counters. In particular, we now count the yieldThread()
      calls that SpinLocks do as well as their spins.
      
      The added counters are:
      * gc_spin (spin and yield)
      * mut_spin (spin and yield)
      * whitehole_threadPaused (spin only)
      * whitehole_executeMessage (spin only)
      * whitehole_lockClosure (spin only)
      * waitForGcThreadsd (spin and yield)
      
      As well as the following, which are not SpinLock-like things:
      * any_work
      * do_work
      * scav_find_work
      
      See the Note for descriptions of what these counters are.
      
      We add busy_wait_nops in these loops along with the counter increment where it
      was absent.
      
      Old internal counters output:
      ```
      gc_alloc_block_sync: 0
      whitehole_gc_spin: 0
      gen[0].sync: 0
      gen[1].sync: 0
      ```
      
      New internal counters output:
      ```
      Internal Counters:
                                                 Spins        Yields
          gc_alloc_block_sync                      323             0
          gc_spin                              9016713           752
          mut_spin                            57360944         47716
          whitehole_gc                               0           n/a
          whitehole_threadPaused                     0           n/a
          whitehole_executeMessage                   0           n/a
          whitehole_lockClosure                      0             0
          waitForGcThreads                           2           415
          gen[0].sync                                6             0
          gen[1].sync                                1             0
      
          any_work                                2017
          no_work                                 2014
          scav_find_work                          1004
      ```
      
      Test Plan:
      ./validate
      
      Check it builds with #define PROF_SPIN removed from includes/rts/Config.h
      
      Reviewers: bgamari, erikd, simonmar, hvr
      
      Reviewed By: simonmar
      
      Subscribers: rwbarton, thomie, carter
      
      GHC Trac Issues: #3553, #9221
      
      Differential Revision: https://phabricator.haskell.org/D4302
      2918abf7
  9. 06 Feb, 2018 1 commit
  10. 01 Feb, 2018 1 commit
    • blaze's avatar
      Make RTS keep less memory (fixes #14702) · 0171e09e
      blaze authored
      Currently runtime keeps hold to 4*used_memory. This includes, in
      particular, nursery, which can be quite large on multiprocessor
      machines: 16 CPUs x 64Mb each is 1GB. Multiplying it by 4 means whatever
      actual memory usage is, runtime will never release memory under 4GB, and
      this is quite excessive for processes which only need a lot of memory
      shortly (think building data structures from large files).
      
      This diff makes multiplier to apply only to GC-managed memory, leaving
      all "static" allocations alone.
      
      Test Plan: make test TEST="T14702"
      
      Reviewers: bgamari, erikd, simonmar
      
      Reviewed By: simonmar
      
      Subscribers: rwbarton, thomie, carter
      
      GHC Trac Issues: #14702
      
      Differential Revision: https://phabricator.haskell.org/D4338
      0171e09e
  11. 21 Jan, 2018 1 commit
    • Douglas Wilson's avatar
      [rts] Adjust whitehole_spin · 180ca65f
      Douglas Wilson authored
      Rename to whitehole_gc_spin, in preparation for adding stats for the
      whitehole busy-loop in SMPClosureOps.
      
      Make whitehole_gc_spin volatile, and move it to be defined and
      statically initialised in GC.c. This saves some #ifs, and I'm pretty
      sure it should be volatile.
      
      Test Plan: ./validate
      
      Reviewers: bgamari, erikd, simonmar
      
      Reviewed By: bgamari
      
      Subscribers: rwbarton, thomie, carter
      
      Differential Revision: https://phabricator.haskell.org/D4300
      180ca65f
  12. 16 Nov, 2017 1 commit
    • Simon Marlow's avatar
      Detect overly long GC sync · 2f463873
      Simon Marlow authored
      Summary:
      GC sync is the time between a GC being intiated and all the mutator
      threads finally stopping so that the GC can start. Problems that cause
      the GC sync to be delayed are hard to find and can cause dramatic
      slowdowns for heavily parallel programs.
      
      The new flag --long-gc-sync=<time> helps by emitting a warning and
      calling a user-overridable hook when the GC sync time exceeds the
      specified threshold. A debugger can be used to set a breakpoint when
      this happens and inspect the stacks of threads to find the culprit.
      
      Test Plan:
      ```
      $ ./inplace/bin/ghc-stage2 +RTS --long-gc-sync=0.0000001 -S
          Alloc    Copied     Live     GC     GC      TOT      TOT  Page Flts
          bytes     bytes     bytes   user   elap     user     elap
        1135856     51144    153736  0.000  0.000    0.002    0.002    0    0  (Gen:  0)
        1034760     94704    188752  0.000  0.000    0.002    0.002    0    0  (Gen:  0)
        1038888    134832    228888  0.009  0.009    0.011    0.011    0    0  (Gen:  1)
        1025288     90128    235184  0.000  0.000    0.012    0.012    0    0  (Gen:  0)
        1049088    130080    333984  0.000  0.000    0.013    0.013    0    0  (Gen:  0)
      Warning: waited 0us for GC sync
        1034424     73360    331976  0.000  0.000    0.013    0.013    0    0  (Gen:  0)
      ```
      
      Also tested on a real production problem.
      
      Reviewers: niteria, bgamari, erikd
      
      Subscribers: rwbarton, thomie
      
      Differential Revision: https://phabricator.haskell.org/D4193
      2f463873
  13. 11 Jul, 2017 1 commit
    • Douglas Wilson's avatar
      Fix Work Balance computation in RTS stats · 7c9e356d
      Douglas Wilson authored
      An additional stat is tracked per gc: par_balanced_copied This is the
      the number of bytes copied by each gc thread under the balanced lmit,
      which is simply (copied_bytes / num_gc_threads).  The stat is added to
      all the appropriate GC structures, so is visible in the eventlog and in
      GHC.Stats.
      
      A note is added explaining how work balance is computed.
      
      Remove some end of line whitespace
      
      Test Plan:
      ./validate
      experiment with the program attached to the ticket
      examine code changes carefully
      
      Reviewers: simonmar, austin, hvr, bgamari, erikd
      
      Reviewed By: simonmar
      
      Subscribers: Phyx, rwbarton, thomie
      
      GHC Trac Issues: #13830
      
      Differential Revision: https://phabricator.haskell.org/D3658
      7c9e356d
  14. 29 Apr, 2017 1 commit
  15. 02 Apr, 2017 1 commit
    • Simon Marlow's avatar
      Report heap overflow in the same way as stack overflow · 61ba4518
      Simon Marlow authored
      Now that we throw an exception for heap overflow, we should only print
      the heap overflow message in the main thread when the HeapOverflow
      exception is caught, rather than as a side effect in the GC.
      
      Stack overflows were already done this way, I just made heap overflow
      consistent with stack overflow, and did some related cleanup.
      
      Fixes broken T2592(profasm) which was reporting the heap overflow
      message twice (you would only notice when building with profiling
      libs enabled).
      
      Test Plan: validate
      
      Reviewers: bgamari, niteria, austin, DemiMarie, hvr, erikd
      
      Reviewed By: bgamari
      
      Subscribers: rwbarton, thomie
      
      Differential Revision: https://phabricator.haskell.org/D3394
      61ba4518
  16. 23 Feb, 2017 1 commit
  17. 06 Dec, 2016 1 commit
    • Simon Marlow's avatar
      Overhaul GC stats · 24e6594c
      Simon Marlow authored
      Summary:
      Visible API changes:
      
      * The C struct `GCDetails` gives the stats about a single GC.  This is
        passed to the `gcDone()` callback if one is set via the
        RtsConfig. (previously we just passed a collection of values, so this
        is more extensible, at the expense of breaking the existing API)
      
      * `RTSStats` gives cumulative stats since the start of the program,
        and includes the `GCDetails` for the most recent GC.  This struct
        can be obtained via `getRTSStats()` (the old `getGCStats()` has been
        removed, and `getGCStatsEnabled()` has been renamed to
        `getRTSStatsEnabled()`)
      
      Improvements:
      
      * The per-GC stats and cumulative stats are now cleanly separated.
      
      * Inside the RTS we have a top-level `RTSStats` struct to keep all our
        stats in, previously this was just a collection of strangely-named
        variables.  This struct is mostly just copied in `getRTSStats()`, so
        the implementation of that function is a lot shorter.
      
      * Types are more consistent.  We use a uint64_t byte count for all
        memory values, and Time for all time values.
      
      * Names are more consistent.  We use a suffix `_bytes` for all byte
        counts and `_ns` for all time values.
      
      * We now collect information about the amount of memory in large
        objects and compact objects in `GCDetails`. (the latter was the reason
        I started doing this patch but it seems to have ballooned a bit!)
      
      * I fixed a bug in the calculation of the elapsed MUT time, and added
        an ASSERT to stop the calculations going wrong in the future.
      
      For now I kept the Haskell API in `GHC.Stats` the same, by
      impedence-matching with the new API.  We could either break that API
      and make it match the C API more closely, or we could add a new API
      and deprecate the old one.  Opinions welcome.
      
      This stuff is very easy to get wrong, and it's hard to test.  Reviews
      welcome!
      
      Test Plan:
      manual testing
      validate
      
      Reviewers: bgamari, niteria, austin, ezyang, hvr, erikd, rwbarton, Phyx
      
      Subscribers: thomie
      
      Differential Revision: https://phabricator.haskell.org/D2756
      24e6594c
  18. 29 Nov, 2016 1 commit
  19. 29 Oct, 2016 1 commit
    • Simon Marlow's avatar
      Fix a bug in parallel GC synchronisation · 4e088b49
      Simon Marlow authored
      Summary:
      The problem boils down to global variables: in particular gc_threads[],
      which was being modified by a subsequent GC before the previous GC had
      finished with it.  The fix is to not use global variables.
      
      This was causing setnumcapabilities001 to fail (again!).  It's an old
      bug though.
      
      Test Plan:
      Ran setnumcapabilities001 in a loop for a couple of hours.  Before this
      patch it had been failing after a few minutes.  Not a very scientific
      test, but it's the best I have.
      
      Reviewers: bgamari, austin, fryguybob, niteria, erikd
      
      Subscribers: thomie
      
      Differential Revision: https://phabricator.haskell.org/D2654
      4e088b49
  20. 27 Jul, 2016 1 commit
  21. 20 Jul, 2016 1 commit
    • gcampax's avatar
      Compact Regions · cf989ffe
      gcampax authored
      This brings in initial support for compact regions, as described in the
      ICFP 2015 paper "Efficient Communication and Collection with Compact
      Normal Forms" (Edward Z. Yang et.al.) and implemented by Giovanni
      Campagna.
      
      Some things may change before the 8.2 release, but I (Simon M.) wanted
      to get the main patch committed so that we can iterate.
      
      What documentation there is is in the Data.Compact module in the new
      compact package.  We'll need to extend and polish the documentation
      before the release.
      
      Test Plan:
      validate
      (new test cases included)
      
      Reviewers: ezyang, simonmar, hvr, bgamari, austin
      
      Subscribers: vikraman, Yuras, RyanGlScott, qnikst, mboes, facundominguez, rrnewton, thomie, erikd
      
      Differential Revision: https://phabricator.haskell.org/D1264
      
      GHC Trac Issues: #11493
      cf989ffe
  22. 10 Jun, 2016 2 commits
    • Simon Marlow's avatar
      Rts flags cleanup · c88f31a0
      Simon Marlow authored
      * Remove unused/old flags from the structs
      * Update old comments
      * Add missing flags to GHC.RTS
      * Simplify GHC.RTS, remove C code and use hsc2hs instead
      * Make ParFlags unconditional, and add support to GHC.RTS
      c88f31a0
    • Simon Marlow's avatar
      NUMA support · 9e5ea67e
      Simon Marlow authored
      Summary:
      The aim here is to reduce the number of remote memory accesses on
      systems with a NUMA memory architecture, typically multi-socket servers.
      
      Linux provides a NUMA API for doing two things:
      * Allocating memory local to a particular node
      * Binding a thread to a particular node
      
      When given the +RTS --numa flag, the runtime will
      * Determine the number of NUMA nodes (N) by querying the OS
      * Assign capabilities to nodes, so cap C is on node C%N
      * Bind worker threads on a capability to the correct node
      * Keep a separate free lists in the block layer for each node
      * Allocate the nursery for a capability from node-local memory
      * Allocate blocks in the GC from node-local memory
      
      For example, using nofib/parallel/queens on a 24-core 2-socket machine:
      
      ```
      $ ./Main 15 +RTS -N24 -s -A64m
        Total   time  173.960s  (  7.467s elapsed)
      
      $ ./Main 15 +RTS -N24 -s -A64m --numa
        Total   time  150.836s  (  6.423s elapsed)
      ```
      
      The biggest win here is expected to be allocating from node-local
      memory, so that means programs using a large -A value (as here).
      
      According to perf, on this program the number of remote memory accesses
      were reduced by more than 50% by using `--numa`.
      
      Test Plan:
      * validate
      * There's a new flag --debug-numa=<n> that pretends to do NUMA without
        actually making the OS calls, which is useful for testing the code
        on non-NUMA systems.
      * TODO: I need to add some unit tests
      
      Reviewers: erikd, austin, rwbarton, ezyang, bgamari, hvr, niteria
      
      Subscribers: thomie
      
      Differential Revision: https://phabricator.haskell.org/D2199
      9e5ea67e
  23. 04 May, 2016 2 commits
    • Erik de Castro Lopo's avatar
      rts: Replace `nat` with `uint32_t` · db9de7eb
      Erik de Castro Lopo authored
      The `nat` type was an alias for `unsigned int` with a comment saying
      it was at least 32 bits. We keep the typedef in case client code is
      using it but mark it as deprecated.
      
      Test Plan: Validated on Linux, OS X and Windows
      
      Reviewers: simonmar, austin, thomie, hvr, bgamari, hsyl20
      
      Differential Revision: https://phabricator.haskell.org/D2166
      db9de7eb
    • Simon Marlow's avatar
      Allow limiting the number of GC threads (+RTS -qn<n>) · 76ee2607
      Simon Marlow authored
      This allows the GC to use fewer threads than the number of capabilities.
      At each GC, we choose some of the capabilities to be "idle", which means
      that the thread running on that capability (if any) will sleep for the
      duration of the GC, and the other threads will do its work.  We choose
      capabilities that are already idle (if any) to be the idle capabilities.
      
      The idea is that this helps in the following situation:
      
      * We want to use a large -N value so as to make use of hyperthreaded
        cores
      * We use a large heap size, so GC is infrequent
      * But we don't want to use all -N threads in the GC, because that
        thrashes the memory too much.
      
      See docs for usage.
      76ee2607
  24. 12 Apr, 2016 2 commits
    • Simon Marlow's avatar
      Allocate blocks in the GC in batches · f4446c5b
      Simon Marlow authored
      Avoids contention for the block allocator lock in the GC; this can be
      seen in the gc_alloc_block_sync counter emitted by +RTS -s.
      
      I experimented with this a while ago, and there was already
      commented-out code for it in GCUtils.c, but I've now improved it so that
      it doesn't result in significantly worse memory usage.
      
      * The old method of putting spare blocks on ws->part_list was wasteful,
        the spare blocks are now shared between all generations and retained
        between GCs.
      
      * repeated allocGroup() results in fragmentation, so I switched to using
        allocLargeChunk() instead which is fragmentation-friendly; we already
        use it for the same reason in nursery allocation.
      f4446c5b
    • Simon Marlow's avatar
      Cache the size of part_list/scavd_list (#11783) · 5c4cd0e4
      Simon Marlow authored
      After a parallel GC, it is possible to have a long list of blocks in
      ws->part_list, if we did a lot of work stealing but didn't fill up the
      blocks we stole.  These blocks persist until the next load-balanced GC,
      which might be a long time, and during every GC we were traversing this
      list to find its size.  The fix is to maintain the size all the time, so
      we don't have to compute it.
      5c4cd0e4
  25. 07 Feb, 2016 1 commit
  26. 18 Nov, 2015 1 commit
  27. 28 Jul, 2015 1 commit
    • Simon Marlow's avatar
      Eliminate zero_static_objects_list() · f83aab95
      Simon Marlow authored
      Summary:
      [Revised version of D1076 that was committed and then backed out]
      
      In a workload with a large amount of code, zero_static_objects_list()
      takes a significant amount of time, and furthermore it is in the
      single-threaded part of the GC.
      
      This patch uses a slightly fiddly scheme for marking objects on the
      static object lists, using a flag in the low 2 bits that flips between
      two states to indicate whether an object has been visited during this
      GC or not.  We also have to take into account objects that have not
      been visited yet, which might appear at any time due to runtime linking.
      
      Test Plan: validate
      
      Reviewers: austin, ezyang, rwbarton, bgamari, thomie
      
      Reviewed By: bgamari, thomie
      
      Subscribers: thomie
      
      Differential Revision: https://phabricator.haskell.org/D1106
      f83aab95
  28. 27 Jul, 2015 1 commit
  29. 22 Jul, 2015 1 commit
    • Simon Marlow's avatar
      Eliminate zero_static_objects_list() · b949c96b
      Simon Marlow authored
      Summary:
      In a workload with a large amount of code, zero_static_objects_list()
      takes a significant amount of time, and furthermore it is in the
      single-threaded part of the GC.
      
      This patch uses a slightly fiddly scheme for marking objects on the
      static object lists, using a flag in the low 2 bits that flips between
      two states to indicate whether an object has been visited during this
      GC or not.  We also have to take into account objects that have not
      been visited yet, which might appear at any time due to runtime linking.
      
      Test Plan: validate
      
      Reviewers: austin, bgamari, ezyang, rwbarton
      
      Subscribers: thomie
      
      Differential Revision: https://phabricator.haskell.org/D1076
      b949c96b
  30. 07 Apr, 2015 1 commit
  31. 25 Nov, 2014 1 commit
  32. 21 Oct, 2014 1 commit
  33. 10 Oct, 2014 1 commit
  34. 29 Sep, 2014 1 commit
  35. 28 Jul, 2014 1 commit
  36. 10 Jul, 2014 1 commit