1. 06 Dec, 2016 1 commit
    • Simon Marlow's avatar
      Overhaul GC stats · 24e6594c
      Simon Marlow authored
      Summary:
      Visible API changes:
      
      * The C struct `GCDetails` gives the stats about a single GC.  This is
        passed to the `gcDone()` callback if one is set via the
        RtsConfig. (previously we just passed a collection of values, so this
        is more extensible, at the expense of breaking the existing API)
      
      * `RTSStats` gives cumulative stats since the start of the program,
        and includes the `GCDetails` for the most recent GC.  This struct
        can be obtained via `getRTSStats()` (the old `getGCStats()` has been
        removed, and `getGCStatsEnabled()` has been renamed to
        `getRTSStatsEnabled()`)
      
      Improvements:
      
      * The per-GC stats and cumulative stats are now cleanly separated.
      
      * Inside the RTS we have a top-level `RTSStats` struct to keep all our
        stats in, previously this was just a collection of strangely-named
        variables.  This struct is mostly just copied in `getRTSStats()`, so
        the implementation of that function is a lot shorter.
      
      * Types are more consistent.  We use a uint64_t byte count for all
        memory values, and Time for all time values.
      
      * Names are more consistent.  We use a suffix `_bytes` for all byte
        counts and `_ns` for all time values.
      
      * We now collect information about the amount of memory in large
        objects and compact objects in `GCDetails`. (the latter was the reason
        I started doing this patch but it seems to have ballooned a bit!)
      
      * I fixed a bug in the calculation of the elapsed MUT time, and added
        an ASSERT to stop the calculations going wrong in the future.
      
      For now I kept the Haskell API in `GHC.Stats` the same, by
      impedence-matching with the new API.  We could either break that API
      and make it match the C API more closely, or we could add a new API
      and deprecate the old one.  Opinions welcome.
      
      This stuff is very easy to get wrong, and it's hard to test.  Reviews
      welcome!
      
      Test Plan:
      manual testing
      validate
      
      Reviewers: bgamari, niteria, austin, ezyang, hvr, erikd, rwbarton, Phyx
      
      Subscribers: thomie
      
      Differential Revision: https://phabricator.haskell.org/D2756
      24e6594c
  2. 29 Nov, 2016 1 commit
  3. 29 Oct, 2016 1 commit
    • Simon Marlow's avatar
      Fix a bug in parallel GC synchronisation · 4e088b49
      Simon Marlow authored
      Summary:
      The problem boils down to global variables: in particular gc_threads[],
      which was being modified by a subsequent GC before the previous GC had
      finished with it.  The fix is to not use global variables.
      
      This was causing setnumcapabilities001 to fail (again!).  It's an old
      bug though.
      
      Test Plan:
      Ran setnumcapabilities001 in a loop for a couple of hours.  Before this
      patch it had been failing after a few minutes.  Not a very scientific
      test, but it's the best I have.
      
      Reviewers: bgamari, austin, fryguybob, niteria, erikd
      
      Subscribers: thomie
      
      Differential Revision: https://phabricator.haskell.org/D2654
      4e088b49
  4. 27 Jul, 2016 1 commit
  5. 20 Jul, 2016 1 commit
    • gcampax's avatar
      Compact Regions · cf989ffe
      gcampax authored
      This brings in initial support for compact regions, as described in the
      ICFP 2015 paper "Efficient Communication and Collection with Compact
      Normal Forms" (Edward Z. Yang et.al.) and implemented by Giovanni
      Campagna.
      
      Some things may change before the 8.2 release, but I (Simon M.) wanted
      to get the main patch committed so that we can iterate.
      
      What documentation there is is in the Data.Compact module in the new
      compact package.  We'll need to extend and polish the documentation
      before the release.
      
      Test Plan:
      validate
      (new test cases included)
      
      Reviewers: ezyang, simonmar, hvr, bgamari, austin
      
      Subscribers: vikraman, Yuras, RyanGlScott, qnikst, mboes, facundominguez, rrnewton, thomie, erikd
      
      Differential Revision: https://phabricator.haskell.org/D1264
      
      GHC Trac Issues: #11493
      cf989ffe
  6. 10 Jun, 2016 2 commits
    • Simon Marlow's avatar
      Rts flags cleanup · c88f31a0
      Simon Marlow authored
      * Remove unused/old flags from the structs
      * Update old comments
      * Add missing flags to GHC.RTS
      * Simplify GHC.RTS, remove C code and use hsc2hs instead
      * Make ParFlags unconditional, and add support to GHC.RTS
      c88f31a0
    • Simon Marlow's avatar
      NUMA support · 9e5ea67e
      Simon Marlow authored
      Summary:
      The aim here is to reduce the number of remote memory accesses on
      systems with a NUMA memory architecture, typically multi-socket servers.
      
      Linux provides a NUMA API for doing two things:
      * Allocating memory local to a particular node
      * Binding a thread to a particular node
      
      When given the +RTS --numa flag, the runtime will
      * Determine the number of NUMA nodes (N) by querying the OS
      * Assign capabilities to nodes, so cap C is on node C%N
      * Bind worker threads on a capability to the correct node
      * Keep a separate free lists in the block layer for each node
      * Allocate the nursery for a capability from node-local memory
      * Allocate blocks in the GC from node-local memory
      
      For example, using nofib/parallel/queens on a 24-core 2-socket machine:
      
      ```
      $ ./Main 15 +RTS -N24 -s -A64m
        Total   time  173.960s  (  7.467s elapsed)
      
      $ ./Main 15 +RTS -N24 -s -A64m --numa
        Total   time  150.836s  (  6.423s elapsed)
      ```
      
      The biggest win here is expected to be allocating from node-local
      memory, so th...
      9e5ea67e
  7. 04 May, 2016 2 commits
    • Erik de Castro Lopo's avatar
      rts: Replace `nat` with `uint32_t` · db9de7eb
      Erik de Castro Lopo authored
      The `nat` type was an alias for `unsigned int` with a comment saying
      it was at least 32 bits. We keep the typedef in case client code is
      using it but mark it as deprecated.
      
      Test Plan: Validated on Linux, OS X and Windows
      
      Reviewers: simonmar, austin, thomie, hvr, bgamari, hsyl20
      
      Differential Revision: https://phabricator.haskell.org/D2166
      db9de7eb
    • Simon Marlow's avatar
      Allow limiting the number of GC threads (+RTS -qn<n>) · 76ee2607
      Simon Marlow authored
      This allows the GC to use fewer threads than the number of capabilities.
      At each GC, we choose some of the capabilities to be "idle", which means
      that the thread running on that capability (if any) will sleep for the
      duration of the GC, and the other threads will do its work.  We choose
      capabilities that are already idle (if any) to be the idle capabilities.
      
      The idea is that this helps in the following situation:
      
      * We want to use a large -N value so as to make use of hyperthreaded
        cores
      * We use a large heap size, so GC is infrequent
      * But we don't want to use all -N threads in the GC, because that
        thrashes the memory too much.
      
      See docs for usage.
      76ee2607
  8. 12 Apr, 2016 2 commits
    • Simon Marlow's avatar
      Allocate blocks in the GC in batches · f4446c5b
      Simon Marlow authored
      Avoids contention for the block allocator lock in the GC; this can be
      seen in the gc_alloc_block_sync counter emitted by +RTS -s.
      
      I experimented with this a while ago, and there was already
      commented-out code for it in GCUtils.c, but I've now improved it so that
      it doesn't result in significantly worse memory usage.
      
      * The old method of putting spare blocks on ws->part_list was wasteful,
        the spare blocks are now shared between all generations and retained
        between GCs.
      
      * repeated allocGroup() results in fragmentation, so I switched to using
        allocLargeChunk() instead which is fragmentation-friendly; we already
        use it for the same reason in nursery allocation.
      f4446c5b
    • Simon Marlow's avatar
      Cache the size of part_list/scavd_list (#11783) · 5c4cd0e4
      Simon Marlow authored
      After a parallel GC, it is possible to have a long list of blocks in
      ws->part_list, if we did a lot of work stealing but didn't fill up the
      blocks we stole.  These blocks persist until the next load-balanced GC,
      which might be a long time, and during every GC we were traversing this
      list to find its size.  The fix is to maintain the size all the time, so
      we don't have to compute it.
      5c4cd0e4
  9. 07 Feb, 2016 1 commit
  10. 18 Nov, 2015 1 commit
  11. 28 Jul, 2015 1 commit
    • Simon Marlow's avatar
      Eliminate zero_static_objects_list() · f83aab95
      Simon Marlow authored
      Summary:
      [Revised version of D1076 that was committed and then backed out]
      
      In a workload with a large amount of code, zero_static_objects_list()
      takes a significant amount of time, and furthermore it is in the
      single-threaded part of the GC.
      
      This patch uses a slightly fiddly scheme for marking objects on the
      static object lists, using a flag in the low 2 bits that flips between
      two states to indicate whether an object has been visited during this
      GC or not.  We also have to take into account objects that have not
      been visited yet, which might appear at any time due to runtime linking.
      
      Test Plan: validate
      
      Reviewers: austin, ezyang, rwbarton, bgamari, thomie
      
      Reviewed By: bgamari, thomie
      
      Subscribers: thomie
      
      Differential Revision: https://phabricator.haskell.org/D1106
      f83aab95
  12. 27 Jul, 2015 1 commit
  13. 22 Jul, 2015 1 commit
    • Simon Marlow's avatar
      Eliminate zero_static_objects_list() · b949c96b
      Simon Marlow authored
      Summary:
      In a workload with a large amount of code, zero_static_objects_list()
      takes a significant amount of time, and furthermore it is in the
      single-threaded part of the GC.
      
      This patch uses a slightly fiddly scheme for marking objects on the
      static object lists, using a flag in the low 2 bits that flips between
      two states to indicate whether an object has been visited during this
      GC or not.  We also have to take into account objects that have not
      been visited yet, which might appear at any time due to runtime linking.
      
      Test Plan: validate
      
      Reviewers: austin, bgamari, ezyang, rwbarton
      
      Subscribers: thomie
      
      Differential Revision: https://phabricator.haskell.org/D1076
      b949c96b
  14. 07 Apr, 2015 1 commit
  15. 25 Nov, 2014 1 commit
  16. 21 Oct, 2014 1 commit
  17. 10 Oct, 2014 1 commit
  18. 29 Sep, 2014 1 commit
  19. 28 Jul, 2014 1 commit
  20. 10 Jul, 2014 1 commit
  21. 30 May, 2014 1 commit
  22. 27 Apr, 2014 1 commit
  23. 09 Dec, 2013 1 commit
  24. 30 Nov, 2013 1 commit
  25. 21 Nov, 2013 2 commits
    • Simon Marlow's avatar
      Allow the linker to be used without retaining CAFs unconditionally · 5874f13f
      Simon Marlow authored
      This creates a new C API:
      
         initLinker_ (int retain_cafs)
      
      The old initLinker() was left as-is for backwards compatibility.  See
      documentation in Linker.h.
      5874f13f
    • Simon Marlow's avatar
      In the DEBUG rts, track when CAFs are GC'd · e82fa829
      Simon Marlow authored
      This resurrects some old code and makes it work again.  The idea is
      that we want to get an error message if we ever enter a CAF that has
      been GC'd, rather than following its indirection which will likely
      cause a segfault.  Without this patch, these bugs are hard to track
      down in gdb, because the IND_STATIC code overwrites R1 (the pointer to
      the CAF) with its indirectee before jumping into bad memory, so we've
      lost the address of the CAF that got GC'd.
      
      Some associated refactoring while I was here.
      e82fa829
  26. 02 Oct, 2013 1 commit
  27. 01 Oct, 2013 1 commit
  28. 04 Sep, 2013 1 commit
    • Simon Marlow's avatar
      Don't move Capabilities in setNumCapabilities (#8209) · aa779e09
      Simon Marlow authored
      We have various problems with reallocating the array of Capabilities,
      due to threads in waitForReturnCapability that are already holding a
      pointer to a Capability.
      
      Rather than add more locking to make this safer, I decided it would be
      easier to ensure that we never move the Capabilities at all.  The
      capabilities array is now an array of pointers to Capabaility.  There
      are extra indirections, but it rarely matters - we don't often access
      Capabilities via the array, normally we already have a pointer to
      one.  I ran the parallel benchmarks and didn't see any difference.
      aa779e09
  29. 22 Aug, 2013 1 commit
    • Simon Marlow's avatar
      Really unload object code when it is safe to do so (#8039) · bdfefb3b
      Simon Marlow authored
      The next major GC after an unloadObj() will do a traversal of the heap
      to determine whether the object code can be removed from memory or
      not.  We'll keep doing these until it is safe to remove the object
      code.
      
      In my experiments with GHCi, the objects get unloaded immediately,
      which is a good sign: we're not accidentally holding on to any
      references anywhere in the GHC data structures.
      
      Changes relative to the patch earlier posted on the ticket:
       - fix two memory leaks discovered with Valgrind, after
         testing with tests/rts/linker_unload.c
      bdfefb3b
  30. 21 Aug, 2013 1 commit
  31. 15 Jun, 2013 1 commit
  32. 13 May, 2013 1 commit
  33. 14 Feb, 2013 2 commits
    • Simon Marlow's avatar
      Separate StablePtr and StableName tables (#7674) · 7e7a4e4d
      Simon Marlow authored
      To improve performance of StablePtr.
      7e7a4e4d
    • Simon Marlow's avatar
      Simplify the allocation stats accounting · 65a0e1eb
      Simon Marlow authored
      We were doing it in two different ways and asserting that the results
      were the same.  In most cases they were, but I found one case where
      they weren't: the GC itself allocates some memory for running
      finalizers, and this memory was accounted for one way but not the
      other.
      
      It was simpler to remove the old way of counting allocation that to
      try to fix it up, so I did that.
      65a0e1eb
  34. 17 Jan, 2013 1 commit
  35. 16 Nov, 2012 1 commit
    • Simon Marlow's avatar
      Add a write barrier for TVAR closures · 6d784c43
      Simon Marlow authored
      This improves GC performance when there are a lot of TVars in the
      heap.  For instance, a TChan with a lot of elements causes a massive
      GC drag without this patch.
      
      There's more to do - several other STM closure types don't have write
      barriers, so GC performance when there are a lot of threads blocked on
      STM isn't great.  But fixing the problem for TVar is a good start.
      6d784c43