1. 13 Dec, 2016 1 commit
  2. 07 Dec, 2016 1 commit
    • Simon Marlow's avatar
      Overhaul of Compact Regions (#12455) · 7036fde9
      Simon Marlow authored
      Summary:
      This commit makes various improvements and addresses some issues with
      Compact Regions (aka Compact Normal Forms).
      
      This was the most important thing I wanted to fix.  Compaction
      previously prevented GC from running until it was complete, which
      would be a problem in a multicore setting.  Now, we compact using a
      hand-written Cmm routine that can be interrupted at any point.  When a
      GC is triggered during a sharing-enabled compaction, the GC has to
      traverse and update the hash table, so this hash table is now stored
      in the StgCompactNFData object.
      
      Previously, compaction consisted of a deepseq using the NFData class,
      followed by a traversal in C code to copy the data.  This is now done
      in a single pass with hand-written Cmm (see rts/Compact.cmm). We no
      longer use the NFData instances, instead the Cmm routine evaluates
      components directly as it compacts.
      
      The new compaction is about 50% faster than the old one with no
      sharing, and a little faster on average with sharing (the cost of the
      hash table dominates when we're doing sharing).
      
      Static objects that don't (transitively) refer to any CAFs don't need
      to be copied into the compact region.  In particular this means we
      often avoid copying Char values and small Int values, because these
      are static closures in the runtime.
      
      Each Compact# object can support a single compactAdd# operation at any
      given time, so the Data.Compact library now enforces mutual exclusion
      using an MVar stored in the Compact object.
      
      We now get exceptions rather than killing everything with a barf()
      when we encounter an object that cannot be compacted (a function, or a
      mutable object).  We now also detect pinned objects, which can't be
      compacted either.
      
      The Data.Compact API has been refactored and cleaned up.  A new
      compactSize operation returns the size (in bytes) of the compact
      object.
      
      Most of the documentation is in the Haddock docs for the compact
      library, which I've expanded and improved here.
      
      Various comments in the code have been improved, especially the main
      Note [Compact Normal Forms] in rts/sm/CNF.c.
      
      I've added a few tests, and expanded a few of the tests that were
      there.  We now also run the tests with GHCi, and in a new test way
      that enables sanity checking (+RTS -DS).
      
      There's a benchmark in libraries/compact/tests/compact_bench.hs for
      measuring compaction speed and comparing sharing vs. no sharing.
      
      The field totalDataW in StgCompactNFData was unnecessary.
      
      Test Plan:
      * new unit tests
      * validate
      * tested manually that we can compact Data.Aeson data
      
      Reviewers: gcampax, bgamari, ezyang, austin, niteria, hvr, erikd
      
      Subscribers: thomie, simonpj
      
      Differential Revision: https://phabricator.haskell.org/D2751
      
      GHC Trac Issues: #12455
      7036fde9
  3. 06 Dec, 2016 1 commit
    • Simon Marlow's avatar
      Overhaul GC stats · 24e6594c
      Simon Marlow authored
      Summary:
      Visible API changes:
      
      * The C struct `GCDetails` gives the stats about a single GC.  This is
        passed to the `gcDone()` callback if one is set via the
        RtsConfig. (previously we just passed a collection of values, so this
        is more extensible, at the expense of breaking the existing API)
      
      * `RTSStats` gives cumulative stats since the start of the program,
        and includes the `GCDetails` for the most recent GC.  This struct
        can be obtained via `getRTSStats()` (the old `getGCStats()` has been
        removed, and `getGCStatsEnabled()` has been renamed to
        `getRTSStatsEnabled()`)
      
      Improvements:
      
      * The per-GC stats and cumulative stats are now cleanly separated.
      
      * Inside the RTS we have a top-level `RTSStats` struct to keep all our
        stats in, previously this was just a collection of strangely-named
        variables.  This struct is mostly just copied in `getRTSStats()`, so
        the implementation of that function is a lot shorter.
      
      * Types are more consistent.  We use a uint64_t byte count for all
        memory values, and Time for all time values.
      
      * Names are more consistent.  We use a suffix `_bytes` for all byte
        counts and `_ns` for all time values.
      
      * We now collect information about the amount of memory in large
        objects and compact objects in `GCDetails`. (the latter was the reason
        I started doing this patch but it seems to have ballooned a bit!)
      
      * I fixed a bug in the calculation of the elapsed MUT time, and added
        an ASSERT to stop the calculations going wrong in the future.
      
      For now I kept the Haskell API in `GHC.Stats` the same, by
      impedence-matching with the new API.  We could either break that API
      and make it match the C API more closely, or we could add a new API
      and deprecate the old one.  Opinions welcome.
      
      This stuff is very easy to get wrong, and it's hard to test.  Reviews
      welcome!
      
      Test Plan:
      manual testing
      validate
      
      Reviewers: bgamari, niteria, austin, ezyang, hvr, erikd, rwbarton, Phyx
      
      Subscribers: thomie
      
      Differential Revision: https://phabricator.haskell.org/D2756
      24e6594c
  4. 29 Nov, 2016 1 commit
  5. 27 Jul, 2016 1 commit
  6. 26 Jul, 2016 1 commit
  7. 19 May, 2016 1 commit
  8. 10 May, 2016 1 commit
    • wereHamster's avatar
      Use stdint types for Stg{Word,Int}{8,16,32,64} · 260a5648
      wereHamster authored
      We can't define Stg{Int,Word} in terms of {,u}intptr_t because STG
      depends on them being the exact same size as void*, and {,u}intptr_t
      does not make that guarantee. Furthermore, we also need to define
      StgHalf{Int,Word}, so the preprocessor if needs to stay. But we can at
      least keep it in a single place instead of repeating it in various
      files.
      
      Also define STG_{INT,WORD}{8,16,32,64}_{MIN,MAX} and use it in HsFFI.h,
      further reducing the need for CPP in other files.
      
      Reviewers: austin, bgamari, simonmar, hvr, erikd
      
      Subscribers: thomie
      
      Differential Revision: https://phabricator.haskell.org/D2182
      260a5648
  9. 04 May, 2016 1 commit
  10. 07 Feb, 2016 1 commit
  11. 18 Nov, 2015 1 commit
  12. 07 Apr, 2015 1 commit
  13. 25 Nov, 2014 1 commit
    • Simon Marlow's avatar
      Make clearNursery free · e22bc0de
      Simon Marlow authored
      Summary:
      clearNursery resets all the bd->free pointers of nursery blocks to
      make the blocks empty.  In profiles we've seen clearNursery taking
      significant amounts of time particularly with large -N and -A values.
      
      This patch moves the work of clearNursery to the point at which we
      actually need the new block, thereby introducing an invariant that
      blocks to the right of the CurrentNursery pointer still need their
      bd->free pointer reset.  This should make things faster overall,
      because we don't need to clear blocks that we don't use.
      
      Test Plan: validate
      
      Reviewers: AndreasVoellmy, ezyang, austin
      
      Subscribers: thomie, carter, ezyang, simonmar
      
      Differential Revision: https://phabricator.haskell.org/D318
      e22bc0de
  14. 21 Oct, 2014 1 commit
  15. 29 Sep, 2014 1 commit
  16. 28 Jul, 2014 2 commits
    • Austin Seipp's avatar
      rts: add Emacs 'Local Variables' to every .c file · 39b5c1cb
      Austin Seipp authored
      This will hopefully help ensure some basic consistency in the forward by
      overriding buffer variables. In particular, it sets the wrap length, the
      offset to 4, and turns off tabs.
      Signed-off-by: default avatarAustin Seipp <austin@well-typed.com>
      39b5c1cb
    • Herbert Valerio Riedel's avatar
      Increase precision of timings reported by RTS · 57ed4101
      Herbert Valerio Riedel authored
      Summary:
      Today's hardware is much faster, so it makes sense to report timings
      with more precision, and possibly help reduce rounding-induced
      fluctuations in the nofib statistics.
      
      This commit increases the precision of all timings previously reported
      with a granularity of 10ms to 1ms. For instance, the `+RTS -S` output is
      now rendered as:
      
          Alloc    Copied     Live     GC     GC      TOT      TOT  Page Flts
          bytes     bytes     bytes   user   elap     user     elap
         641936     59944    158120  0.000  0.000    0.013    0.001    0    0  (Gen:  0)
         517672     60840    158464  0.000  0.000    0.013    0.002    0    0  (Gen:  0)
         517256     58800    156424  0.005  0.005    0.019    0.007    0    0  (Gen:  1)
         670208      9520    158728  0.000  0.000    0.019    0.008    0    0  (Gen:  0)
      
        ...
      
                                           Tot time (elapsed)  Avg pause  Max pause
        Gen  0        24 colls,     0 par    0.002s   0.002s     0.0001s    0.0002s
        Gen  1         3 colls,     0 par    0.011s   0.011s     0.0038s    0.0055s
      
        TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)
      
        SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
      
        INIT    time    0.001s  (  0.001s elapsed)
        MUT     time    0.005s  (  0.006s elapsed)
        GC      time    0.014s  (  0.014s elapsed)
        EXIT    time    0.001s  (  0.001s elapsed)
        Total   time    0.032s  (  0.020s elapsed)
      
      Note that this change also requires associated changes in the nofib
      submodule.
      
      Test Plan: tested with modified nofib
      
      Reviewers: simonmar, nomeata, austin
      
      Subscribers: simonmar, relrod, carter
      
      Differential Revision: https://phabricator.haskell.org/D97
      57ed4101
  17. 10 Jul, 2014 1 commit
  18. 05 Dec, 2013 1 commit
  19. 28 Oct, 2013 1 commit
  20. 04 Sep, 2013 1 commit
    • Simon Marlow's avatar
      Don't move Capabilities in setNumCapabilities (#8209) · aa779e09
      Simon Marlow authored
      We have various problems with reallocating the array of Capabilities,
      due to threads in waitForReturnCapability that are already holding a
      pointer to a Capability.
      
      Rather than add more locking to make this safer, I decided it would be
      easier to ensure that we never move the Capabilities at all.  The
      capabilities array is now an array of pointers to Capabaility.  There
      are extra indirections, but it rarely matters - we don't often access
      Capabilities via the array, normally we already have a pointer to
      one.  I ran the parallel benchmarks and didn't see any difference.
      aa779e09
  21. 14 Feb, 2013 1 commit
    • Simon Marlow's avatar
      Simplify the allocation stats accounting · 65a0e1eb
      Simon Marlow authored
      We were doing it in two different ways and asserting that the results
      were the same.  In most cases they were, but I found one case where
      they weren't: the GC itself allocates some memory for running
      finalizers, and this memory was accounted for one way but not the
      other.
      
      It was simpler to remove the old way of counting allocation that to
      try to fix it up, so I did that.
      65a0e1eb
  22. 14 Sep, 2012 1 commit
  23. 07 Sep, 2012 1 commit
    • Simon Marlow's avatar
      Deprecate lnat, and use StgWord instead · 41737f12
      Simon Marlow authored
      lnat was originally "long unsigned int" but we were using it when we
      wanted a 64-bit type on a 64-bit machine.  This broke on Windows x64,
      where long == int == 32 bits.  Using types of unspecified size is bad,
      but what we really wanted was a type with N bits on an N-bit machine.
      StgWord is exactly that.
      
      lnat was mentioned in some APIs that clients might be using
      (e.g. StackOverflowHook()), so we leave it defined but with a comment
      to say that it's deprecated.
      41737f12
  24. 19 Jun, 2012 1 commit
  25. 26 Apr, 2012 2 commits
    • Ian Lynagh's avatar
      Win32 build fix · 0ff6dbc8
      Ian Lynagh authored
      0ff6dbc8
    • Ian Lynagh's avatar
      Fix warnings on Win64 · 1dbe6d59
      Ian Lynagh authored
      Mostly this meant getting pointer<->int conversions to use the right
      sizes. lnat is now size_t, rather than unsigned long, as that seems a
      better match for how it's used.
      1dbe6d59
  26. 04 Apr, 2012 6 commits
    • Mikolaj Konarski's avatar
      Add the GC_GLOBAL_SYNC event marking that all caps are stopped for GC · c294d95d
      Mikolaj Konarski authored
      Quoting design rationale by dcoutts: The event indicates that we're doing
      a stop-the-world GC and all other HECs should be between their GC_START
      and GC_END events at that moment. We don't want to use GC_STATS_GHC
      for that, because GC_STATS_GHC is for extra GHC-specific info,
      not something we have to rely on to be able to match the GC pauses
      across HECs to a particular global GC.
      c294d95d
    • Mikolaj Konarski's avatar
      Fix the timestamps in GC_START and GC_END events on the GC-initiating cap · 598109eb
      Mikolaj Konarski authored
      There was a discrepancy between GC times reported in +RTS -s
      and the timestamps of GC_START and GC_END events on the cap,
      on which +RTS -s stats for the given GC are based.
      This is fixed by posting the events with exactly the same timestamp
      as generated for the stat calculation. The calls posting the events
      are moved too, so that the events are emitted close to the time instant
      they claim to be emitted at. The GC_STATS_GHC was moved, too, ensuring
      it's emitted before the moved GC_END on all caps, which simplifies tools code.
      598109eb
    • Duncan Coutts's avatar
      Emit final heap alloc events and rearrange code to calculate alloc totals · 1f809ce6
      Duncan Coutts authored
      In stat_exit we want to emit a final EVENT_HEAP_ALLOCATED for each cap
      so that we get the same total allocation count as reported via +RTS -s.
      To do so we need to update the per-cap total_allocated counts.
      
      Previously we had a single calcAllocated(rtsBool) function that counted
      the large allocations and optionally the nurseries for all caps. The GC
      would always call it with false, and the stat_exit always with true.
      The reason for these two modes is that the GC counts the nurseries via
      clearNurseries() (which also updates the per-cap total_allocated
      counts), so it's only the stat_exit() path that needs to count them.
      
      We now split the calcAllocated() function into two: countLargeAllocated
      and updateNurseriesStats. As the name suggests, the latter now updates
      the per-cap total_allocated counts, in additon to returning a total.
      1f809ce6
    • Duncan Coutts's avatar
      Add new eventlog events for various heap and GC statistics · 65aaa9b2
      Duncan Coutts authored
      They cover much the same info as is available via the GHC.Stats module
      or via the '+RTS -s' textual output, but via the eventlog and with a
      better sampling frequency.
      
      We have three new generic heap info events and two very GHC-specific
      ones. (The hope is the general ones are usable by other implementations
      that use the same eventlog system, or indeed not so sensitive to changes
      in GHC itself.)
      
      The general ones are:
      
       * total heap mem allocated since prog start, on a per-HEC basis
       * current size of the heap (MBlocks reserved from OS for the heap)
       * current size of live data in the heap
      
      Currently these are all emitted by GHC at GC time (live data only at
      major GC).
      
      The GHC specific ones are:
      
       * an event giving various static heap paramaters:
         * number of generations (usually 2)
         * max size if any
         * nursary size
         * MBlock and block sizes
       * a event emitted on each GC containing:
         * GC generation (usually just 0,1)
         * total bytes copied
         * bytes lost to heap slop and fragmentation
         * the number of threads in the parallel GC (1 for serial)
         * the maximum number of bytes copied by any par GC thread
         * the total number of bytes copied by all par GC threads
           (these last three can be used to calculate an estimate of the
            work balance in parallel GCs)
      65aaa9b2
    • Duncan Coutts's avatar
      Change the presentation of parallel GC work balance in +RTS -s · cd930da1
      Duncan Coutts authored
      Also rename internal variables to make the names match what they hold.
      The parallel GC work balance is calculated using the total amount of
      memory copied by all GC threads, and the maximum copied by any
      individual thread. You have serial GC when the max is the same as
      copied, and perfectly balanced GC when total/max == n_caps.
      
      Previously we presented this as the ratio total/max and told users
      that the serial value was 1 and the ideal value N, for N caps, e.g.
      
        Parallel GC work balance: 1.05 (4045071 / 3846774, ideal 2)
      
      The downside of this is that the user always has to keep in mind the
      number of cores being used. Our new presentation uses a normalised
      scale 0--1 as a percentage. The 0% means completely serial and 100%
      is perfect balance, e.g.
      
        Parallel GC work balance: 4.56% (serial 0%, perfect 100%)
      cd930da1
    • Duncan Coutts's avatar
      Calculate the total memory allocated on a per-capability basis · 8536f09c
      Duncan Coutts authored
      In addition to the existing global method. For now we just do
      it both ways and assert they give the same grand total. At some
      stage we can simplify the global method to just take the sum of
      the per-cap counters.
      8536f09c
  27. 02 Mar, 2012 1 commit
    • Simon Marlow's avatar
      Drop the per-task timing stats, give a summary only (#5897) · 085c7fe5
      Simon Marlow authored
      We were keeping around the Task struct (216 bytes) for every worker we
      ever created, even though we only keep a maximum of 6 workers per
      Capability.  These Task structs accumulate and cause a space leak in
      programs that do lots of safe FFI calls; this patch frees the Task
      struct as soon as a worker exits.
      
      One reason we were keeping the Task structs around is because we print
      out per-Task timing stats in +RTS -s, but that isn't terribly useful.
      What is sometimes useful is knowing how *many* Tasks there were.  So
      now I'm printing a single-line summary, this is for the program in
      
        TASKS: 2001 (1 bound, 31 peak workers (2000 total), using -N1)
      
      So although we created 2k tasks overall, there were only 31 workers
      active at any one time (which is exactly what we expect: the program
      makes 30 safe FFI calls concurrently).
      
      This also gives an indication of how many capabilities were being
      used, which is handy if you use +RTS -N without an explicit number.
      085c7fe5
  28. 15 Jan, 2012 2 commits
  29. 06 Jan, 2012 1 commit
  30. 06 Dec, 2011 1 commit
  31. 25 Nov, 2011 1 commit
    • Simon Marlow's avatar
      Time handling overhaul · 6b109851
      Simon Marlow authored
      Terminology cleanup: the type "Ticks" has been renamed "Time", which
      is an StgWord64 in units of TIME_RESOLUTION (currently nanoseconds).
      The terminology "tick" is now used consistently to mean the interval
      between timer signals.
      
      The ticker now always ticks in realtime (actually CLOCK_MONOTONIC if
      we have it).  Before it used CPU time in the non-threaded RTS and
      realtime in the threaded RTS, but I've discovered that the CPU timer
      has terrible resolution (at least on Linux) and isn't much use for
      profiling.  So now we always use realtime.  This should also fix
      
      The default tick interval is now 10ms, except when profiling where we
      drop it to 1ms.  This gives more accurate profiles without affecting
      runtime too much (<1%).
      
      Lots of cleanups - the resolution of Time is now in one place
      only (Rts.h) rather than having calculations that depend on the
      resolution scattered all over the RTS.  I hope I found them all.
      6b109851
  32. 02 Nov, 2011 1 commit