1. 08 Sep, 2013 2 commits
  2. 04 Sep, 2013 1 commit
    • Simon Marlow's avatar
      Don't move Capabilities in setNumCapabilities (#8209) · aa779e09
      Simon Marlow authored
      We have various problems with reallocating the array of Capabilities,
      due to threads in waitForReturnCapability that are already holding a
      pointer to a Capability.
      Rather than add more locking to make this safer, I decided it would be
      easier to ensure that we never move the Capabilities at all.  The
      capabilities array is now an array of pointers to Capabaility.  There
      are extra indirections, but it rarely matters - we don't often access
      Capabilities via the array, normally we already have a pointer to
      one.  I ran the parallel benchmarks and didn't see any difference.
  3. 15 Jun, 2013 1 commit
  4. 08 Jun, 2013 1 commit
  5. 29 Mar, 2013 1 commit
    • nfrisby's avatar
      ticky enhancements · 460abd75
      nfrisby authored
        * the new StgCmmArgRep module breaks a dependency cycle; I also
          untabified it, but made no real changes
        * updated the documentation in the wiki and change the user guide to
          point there
        * moved the allocation enters for ticky and CCS to after the heap check
          * I left LDV where it was, which was before the heap check at least
            once, since I have no idea what it is
        * standardized all (active?) ticky alloc totals to bytes
        * in order to avoid double counting StgCmmLayout.adjustHpBackwards
          no longer bumps ALLOC_HEAP_ctr
        * I resurrected the SLOW_CALL counters
          * the new module StgCmmArgRep breaks cyclic dependency between
            Layout and Ticky (which the SLOW_CALL counters cause)
          * renamed them SLOW_CALL_fast_<pattern> and VERY_SLOW_CALL
        * added ALLOC_RTS_ctr and _tot ticky counters
          * eg allocation by Storage.c:allocate or a BUILD_PAP in stg_ap_*_info
          * resurrected ticky counters for ALLOC_THK, ALLOC_PAP, and
          * added -ticky and -DTICKY_TICKY in ways.mk for debug ways
        * added a ticky counter for total LNE entries
        * new flags for ticky: -ticky-allocd -ticky-dyn-thunk -ticky-LNE
          * all off by default
          * -ticky-allocd: tracks allocation *of* closure in addition to
             allocation *by* that closure
          * -ticky-dyn-thunk tracks dynamic thunks as if they were functions
          * -ticky-LNE tracks LNEs as if they were functions
        * updated the ticky report format, including making the argument
          categories (more?) accurate again
        * the printed name for things in the report include the unique of
          their ticky parent as well as if they are not top-level
  6. 14 Feb, 2013 1 commit
    • Simon Marlow's avatar
      Simplify the allocation stats accounting · 65a0e1eb
      Simon Marlow authored
      We were doing it in two different ways and asserting that the results
      were the same.  In most cases they were, but I found one case where
      they weren't: the GC itself allocates some memory for running
      finalizers, and this memory was accounted for one way but not the
      It was simpler to remove the old way of counting allocation that to
      try to fix it up, so I did that.
  7. 17 Jan, 2013 1 commit
    • Simon Marlow's avatar
      Hopefully fix breakage on OS X w/ LLVM · 0dcccf0c
      Simon Marlow authored
      Reordering of includes in GC.c broke on OS X because gctKey is
      declared in Task.h and is needed in the storage manager.  This is
      really the wrong place for it anyway, so I've moved the gctKey pieces
      to where they should be.
  8. 16 Nov, 2012 1 commit
    • Simon Marlow's avatar
      Add a write barrier for TVAR closures · 6d784c43
      Simon Marlow authored
      This improves GC performance when there are a lot of TVars in the
      heap.  For instance, a TChan with a lot of elements causes a massive
      GC drag without this patch.
      There's more to do - several other STM closure types don't have write
      barriers, so GC performance when there are a lot of threads blocked on
      STM isn't great.  But fixing the problem for TVar is a good start.
  9. 21 Sep, 2012 3 commits
  10. 07 Sep, 2012 3 commits
  11. 21 Aug, 2012 2 commits
  12. 10 Jul, 2012 1 commit
    • Simon Marlow's avatar
      Parallelise clearNurseries() in the parallel GC · 713cf473
      Simon Marlow authored
      The clearNurseries() operation resets the free pointer in each nursery
      block to the start of the block, emptying the nursery.  In the
      parallel GC this was done on the main GC thread, but that's bad
      because it accesses the bdescr of every nursery block, and move all
      those cache lines onto the CPU of the main GC thread.  With large
      nurseries, this can be especially bad.  So instead we want to clear
      each nursery in its local GC thread.
      Thanks to Andreas Voellmy <andreas.voellmy@gmail.com> for idenitfying
      the issue.
      After this change and the previous patch to make the last GC a major
      one, I see these results for nofib/parallel on 8 cores:
         blackscholes          +0.0%     +0.0%     -3.7%     -3.3%     +0.3%
                coins          +0.0%     +0.0%     -5.1%     -5.0%     +0.4%
                 gray          +0.0%     +0.0%     -4.5%     -2.1%     +0.8%
               mandel          +0.0%     -0.0%     -7.6%     -5.1%     -2.3%
              matmult          +0.0%     +5.5%     -2.8%     -1.9%     -5.8%
              minimax          +0.0%     +0.0%    -10.6%    -10.5%     +0.0%
                nbody          +0.0%     -4.4%     +0.0%      0.07     +0.0%
               parfib          +0.0%     +1.0%     +0.5%     +0.9%     +0.0%
              partree          +0.0%     +0.0%     -2.4%     -2.5%     +1.7%
                 prsa          +0.0%     -0.2%     +1.8%     +4.2%     +0.0%
               queens          +0.0%     -0.0%     -1.8%     -1.4%     -4.8%
                  ray          +0.0%     -0.6%    -18.5%    -17.8%     +0.0%
             sumeuler          +0.0%     -0.0%     -3.7%     -3.7%     +0.0%
            transclos          +0.0%     -0.0%    -25.7%    -26.6%     +0.0%
                  Min          +0.0%     -4.4%    -25.7%    -26.6%     -5.8%
                  Max          +0.0%     +5.5%     +1.8%     +4.2%     +1.7%
       Geometric Mean          +0.0%     +0.1%     -6.3%     -6.1%     -0.7%
  13. 30 Apr, 2012 1 commit
    • Ian Lynagh's avatar
      Fix maintenance of n_blocks in the RTS · 3457c6be
      Ian Lynagh authored
      It was causing assertion failures of
          ASSERT(countBlocks(nursery->blocks) == nursery->n_blocks)
          ghc-stage2: internal error: ASSERTION FAILED: file rts/sm/Sanity.c, line 878
  14. 04 Apr, 2012 3 commits
    • Duncan Coutts's avatar
      Emit final heap alloc events and rearrange code to calculate alloc totals · 1f809ce6
      Duncan Coutts authored
      In stat_exit we want to emit a final EVENT_HEAP_ALLOCATED for each cap
      so that we get the same total allocation count as reported via +RTS -s.
      To do so we need to update the per-cap total_allocated counts.
      Previously we had a single calcAllocated(rtsBool) function that counted
      the large allocations and optionally the nurseries for all caps. The GC
      would always call it with false, and the stat_exit always with true.
      The reason for these two modes is that the GC counts the nurseries via
      clearNurseries() (which also updates the per-cap total_allocated
      counts), so it's only the stat_exit() path that needs to count them.
      We now split the calcAllocated() function into two: countLargeAllocated
      and updateNurseriesStats. As the name suggests, the latter now updates
      the per-cap total_allocated counts, in additon to returning a total.
    • Duncan Coutts's avatar
      Add new eventlog events for various heap and GC statistics · 65aaa9b2
      Duncan Coutts authored
      They cover much the same info as is available via the GHC.Stats module
      or via the '+RTS -s' textual output, but via the eventlog and with a
      better sampling frequency.
      We have three new generic heap info events and two very GHC-specific
      ones. (The hope is the general ones are usable by other implementations
      that use the same eventlog system, or indeed not so sensitive to changes
      in GHC itself.)
      The general ones are:
       * total heap mem allocated since prog start, on a per-HEC basis
       * current size of the heap (MBlocks reserved from OS for the heap)
       * current size of live data in the heap
      Currently these are all emitted by GHC at GC time (live data only at
      major GC).
      The GHC specific ones are:
       * an event giving various static heap paramaters:
         * number of generations (usually 2)
         * max size if any
         * nursary size
         * MBlock and block sizes
       * a event emitted on each GC containing:
         * GC generation (usually just 0,1)
         * total bytes copied
         * bytes lost to heap slop and fragmentation
         * the number of threads in the parallel GC (1 for serial)
         * the maximum number of bytes copied by any par GC thread
         * the total number of bytes copied by all par GC threads
           (these last three can be used to calculate an estimate of the
            work balance in parallel GCs)
    • Duncan Coutts's avatar
      Calculate the total memory allocated on a per-capability basis · 8536f09c
      Duncan Coutts authored
      In addition to the existing global method. For now we just do
      it both ways and assert they give the same grand total. At some
      stage we can simplify the global method to just take the sum of
      the per-cap counters.
  15. 27 Feb, 2012 2 commits
  16. 13 Feb, 2012 1 commit
    • Simon Marlow's avatar
      Allocate pinned object blocks from the nursery, not the global · 67f4ab7e
      Simon Marlow authored
      Prompted by a benchmark posted to parallel-haskell@haskell.org by
      Andreas Voellmy <andreas.voellmy@gmail.com>.  This program exhibits
      contention for the block allocator when run with -N2 and greater
      without the fix:
      {-# LANGUAGE MagicHash, UnboxedTuples, BangPatterns #-}
      module Main where
      import Control.Monad
      import Control.Concurrent
      import System.Environment
      import GHC.IO
      import GHC.Exts
      import GHC.Conc
      main = do
       [m] <- fmap (fmap read) getArgs
       n <- getNumCapabilities
       ms <- replicateM n newEmptyMVar
       sequence [ forkIO $ busyWorkerB (m `quot` n) >> putMVar mv () | mv <- ms ]
       mapM takeMVar ms
      busyWorkerB :: Int -> IO ()
      busyWorkerB n_loops = go 0
        where go !n | n >= n_loops = return ()
                    | otherwise    =
                do p <- (IO $ \s ->
                          case newPinnedByteArray# 1024# s      of
                            { (# s', mbarr# #) ->
                                 (# s', () #)
                   go (n+1)
  17. 13 Dec, 2011 2 commits
  18. 06 Dec, 2011 1 commit
  19. 29 Nov, 2011 1 commit
    • Simon Marlow's avatar
      Make profiling work with multiple capabilities (+RTS -N) · 50de6034
      Simon Marlow authored
      This means that both time and heap profiling work for parallel
      programs.  Main internal changes:
        - CCCS is no longer a global variable; it is now another
          pseudo-register in the StgRegTable struct.  Thus every
          Capability has its own CCCS.
        - There is a new built-in CCS called "IDLE", which records ticks for
          Capabilities in the idle state.  If you profile a single-threaded
          program with +RTS -N2, you'll see about 50% of time in "IDLE".
        - There is appropriate locking in rts/Profiling.c to protect the
          shared cost-centre-stack data structures.
      This patch does enough to get it working, I have cut one big corner:
      the cost-centre-stack data structure is still shared amongst all
      Capabilities, which means that multiple Capabilities will race when
      updating the "allocations" and "entries" fields of a CCS.  Not only
      does this give unpredictable results, but it runs very slowly due to
      cache line bouncing.
      It is strongly recommended that you use -fno-prof-count-entries to
      disable the "entries" count when profiling parallel programs. (I shall
      add a note to this effect to the docs).
  20. 02 Nov, 2011 1 commit
    • Simon Marlow's avatar
      Overhaul of infrastructure for profiling, coverage (HPC) and breakpoints · 7bb0447d
      Simon Marlow authored
      User visible changes
      Flags renamed (the old ones are still accepted for now):
        OLD            NEW
        ---------      ------------
        -auto-all      -fprof-auto
        -auto          -fprof-exported
        -caf-all       -fprof-cafs
      New flags:
        -fprof-auto              Annotates all bindings (not just top-level
                                 ones) with SCCs
        -fprof-top               Annotates just top-level bindings with SCCs
        -fprof-exported          Annotates just exported bindings with SCCs
        -fprof-no-count-entries  Do not maintain entry counts when profiling
                                 (can make profiled code go faster; useful with
                                 heap profiling where entry counts are not used)
      Cost-centre stacks have a new semantics, which should in most cases
      result in more useful and intuitive profiles.  If you find this not to
      be the case, please let me know.  This is the area where I have been
      experimenting most, and the current solution is probably not the
      final version, however it does address all the outstanding bugs and
      seems to be better than GHC 7.2.
      Stack traces
      +RTS -xc now gives more information.  If the exception originates from
      a CAF (as is common, because GHC tends to lift exceptions out to the
      top-level), then the RTS walks up the stack and reports the stack in
      the enclosing update frame(s).
      Result: +RTS -xc is much more useful now - but you still have to
      compile for profiling to get it.  I've played around a little with
      adding 'head []' to GHC itself, and +RTS -xc does pinpoint the problem
      quite accurately.
      I plan to add more facilities for stack tracing (e.g. in GHCi) in the
      Coverage (HPC)
       * derived instances are now coloured yellow if they weren't used
       * likewise record field names
       * entry counts are more accurate (hpc --fun-entry-count)
       * tab width is now correct (markup was previously off in source with
      Internal changes
      In Core, the Note constructor has been replaced by
              Tick (Tickish b) (Expr b)
      which is used to represent all the kinds of source annotation we
      support: profiling SCCs, HPC ticks, and GHCi breakpoints.
      Depending on the properties of the Tickish, different transformations
      apply to Tick.  See CoreUtils.mkTick for details.
      This commit closes the following tickets, test cases to follow:
        - Close #2552: not a bug, but the behaviour is now more intuitive
          (test is T2552)
        - Close #680 (test is T680)
        - Close #1531 (test is result001)
        - Close #949 (test is T949)
        - Close #2466: test case has bitrotted (doesn't compile against current
          version of vector-space package)
  21. 17 Oct, 2011 1 commit
  22. 14 Apr, 2011 1 commit
    • Simon Marlow's avatar
      Avoid accumulating slop in the pinned_object_block. · cc2ea98a
      Simon Marlow authored
      The pinned_object_block is where we allocate small pinned ByteArray#
      objects.  At a GC the pinned_object_block was being treated like other
      large objects and promoted to the next step/generation, even if it was
      only partly full.  Under some ByteString-heavy workloads this would
      accumulate on average 2k of slop per GC, and this memory is never
      released until the ByteArray# objects in the block are freed.
      So now, we keep allocating into the pinned_object_block until it is
      completely full, at which point it is handed over to the GC as before.
      The pinned_object_block might therefore contain objects which a large
      range of ages, but I don't think this is any worse than the situation
      before.  We still have the fragmentation issue in general, but the new
      scheme can improve the memory overhead for some workloads
  23. 02 Feb, 2011 3 commits
    • Simon Marlow's avatar
      fix warning · df521c32
      Simon Marlow authored
    • Simon Marlow's avatar
      GC refactoring and cleanup · 18896fa2
      Simon Marlow authored
      Now we keep any partially-full blocks in the gc_thread[] structs after
      each GC, rather than moving them to the generation.  This should give
      us slightly better locality (though I wasn't able to measure any
      Also in this patch: better sanity checking with THREADED.
    • Simon Marlow's avatar
      Remove the per-generation mutable lists · 32907722
      Simon Marlow authored
      Now that we use the per-capability mutable lists exclusively.
  24. 21 Dec, 2010 1 commit
    • Simon Marlow's avatar
      Count allocations more accurately · db0c13a4
      Simon Marlow authored
      The allocation stats (+RTS -s etc.) used to count the slop at the end
      of each nursery block (except the last) as allocated space, now we
      count the allocated words accurately.  This should make allocation
      figures more predictable, too.
      This has the side effect of reducing the apparent allocations by a
      small amount (~1%), so remember to take this into account when looking
      at nofib results.
  25. 15 Dec, 2010 1 commit
    • Simon Marlow's avatar
      Implement stack chunks and separate TSO/STACK objects · f30d5273
      Simon Marlow authored
      This patch makes two changes to the way stacks are managed:
      1. The stack is now stored in a separate object from the TSO.
      This means that it is easier to replace the stack object for a thread
      when the stack overflows or underflows; we don't have to leave behind
      the old TSO as an indirection any more.  Consequently, we can remove
      ThreadRelocated and deRefTSO(), which were a pain.
      This is obviously the right thing, but the last time I tried to do it
      it made performance worse.  This time I seem to have cracked it.
      2. Stacks are now represented as a chain of chunks, rather than
         a single monolithic object.
      The big advantage here is that individual chunks are marked clean or
      dirty according to whether they contain pointers to the young
      generation, and the GC can avoid traversing clean stack chunks during
      a young-generation collection.  This means that programs with deep
      stacks will see a big saving in GC overhead when using the default GC
      A secondary advantage is that there is much less copying involved as
      the stack grows.  Programs that quickly grow a deep stack will see big
      In some ways the implementation is simpler, as nothing special needs
      to be done to reclaim stack as the stack shrinks (the GC just recovers
      the dead stack chunks).  On the other hand, we have to manage stack
      underflow between chunks, so there's a new stack frame
      (UNDERFLOW_FRAME), and we now have separate TSO and STACK objects.
      The total amount of code is probably about the same as before.
      There are new RTS flags:
         -ki<size> Sets the initial thread stack size (default 1k)  Egs: -ki4k -ki2m
         -kc<size> Sets the stack chunk size (default 32k)
         -kb<size> Sets the stack chunk buffer size (default 1k)
      -ki was previously called just -k, and the old name is still accepted
      for backwards compatibility.  These new options are documented.
  26. 27 Aug, 2010 1 commit
  27. 28 Jun, 2010 1 commit
  28. 04 Jun, 2010 1 commit