1. 21 Aug, 2012 2 commits
  2. 10 Jul, 2012 1 commit
    • Simon Marlow's avatar
      Parallelise clearNurseries() in the parallel GC · 713cf473
      Simon Marlow authored
      The clearNurseries() operation resets the free pointer in each nursery
      block to the start of the block, emptying the nursery.  In the
      parallel GC this was done on the main GC thread, but that's bad
      because it accesses the bdescr of every nursery block, and move all
      those cache lines onto the CPU of the main GC thread.  With large
      nurseries, this can be especially bad.  So instead we want to clear
      each nursery in its local GC thread.
      
      Thanks to Andreas Voellmy <andreas.voellmy@gmail.com> for idenitfying
      the issue.
      
      After this change and the previous patch to make the last GC a major
      one, I see these results for nofib/parallel on 8 cores:
      
         blackscholes          +0.0%     +0.0%     -3.7%     -3.3%     +0.3%
                coins          +0.0%     +0.0%     -5.1%     -5.0%     +0.4%
                 gray          +0.0%     +0.0%     -4.5%     -2.1%     +0.8%
               mandel          +0.0%     -0.0%     -7.6%     -5.1%     -2.3%
              matmult          +0.0%     +5.5%     -2.8%     -1.9%     -5.8%
              minimax          +0.0%     +0.0%    -10.6%    -10.5%     +0.0%
                nbody          +0.0%     -4.4%     +0.0%      0.07     +0.0%
               parfib          +0.0%     +1.0%     +0.5%     +0.9%     +0.0%
              partree          +0.0%     +0.0%     -2.4%     -2.5%     +1.7%
                 prsa          +0.0%     -0.2%     +1.8%     +4.2%     +0.0%
               queens          +0.0%     -0.0%     -1.8%     -1.4%     -4.8%
                  ray          +0.0%     -0.6%    -18.5%    -17.8%     +0.0%
             sumeuler          +0.0%     -0.0%     -3.7%     -3.7%     +0.0%
            transclos          +0.0%     -0.0%    -25.7%    -26.6%     +0.0%
      --------------------------------------------------------------------------------
                  Min          +0.0%     -4.4%    -25.7%    -26.6%     -5.8%
                  Max          +0.0%     +5.5%     +1.8%     +4.2%     +1.7%
       Geometric Mean          +0.0%     +0.1%     -6.3%     -6.1%     -0.7%
      713cf473
  3. 30 Apr, 2012 1 commit
    • Ian Lynagh's avatar
      Fix maintenance of n_blocks in the RTS · 3457c6be
      Ian Lynagh authored
      It was causing assertion failures of
          ASSERT(countBlocks(nursery->blocks) == nursery->n_blocks)
      at
          ghc-stage2: internal error: ASSERTION FAILED: file rts/sm/Sanity.c, line 878
      3457c6be
  4. 04 Apr, 2012 3 commits
    • Duncan Coutts's avatar
      Emit final heap alloc events and rearrange code to calculate alloc totals · 1f809ce6
      Duncan Coutts authored
      In stat_exit we want to emit a final EVENT_HEAP_ALLOCATED for each cap
      so that we get the same total allocation count as reported via +RTS -s.
      To do so we need to update the per-cap total_allocated counts.
      
      Previously we had a single calcAllocated(rtsBool) function that counted
      the large allocations and optionally the nurseries for all caps. The GC
      would always call it with false, and the stat_exit always with true.
      The reason for these two modes is that the GC counts the nurseries via
      clearNurseries() (which also updates the per-cap total_allocated
      counts), so it's only the stat_exit() path that needs to count them.
      
      We now split the calcAllocated() function into two: countLargeAllocated
      and updateNurseriesStats. As the name suggests, the latter now updates
      the per-cap total_allocated counts, in additon to returning a total.
      1f809ce6
    • Duncan Coutts's avatar
      Add new eventlog events for various heap and GC statistics · 65aaa9b2
      Duncan Coutts authored
      They cover much the same info as is available via the GHC.Stats module
      or via the '+RTS -s' textual output, but via the eventlog and with a
      better sampling frequency.
      
      We have three new generic heap info events and two very GHC-specific
      ones. (The hope is the general ones are usable by other implementations
      that use the same eventlog system, or indeed not so sensitive to changes
      in GHC itself.)
      
      The general ones are:
      
       * total heap mem allocated since prog start, on a per-HEC basis
       * current size of the heap (MBlocks reserved from OS for the heap)
       * current size of live data in the heap
      
      Currently these are all emitted by GHC at GC time (live data only at
      major GC).
      
      The GHC specific ones are:
      
       * an event giving various static heap paramaters:
         * number of generations (usually 2)
         * max size if any
         * nursary size
         * MBlock and block sizes
       * a event emitted on each GC containing:
         * GC generation (usually just 0,1)
         * total bytes copied
         * bytes lost to heap slop and fragmentation
         * the number of threads in the parallel GC (1 for serial)
         * the maximum number of bytes copied by any par GC thread
         * the total number of bytes copied by all par GC threads
           (these last three can be used to calculate an estimate of the
            work balance in parallel GCs)
      65aaa9b2
    • Duncan Coutts's avatar
      Calculate the total memory allocated on a per-capability basis · 8536f09c
      Duncan Coutts authored
      In addition to the existing global method. For now we just do
      it both ways and assert they give the same grand total. At some
      stage we can simplify the global method to just take the sum of
      the per-cap counters.
      8536f09c
  5. 27 Feb, 2012 2 commits
  6. 13 Feb, 2012 1 commit
    • Simon Marlow's avatar
      Allocate pinned object blocks from the nursery, not the global · 67f4ab7e
      Simon Marlow authored
      allocator.
      
      Prompted by a benchmark posted to parallel-haskell@haskell.org by
      Andreas Voellmy <andreas.voellmy@gmail.com>.  This program exhibits
      contention for the block allocator when run with -N2 and greater
      without the fix:
      
      {-# LANGUAGE MagicHash, UnboxedTuples, BangPatterns #-}
      module Main where
      
      import Control.Monad
      import Control.Concurrent
      import System.Environment
      import GHC.IO
      import GHC.Exts
      import GHC.Conc
      
      main = do
       [m] <- fmap (fmap read) getArgs
       n <- getNumCapabilities
       ms <- replicateM n newEmptyMVar
       sequence [ forkIO $ busyWorkerB (m `quot` n) >> putMVar mv () | mv <- ms ]
       mapM takeMVar ms
      
      busyWorkerB :: Int -> IO ()
      busyWorkerB n_loops = go 0
        where go !n | n >= n_loops = return ()
                    | otherwise    =
                do p <- (IO $ \s ->
                          case newPinnedByteArray# 1024# s      of
                            { (# s', mbarr# #) ->
                                 (# s', () #)
                            }
                        )
                   go (n+1)
      67f4ab7e
  7. 13 Dec, 2011 2 commits
  8. 06 Dec, 2011 1 commit
  9. 29 Nov, 2011 1 commit
    • Simon Marlow's avatar
      Make profiling work with multiple capabilities (+RTS -N) · 50de6034
      Simon Marlow authored
      This means that both time and heap profiling work for parallel
      programs.  Main internal changes:
      
        - CCCS is no longer a global variable; it is now another
          pseudo-register in the StgRegTable struct.  Thus every
          Capability has its own CCCS.
      
        - There is a new built-in CCS called "IDLE", which records ticks for
          Capabilities in the idle state.  If you profile a single-threaded
          program with +RTS -N2, you'll see about 50% of time in "IDLE".
      
        - There is appropriate locking in rts/Profiling.c to protect the
          shared cost-centre-stack data structures.
      
      This patch does enough to get it working, I have cut one big corner:
      the cost-centre-stack data structure is still shared amongst all
      Capabilities, which means that multiple Capabilities will race when
      updating the "allocations" and "entries" fields of a CCS.  Not only
      does this give unpredictable results, but it runs very slowly due to
      cache line bouncing.
      
      It is strongly recommended that you use -fno-prof-count-entries to
      disable the "entries" count when profiling parallel programs. (I shall
      add a note to this effect to the docs).
      50de6034
  10. 02 Nov, 2011 1 commit
    • Simon Marlow's avatar
      Overhaul of infrastructure for profiling, coverage (HPC) and breakpoints · 7bb0447d
      Simon Marlow authored
      User visible changes
      ====================
      
      Profilng
      --------
      
      Flags renamed (the old ones are still accepted for now):
      
        OLD            NEW
        ---------      ------------
        -auto-all      -fprof-auto
        -auto          -fprof-exported
        -caf-all       -fprof-cafs
      
      New flags:
      
        -fprof-auto              Annotates all bindings (not just top-level
                                 ones) with SCCs
      
        -fprof-top               Annotates just top-level bindings with SCCs
      
        -fprof-exported          Annotates just exported bindings with SCCs
      
        -fprof-no-count-entries  Do not maintain entry counts when profiling
                                 (can make profiled code go faster; useful with
                                 heap profiling where entry counts are not used)
      
      Cost-centre stacks have a new semantics, which should in most cases
      result in more useful and intuitive profiles.  If you find this not to
      be the case, please let me know.  This is the area where I have been
      experimenting most, and the current solution is probably not the
      final version, however it does address all the outstanding bugs and
      seems to be better than GHC 7.2.
      
      Stack traces
      ------------
      
      +RTS -xc now gives more information.  If the exception originates from
      a CAF (as is common, because GHC tends to lift exceptions out to the
      top-level), then the RTS walks up the stack and reports the stack in
      the enclosing update frame(s).
      
      Result: +RTS -xc is much more useful now - but you still have to
      compile for profiling to get it.  I've played around a little with
      adding 'head []' to GHC itself, and +RTS -xc does pinpoint the problem
      quite accurately.
      
      I plan to add more facilities for stack tracing (e.g. in GHCi) in the
      future.
      
      Coverage (HPC)
      --------------
      
       * derived instances are now coloured yellow if they weren't used
       * likewise record field names
       * entry counts are more accurate (hpc --fun-entry-count)
       * tab width is now correct (markup was previously off in source with
         tabs)
      
      Internal changes
      ================
      
      In Core, the Note constructor has been replaced by
      
              Tick (Tickish b) (Expr b)
      
      which is used to represent all the kinds of source annotation we
      support: profiling SCCs, HPC ticks, and GHCi breakpoints.
      
      Depending on the properties of the Tickish, different transformations
      apply to Tick.  See CoreUtils.mkTick for details.
      
      Tickets
      =======
      
      This commit closes the following tickets, test cases to follow:
      
        - Close #2552: not a bug, but the behaviour is now more intuitive
          (test is T2552)
      
        - Close #680 (test is T680)
      
        - Close #1531 (test is result001)
      
        - Close #949 (test is T949)
      
        - Close #2466: test case has bitrotted (doesn't compile against current
          version of vector-space package)
      7bb0447d
  11. 17 Oct, 2011 1 commit
  12. 14 Apr, 2011 1 commit
    • Simon Marlow's avatar
      Avoid accumulating slop in the pinned_object_block. · cc2ea98a
      Simon Marlow authored
      The pinned_object_block is where we allocate small pinned ByteArray#
      objects.  At a GC the pinned_object_block was being treated like other
      large objects and promoted to the next step/generation, even if it was
      only partly full.  Under some ByteString-heavy workloads this would
      accumulate on average 2k of slop per GC, and this memory is never
      released until the ByteArray# objects in the block are freed.
      
      So now, we keep allocating into the pinned_object_block until it is
      completely full, at which point it is handed over to the GC as before.
      The pinned_object_block might therefore contain objects which a large
      range of ages, but I don't think this is any worse than the situation
      before.  We still have the fragmentation issue in general, but the new
      scheme can improve the memory overhead for some workloads
      dramatically.
      cc2ea98a
  13. 02 Feb, 2011 3 commits
    • Simon Marlow's avatar
      fix warning · df521c32
      Simon Marlow authored
      df521c32
    • Simon Marlow's avatar
      GC refactoring and cleanup · 18896fa2
      Simon Marlow authored
      Now we keep any partially-full blocks in the gc_thread[] structs after
      each GC, rather than moving them to the generation.  This should give
      us slightly better locality (though I wasn't able to measure any
      difference).
      
      Also in this patch: better sanity checking with THREADED.
      18896fa2
    • Simon Marlow's avatar
      Remove the per-generation mutable lists · 32907722
      Simon Marlow authored
      Now that we use the per-capability mutable lists exclusively.
      32907722
  14. 21 Dec, 2010 1 commit
    • Simon Marlow's avatar
      Count allocations more accurately · db0c13a4
      Simon Marlow authored
      The allocation stats (+RTS -s etc.) used to count the slop at the end
      of each nursery block (except the last) as allocated space, now we
      count the allocated words accurately.  This should make allocation
      figures more predictable, too.
      
      This has the side effect of reducing the apparent allocations by a
      small amount (~1%), so remember to take this into account when looking
      at nofib results.
      db0c13a4
  15. 15 Dec, 2010 1 commit
    • Simon Marlow's avatar
      Implement stack chunks and separate TSO/STACK objects · f30d5273
      Simon Marlow authored
      This patch makes two changes to the way stacks are managed:
      
      1. The stack is now stored in a separate object from the TSO.
      
      This means that it is easier to replace the stack object for a thread
      when the stack overflows or underflows; we don't have to leave behind
      the old TSO as an indirection any more.  Consequently, we can remove
      ThreadRelocated and deRefTSO(), which were a pain.
      
      This is obviously the right thing, but the last time I tried to do it
      it made performance worse.  This time I seem to have cracked it.
      
      2. Stacks are now represented as a chain of chunks, rather than
         a single monolithic object.
      
      The big advantage here is that individual chunks are marked clean or
      dirty according to whether they contain pointers to the young
      generation, and the GC can avoid traversing clean stack chunks during
      a young-generation collection.  This means that programs with deep
      stacks will see a big saving in GC overhead when using the default GC
      settings.
      
      A secondary advantage is that there is much less copying involved as
      the stack grows.  Programs that quickly grow a deep stack will see big
      improvements.
      
      In some ways the implementation is simpler, as nothing special needs
      to be done to reclaim stack as the stack shrinks (the GC just recovers
      the dead stack chunks).  On the other hand, we have to manage stack
      underflow between chunks, so there's a new stack frame
      (UNDERFLOW_FRAME), and we now have separate TSO and STACK objects.
      The total amount of code is probably about the same as before.
      
      There are new RTS flags:
      
         -ki<size> Sets the initial thread stack size (default 1k)  Egs: -ki4k -ki2m
         -kc<size> Sets the stack chunk size (default 32k)
         -kb<size> Sets the stack chunk buffer size (default 1k)
      
      -ki was previously called just -k, and the old name is still accepted
      for backwards compatibility.  These new options are documented.
      f30d5273
  16. 27 Aug, 2010 1 commit
  17. 28 Jun, 2010 1 commit
  18. 04 Jun, 2010 1 commit
  19. 09 May, 2010 1 commit
  20. 06 May, 2010 1 commit
  21. 29 Mar, 2010 4 commits
    • Simon Marlow's avatar
      Move a thread to the front of the run queue when another thread blocks on it · 2726a2f1
      Simon Marlow authored
      This fixes #3838, and was made possible by the new BLACKHOLE
      infrastructure.  To allow reording of the run queue I had to make it
      doubly-linked, which entails some extra trickiness with regard to
      GC write barriers and suchlike.
      2726a2f1
    • Simon Marlow's avatar
      tiny GC optimisation · 1373cd30
      Simon Marlow authored
      1373cd30
    • Simon Marlow's avatar
      New implementation of BLACKHOLEs · 5d52d9b6
      Simon Marlow authored
      This replaces the global blackhole_queue with a clever scheme that
      enables us to queue up blocked threads on the closure that they are
      blocked on, while still avoiding atomic instructions in the common
      case.
      
      Advantages:
      
       - gets rid of a locked global data structure and some tricky GC code
         (replacing it with some per-thread data structures and different
         tricky GC code :)
      
       - wakeups are more prompt: parallel/concurrent performance should
         benefit.  I haven't seen anything dramatic in the parallel
         benchmarks so far, but a couple of threading benchmarks do improve
         a bit.
      
       - waking up a thread blocked on a blackhole is now O(1) (e.g. if
         it is the target of throwTo).
      
       - less sharing and better separation of Capabilities: communication
         is done with messages, the data structures are strictly owned by a
         Capability and cannot be modified except by sending messages.
      
       - this change will utlimately enable us to do more intelligent
         scheduling when threads block on each other.  This is what started
         off the whole thing, but it isn't done yet (#3838).
      
      I'll be documenting all this on the wiki in due course.
      5d52d9b6
    • Simon Marlow's avatar
      e74936e3
  22. 25 Mar, 2010 1 commit
  23. 31 Dec, 2009 1 commit
  24. 07 Dec, 2009 1 commit
  25. 04 Dec, 2009 1 commit
  26. 03 Dec, 2009 2 commits
    • Simon Marlow's avatar
      GC refactoring, remove "steps" · 214b3663
      Simon Marlow authored
      The GC had a two-level structure, G generations each of T steps.
      Steps are for aging within a generation, mostly to avoid premature
      promotion.  
      
      Measurements show that more than 2 steps is almost never worthwhile,
      and 1 step is usually worse than 2.  In theory fractional steps are
      possible, so the ideal number of steps is somewhere between 1 and 3.
      GHC's default has always been 2.
      
      We can implement 2 steps quite straightforwardly by having each block
      point to the generation to which objects in that block should be
      promoted, so blocks in the nursery point to generation 0, and blocks
      in gen 0 point to gen 1, and so on.
      
      This commit removes the explicit step structures, merging generations
      with steps, thus simplifying a lot of code.  Performance is
      unaffected.  The tunable number of steps is now gone, although it may
      be replaced in the future by a way to tune the aging in generation 0.
      214b3663
    • Simon Marlow's avatar
      add a missing lock around allocGroup() · 8bc3f028
      Simon Marlow authored
      8bc3f028
  27. 02 Dec, 2009 2 commits
  28. 01 Dec, 2009 1 commit
    • Simon Marlow's avatar
      Make allocatePinned use local storage, and other refactorings · 5270423a
      Simon Marlow authored
      This is a batch of refactoring to remove some of the GC's global
      state, as we move towards CPU-local GC.  
      
        - allocateLocal() now allocates large objects into the local
          nursery, rather than taking a global lock and allocating
          then in gen 0 step 0.
      
        - allocatePinned() was still allocating from global storage and
          taking a lock each time, now it uses local storage. 
          (mallocForeignPtrBytes should be faster with -threaded).
          
        - We had a gen 0 step 0, distinct from the nurseries, which are
          stored in a separate nurseries[] array.  This is slightly strange.
          I removed the g0s0 global that pointed to gen 0 step 0, and
          removed all uses of it.  I think now we don't use gen 0 step 0 at
          all, except possibly when there is only one generation.  Possibly
          more tidying up is needed here.
      
        - I removed the global allocate() function, and renamed
          allocateLocal() to allocate().
      
        - the alloc_blocks global is gone.  MAYBE_GC() and
          doYouWantToGC() now check the local nursery only.
      5270423a