1. 22 Jan, 2010 1 commit
    • Simon Marlow's avatar
      When acquiring a spinlock, yieldThread() every 1000 spins (#3553, #3758) · 65ac2f4c
      Simon Marlow authored
      This helps when the thread holding the lock has been descheduled,
      which is the main cause of the "last-core slowdown" problem.  With
      this patch, I get much better results with -N8 on an 8-core box,
      although some benchmarks are still worse than with 7 cores.
      
      I also added a yieldThread() into the any_work() loop of the parallel
      GC when it has no work to do. Oddly, this seems to improve performance
      on the parallel GC benchmarks even when all the cores are busy.
      Perhaps it is due to reducing contention on the memory bus.
      65ac2f4c
  2. 31 Dec, 2009 1 commit
  3. 17 Dec, 2009 1 commit
    • Simon Marlow's avatar
      Fix #650: use a card table to mark dirty sections of mutable arrays · 0417404f
      Simon Marlow authored
      The card table is an array of bytes, placed directly following the
      actual array data.  This means that array reading is unaffected, but
      array writing needs to read the array size from the header in order to
      find the card table.
      
      We use a bytemap rather than a bitmap, because updating the card table
      must be multi-thread safe.  Each byte refers to 128 entries of the
      array, but this is tunable by changing the constant
      MUT_ARR_PTRS_CARD_BITS in includes/Constants.h.
      0417404f
  4. 12 Dec, 2009 1 commit
    • chak@cse.unsw.edu.au.'s avatar
      Expose all EventLog events as DTrace probes · 015d3d46
      chak@cse.unsw.edu.au. authored
      - Defines a DTrace provider, called 'HaskellEvent', that provides a probe
        for every event of the eventlog framework.
      - In contrast to the original eventlog, the DTrace probes are available in
        all flavours of the runtime system (DTrace probes have virtually no
        overhead if not enabled); when -DTRACING is defined both the regular
        event log as well as DTrace probes can be used.
      - Currently, Mac OS X only.  User-space DTrace probes are implemented
        differently on Mac OS X than in the original DTrace implementation.
        Nevertheless, it shouldn't be too hard to enable these probes on other
        platforms, too.
      - Documentation is at http://hackage.haskell.org/trac/ghc/wiki/DTrace
      015d3d46
  5. 09 Dec, 2009 1 commit
  6. 08 Dec, 2009 1 commit
  7. 04 Dec, 2009 1 commit
  8. 03 Dec, 2009 1 commit
    • Simon Marlow's avatar
      GC refactoring, remove "steps" · 214b3663
      Simon Marlow authored
      The GC had a two-level structure, G generations each of T steps.
      Steps are for aging within a generation, mostly to avoid premature
      promotion.  
      
      Measurements show that more than 2 steps is almost never worthwhile,
      and 1 step is usually worse than 2.  In theory fractional steps are
      possible, so the ideal number of steps is somewhere between 1 and 3.
      GHC's default has always been 2.
      
      We can implement 2 steps quite straightforwardly by having each block
      point to the generation to which objects in that block should be
      promoted, so blocks in the nursery point to generation 0, and blocks
      in gen 0 point to gen 1, and so on.
      
      This commit removes the explicit step structures, merging generations
      with steps, thus simplifying a lot of code.  Performance is
      unaffected.  The tunable number of steps is now gone, although it may
      be replaced in the future by a way to tune the aging in generation 0.
      214b3663
  9. 01 Dec, 2009 2 commits
    • Simon Marlow's avatar
      Make allocatePinned use local storage, and other refactorings · 5270423a
      Simon Marlow authored
      This is a batch of refactoring to remove some of the GC's global
      state, as we move towards CPU-local GC.  
      
        - allocateLocal() now allocates large objects into the local
          nursery, rather than taking a global lock and allocating
          then in gen 0 step 0.
      
        - allocatePinned() was still allocating from global storage and
          taking a lock each time, now it uses local storage. 
          (mallocForeignPtrBytes should be faster with -threaded).
          
        - We had a gen 0 step 0, distinct from the nurseries, which are
          stored in a separate nurseries[] array.  This is slightly strange.
          I removed the g0s0 global that pointed to gen 0 step 0, and
          removed all uses of it.  I think now we don't use gen 0 step 0 at
          all, except possibly when there is only one generation.  Possibly
          more tidying up is needed here.
      
        - I removed the global allocate() function, and renamed
          allocateLocal() to allocate().
      
        - the alloc_blocks global is gone.  MAYBE_GC() and
          doYouWantToGC() now check the local nursery only.
      5270423a
    • Simon Marlow's avatar
      063b822b
  10. 30 Nov, 2009 1 commit
    • Simon Marlow's avatar
      Implement a new heap-tuning option: -H · 32395093
      Simon Marlow authored
      -H alone causes the RTS to use a larger nursery, but without exceeding
      the amount of memory that the application is already using.  It trades
      off GC time against locality: the default setting is to use a
      fixed-size 512k nursery, but this is sometimes worse than using a very
      large nursery despite the worse locality.
      
      Not all programs get faster, but some programs that use large heaps do
      much better with -H.  e.g. this helps a lot with #3061 (binary-trees),
      though not as much as specifying -H<large>.  Typically using -H<large>
      is better than plain -H, because the runtime doesn't know ahead of
      time how much memory you want to use.
      
      Should -H be on by default?  I'm not sure, it makes some programs go
      slower, but others go faster.
      32395093
  11. 29 Nov, 2009 1 commit
    • Simon Marlow's avatar
      Store a destination step in the block descriptor · f9d15f9f
      Simon Marlow authored
      At the moment, this just saves a memory reference in the GC inner loop
      (worth a percent or two of GC time).  Later, it will hopefully let me
      experiment with partial steps, and simplifying the generation/step
      infrastructure.
      f9d15f9f
  12. 25 Nov, 2009 2 commits
    • Simon Marlow's avatar
      threadStackOverflow: check whether stack squeezing released some stack (#3677) · d2c874dc
      Simon Marlow authored
      In a stack overflow situation, stack squeezing may reduce the stack
      size, but we don't know whether it has been reduced enough for the
      stack check to succeed if we try again.  Fortunately stack squeezing
      is idempotent, so all we need to do is record whether *any* squeezing
      happened.  If we are at the stack's absolute -K limit, and stack
      squeezing happened, then we try running the thread again.
      
      We also want to avoid enlarging the stack if squeezing has already
      released some of it.  However, we don't want to get into a
      pathalogical situation where a thread has a nearly full stack (near
      its current limit, but not near the absolute -K limit), keeps
      allocating a little bit, squeezing removes a little bit, and then it
      runs again.  So to avoid this, if we squeezed *and* there is still
      less than BLOCK_SIZE_W words free, then we enlarge the stack anyway.
      d2c874dc
    • Simon Marlow's avatar
      add a comment to TSO_MARKED · 69ba3e6b
      Simon Marlow authored
      69ba3e6b
  13. 19 Nov, 2009 1 commit
  14. 18 Nov, 2009 1 commit
  15. 14 Nov, 2009 1 commit
  16. 12 Nov, 2009 1 commit
  17. 11 Nov, 2009 1 commit
    • Simon Marlow's avatar
      Second attempt to fix #1185 (forkProcess and -threaded) · 2d5e052d
      Simon Marlow authored
      Patch 1/2: second part of the patch is to libraries/base
      
      This time without dynamic linker hacks, instead I've expanded the
      existing rts/Globals.c to cache more CAFs, specifically those in
      GHC.Conc.  We were already using this trick for signal handlers, I
      should have realised before.
      
      It's still quite unsavoury, but we can do away with rts/Globals.c in
      the future when we switch to a dynamically-linked GHCi.
      2d5e052d
  18. 05 Nov, 2009 1 commit
  19. 15 Oct, 2009 1 commit
  20. 14 Oct, 2009 2 commits
    • Simon Marlow's avatar
      micro-opt: replace stmGetEnclosingTRec() with a field access · 0856ac59
      Simon Marlow authored
      While fixing #3578 I noticed that this function was just a field
      access to StgTRecHeader, so I inlined it manually.
      0856ac59
    • Simon Marlow's avatar
      Fixes for cross-compiling to a different word size · 79e9cfa3
      Simon Marlow authored
      This patch eliminates a couple of places where we were assuming that
      the host word size is the same as the target word size.
      
      Also a little refactoring: Constants now exports the types TargetInt
      and TargetWord corresponding to the Int/Word type on the target
      platform, and I moved the definitions of tARGET_INT_MAX and friends
      from Literal to Constants.
      
      Thanks to Barney Stratford <barney_stratford@fastmail.fm> for helping
      track down the problem and fix it.  We now know that GHC can
      successfully cross-compile from 32-bit to 64-bit.
      79e9cfa3
  21. 08 Oct, 2009 1 commit
  22. 29 Sep, 2009 1 commit
  23. 28 Sep, 2009 1 commit
  24. 25 Sep, 2009 1 commit
    • Simon Marlow's avatar
      Add a way to generate tracing events programmatically · 5407ad8e
      Simon Marlow authored
      added:
      
       primop  TraceEventOp "traceEvent#" GenPrimOp
         Addr# -> State# s -> State# s
         { Emits an event via the RTS tracing framework.  The contents
           of the event is the zero-terminated byte string passed as the first
           argument.  The event will be emitted either to the .eventlog file,
           or to stderr, depending on the runtime RTS flags. }
      
      and added the required RTS functionality to support it.  Also a bit of
      refactoring in the RTS tracing code.
      5407ad8e
  25. 18 Sep, 2009 2 commits
  26. 15 Sep, 2009 1 commit
    • Simon Marlow's avatar
      Event tracing: put the capability in the block marker, omit it from the events · e459b0d1
      Simon Marlow authored
      This makes events smaller and tracing quicker, and speeds up reading
      and sorting the trace file.
      
      HEADS UP: this changes the format of event log files.  Corresponding
      changes to the ghc-events package are required (and will be pushed
      soon).  Normally we would make backwards-compatible changes, but this
      changes the format of every event (to remove the capability) so I'm
      breaking the rules this time.  This will be the only time we can do
      this, since the format becomes public in 6.12.1.
      e459b0d1
  27. 13 Sep, 2009 1 commit
    • Simon Marlow's avatar
      Add event block markers · d7161575
      Simon Marlow authored
      These indicate the size and time span of a sequence of events in the
      event log, to make it easier to sort and navigate a large event log.
      d7161575
  28. 12 Sep, 2009 1 commit
  29. 15 Sep, 2009 1 commit
    • Simon Marlow's avatar
      Improve the default parallel GC settings, and sanitise the flags (#3340) · 53628e91
      Simon Marlow authored
      Flags (from +RTS -?):
      
        -qg[<n>]  Use parallel GC only for generations >= <n>
                  (default: 0, -qg alone turns off parallel GC)
        -qb[<n>]  Use load-balancing in the parallel GC only for generations >= <n>
                  (default: 1, -qb alone turns off load-balancing)
      
      these are good defaults for most parallel programs.  Single-threaded
      programs that want to make use of parallel GC will probably want +RTS
      -qg1 (this is documented).
      
      I've also updated the docs.
      53628e91
  30. 10 Sep, 2009 2 commits
  31. 09 Sep, 2009 1 commit
  32. 29 Aug, 2009 3 commits
    • Simon Marlow's avatar
      Unify event logging and debug tracing. · a5288c55
      Simon Marlow authored
        - tracing facilities are now enabled with -DTRACING, and -DDEBUG
          additionally enables debug-tracing.  -DEVENTLOG has been
          removed.
      
        - -debug now implies -eventlog
      
        - events can be printed to stderr instead of being sent to the
          binary .eventlog file by adding +RTS -v (which is implied by the
          +RTS -Dx options).
      
        - -Dx debug messages can be sent to the binary .eventlog file
          by adding +RTS -l.  This should help debugging by reducing
          the impact of debug tracing on execution time.
      
        - Various debug messages that duplicated the information in events
          have been removed.
      a5288c55
    • Simon Marlow's avatar
      add RTS_PRIVATE attribute · 90d1dedd
      Simon Marlow authored
      90d1dedd
    • Simon Marlow's avatar
      Fix incorrectly hidden RTS symbols · 254528e3
      Simon Marlow authored
      254528e3
  33. 05 Aug, 2009 1 commit